Databricks hat OfficeQA ins Leben gerufen, einen Open-Source-Benchmark zur Bewertung von KI in realistischen Geschäftsszenarien.
In Kürze
- OfficeQA testet KI-Modelle in praxisnahen Unternehmenssituationen
- GPT-5.1 erzielt nur 43% korrekte Antworten
- Benchmark als Diagnosewerkzeug für KI-Schwächen
Databricks Launches OfficeQA Open-Source Benchmark
Databricks has introduced a new open-source benchmark called OfficeQA, aimed at testing artificial intelligence in realistic business scenarios. Unlike many existing tests that often focus on abstract thinking, OfficeQA emphasizes practical business situations. Here, AI models must answer questions based on extensive documents, such as the U.S. Treasury Bulletins—a vast database full of tables and historical information.
Benchmark Details and AI Performance
The benchmark includes a total of 246 questions, with answers that are clearly verifiable. The results show that AI models like GPT-5.1 perform rather moderately: only slightly more than 43 percent of the questions were answered correctly. In particularly tricky tasks, the success rate even dropped to below 25 percent. These results highlight that performing well on academic tests does not necessarily mean that these models also work reliably in practical business contexts.
Purpose and Future Development of OfficeQA
OfficeQA not only aims to test the performance of AI but also serves as a diagnostic tool to identify the weaknesses of AI—especially in dealing with complex tables and diagrams. Databricks plans to further develop this benchmark as part of a competition, the Grounded Reasoning Cup 2026. Researchers and companies are invited to contribute their ideas and approaches to establish OfficeQA as a useful tool for advancing AI in the business environment.
The benchmark is available as an open-source project on GitHub, inviting the research community to actively participate and drive development forward.
Quellen
- Quelle: Databricks
- Der ursprüngliche Artikel wurde hier veröffentlicht
- Dieser Artikel wurde im Podcast KI-Briefing-Daily behandelt. Die Folge kannst du hier anhören.




