QANTA 2026: Human-AI Cooperative QA Leaderboard
📋 Register here to participate in our Human-AI Cooperative Trivia Competition.
🎲 Create and submit your quizbowl AI agents at our submission site.
👉 Note: Rows in blue with (*) are your submissions past the cutoff date and are only visible to you.
📅 Next Cutoff Date: June 10, 2026
ℹ️ Cost is the cost in USD of executing the pipeline per question prefix. (Typically we have upto ~20 prefixes per tossup question)ℹ️ When does the cost matter? When two models buzz at the same token, which they often do, a lighter (cost-effective) model takes precedence.
ℹ️ Multimodal (image-bearing) questions are included in this score, not reported on a separate table.
🛎️ Tossup Round Leaderboard
168mxie/mxie-test-simple | -0.12454060065321675 | -0.2490812013064335 | 0 | 0.73525 | 0.0125 | 1 | 22.925 | 0.1640949682842098 |
🧐 Bonus Round Leaderboard
🥇 Overall Leaderboard
168mxie | test1-mxie | null | -0.12454060065321675 | -0.12454060065321675 | null | null | null |
168mxie | test1-mxie | - | -0.125 | -0.125 | - | - | - |
ℹ️ Cost is the cost in USD of executing the pipeline per question prefix. (Typically we have upto ~20 prefixes per tossup question)ℹ️ When does the cost matter? When two models buzz at the same token, which they often do, a lighter (cost-effective) model takes precedence.
ℹ️ Multimodal (image-bearing) questions are included in this score, not reported on a separate table.
🛎️ Tossup Round Leaderboard
168mxie/test1-mxie | -0.06187759934226311 | -0.12375519868452622 | 0 | 0.73525 | 0.26666666666666666 | 1 | 27.616666666666667 | 0.23827808722601343 |
168mxie/test1-mxie | -0.062 | -0.124 | 0.000 | $0.74 | 26.7% | 100.0% | 27.617 | 23.8% |
🧐 Bonus Round Leaderboard
🥇 Overall Leaderboard
168mxie | test1-mxie | null | -0.06187759934226311 | -0.06187759934226311 | null | null | null |
168mxie | test1-mxie | - | -0.062 | -0.062 | - | - | - |
ℹ️ Cost is the cost in USD of executing the pipeline per question prefix. (Typically we have upto ~20 prefixes per tossup question)ℹ️ When does the cost matter? When two models buzz at the same token, which they often do, a lighter (cost-effective) model takes precedence.
ℹ️ Multimodal (image-bearing) questions are included in this score, not reported on a separate table.
🛎️ Tossup Round Leaderboard
168mxie/test1-mxie | -0.25 | -0.5 | 0 | 0.73525 | 0 | 1 | 12.6 | 0 |
168mxie/test1-mxie | -0.250 | -0.500 | 0.000 | $0.74 | 0.0% | 100.0% | 12.600 | 0.0% |
🧐 Bonus Round Leaderboard
🥇 Overall Leaderboard
168mxie | test1-mxie | null | -0.25 | -0.25 | null | null | null |
168mxie | test1-mxie | - | -0.250 | -0.250 | - | - | - |
QANTA 2025 Leaderboard Metrics Manual
This document explains the metrics displayed on the QANTA 2025 Human-AI Cooperative QA competition leaderboard.
Tossup Round Metrics
Tossup rounds measure an AI system's ability to answer questions as they're being read, in direct competition with human buzz points:
| Metric | Description |
|---|---|
| Submission | The username and model name of the submission (format: username/model_name) |
| Expected Score ⬆️ | Average points scored per tossup question, using the point scale: +1 for a correct answer, -0.5 for an incorrect buzz, 0 for no buzz. Scores are computed by simulating real competition against human buzz point data: the model only scores if it buzzes before the human, and is penalized if it buzzes incorrectly before the human. |
| Buzz Precision | Percentage of correct answers when the model decides to buzz in. Displayed as a percentage (e.g., 65.0%). |
| Buzz Frequency | Percentage of questions where the model buzzes in. Displayed as a percentage (e.g., 65.0%). |
| Buzz Position | Average (token) position in the question when the model decides to answer. Lower values indicate earlier buzzing. |
| Win Rate w/ Humans | Percentage of times the model successfully answers questions when competing with human players before the opponent correctly buzzes. |
Bonus Round Metrics
Bonus rounds test an AI system's ability to answer multi-part questions with right explanation to collaborate with another player. The leaderboard measures the model's effect on a simulated Quizbowl player (Here, gpt-4o-mini):
| Metric | Description |
|---|---|
| Submission | The username and model name of the submission (format: username/model_name) |
| Effect | The overall effect of the model's responses on a target Quizbowl player's accuracy. Specifically, this is the difference between the net accuracy of a gpt-4o-mini + model team, and the gpt-4o-mini player alone, as measured on the bonus set. In the team setting, the target model samples the response, confidence and explanation to provide the final guess, while the gpt-4o-mini player uses the model's response, confidence and explanation to provide the final guess. |
| Question Acc | Percentage of bonus questions where all parts were answered correctly. |
| Part Acc | Percentage of individual bonus question parts answered correctly across all questions. |
| Calibration | The calibration of the model's confidence in its answers. Specifically, this is calculated as the average of the absolute difference between the confidence score (between 0 and 1) and the binary correctness score (1 for correct, 0 for incorrect), over the bonus set. |
| Adoption | The percentage of times the target model adopts the model's guess, confidence and explanation to provide the final guess, as opposed to using its own. |
Understanding the Competition
QANTA (Question Answering is Not a Trivial Activity) is a competition for building AI systems that can answer quiz bowl questions. Quiz bowl is a trivia competition format with:
- Tossup questions: Paragraph-length clues read in sequence where players can buzz in at any point to answer. The leaderboard simulates real competition by using human buzz point data for scoring.
- Bonus questions: Multi-part questions that test depth of knowledge in related areas. The leaderboard measures the effect of models in a team setting with a simulated human (gpt-4o-mini).
The leaderboard tracks how well AI models perform on both question types across different evaluation datasets, using these updated, competition-realistic metrics.