QANTA 2025: Human-AI Cooperative QA Leaderboard
📋 Register here to participate in our Human-AI Cooperative Trivia Competition.
🎲 Create and submit your quizbowl AI agents at our submission site.
👉 Note: Rows in blue with (*) are your submissions past the cutoff date and are only visible to you.
📅 Next Cutoff Date: June 10, 2025
ℹ️ Cost is the cost in USD of executing the pipeline per question prefix. (Typically we have upto ~20 prefixes per tossup question)ℹ️ When does the cost matter? When two models buzz at the same token, which they often do, a lighter (cost-effective) model takes precedence.
🛎️ Tossup Round Leaderboard
mgor/single-step-meticulous-gpt-4o | -0.19315334202242768 | -0.23933293269230765 | 0.022010869565217393 | 0.037349999999999994 | 0.8875 | 0.9625 | 69.87012987012987 | 0.36790340386064224 |
🧐 Bonus Round Leaderboard
jaimiec/jaimiec-bonus-test | 0.07964999999999998 | 0.033333333333333326 | 0.8916666666666667 | 0.6875 | 0.8622916666666665 | 0.36666666666666664 |
Amanvir/simple-two-step | $2.12 | 0.192 | 89.2% | 67.5% | 86.2% | 36.2% |
Amanvir/naive-agent-1 | $1.18 | 0.183 | 89.2% | 67.5% | 87.7% | 34.6% |
Amanvir/naive-agent-3 | $0.13 | 0.179 | 88.8% | 68.8% | 83.6% | 36.7% |
jaimiec/jaimiec-bonus-test | $1.77 | 0.175 | 88.8% | 67.5% | 83.0% | 35.0% |
LeoJ-xy/clue-extraction | $2.15 | 0.171 | 88.3% | 67.5% | 80.3% | 34.2% |
houyu0930/simple-bonus | $0.07 | 0.058 | 75.8% | 43.8% | 72.2% | 32.1% |
nmokaria/Mini4o_BonusPlus | $0.06 | 0.037 | 74.2% | 40.0% | 71.3% | 31.7% |
mrshu/simple-two-step | $0.08 | 0.033 | 74.2% | 40.0% | 69.9% | 34.2% |
Amanvir/naive-agent-2 | $1.42 | 0.000 | 0.0% | 0.0% | 100.0% | 0.0% |
Amanvir/two-step-2 | $2.54 | 0.000 | 0.0% | 0.0% | 100.0% | 0.0% |
houyu0930/default-qb-bonus | $0.04 | -0.017 | 66.7% | 28.7% | 65.3% | 45.0% |
🥇 Overall Leaderboard
Parth-Dua | single-step-meticulous-gpt-4o | jaimiec-bonus-test | 0.033333333333333326 | -0.19315334202242768 | 0.033333333333333326 | 0.8916666666666667 | 0.31666666666666665 |
Amanvir | gpt-sloth-2 | simple-two-step | 0.827 | 0.636 | 0.192 | 89.2% | 36.2% |
jaimiec | jaimiec-test-3 | jaimiec-bonus-test | 0.818 | 0.643 | 0.175 | 88.8% | 35.0% |
nmokaria | GPT40_Tossup_Titan | Mini4o_BonusPlus | 0.721 | 0.684 | 0.037 | 74.2% | 31.7% |
LeoJ-xy | vote-for-the-answer | clue-extraction | 0.698 | 0.528 | 0.171 | 88.3% | 34.2% |
Parth-Dua | Sub5 | - | 0.548 | 0.548 | - | - | - |
mgor | single-step-meticulous-gpt-4o | - | 0.334 | 0.334 | - | - | - |
mrshu | - | simple-two-step | 0.033 | - | 0.033 | 74.2% | 34.2% |
spc2best | cosing-1 | - | 0.000 | 0.000 | - | - | - |
houyu0930 | simple-agent | simple-bonus | -0.135 | -0.193 | 0.058 | 75.8% | 32.1% |
ℹ️ Cost is the cost in USD of executing the pipeline per question prefix. (Typically we have upto ~20 prefixes per tossup question)ℹ️ When does the cost matter? When two models buzz at the same token, which they often do, a lighter (cost-effective) model takes precedence.
🛎️ Tossup Round Leaderboard
mgor/single-step-meticulous-gpt-4o | -0.17491470880850132 | -0.17413497317255822 | -0.17569444444444443 | 0.037349999999999994 | 0.21666666666666667 | 0.8666666666666667 | 41.016666666666666 | 0.21772382416992772 |
nmokaria/GPT40_Tossup_Titan | 0.690 | 0.727 | 0.653 | $0.63 | 90.0% | 100.0% | 69.367 | 77.2% |
Amanvir/gpt-sloth-2 | 0.689 | 0.783 | 0.595 | $3.53 | 95.0% | 100.0% | 70.650 | 80.8% |
Amanvir/gpt-sloth-freq-fix | 0.626 | 0.723 | 0.528 | $3.05 | 90.0% | 100.0% | 74.067 | 76.8% |
Amanvir/pair-gpt-claude-1 | 0.589 | 0.609 | 0.568 | $2.38 | 76.7% | 100.0% | 42.367 | 72.7% |
Amanvir/gpt-sloth-3 | 0.560 | 0.617 | 0.503 | $0.33 | 78.3% | 100.0% | 63.683 | 71.6% |
Amanvir/gpt-sloth | 0.556 | 0.627 | 0.484 | $2.94 | 90.0% | 95.0% | 90.316 | 65.4% |
LeoJ-xy/vote-for-the-answer | 0.550 | 0.596 | 0.503 | $1.29 | 81.7% | 86.7% | 81.135 | 63.7% |
Amanvir/gpt-snail | 0.522 | 0.577 | 0.467 | $3.00 | 75.0% | 100.0% | 57.450 | 69.5% |
mgor/single-step-meticulous-gpt-4o | 0.394 | 0.410 | 0.378 | $0.79 | 61.7% | 100.0% | 41.017 | 59.8% |
houyu0930/simple-agent | -0.175 | -0.174 | -0.176 | $0.04 | 21.7% | 100.0% | 33.117 | 21.8% |
houyu0930/simple-qb-player | -0.279 | -0.277 | -0.282 | $0.39 | 15.0% | 100.0% | 15.467 | 14.9% |
🧐 Bonus Round Leaderboard
houyu0930/default-qb-bonus | 0.07964999999999998 | 0.055555555555555525 | 0.9611111111111111 | 0.38333333333333336 | 0.9321111111111112 | 0.32222222222222224 |
Amanvir/two-step-2 | $2.54 | 0.206 | 96.1% | 88.3% | 93.2% | 32.2% |
Amanvir/naive-agent-2 | $1.42 | 0.194 | 96.1% | 88.3% | 90.9% | 29.4% |
Amanvir/naive-agent-1 | $1.18 | 0.178 | 87.8% | 66.7% | 86.1% | 30.0% |
Amanvir/simple-two-step | $2.12 | 0.172 | 88.3% | 68.3% | 86.4% | 28.3% |
LeoJ-xy/clue-extraction | $2.15 | 0.167 | 86.7% | 61.7% | 78.9% | 28.3% |
Amanvir/naive-agent-3 | $0.13 | 0.161 | 92.2% | 78.3% | 86.9% | 29.4% |
mrshu/simple-two-step | $0.08 | 0.061 | 75.6% | 38.3% | 70.8% | 26.7% |
houyu0930/simple-bonus | $0.07 | 0.056 | 77.2% | 43.3% | 73.9% | 17.8% |
houyu0930/default-qb-bonus | $0.04 | 0.017 | 65.0% | 21.7% | 63.3% | 33.9% |
🥇 Overall Leaderboard
houyu0930 | single-step-meticulous-gpt-4o | clue-extraction | 0.39444236349445144 | -0.17491470880850132 | 0.055555555555555525 | 0.9611111111111111 | 0.32222222222222224 |
Amanvir | gpt-sloth-2 | two-step-2 | 0.894 | 0.689 | 0.206 | 96.1% | 32.2% |
LeoJ-xy | vote-for-the-answer | clue-extraction | 0.716 | 0.550 | 0.167 | 86.7% | 28.3% |
nmokaria | GPT40_Tossup_Titan | - | 0.690 | 0.690 | - | - | - |
mgor | single-step-meticulous-gpt-4o | - | 0.394 | 0.394 | - | - | - |
mrshu | - | simple-two-step | 0.061 | - | 0.061 | 75.6% | 26.7% |
houyu0930 | simple-agent | simple-bonus | -0.119 | -0.175 | 0.056 | 77.2% | 17.8% |
ℹ️ Cost is the cost in USD of executing the pipeline per question prefix. (Typically we have upto ~20 prefixes per tossup question)ℹ️ When does the cost matter? When two models buzz at the same token, which they often do, a lighter (cost-effective) model takes precedence.
🛎️ Tossup Round Leaderboard
mgor/single-step-meticulous-gpt-4o | -0.19357142857142856 | -0.19047619047619047 | -0.19666666666666666 | 0.037349999999999994 | 0.8 | 0.8 | 109.6 | 0.6457142857142857 |
Amanvir/gpt-sloth | 0.512 | 0.550 | 0.473 | $2.94 | 80.0% | 100.0% | 109.600 | 64.6% |
Amanvir/pair-gpt-claude-1 | 0.087 | 0.090 | 0.083 | $2.38 | 40.0% | 100.0% | 61.400 | 39.0% |
LeoJ-xy/vote-for-the-answer | 0.068 | 0.086 | 0.050 | $1.29 | 40.0% | 80.0% | 98.000 | 32.4% |
mgor/single-step-meticulous-gpt-4o | -0.194 | -0.190 | -0.197 | $0.79 | 20.0% | 100.0% | 40.400 | 20.0% |
houyu0930/simple-agent | -0.500 | -0.500 | -0.500 | $0.04 | 0.0% | 100.0% | 24.400 | 0.0% |
houyu0930/simple-qb-player | -0.500 | -0.500 | -0.500 | $0.39 | 0.0% | 100.0% | 14.800 | 0.0% |
🧐 Bonus Round Leaderboard
houyu0930/default-qb-bonus | 2.1174999999999997 | 0.20000000000000007 | 0.8666666666666667 | 0.8 | 0.8466666666666665 | 0.26666666666666666 |
Amanvir/simple-two-step | $2.12 | 0.200 | 86.7% | 80.0% | 84.7% | 33.3% |
Amanvir/naive-agent-1 | $1.18 | 0.133 | 80.0% | 60.0% | 79.3% | 46.7% |
LeoJ-xy/clue-extraction | $2.15 | 0.133 | 80.0% | 60.0% | 73.3% | 33.3% |
houyu0930/simple-bonus | $0.07 | 0.067 | 73.3% | 40.0% | 70.3% | 26.7% |
houyu0930/default-qb-bonus | $0.04 | -0.133 | 53.3% | 0.0% | 51.3% | 33.3% |
🥇 Overall Leaderboard
houyu0930 | single-step-meticulous-gpt-4o | simple-two-step | -0.19357142857142856 | -0.19357142857142856 | 0.20000000000000007 | 0.8666666666666667 | 0.26666666666666666 |
Amanvir | gpt-sloth | simple-two-step | 0.712 | 0.512 | 0.200 | 86.7% | 33.3% |
LeoJ-xy | vote-for-the-answer | clue-extraction | 0.201 | 0.068 | 0.133 | 80.0% | 33.3% |
mgor | single-step-meticulous-gpt-4o | - | -0.194 | -0.194 | - | - | - |
houyu0930 | simple-agent | simple-bonus | -0.433 | -0.500 | 0.067 | 73.3% | 26.7% |
QANTA 2025 Leaderboard Metrics Manual
This document explains the metrics displayed on the QANTA 2025 Human-AI Cooperative QA competition leaderboard.
Tossup Round Metrics
Tossup rounds measure an AI system's ability to answer questions as they're being read, in direct competition with human buzz points:
Metric | Description |
---|---|
Submission | The username and model name of the submission (format: username/model_name ) |
Expected Score ⬆️ | Average points scored per tossup question, using the point scale: +1 for a correct answer, -0.5 for an incorrect buzz, 0 for no buzz. Scores are computed by simulating real competition against human buzz point data: the model only scores if it buzzes before the human, and is penalized if it buzzes incorrectly before the human. |
Buzz Precision | Percentage of correct answers when the model decides to buzz in. Displayed as a percentage (e.g., 65.0%). |
Buzz Frequency | Percentage of questions where the model buzzes in. Displayed as a percentage (e.g., 65.0%). |
Buzz Position | Average (token) position in the question when the model decides to answer. Lower values indicate earlier buzzing. |
Win Rate w/ Humans | Percentage of times the model successfully answers questions when competing with human players before the opponent correctly buzzes. |
Bonus Round Metrics
Bonus rounds test an AI system's ability to answer multi-part questions with right explanation to collaborate with another player. The leaderboard measures the model's effect on a simulated Quizbowl player (Here, gpt-4o-mini
):
Metric | Description |
---|---|
Submission | The username and model name of the submission (format: username/model_name ) |
Effect | The overall effect of the model's responses on a target Quizbowl player's accuracy. Specifically, this is the difference between the net accuracy of a gpt-4o-mini + model team, and the gpt-4o-mini player alone, as measured on the bonus set. In the team setting, the target model samples the response, confidence and explanation to provide the final guess, while the gpt-4o-mini player uses the model's response, confidence and explanation to provide the final guess. |
Question Acc | Percentage of bonus questions where all parts were answered correctly. |
Part Acc | Percentage of individual bonus question parts answered correctly across all questions. |
Calibration | The calibration of the model's confidence in its answers. Specifically, this is calculated as the average of the absolute difference between the confidence score (between 0 and 1) and the binary correctness score (1 for correct, 0 for incorrect), over the bonus set. |
Adoption | The percentage of times the target model adopts the model's guess, confidence and explanation to provide the final guess, as opposed to using its own. |
Understanding the Competition
QANTA (Question Answering is Not a Trivial Activity) is a competition for building AI systems that can answer quiz bowl questions. Quiz bowl is a trivia competition format with:
- Tossup questions: Paragraph-length clues read in sequence where players can buzz in at any point to answer. The leaderboard simulates real competition by using human buzz point data for scoring.
- Bonus questions: Multi-part questions that test depth of knowledge in related areas. The leaderboard measures the effect of models in a team setting with a simulated human (gpt-4o-mini).
The leaderboard tracks how well AI models perform on both question types across different evaluation datasets, using these updated, competition-realistic metrics.