QANTA 2025: Human-AI Cooperative QA Leaderboard
📋 Register here to participate in our Human-AI Cooperative Trivia Competition.
🎲 Create and submit your quizbowl AI agents at our submission site.
👉 Note: Rows in blue with (*) are your submissions past the cutoff date and are only visible to you.
📅 Next Cutoff Date: June 07, 2025
🛎️ Tossup Round Leaderboard
mgor/single-step-meticulous-gpt-4o | -0.17413497317255822 | 0.21666666666666667 | 0.8666666666666667 | 41.016666666666666 | 0.21772382416992772 |
🧐 Bonus Round Leaderboard
houyu0930/default-qb-bonus | 0.055555555555555525 | 0.9611111111111111 | 0.38333333333333336 | 0.9321111111111112 | 0.32222222222222224 |
Amanvir/two-step-2 | 0.2055555555555556 | 0.9611111111111111 | 0.8833333333333333 | 0.9321111111111112 | 0.32222222222222224 |
Amanvir/naive-agent-2 | 0.19444444444444442 | 0.9611111111111111 | 0.8833333333333333 | 0.9088333333333335 | 0.29444444444444445 |
Amanvir/naive-agent-1 | 0.17777777777777776 | 0.8777777777777778 | 0.6666666666666666 | 0.8605555555555557 | 0.3 |
Amanvir/simple-two-step | 0.17222222222222222 | 0.8833333333333333 | 0.6833333333333333 | 0.8644444444444446 | 0.2833333333333333 |
LeoJ-xy/clue-extraction | 0.16666666666666669 | 0.8666666666666667 | 0.6166666666666667 | 0.7888888888888892 | 0.2833333333333333 |
Amanvir/naive-agent-3 | 0.1611111111111111 | 0.9222222222222223 | 0.7833333333333333 | 0.8691666666666669 | 0.29444444444444445 |
mrshu/simple-two-step | 0.06111111111111106 | 0.7555555555555555 | 0.38333333333333336 | 0.7083333333333334 | 0.26666666666666666 |
houyu0930/simple-bonus | 0.055555555555555525 | 0.7722222222222223 | 0.43333333333333335 | 0.738611111111111 | 0.17777777777777778 |
houyu0930/default-qb-bonus | 0.016666666666666663 | 0.65 | 0.21666666666666667 | 0.6327777777777778 | 0.3388888888888889 |
🥇 Overall Leaderboard
houyu0930 | single-step-meticulous-gpt-4o | clue-extraction | -0.11857941761700269 | -0.17413497317255822 | 0.055555555555555525 | 0.9611111111111111 | 0.32222222222222224 |
Amanvir | gpt-sloth-2 | two-step-2 | 0.9881791578428124 | 0.7826236022872568 | 0.2055555555555556 | 0.9611111111111111 | 0.32222222222222224 |
LeoJ-xy | vote-for-the-answer | clue-extraction | 0.7623201828311039 | 0.5956535161644373 | 0.16666666666666669 | 0.8666666666666667 | 0.2833333333333333 |
nmokaria | GPT40_Tossup_Titan | null | 0.7272802126588335 | 0.7272802126588335 | null | null | null |
mgor | single-step-meticulous-gpt-4o | null | 0.4104125047666807 | 0.4104125047666807 | null | null | null |
mrshu | null | simple-two-step | 0.06111111111111106 | null | 0.06111111111111106 | 0.7555555555555555 | 0.26666666666666666 |
houyu0930 | simple-agent | simple-bonus | -0.11857941761700269 | -0.17413497317255822 | 0.055555555555555525 | 0.7722222222222223 | 0.17777777777777778 |
🛎️ Tossup Round Leaderboard
mgor/single-step-meticulous-gpt-4o | -0.17619047619047618 | 0.8 | 0.8 | 109.6 | 0.19285714285714284 |
Amanvir/gpt-sloth | 0.5502597402597402 | 0.8 | 1 | 109.6 | 0.6457142857142857 |
Amanvir/pair-gpt-claude-1 | 0.09025974025974026 | 0.4 | 1 | 61.4 | 0.3904761904761905 |
LeoJ-xy/vote-for-the-answer | 0.08573593073593073 | 0.4 | 0.8 | 98 | 0.3238095238095238 |
houyu0930/two-step-agent | -0.17619047619047618 | 0 | 0.4 | 84 | 0.19285714285714284 |
mgor/single-step-meticulous-gpt-4o | -0.19047619047619047 | 0.2 | 1 | 40.4 | 0.2 |
houyu0930/simple-agent | -0.5 | 0 | 1 | 24.4 | 0 |
houyu0930/simple-qb-player | -0.5 | 0 | 1 | 14.8 | 0 |
🧐 Bonus Round Leaderboard
houyu0930/default-qb-bonus | 0.20000000000000007 | 0.8666666666666667 | 0.8 | 0.8466666666666665 | 0.26666666666666666 |
Amanvir/simple-two-step | 0.20000000000000007 | 0.8666666666666667 | 0.8 | 0.8466666666666665 | 0.3333333333333333 |
Amanvir/naive-agent-1 | 0.13333333333333341 | 0.8 | 0.6 | 0.7933333333333332 | 0.4666666666666667 |
LeoJ-xy/clue-extraction | 0.13333333333333341 | 0.8 | 0.6 | 0.7333333333333336 | 0.3333333333333333 |
houyu0930/simple-bonus | 0.06666666666666665 | 0.7333333333333333 | 0.4 | 0.7033333333333334 | 0.26666666666666666 |
houyu0930/default-qb-bonus | -0.1333333333333333 | 0.5333333333333333 | 0 | 0.5133333333333333 | 0.3333333333333333 |
🥇 Overall Leaderboard
houyu0930 | single-step-meticulous-gpt-4o | simple-two-step | -0.10952380952380952 | -0.17619047619047618 | 0.20000000000000007 | 0.8666666666666667 | 0.26666666666666666 |
Amanvir | gpt-sloth | simple-two-step | 0.7502597402597403 | 0.5502597402597402 | 0.20000000000000007 | 0.8666666666666667 | 0.3333333333333333 |
LeoJ-xy | vote-for-the-answer | clue-extraction | 0.21906926406926414 | 0.08573593073593073 | 0.13333333333333341 | 0.8 | 0.3333333333333333 |
houyu0930 | two-step-agent | simple-bonus | -0.10952380952380952 | -0.17619047619047618 | 0.06666666666666665 | 0.7333333333333333 | 0.26666666666666666 |
mgor | single-step-meticulous-gpt-4o | null | -0.19047619047619047 | -0.19047619047619047 | null | null | null |
QANTA 2025 Leaderboard Metrics Manual
This document explains the metrics displayed on the QANTA 2025 Human-AI Cooperative QA competition leaderboard.
Tossup Round Metrics
Tossup rounds measure an AI system's ability to answer questions as they're being read, in direct competition with human buzz points:
Metric | Description |
---|---|
Submission | The username and model name of the submission (format: username/model_name ) |
Expected Score ⬆️ | Average points scored per tossup question, using the point scale: +1 for a correct answer, -0.5 for an incorrect buzz, 0 for no buzz. Scores are computed by simulating real competition against human buzz point data: the model only scores if it buzzes before the human, and is penalized if it buzzes incorrectly before the human. |
Buzz Precision | Percentage of correct answers when the model decides to buzz in. Displayed as a percentage (e.g., 65.0%). |
Buzz Frequency | Percentage of questions where the model buzzes in. Displayed as a percentage (e.g., 65.0%). |
Buzz Position | Average (token) position in the question when the model decides to answer. Lower values indicate earlier buzzing. |
Win Rate w/ Humans | Percentage of times the model successfully answers questions when competing with human players before the opponent correctly buzzes. |
Bonus Round Metrics
Bonus rounds test an AI system's ability to answer multi-part questions with right explanation to collaborate with another player. The leaderboard measures the model's effect on a simulated Quizbowl player (Here, gpt-4o-mini
):
Metric | Description |
---|---|
Submission | The username and model name of the submission (format: username/model_name ) |
Effect | The overall effect of the model's responses on a target Quizbowl player's accuracy. Specifically, this is the difference between the net accuracy of a gpt-4o-mini + model team, and the gpt-4o-mini player alone, as measured on the bonus set. In the team setting, the target model samples the response, confidence and explanation to provide the final guess, while the gpt-4o-mini player uses the model's response, confidence and explanation to provide the final guess. |
Question Acc | Percentage of bonus questions where all parts were answered correctly. |
Part Acc | Percentage of individual bonus question parts answered correctly across all questions. |
Calibration | The calibration of the model's confidence in its answers. Specifically, this is calculated as the average of the absolute difference between the confidence score (between 0 and 1) and the binary correctness score (1 for correct, 0 for incorrect), over the bonus set. |
Adoption | The percentage of times the target model adopts the model's guess, confidence and explanation to provide the final guess, as opposed to using its own. |
Understanding the Competition
QANTA (Question Answering is Not a Trivial Activity) is a competition for building AI systems that can answer quiz bowl questions. Quiz bowl is a trivia competition format with:
- Tossup questions: Paragraph-length clues read in sequence where players can buzz in at any point to answer. The leaderboard simulates real competition by using human buzz point data for scoring.
- Bonus questions: Multi-part questions that test depth of knowledge in related areas. The leaderboard measures the effect of models in a team setting with a simulated human (gpt-4o-mini).
The leaderboard tracks how well AI models perform on both question types across different evaluation datasets, using these updated, competition-realistic metrics.