QANTA 2026: Human-AI Cooperative QA Leaderboard

📋 Register here to participate in our Human-AI Cooperative Trivia Competition.

🎲 Create and submit your quizbowl AI agents at our submission site.

📅 Next Cutoff Date: June 27, 2026

ℹ️ E [Score] is the Expected Score for a question. 🙋🏻 and 🤖 indicate the scores against just the Human and the AI players respectively.
ℹ️ Cost is the cost in USD of executing the pipeline per question prefix. (Typically we have upto ~20 prefixes per tossup question)ℹ️ When does the cost matter? When two models buzz at the same token, which they often do, a lighter (cost-effective) model takes precedence.
ℹ️ Multimodal (image-bearing) questions are included in this score, not reported on a separate table.

🛎️ Tossup Round Leaderboard

🛎️ Tossup Round Leaderboard

Mokshj1/Moksh_Tossup_Multimodal_QA	-0.05237998695640297	-0.10475997391280593	0	0.14120000000000002	0.029166666666666667	0.8791666666666667	60.557522123893804	0.0012324284933623898


sidS216/Sid_Tossup_Agent	0.246	0.492	$1.04	70.0%	87.9%	61.071	71.4%
nirjharami108/QANTA41Mini_V1	0.238	0.475	$0.14	72.5%	94.2%	60.558	71.1%
Mokshj1/Moksh_Tossup_Multimodal_QA	0.226	0.451	$1.62	72.1%	87.9%	72.469	61.7%
ileygreg/QANTA_41miniV1	0.220	0.439	$0.31	72.5%	88.8%	72.061	61.7%
ileygreg/QANTA_41mini_V2	0.212	0.424	$0.35	70.4%	95.8%	59.670	64.5%
ileygreg/QANTA_41mini_V3	0.199	0.398	$0.39	66.2%	93.3%	57.692	65.7%
divyagoyal6224/Divya6224	0.197	0.394	$1.10	55.0%	72.9%	68.131	64.2%
divyagoyal6224/DivyaGoyal	0.166	0.333	$1.01	55.4%	68.3%	84.189	49.4%
eshanli/calibrated_3step_tossup_v1	0.104	0.209	$1.92	35.0%	60.4%	38.621	55.6%
168mxie/simple-tossup	-0.052	-0.105	$0.85	25.5%	97.3%	5.074	9.6%
168mxie/tw-step-simple	-0.060	-0.120	$0.74	24.6%	95.8%	16.965	18.9%
eshanli/eshan_v3_calibrated_nano	-0.078	-0.156	$1.38	22.9%	100.0%	10.004	12.4%
eshanli/calibrated_3step_tossup_v2	-0.095	-0.190	$3.38	20.4%	99.2%	9.013	9.6%
nttruong1007/qb-qwen25-1_5b-cal	-0.128	-0.256	$0.00	2.9%	60.4%	34.834	5.0%
eshanli/eshan_i_v3	-0.139	-0.277	$0.05	14.6%	99.2%	8.987	5.5%
nttruong1007/qb-qwen25-1p5b	-0.209	-0.419	$0.00	2.1%	90.0%	7.028	1.4%
168mxie/docker-test	-0.239	-0.478	$0.00	0.0%	95.8%	12.778	0.1%

ℹ️ Cost for Bonus pipeline is the cost in USD of executing the pipeline per bonus part. (We have exactly 3 parts per bonus question). Multimodal bonus items, when present, are part of the same evaluation and this table.

🧐 Bonus Round Leaderboard

🧐 Bonus Round Leaderboard

Mokshj1/Moksh_Bonus_Multimodal_QA	null	-0.00666666666666671	0.29777777777777775	0.006666666666666667	0.33122222222222225	0.006666666666666667


168mxie/test-bonus2	—	0.171	88.7%	71.3%	86.7%	32.7%
nirjharami108/QANTA41_Bonus_V1	—	0.164	89.1%	72.7%	88.2%	33.8%
168mxie/test-bonus	—	0.160	88.7%	70.7%	86.5%	32.0%
ileygreg/QANTA_bonus_v1	—	0.158	89.8%	74.7%	85.3%	34.2%
Mokshj1/Moksh_Bonus_Multimodal_QA	—	0.144	88.7%	70.0%	84.3%	32.2%
nttruong1007/qb-phi35-mini	—	0.000	0.0%	0.0%	100.0%	0.0%
nttruong1007/qb-smolvlm500	—	-0.007	0.0%	0.0%	51.7%	0.7%
nttruong1007/qb-smolvlm	—	-0.060	0.0%	0.0%	51.7%	15.3%
nttruong1007/qb-hermes3-llama8b	—	-0.140	53.6%	14.7%	54.3%	47.8%
nttruong1007/qb-granite31-8b	—	-0.156	52.2%	13.3%	52.9%	50.0%
nttruong1007/qb-yi15-9b	—	-0.260	29.8%	3.3%	33.1%	52.4%
nttruong1007/qb-qwen25-1p5b	—	-0.273	16.7%	0.7%	18.0%	50.4%
nttruong1007/qb-qwen25-1_5b	—	-0.293	9.1%	0.0%	13.7%	49.8%
nttruong1007/qb-smollm2-1_7b	—	-0.313	10.7%	0.0%	27.4%	52.2%

🥇 Overall Leaderboard

🥇 Overall Leaderboard

divyagoyal6224	Moksh_Tossup_Multimodal_QA	Moksh_Bonus_Multimodal_QA	-0.12811413489571105	-0.05237998695640297	0.1644444444444444	0.8911111111111111	0.32222222222222224


nirjharami108	QANTA41Mini_V1	QANTA41_Bonus_V1	0.402	0.238	0.164	89.1%	33.8%
ileygreg	QANTA_41miniV1	QANTA_bonus_v1	0.377	0.220	0.158	89.8%	34.2%
Mokshj1	Moksh_Tossup_Multimodal_QA	Moksh_Bonus_Multimodal_QA	0.370	0.226	0.144	88.7%	32.2%
sidS216	Sid_Tossup_Agent	-	0.246	0.246	-	-	-
divyagoyal6224	Divya6224	-	0.197	0.197	-	-	-
168mxie	simple-tossup	test-bonus2	0.119	-0.052	0.171	88.7%	32.7%
eshanli	calibrated_3step_tossup_v1	-	0.104	0.104	-	-	-
nttruong1007	qb-qwen25-1_5b-cal	qb-phi35-mini	-0.128	-0.128	0.000	0.0%	0.0%

QANTA 2025 Leaderboard Metrics Manual

This document explains the metrics displayed on the QANTA 2025 Human-AI Cooperative QA competition leaderboard.

Tossup Round Metrics

Tossup rounds measure an AI system's ability to answer questions as they're being read, in direct competition with human buzz points:

Metric	Description
Submission	The username and model name of the submission (format: `username/model_name`)
Expected Score ⬆️	Average points scored per tossup question, using the point scale: +1 for a correct answer, -0.5 for an incorrect buzz, 0 for no buzz. Scores are computed by simulating real competition against human buzz point data: the model only scores if it buzzes before the human, and is penalized if it buzzes incorrectly before the human.
Buzz Precision	Percentage of correct answers when the model decides to buzz in. Displayed as a percentage (e.g., 65.0%).
Buzz Frequency	Percentage of questions where the model buzzes in. Displayed as a percentage (e.g., 65.0%).
Buzz Position	Average (token) position in the question when the model decides to answer. Lower values indicate earlier buzzing.
Win Rate w/ Humans	Percentage of times the model successfully answers questions when competing with human players before the opponent correctly buzzes.

Bonus Round Metrics

Bonus rounds test an AI system's ability to answer multi-part questions with right explanation to collaborate with another player. The leaderboard measures the model's effect on a simulated Quizbowl player (Here, gpt-4o-mini):

Metric	Description
Submission	The username and model name of the submission (format: `username/model_name`)
Effect	The overall effect of the model's responses on a target Quizbowl player's accuracy. Specifically, this is the difference between the net accuracy of a gpt-4o-mini + model team, and the gpt-4o-mini player alone, as measured on the bonus set. In the team setting, the target model samples the response, confidence and explanation to provide the final guess, while the gpt-4o-mini player uses the model's response, confidence and explanation to provide the final guess.
Question Acc	Percentage of bonus questions where all parts were answered correctly.
Part Acc	Percentage of individual bonus question parts answered correctly across all questions.
Calibration	The calibration of the model's confidence in its answers. Specifically, this is calculated as the average of the absolute difference between the confidence score (between 0 and 1) and the binary correctness score (1 for correct, 0 for incorrect), over the bonus set.
Adoption	The percentage of times the target model adopts the model's guess, confidence and explanation to provide the final guess, as opposed to using its own.

Understanding the Competition

QANTA (Question Answering is Not a Trivial Activity) is a competition for building AI systems that can answer quiz bowl questions. Quiz bowl is a trivia competition format with:

Tossup questions: Paragraph-length clues read in sequence where players can buzz in at any point to answer. The leaderboard simulates real competition by using human buzz point data for scoring.
Bonus questions: Multi-part questions that test depth of knowledge in related areas. The leaderboard measures the effect of models in a team setting with a simulated human (gpt-4o-mini).

The leaderboard tracks how well AI models perform on both question types across different evaluation datasets, using these updated, competition-realistic metrics.