Ofir's Gelato Challenge
At NeurIPS 2024, I will buy gelato for the team that has the highest combined score on SWE-bench Lite, AssistantBench, CiteME, and SciCode. Final submission is by end-of-day (anywhere in the world) December 3, 2024.
- You may use any proprietary or open source model.
- All results must be pass@1. Agents are allowed to use the internet but you must make sure that they can’t accidentally browse to a site that contains the actual datasets, thereby revealing the right answers.
- For AssistantBench we will only consider the accuracy metric. For SciCode we’re only going to consider the Main Problem score.
- There must be some type of public posting about the system being submitted, either a preprint or paper or blog post.
- You’ll need to somehow show me that you’re not making up your numbers- either by submitting your agent trajectories or by letting me interact with your system.
- I’m going to get Princeton to reimburse me for the gelatos, so the limit on the amount of gelatos I buy is whatever the max reimbursement at Princeton is for a gelato party. If your team has 1000 people we’ll have to talk to the Trustees of Princeton and see what they can do.
To make a submission, email me with your scores.

Written on August 12, 2024