Ofir's Gelato Challenge

At NeurIPS 2024, I will buy gelato for the team that has the highest combined score on SWE-bench Lite, AssistantBench, CiteME, and SciCode. Final submission is by end-of-day (anywhere in the world) December 3, 2024.

Rules:

  1. You may use any proprietary or open source model.
  2. All results must be pass@1. Agents are allowed to use the internet but you must make sure that they can’t accidentally browse to a site that contains the actual datasets, thereby revealing the right answers.
  3. For AssistantBench we will only consider the accuracy metric. For SciCode we’re only going to consider the Main Problem score.
  4. There must be some type of public posting about the system being submitted, either a preprint or paper or blog post.
  5. You’ll need to somehow show me that you’re not making up your numbers- either by submitting your agent trajectories or by letting me interact with your system.
  6. I’m going to get Princeton to reimburse me for the gelatos, so the limit on the amount of gelatos I buy is whatever the max reimbursement at Princeton is for a gelato party. If your team has 1000 people we’ll have to talk to the Trustees of Princeton and see what they can do.

To make a submission, email me with your scores.

Written on August 12, 2024