Ofir Press

I am a postdoc at Princeton University’s PLI. My aim is to develop language model systems that can autonomously complete tasks end-to-end, including software engineering, optimizing code runtime and finding cybersecurity vulnerabilities. My main focus area is coding, since I believe that it’s currently the most important and challenging area in AI, but I occasionally do research in other areas, including complex question answering.
I’m working towards this goal by doing the following 3 things:
1) Building frontier-pushing benchmarks for LM systems (i.e. SWE-bench (software engineering), SciCode (PhD-level physics coding), AlgoTune (software runtime optimization))
2) Developing LM-based agents to help frontier LMs do better on these benchmarks (i.e. SWE-agent)
3) Developing synthetic data pipelines and finetuning methods to help small LMs do better on these benchmarks (i.e. SWE-smith)

Our SWE-bench, which tests AI systems ability to solve real software issues from popular GitHub repositories, has been downloaded more than 2 million times, and there are teams at OpenAI, Meta, Google, Anthropic and many other academic & industry groups that develop systems for SWE-bench. Our SWE-agent was the first open-source agent to beat the 10% accuracy threshold on SWE-bench, and is used by mutliple academic and industry teams, including at Google, Anthropic, Meta, The Allen Institute for AI, ETH Zurich and others.

Check out my YouTube channel for videos that explain my research and language modeling in general.

Mentees:

During my PhD and Postdoc I’ve collaborated with over 30 amazing undergrad, masters, and PhD students. This list contains collaborators that have worked with me on more than a single project.

Carlos Jimenez (2023- , Princeton PhD)
John Yang (2023- , Princeton MSc/RA -> Stanford PhD)
Kilian Lieret (2023- , Princeton RE)
Ori Press (2024-, University of Tübingen PhD)
Minyang Tian (2024-, UIUC PhD)
Alex L. Zhang (2024-, Princeton Undergrad)
Talor Abramovich (2024-2024, Tel-Aviv University MSc)
Muru Zhang (2022-2023, University of Washington MSc)

People I’ve Worked With

During my postdoc I work with Karthik Narasimhan’s group at Princeton University.

I completed my PhD at the Paul G. Allen School for Computer Science & Engineering at the University of Washington, where I was very fortunate to be advised by Noah Smith. For the last year of my PhD I was also a visiting scholar at Kyunghyun Cho’s group at NYU.

During my PhD I spent two years as a visiting researcher at Facebook AI Research Labs on Luke Zettlemoyer’s team where I mainly worked with Mike Lewis. Prior to that, in the summer of 2019 I interned at Facebook AI Research with Omer Levy. Towards the end of my PhD I spent half a year as a visiting researcher at MosaicML on Jonathan Frankle’s team.

Before starting my PhD I completed my Bachelor’s and Master’s degrees in Computer Science at Tel Aviv University (where I was advised by Lior Wolf and also worked with Jonathan Berant). Between my Bachelor’s and Master’s degrees I was a software developer for a year.

My Background In Language Modeling

I’ve been writing papers on neural language modeling since 2016. In the first six years of my career I focused on improving LM architectures without increasing their size or runtime.

The weight tying method I developed is used today by many popular language and translation models, including OpenAI’s GPT, Google’s BERT, Apple’s on-device LM, and the translation models of Google, Microsoft, Meta and Amazon.

Our ALiBi method showed for the first time how to efficiently enable LMs to handle longer sequences at inference than the ones they were trained on. It has been adopted by BigScience’s 176 billion parameter BLOOM model, by the MPT series of models from MosaicML, by Replit’s models and many others. It is also used by multiple vision and speech models, including Meta’s Audiobox model.

In the final paper of my PhD we showed how to improve the ability of LMs to answer complex questions by simply using a better prompt. Our self-ask prompting method has the language model ask and answer sub-questions about the input question before generating the final answer. The structure of the self-ask prompt allows us to easily plug in Google Search to answer the sub-questions, which further improves performance. This system was an early precursor to the LM-based web-browsing agents that are popular today.

Misc.

My brother Ori Press is a machine learning researcher.

Contact Me

ofirp@princeton.edu
@ofirpress on Twitter

Selected Works (Google Scholar, Semantic Scholar)

AlgoTune: Can Language Models Speed Up General-Purpose Numerical Programs?
Ori Press, Brandon Amos, Haoyu Zhao, Yikai Wu, Samuel Ainsworth, Dominik Krupke, Patrick Kidger, Touqir Sajed, Bartolomeo Stellato, Jisun Park, Nathanael Bosch, Eli Meril, Albert Steppi, Arman Zharmagambetov, Fangzhao Zhang, David Pérez-Piñeiro, Alberto Mercurio, Ni Zhan, Talor Abramovich, Kilian Lieret, Hanlin Zhang, Shirley Huang, Matthias Bethge, Ofir Press
Under Review

VideoGameBench: Can Vision-Language Models complete popular video games?
Alex L. Zhang, Thomas L. Griffiths, Karthik R. Narasimhan, Ofir Press
Under Review

SWE-smith: Scaling Data for Software Engineering Agents
John Yang, Kilian Lieret, Carlos E. Jimenez, Alexander Wettig, Kabir Khandpur, Yanzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, Diyi Yang
Under Review

Interactive Tools Substantially Assist LM Agents in Finding Security Vulnerabilities
Talor Abramovich, Meet Udeshi, Minghao Shao, Kilian Lieret, Haoran Xi, Kimberly Milner, Sofija Jancheska, John Yang, Carlos E. Jimenez, Farshad Khorrami, Prashanth Krishnamurthy, Brendan Dolan-Gavitt, Muhammad Shafique, Karthik Narasimhan, Ramesh Karri, Ofir Press
ICML 2025

SWE-bench Multimodal
John Yang*, Carlos E. Jimenez*, Alex L. Zhang, Kilian Lieret, Joyce Yang, Xindi Wu, Ori Press, Niklas Muennighoff, Gabriel Synnaeve, Karthik Narasimhan, Diyi Yang, Sida I. Wang, Ofir Press
ICLR 2025

SciCode
Minyang Tian*, Luyu Gao*, Shizhuo Dylan Zhang, Xinan Chen, Cunwei Fan, Xuefei Guo, Roland Haas, Pan Ji, Kittithat Krongchon, Yao Li, Shengyan Liu, Di Luo, Yutao Ma, Hao Tong, Kha Trinh, Chenyu Tian, Zihan Wang, Bohao Wu, Yanyu Xiong, Shengzhu Yin, Minhui Zhu, Kilian Lieret, Yanxin Lu, Genglin Liu, Yufeng Du, Tianhua Tao, Ofir Press, Jamie Callan, Eliu Huerta, Hao Peng
NeurIPS 2024

AssistantBench
Ori Yoran, Samuel Joseph Amouyal, Chaitanya Malaviya, Ben Bogin, Ofir Press, Jonathan Berant
EMNLP 2024

CiteME
Ori Press*, Andreas Hochlehnert*, Ameya Prabhu, Vishaal Udandarao, Ofir Press, Matthias Bethge
NeurIPS 2024

SWE-agent
John Yang*, Carlos E. Jimenez*, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, Ofir Press
NeurIPS 2024
[website]

SWE-bench
Carlos E. Jimenez*, John Yang*, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, Karthik Narasimhan
ICLR 2024 (Oral)
[website]

How Language Model Hallucinations Can Snowball
Muru Zhang, Ofir Press, William Merrill, Alisa Liu, Noah A. Smith
ICML 2024
[paper]

Measuring and Narrowing the Compositionality Gap in Language Models
Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A. Smith, Mike Lewis
Findings of EMNLP 2023
[paper] [code] [datasets (Compositional Celebrities, Bamboogle)] [bib]
[Self-ask & Self-ask + Google Search demo video, 2 min]
[The Compositionality Gap Explained (video), 2 min]
[Introducing the Bamboogle Dataset (video), 2 min]
[In-depth overview of Self-ask and the Compositionality Gap (video), 47 min]

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
Ofir Press, Noah A. Smith, Mike Lewis
ICLR 2022
ALiBi is the position embedding method of BigScience’s BLOOM model, MosaicML’s LMs, Replit’s LMs, and many others.
[paper] [code] [FAQ] [bib]
[Yannic Kilcher’s video] [My video (in-depth overview, 47 min)] [ICLR video (summarizes the important bits, 5 min)]

Shortformer: Better Language Modeling using Shorter Inputs
Ofir Press, Noah A. Smith, Mike Lewis
ACL 2021
[paper] [code] [bib]
[ACL video (summarizes the important bits, 12 min)] [video (detailed overview, 1 hour)]

Improving Transformer Models by Reordering their Sublayers
Ofir Press, Noah A. Smith, Omer Levy
ACL 2020
[paper] [summary] [code] [bib]
[ACL video (summarizes the important bits, 12 min)] [video (detailed overview, 35 min)]

Language Generation with Recurrent Generative Adversarial Networks without Pre-training
Ofir Press*, Amir Bar*, Ben Bogin*, Jonathan Berant, Lior Wolf
1st Workshop on Learning to Generate Natural Language at ICML 2017
[paper] [summary] [code] [bib]

Using the Output Embedding to Improve Language Models
Ofir Press, Lior Wolf
EACL 2017
Introduced the weight tying method which is now used in GPT, BERT and many other state of the art language & translation models.
[paper] [summary] [blog post] [code] [bib]

Technical Reports

Partially Shuffling the Training Data to Improve Language Models
Ofir Press
Preprint, 2019
[preprint] [code] [bib]

You May Not Need Attention
Ofir Press, Noah A. Smith
Preprint, 2018
[preprint] [summary] [code] [bib]

Reviewing:

NAACL: 2021, 2019 (secondary reviewer)
EMNLP: 2022, 2021, 2019 (secondary reviewer)
ACL: 2021, 2020 (secondary reviewer)
EACL: 2021
NeurIPS: 2025, 2024, 2022, 2021 (emergency reviewer)
ICLR: 2022
ICML: 2025, 2024
COLM: 2024
Journals: Harvard Data Science Review (2024)
Workshops: SustaiNLP 2020, NeuralGen 2019