I am a postdoc at Princeton’s PLI. I build tough benchmarks for LMs (i.e. SWE-bench, CiteMe, SciCode, AssistantBench) and then I get the LMs to solve them (i.e. SWE-agent).

Check out my YouTube channel for videos that explain my research and language modeling in general.

I completed my PhD at the Paul G. Allen School for Computer Science & Engineering at the University of Washington, where I was very fortunate to be advised by Noah Smith.

During my PhD I spent two years as a visiting researcher at Facebook AI Research Labs on Luke Zettlemoyer’s team where I mainly worked with Mike Lewis. Prior to that, in the summer of 2019 I interned at Facebook AI Research with Omer Levy. Towards the end of my PhD I spent half a year as a visiting researcher at MosaicML on Jonathan Frankle’s team.

I’ve been writing papers on neural language modeling since 2016. My focus is on making language models more useful to humans. In the first six years of my career I accomplished this by improving LM architectures without increasing their size or runtime. I then moved to working on better prompting methods for improving LMs. I currently try to accomplish my goals by constructing benchmarks that show us where there’s room for improvement in language modeling, and by building systems that use language models to try and solve those tough benchmarks.

The weight tying method I developed is used today by almost all big language and translation models, including OpenAI’s GPT, Google’s BERT, Apple’s on-device LM, and the translation models of Google, Microsoft, Meta and Amazon.

Our ALiBi method showed for the first time how to efficiently enable LMs to handle longer sequences at inference than the ones they were trained on. It has been adopted by BigScience’s 176 billion parameter BLOOM model, by the MPT series of models from MosaicML, by Replit’s models and many others.

In the final paper of my PhD we showed how to improve the ability of LMs to answer complex questions by simply using a better prompt. Our self-ask prompt has the language model ask and answer sub-questions about the input question before generating the final answer. The structure of the self-ask prompt allows us to easily plug in Google Search to answer the sub-questions, which further improves performance.

Before starting my PhD I completed my Bachelor’s and Master’s degrees in Computer Science at Tel Aviv University (where I was advised by Lior Wolf and also worked with Jonathan Berant). Between my Bachelor’s and Master’s degrees I was a software developer for a year.

My brother Ori Press is a machine learning researcher.

Contact me

ofirp@princeton.edu
@ofirpress on Twitter

Mentees:

Carlos Jimenez (2023- , Princeton PhD)
John Yang (2023- , Princeton MSc)
Muru Zhang (2022-2023, UWashington MSc)

Selected Works (Google Scholar, Semantic Scholar)

New (July 2024): Checkout our three new benchmarks: CiteMe, SciCode, and AssistantBench

SWE-agent
John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, Ofir Press
[website]

SWE-bench
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, Karthik Narasimhan
ICLR 2024 (Oral)
[website]

How Language Model Hallucinations Can Snowball
Muru Zhang, Ofir Press, William Merrill, Alisa Liu, Noah A. Smith
ICML 2024
[paper]

Measuring and Narrowing the Compositionality Gap in Language Models
Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A. Smith, Mike Lewis
Findings of EMNLP 2023
[paper] [code] [datasets (Compositional Celebrities, Bamboogle)] [bib]
[Self-ask & Self-ask + Google Search demo video, 2 min]
[The Compositionality Gap Explained (video), 2 min]
[Introducing the Bamboogle Dataset (video), 2 min]
[In-depth overview of Self-ask and the Compositionality Gap (video), 47 min]

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
Ofir Press, Noah A. Smith, Mike Lewis
ICLR 2022
ALiBi is the position embedding method of BigScience’s BLOOM model, MosaicML’s LMs, Replit’s LMs, and many others.
[paper] [code] [FAQ] [bib]
[Yannic Kilcher’s video] [My video (in-depth overview, 47 min)] [ICLR video (summarizes the important bits, 5 min)]

Shortformer: Better Language Modeling using Shorter Inputs
Ofir Press, Noah A. Smith, Mike Lewis
ACL 2021
[paper] [code] [bib]
[ACL video (summarizes the important bits, 12 min)] [video (detailed overview, 1 hour)]

Improving Transformer Models by Reordering their Sublayers
Ofir Press, Noah A. Smith, Omer Levy
ACL 2020
[paper] [summary] [code] [bib]
[ACL video (summarizes the important bits, 12 min)] [video (detailed overview, 35 min)]

Language Generation with Recurrent Generative Adversarial Networks without Pre-training
Ofir Press*, Amir Bar*, Ben Bogin*, Jonathan Berant, Lior Wolf
1st Workshop on Learning to Generate Natural Language at ICML 2017
[paper] [summary] [code] [bib]

Using the Output Embedding to Improve Language Models
Ofir Press, Lior Wolf
EACL 2017
Introduced the weight tying method which is now used in GPT, BERT and many other state of the art language & translation models.
[paper] [summary] [blog post] [code] [bib]

Technical Reports

Partially Shuffling the Training Data to Improve Language Models
Ofir Press
Preprint, 2019
[preprint] [code] [bib]

You May Not Need Attention
Ofir Press, Noah A. Smith
Preprint, 2018
[preprint] [summary] [code] [bib]

Reviewing:

NAACL: 2021, 2019 (secondary reviewer)
EMNLP: 2022, 2021, 2019 (secondary reviewer)
ACL: 2021, 2020 (secondary reviewer)
EACL: 2021
NeurIPS: 2024, 2022, 2021 (emergency reviewer)
ICLR: 2022
ICML: 2024
COLM: 2024
Journals: Harvard Data Science Review (2024)
Workshops: SustaiNLP 2020, NeuralGen 2019