Ofir Press

5 Tips for Finding Research Topics

2024-02-13T00:00:00+00:00

I’ve spent the past eight years doing research on neural language models. While my top-level goal of making language models more useful to humans has remained stable, my path to achieving it has changed over time. I’ve learned that in research, you not only have to learn how to execute on a given idea, but you also have to learn how to pick which direction to direct your efforts at.

This post contains five patterns that I’ve noticed that both my peers and I use to think of new ideas for research projects. It’s important to note that this advice might only apply to the research sub-field that I’m in and might only apply to people who want to perform research in a manner that is similar to the one I use. There are many ways to succeed, some of which are orthogonal to the guidelines that I use.

1. Focus first on finding a problem, not a solution.

A common mistake I see people do is start a project by thinking about a model they want to build, without even thinking about whether it’s even necessary. They’ll get really excited about a certain method and try to just build something based on that. “Let’s build a retrieval-augmented language model!”. “I want to build an LM agent that uses reinforcement learning!”. But I don’t really think that research can be done with such a vague goal. In order to ground your research, I believe that it’s incredibly useful to find a specific issue that the community would care about, and then work from there. So for example “I’ve noticed that GPT has a really hard time answering questions that have a spatial component” or “I’ve noticed that GPT has a really hard time programming solutions to programming puzzles about graphs” are good starting points. First define an issue, then try to figure out if it would interest the community, then build a dataset of example problems and then try to build a model that solves them. While building a solution, if you figure out that you do need RL or retrieval, then integrate that into your solution. But you’ll always be grounded in the results you’ll get from running your various baselines and new models on the dataset that you built. That’ll tell you whether those components are actually improving your model or not. I used to think that the majority of the work for researchers in deep learning was in building solutions, but I’m now pretty convinced that the majority of the work is in finding good problems!

2. Once you find a good problem to work on, it’s better to iterate and experiment through many potential solutions that are maybe good rather than working through one or two solutions that are “definitely” correct.

The king of deep learning is empirical results. Once you’ve decided which problem to focus on, the only thing that you can do if you want to test a certain solution is to run it and see what happens. Therefore, for me it has been very beneficial to iterate through the idea-brainstorming-to implementation-to-results cycle many times per project. I sometimes see people get stuck in endless pontification and over-analysis of ideas without ever opening an IDE. They’ll think of an idea and then analyze it on a whiteboard for weeks before they even consider sitting down to program it. In my view, that hypothetical analysis is not very useful. Since there’s not really a theory of deep learning yet, the utility of analyzing potential solutions on a whiteboard for more than a few hours is not much. As ML practitioners, our source of truth is the results we get after running an experiment. Yes, you shouldn’t just program every single idea that comes into your head, but also, once you think of something and spend an hour or two thinking it through, just implement it and see how it does.

Only many empirical experiments will build an intuition for what works and what doesn’t and will help you define the path forward. Quick iteration through a trajectory of research ideas leads to progress in deep learning. I’ve also observed that one of the most important factors is the number of iterations on research ideas and not the magnitude of each idea. So I generally recommend thinking of and working on solutions that are as simple as possible, so that you can go through the idea-brainstorming-to implementation-to-results cycle as many times as possible per project. Understanding what ideas to implement and which ones not to, how to prioritize different possible directions, and how to reject ideas before even implementing them are skills that you will learn as you spend more time doing research.

In the second year of my PhD I had an idea for a new type of retrieval-augmented LM. This model had two components, a retrieval and an LM. The LM was pretty much off-the-shelf, and the retriever was the component that I wanted to innovate on. The full model was quite complex and so the first prototype took me about three months to build. I then ran it and it didn’t improve performance. So I then thought of a second version, which was also quite complex and ended up taking me two months to program. That model didn’t work either.

I then realized that I should first verify if this idea could even work by using an oracle retriever: basically cheating at the retrieval stage to make your retriever as good as it could be. This oracle-based model took a few days to program. When I ran it, it also didn’t work, thereby making my confidence in the overall idea tank. If the model couldn’t work with an impossibly-powerful (enabled by cheating) retriever, I didn’t think it could work with a less powerful, but possible, retriever. In hindsight, when I had the original idea, I wish I would’ve been able to notice that the initial prototype would’ve taken three months to program, and instead of starting from it, it would’ve been much smarter to start from the much smaller oracle prototype, and only successively build my way up to the three-month prototype, only if the earlier prototypes showed promise. That would’ve saved me a lot of time.

3. Write a paper that many people would want to read:

Your paper should either teach us something new about an existing system, method or benchmark, or achieve better performance on an existing benchmark, or present a new benchmark. Good papers sometimes do two of those things.

What does it mean to achieve better performance?

If I have a model that can get 70% accuracy on SWE-bench, and you develop a new one that achieves 72% accuracy but is three times slower then that’s probably not an interesting new system. Improving performance doesn’t just mean improving accuracy and ignoring all other factors. When we look at a system, we have to observe both accuracy but also training time, inference time, memory usage, latency and disk space usage. For a new system to be better than an old one, it should improve on one of these metrics while keeping the other ones at the same level or better. If the improvement is substantial, it’s fine to disregard this rule, and for example, present a system that achieves 99% accuracy on SWE-bench while being 30% slower than the current state-of-the-art. But in that case, make sure to compare your method to the baseline, when the baseline is given 30% more time. You should always be comparing your methods to the strongest baselines you can find, and you should strive to make that comparison as fair as possible.

Why is community excitement important?

You should definitely work on research that excites you, but you should also try to find a topic that would also excite a large amount of the other researchers in the field. Papers only reach a small percentage of their target audience. If you’re writing a paper about a niche topic that only has 50 researchers working on it, maybe 15 of them would actually hear about your work, 5 of them would read it and likely none of them would perform follow-up work. I therefore find it incredibly useful and rewarding to work on topics that have a wide interest in the community.

4. Write a paper that would be an interesting part of the research discourse that will be happening in nine months.

In my experience, for my style of work, research projects take nine months on average. That means that when you start thinking of ideas you shouldn’t think of ideas that would be interesting if they were published today or tomorrow, because it’s going to take you much longer than that to write a paper. You should definitely make sure that you’re not writing a paper that would have been relevant one or two years ago. One simple rule of thumb is to make sure that at least some of the related works that you’re citing are from the past year or two. If your latest citation is from five years ago, there’s a big chance that you may be writing a paper that would be irrelevant to the current research discourse.

The most important skill to learn here is observing the trajectories that occur within your research sub-community and being able to predict where they will go. As you do research for more time you’ll get better at predicting where your research field is heading and how to do work that will fit into that puzzle as well as possible. It sounds impossible at first but I believe that, at least in deep learning, it’s possible to predict with high accuracy where the field will be in 9 months. It’s totally impossible to predict where the field will be in 18 months or longer, and that’s why I recommend not working on projects that would take that long.

5. Keep it simple.

And what do I mean by ‘it’? Everything. Try to work on problems that are easy to describe. Try to find simple solutions for those problems. Try to describe your solution in your paper in as simple of a way as possible. Try to write the code for your method in a simple way, and make it easy for others to run and extend your code.

Think of the counterexample here. If you worked on a problem that took two pages to describe in a paper, would any reader stick around for that whole description, let alone would anyone stick around to read about your solution? I also feel like most of the most important problems in our field can be stated in a sentence, so if it takes you six paragraphs to describe the problem you’re working on, that might be a hint that you’re working on a problem that is too niche or contrived.

It makes me happy when lots of people get excited by the work I release. Simplicity is one of the main driving factors in finding ideas that lots of people might get excited by. If you work on a super complex topic, there’s a high chance that very few people would even understand the problem you’re trying to solve, let alone your solution.

As for keeping your solution simple- the ML community has proven time and time again that the best methods that have the most long-lasting impact are always the simplest ones.

As with all rules, in some cases it does make sense to violate this rule. Tim Dettmers’ bitsandbytes is a super popular efficiency library. It’s made up of a lot of very non-simple CUDA code. But everything else about this library is extremely simple: it’s really easy to use, and the motivation behind it is very clear (“bitsandbytes helps you run big models on small GPUs”).

As a researcher that has to publish, it may be tempting to find complex solutions to complex problems, since reviewers are frequently impressed when they see many equations and proofs in a paper. But in my experience, while those types of papers may initially get accepted, over time, complex solutions do not have much lasting impact. Complex papers are harder to read, and their code is usually harder to extend; these properties substantially harm impact.

Closing note:

The strength of the research community is in its large size and diversity. There are many ways to do good research, some of which align with the tips above and some of which don’t. I hope that by sharing these lessons that I’ve learned over the years I’ve helped you improve your ability to do the best research you can.

If you enjoyed this post, you might also enjoy my post on tips for junior researchers, which focuses on how to do research and how to work with a mentor.

Thank you to Nelson Liu, Will Merrill, Samuel Ainsworth, Shunyu Yao and Naomi Saphra for comments on previous drafts of this post.

PhD Thesis Acknowledgments and Dedication

2023-07-06T00:00:00+00:00

The only important parts of a PhD thesis are the acknowledgments and dedication sections 😉 so I’ve uploaded the ones from my thesis here.

Acknowledgments

When Noah A. Smith accepted me into the Ph.D. program, he invited me into an environment that eventually produced a new and improved me: a version of me that is more open to new ideas, better at executing, smarter, and more patient. I will be forever grateful to Noah for believing in me and fully trusting me from day one and for always treating me with respect, patience, and love. Noah not only taught me how to do science, he also significantly improved my storytelling abilities, my ability to hold onto a reader’s attention, and my ability to frame my stories in an appealing way.

Although formally Mike Lewis wasn’t on my committee, he was on all of the papers in my thesis, and I consider him to have been my de facto co-advisor. Our ability to bounce ideas off of each other while improving them at each iteration is incredible.
Mike’s advising complemented my thinking style in a way that immensely improved the level of the work I did during the Ph.D. Working with Mike has shown me how much easier it is to tackle complex problems when you approach them with the right partner.

The end of my first academic year was rough for me, both academically and personally. It was then that Omer Levy invited me to be his intern at Facebook AI Research (FAIR). This totally changed the course of my Ph.D., and I am very grateful to Omer for that opportunity. Omer taught me many important skills for empirical machine learning research that I still carry to this day.

After spending half a year at FAIR as Omer’s intern, Luke Zettlemoyer invited me to stay at FAIR, as a visiting researcher, for two additional years. That allowed me to work with Mike and do resource-intensive research that would not have been possible without an industry affiliation. Throughout this time, Luke gave me total freedom to explore and do research on whatever I wanted to and always provided support. Being at FAIR for two and a half years made my research much stronger than it would have been if I hadn’t been there.

I’m grateful to Jonathan Frankle for inviting me to join MosaicML as a visiting researcher for six months and for always supporting my research.

I’m grateful to Kyunghyun Cho for hosting me at the wonderful Center for Data Science at NYU where I spent the last academic year of my PhD.

I’m grateful for my first mentee, Muru Zhang, for showing me how rewarding it can be to advise.

I’m grateful for the help of Elise deGoede Dorough and Sandy Kaplan from the Allen School at the University of Washington.

I’m grateful for all the other collaborators I’ve had during my Ph.D., who I learned so much from: Adi Haviv, Ori Ram, Peter Izsak, Sewon Min, Ludwig Schmidt, Will Merrill, Alisa Liu, and members of the BigScience project.

I’m grateful for my friend Samuel K. Ainsworth. I met Sam at visit days before starting graduate school where we bonded over our love for free food. Five years later we’re still eating free food together, hopefully for many more years.

I’m grateful for my friend Ivan Evtimov. We met during the beginning of the first year of school since Ivan would always show up to the board-game nights I organized, and we have been friends ever since.

I’m grateful for my friendship with Tim Dettmers and Gabriel Ilharco and for their smart comments and strong support for my work.

I’m grateful to my academic twin brother Jungo Kasai for five years of conversations, comments on papers, and reminders about administrative tasks that I had to complete.

I’m grateful to Ian Covert, Edward Misback, and Nathan Hatch for our countless days together playing tennis.

I’m grateful for my friends back home Ofek Doitch, Dean Stephansky, Andrey Shulika, Nimrod Fiat, Gili Yablonka, Nir Aviv, Tzvika Geft, Shai Kazaz, Gil Levi, Gregory Axler, and Lior Uzan.

I’m grateful to Ben K., Sammy, Danny, Kyra, Dasha, Ben A., Jamie, Paul, Jason, Lana, Carolyn, Shirley, Ivan, Pearly, Adi, Orville, Joe, Amitai, and everyone else who taught me how to stand on the shoulders of giants.

I’m grateful for James, Trixie, Andrea, Raven, Grini, Ian, Za, and everyone else who went to Brazil with me.

I’m grateful for my therapist who has taught me how to better understand myself and others.

I’m grateful for all the love and support I’ve received from Saba, Safta, Ima, Ori, Avia, Yossi, Liat, Idan, Ronnie, and Bara, without which I would not have been able to complete my Ph.D.

Dedication

Dedicated to my grandfather, Haim Yehiel (born in 1926 in Thessaloniki, Greece, died in 2022 in Ramat Hasharon, Israel), for inspiring me to be a nerd.

This is him at the Technion in Haifa in 1945 during an undergraduate class on welding.

The Use Case for Relative Position Embeddings

2022-11-08T00:00:00+00:00

We’re in 2022 but many of our most popular causal language models (LMs), including GPT-3, still use absolute positional embeddings. I believe we should stop using those and move to relative positional embeddings such as ALiBi. Deepmind’s Gopher and BigScience’s BLOOM already use relative positioning methods, and I’ve heard that multiple upcoming models also will, and so hopefully this post will help in encouraging the remanining holdouts to follow suit.

Imagine you’re building the next version of a causal code prediction model like Codex. When we train an LM like this, due to GPU memory limitations, we must pick a finite sequence length, say 4,000 tokens, to train the model on. If at inference time, users only want to make predictions in code files shorter than 4,000 tokens, we’re good. But if a user wants to make a prediction for token 4,001, it would be impossible with absolute position embeddings. If you use learned position embeddings, feeding 4,001 tokens to your LM will simply throw a runtime exception (there is no 4,001 position representation). If you use sinusoidal position embeddings, the model will run given 4,001 inputs, but as we show in the ALiBi paper, it will produce really bad predictions (for any token beyond the first 4,000).

Relative positional methods like ALiBi solve this. The T5 bias is another good option, although personally I prefer ALiBi because our paper shows it obtains better results and also it’s faster and doesn’t require any trainable parameters.

The rotary method has shown some strong results when evaluating sequences that are shorter than or equal to train length, but in our paper we show that it is not able to extrapolate to longer sequences. In addition, it is slower than ALiBi and the absolute positional methods. Lastly, while some people consider it a relative position embedding method, in my opinion, that’s incorrect. Rotary simply element-wise multiplies position representations by the word representations (instead of adding position reps to word reps, as is done in the absolute methods). This means that Rotary still employs position embeddings, which in my view makes it an absolute position method, not a relative one. This thread has more details on why I believe absolute position methods are not the way to go.

Can’t I just use an absolute position method and a sliding window to extrapolate? Short answer: Depending on how you implement this, it either won’t work or will be very very inefficient.

Details: Absolute position embeddings are battle tested and so when engineers want to build LMs that can handle longer sequences, one of the first ideas they have is to use a sliding window with an absolute position embedding method. So if we go back to our Codex example from before, we would train the same 4,000 token LM, but at inference, we would limit the attention sublayer to only attend to the last 4,000 tokens. So when we input 4,001 tokens we would only attend to tokens 2 to 4,001 and when we input 4,002 tokens we would only attend to tokens 3 to 4,002 and so on.

There are two ways to do this:

The simpler approach is just to re-encode everything at every timestep. So in the first feedforward pass of the LM we encode the first 4,000 words, and then in the second feedword pass when we’re looking at words 2 to 4,0001, we discard everything from before and re-encode everything even though there’s an overlap of 3,999 of the words between the two runs. This will work, but is very inefficient. You have to re-encode everything during each forward pass beyond the first 4,000.
In the second approach, we don’t re-encode previously encoded tokens. This will just lead to really bad predictions (I’ve tested this). This is because the same token will be assigned to different positions during subsequent inference runs, which means that its cached representation is invalid. See visualization below.

In both of these approaches, when we’re not extrapolating (i.e. when we’re doing inference on tokens 1 to 4,000), we do normal LM inference. So for example, when token 600 comes in, we have already computed representations for tokens 1 to 599, so we attend to those representations and only have to construct new outputs for token 600. As mentioned above, if we don’t use a relative position method like ALiBi, continuing this inference beyond the first 4,000 tokens will either just produce really bad predictions or it could work but very slowly, if we re-encode the past tokens. Using ALiBi means we get to continue doing inference much beyond token 4,000 without needing to re-encode anything.

To visualize approach 2 from above, I have a toy input sentence here with a toy language model, whose context size is 4 tokens. We see two subsequent inference passes.

Let’s look at the token ‘jumped’ in these two subsequent forward passes with the sliding window + absolute embeddings approach. ‘Jumped’ was assigned position 4 in the initial forward pass, and then if we use this sliding window approach we would have to treat ‘jumped’ as position 3 in the next forward pass, even though we need to attend to the old cached representation in which it had position 4. This weirdness that the model definitely didn’t experience during training just leads it to produce very bad predictions. This approach does not work.

Edit: Shortly after I wrote this blog post I was made aware of this new paper which reveals new evidence showing the weakness of absolute position embeddings.

To learn more about ALiBi and relative position embeddings in general, watch my lecture here:

7 Tips for Junior Researchers

2022-11-01T00:00:00+00:00

Research:

Research is hard and involves a tremendous amount of failure. It’s totally normal to feel like you didn’t progress in the past week, month, or year. When you read a paper, the finding is frequently presented as an obvious step that was the result of a seemingly short exploration, but when you actually do research you realize that good results never just fall into your hands. They’re the results of many months if not years of failed expeditions and many experiments that did not work. Failure seems extremely demoralizing at first but experienced researchers will attest that failure isn’t that bad- you learn a lot from failing (you learn what not to do, or what doesn’t work) and after enough failure, you achieve enough understanding to have a good enough idea for something that will work. Even as a senior Ph.D. candidate, I will still sometimes “fail” for months on end, running experiments and exploring ideas that lead to nothing interesting. It’s still tough for me to go through this and while I’m in that zone of trying things and constantly failing it feels depressing. But at some point things start working and it makes all that failure worthwhile.
Don’t ever lie, make up results or sweep negative findings under the rug. These things might help you in the short term but in the long term they never will, and good research is all about the long term. For example, if you run your new model with 4 different random seeds, and in one of those runs the improvement is 10%, and in the other three runs the improvements are -2%, 1% and -9%, that means that the 10% run was a fluke. If you wanted to, you could submit a paper where you don’t mention the other runs, but that would be deceitful. Sure, that paper might get accepted, but eventually someone will try your idea, and they’ll probably run it with a few different random seeds, and notice that your improvement is not statistically significant. Not only will they then not use your method, they’ll also be wary of your future papers. Junior researchers are anxious to get an initial publishable result and might put aside annoying things like statistical significance, but in the long run that will hurt them. There’s no rush- good research takes time, and it’s better to take one year to write a solid paper than it is to write four low-quality papers that each took three months to write.
Focusing is important, especially in the beginning. I’ve noticed that some junior researchers are afraid to pick one project and stick to it, instead preferring to simultaneously work multiple projects. Their explanation is that they’re not sure which of the projects will pan out, so they want to try multiple projects at once so they have a better chance of one of the projects making it big. In my experience this logic is flawed. Good research requires intense focus and if you work on three projects at once you’re not really going to focus on any of them and so the outcome will probably not be as good as it could’ve been. Instead, I recommend picking the one direction that you’re most excited about and focusing on that. You’ll see that as you keep working on it, it will evolve, pivot, and change many times into many different directions. You may even get exhausted at some point and want to take a small break to work on something else, before coming to back to the original direction. That’s ok. Just try to, at least for your first year in grad school, work on one direction at a time. Once you’ve submitted your first paper, you’ll be much wiser and have a much better understanding of the academic idea-to-submission cycle, and then if you feel towards your later years that you’d like to work on two directions at once, I believe that you’ll understand what that means and be able to handle that.

Working with a mentor:

It’s ok (and even recommended) to say “I don’t understand you”. When we start doing research we usually do it with an advisor who is much more senior than us, a professor if you’re in grad school or a PhD student if you’re an undergrad. Sometimes the advisor will say something super complicated that is totally incomprehensible to the junior person. When this happened to me as a junior researcher I sometimes was too afraid to say “I don’t understand”, since I thought that the senior person would think that I’m dumb. Now I know that we all think in different ways and something that’s obvious to us might be hard to explain to a different person. Good researchers understand this and are very open to explaining and re-explaining and re-re-explaining their thoughts. Good senior researchers also know that explaining new ideas to different people helps us to better frame our thoughts and understand how to write them in a paper.
It’s ok (and even recommended) to say “I don’t know”. We’re all doing research since we don’t know a lot of things and we’re trying to figure them out together. Saying “I don’t know” when you don’t know something doesn’t make you seem stupid – it makes you seem honest. If you just start pretending to know things you don’t and have answers for the topics you don’t have answers to, it’s not going to be very constructive.
It’s ok (and even recommended) to say “I don’t agree with you”. Progress in research is partially driven by disagreements. At any given moment there are many different paths being explored to solve each issue, and that’s how we progress towards the solution. If everyone worked on the same ideas that would be horrible. So disagreement (even within the same research group or mentor-mentee pair, or even with yourself, over time) is totally acceptable in the research world. Just be nice about it! Never say something like “that’s a dumb idea” or anything even close to that. But if you disagree with a certain direction or idea, find a respectful way to voice your concern. Doing this will let the other person try to convince you why they believe their idea is good, which is helpful for both you (now you understand what they want to do and why they believe in it) and them (you might have uncovered a potential weakness which they can now try to remedy).
Your advisor doesn’t have all the answers. Doing research is a multi-faceted endeavor, with many different questions to answer: What problems are currently relevant and exciting to the community? Which of these is the best fit for me? What is the high-level plan for solving this problem? How can I best execute that plan (implementing the model, figuring out what/where to find hardware, and so on…)? Once I’ve found a solution, what is the best framing for it? How can I best market my paper? Your advisor is going to help you with some of these questions, but might not be able to give you all the answers. Some of them will have to come from you- but one thing that might significantly help is finding another senior collaborator. If you look at papers coming out of the UW NLP (and many other ML/NLP/Vision) groups you’ll notice that a lot of them have both a professor co-author and a co-author who is a postdoc or research scientist (in addition to the main author, who is usually a PhD student). I’ve found this advising style to be incredibly useful: the professor provides high-level feedback on the story and framing, while the postdoc/research scientist can provide more low-level feedback about code and other implementation details.

Closing note

After reading an earlier draft of this post one of my friends said that I didn’t mention the most important part of succeeding in research- being lucky enough to find good mentors to work with. I’m incredibly fortunate to have been able to work with some of the nicest and smartest people in the world and unfortunately not everyone is this lucky. Some mentors won’t let you say “I don’t know”, won’t respect your opinions when brainstorming, won’t give you the freedom to work on things that interest and spark joy in you and will pressure you to meet made-up deadlines. I hope that anyone with a toxic mentor can read this post, realize that there are better alternatives, and try to seek one that would work better for them.

Thank you Noah A. Smith, Gregory Axler, Gabriel Ilharco, Mitchell Wortsman, Samuel Ainsworth, Ori Press, and Gabriel Stanovsky for comments on previous drafts of this post.

The Bamboogle Dataset

2022-10-18T00:00:00+00:00

Bamboogle is a dataset that we constructed, made up only of questions that Google answers incorrectly. The leaderboard for it is here.

In our Compositionality Gap paper, we show that language models also struggle with these questions and that our self-ask prompting method substantially improves the ability of language models to answer these questions (better than Chain-of-Thought).

For more details, check out the video above.

Bamboogle was introduced in our Compositionality Gap paper which can be found here, and the dataset itself is here.

The Compositionality Gap and the Compositional Celebrities Dataset

2022-10-17T00:00:00+00:00

As language models grow in size they know more, but do they get better at reasoning? To test GPT-3, we generated lots of questions such as “What is the calling code of the birthplace of Adele?”. We show that as GPT-3 size grows, it does not improve its reasoning abilities on these types of questions.

Compositional Celebrities

To test the reasoning abilities of LMs we built Compositional Celebrities, a dataset of 8.6k questions in 17 different categories.

These questions all require first retreiving 2 facts and then conducting some basic reasoning about them. For example, to answer “What is the calling code of the birthplace of Adele?” a model must first know that Adele was born in the UK and then should figure out that it needs to return the calling code of the UK- +44.

The reasoning required to answer them is simple, and the basic facts are commonly appearing (since they are either related to celebrities or their birth country or year), but we believe that these 2-hop questions have never previously appeared in any text that could be in the training set of an LM.

The Compositionality Gap

We can check the accuracy on these compositional questions (blue) or the accuracy for each pair of sub-questions separately (i.e. “What is the birthplace of Adele?” & “What is the calling code of the U.K.?”).

The surprising result we uncovered is that the compositionality gap doesn’t narrow with scale!

The compositionality gap is the fraction of compositional questions that GPT-3 can’t answer even though it can separately answer the two sub-questions that make up the compositional question.

As GPT-3 gets larger it’s remembering more facts but it’s not able to compose ~40% of these fact pairs, at all model sizes between 1B to 175B parameters! Maybe scale can’t solve everything?

This surprising result also occurs in the InstructGPT-3 family of models! The compositionality gap stays around 40% no matter how much we increase model size.

In the table below, we zoom in on the results for the best GPT-3 model, davinci-002. Some compositional question categories are really easy for it, like Birthplace/Domain Name (80% acc), but some are super hard, like Birth Year/Lit. Nobel Winner (1% acc). We’re not quite sure why.

And finally, the figure below presents an interesting finding: when GPT-3 (davinci-002) is very confident about two facts, it will be able to answer the compositional 2-hop question about them with much higher probability!

Our paper is available here, and the Compositional Celebrities dataset is available on GitHub.

Self-ask Prompting

2022-10-10T00:00:00+00:00

Self-ask is a new prompting method which improves the ability of language models to answer complex questions.

Normally a question answering prompt looks like this:

Question: Who lived longer, Muhammad Ali or Alan Turing?
Answer: Muhammad Ali 

Question: When was the founder of craigslist born?
Answer: December 6, 1952

Question: Who was the maternal grandfather of George Washington?
Answer: Joseph Ball 

Question: Are both the directors of Jaws and Casino Royale from the same country? 
Answer: No

In self-ask, we first have the model generate and then answer sub-questions about the main input question, before answering the input question. So our prompt would look like so:

Question: Who lived longer, Muhammad Ali or Alan Turing?
Are follow up questions needed here: Yes.
Follow up: How old was Muhammad Ali when he died?
Intermediate answer: Muhammad Ali was 74 years old when he died.
Follow up: How old was Alan Turing when he died?
Intermediate answer: Alan Turing was 41 years old when he died.
So the final answer is: Muhammad Ali 

Question: When was the founder of craigslist born?
Are follow up questions needed here: Yes.
Follow up: Who was the founder of craigslist?
Intermediate answer: Craigslist was founded by Craig Newmark.
Follow up: When was Craig Newmark born?
Intermediate answer: Craig Newmark was born on December 6, 1952.
So the final answer is: December 6, 1952

Question: Who was the maternal grandfather of George Washington?
Are follow up questions needed here: Yes.
Follow up: Who was the mother of George Washington?
Intermediate answer: The mother of George Washington was Mary Ball Washington.
Follow up: Who was the father of Mary Ball Washington?
Intermediate answer: The father of Mary Ball Washington was Joseph Ball.
So the final answer is: Joseph Ball 

Question: Are both the directors of Jaws and Casino Royale from the same country? 
Are follow up questions needed here: Yes. 
Follow up: Who is the director of Jaws? 
Intermediate Answer: The director of Jaws is Steven Spielberg. 
Follow up: Where is Steven Spielberg from? 
Intermediate Answer: The United States. 
Follow up: Who is the director of Casino Royale? 
Intermediate Answer: The director of Casino Royale is Martin Campbell. 
Follow up: Where is Martin Campbell from? 
Intermediate Answer: New Zealand. 
So the final answer is: No

This leads the model to solve test-time questions by first answering subquestions about them, and this leads to an increase in performance.

The structure of our prompt allows us to easily parse-out these subquestions and have Google Search answer them instead of the LM. We show that this further improves performance. In our paper we call this system Self-ask + Search Engine. Note that Google Search does not have a publicly available API, and so we use SerpApi, which is a cloud service that provides an easy to use API to Google Search.

Watch our demo (embedded above) for a deeper overview of Self-ask and Self-ask + Google Search.

To learn more about the Compositional Celebrities dataset that we created to evaluate self-ask, and about the surprising compositionality gap that we discovered, go to this post.

To learn more about the Bamboogle dataset that we created, made up only of questions that Google can’t answer, go here.

Our paper is available here, and our code for Self-ask + Google Search is on GitHub.

The following video provies an in-depth overview of all of the topics listed above:

Improving Transformer Models by Reordering their Sublayers

2020-05-04T00:00:00+00:00

The transformer layer is currently the primary component in natural language processing, playing a leading role in recent innovations such as BERT and GPT-3. Each transformer layer consists of a self-attention sublayer (s) followed by a feedforward sublayer (f), creating an interleaving pattern of self-attention and feedforward sublayers throughout a multilayer transformer model.

Is this interleaved pattern the best way to order these sublayers?

In this post, I’ll explain how we recently found a better way to order these sublayers. That ordering leads to superior performance on multiple language modeling benchmarks.

We started by generating random transformer models, varying the number of each type of sublayer, and their ordering, while keeping the overall model size (number of parameters) constant. Here are a few of these randomly generated models:

(Note that one self-attention sublayer has half the parameters of a feedforward sublayer, so you’ll notice that models that have more feedforward sublayers are shallower. )

We trained these models on the standard WikiText-103 word-level language modeling benchmark. While most of these randomly generated models performed worse than the interleaved model, about a third of these random models outperformed it (mostly by a small margin). Our analysis shows that models with more self-attention toward the bottom and more feedforward sublayers toward the top tend to perform better in general:

We also observed that models that had an equal number of self-attention and feedforward sublayers tended to perform better than models that had an unequal number of self-attention and feedforward sublayers. Based on this insight, we design a new family of transformer models that follow a distinct sublayer ordering pattern: sandwich transformers. Sandwich transformers are made up of n self-attention sublayers, followed by the regular interleaved transformer pattern, followed by n feedforward sublayers:

Our experiments demonstrate that a sandwich transformer outperforms the baseline interleaved transformer model. This result is made more interesting by the fact that our sandwich transformer is simply a reordering of the sublayers in the baseline model, and does not require more parameters, memory, or training time. On WikiText-103, sandwich transforms improve perplexity while also reducing the variance caused by selecting different random seeds:

Finally, we demonstrate that even though the sandwich transformer is motivated by random search experiments on WikiText-103, it can improve performance on additional domains and tasks. Sandwich transformers achieve state-of-the-art results on the enwik8 character-level language modeling dataset and on an additional word-level corpus. We conjecture that tuning transformer reorderings to specific tasks could yield even larger gains, and that further exploration of the ordering space may provide universally beneficial patterns.

Other conclusions and insights:

The transformer layer is not the smallest indivisible unit. The self-attention or feedforward sublayers can each function independently.
The transformer architecture is quite robust to sublayer order changes. A non-insignificant amount of the random orderings that we trained performed just as well (and sometimes better than) the baseline.
The ‘extreme standwich’ ordering s¹⁶f¹⁶ (shown below) works almost as well as the baseline on WikiText-103.

The optimal transformer ordering is not identical across different datasets. For example, the best sandwiching coefficient for WikiText-103 is 6, but the best coefficient for the Toronto Book Corpus language modeling dataset is 7. For character level language modeling the optimal sandwiching coefficients were also different.

The paper is available here. We also have a video presentation available here.

Neural Language Models Explained

2017-09-07T00:00:00+00:00

Language models assign probability values to sequences of words. Those three words that appear right above your keyboard on your phone that try to predict the next word you’ll type are one of the uses of language modeling. In the case shown below, the language model is predicting that “from”, “on” and “it” have a high probability of being the next word in the given sentence. Internally, for each word in its vocabulary, the language model computes the probability that it will be the next word, but the user only gets to see the top three most probable words.

Neural language models are a fundamental part of many systems that attempt to solve natural language processing tasks such as machine translation and speech recognition. Currently, all state of the art language models are neural networks.

The first part of this post presents a simple feedforward neural network that solves this task. In the second part of the post, we will improve the simple model by adding to it a recurrent neural network (RNN). The final part will discuss two recently proposed regularization techniques for improving RNN based language models.

A simple model

To begin we will build a simple model that given a single word taken from some sentence tries predicting the word following it.

We represent words using one-hot vectors: we decide on an arbitrary ordering of the words in the vocabulary and then represent the nth word as a vector of the size of the vocabulary (N), which is set to 0 everywhere except element n which is set to 1.

The model can be separated into two components:

We start by encoding the input word. This is done by taking the one hot vector representing the input word (c in the diagram), and multiplying it by a matrix of size (N,200) which we call the input embedding (U). This multiplication results in a vector of size 200, which is also referred to as a word embedding. This embedding is a dense representation of the current input word. This representation is both of a much smaller size than the one-hot vector representing the same word, and also has some other interesting properties. For example, while the distance between every two words represented by a one-hot vectors is always the same, these dense representations have the property that words that are close in meaning will have representations that are close in the embedding space.
The second component can be seen as a decoder. After the encoding step, we have a representation of the input word. We multiply it by a matrix of size (200,N), which we call the output embedding (V). The resulting vector of size N is then passed through the softmax function, normalizing its values into a probability distribution (meaning each one of the values is between 0 and 1, and their sum is 1). This distribution is denoted by p in the diagram above.

The decoder is a simple function that takes a representation of the input word and returns a distribution which represents the model’s predictions for the next word: the model assigns to each word the probability that it will be the next word in the sequence.

To train this model, we need pairs of input and target output words. For the (input, target-output) pairs we use the Penn Treebank dataset which contains around 40K sentences from news articles, and has a vocabulary of exactly 10,000 words. To generate word pairs for the model to learn from, we will just take every pair of neighboring words from the text and use the first one as the input word and the second one as the target output word. So for example for the sentence “The cat is on the mat” we will extract the following word pairs for training: (The, cat), (cat, is), (is, on), and so on.

We use stochastic gradient descent to update the model during training, and the loss used is the cross-entropy loss. Intuitively, this loss measures the distance between the output distribution predicted by the model and the target distribution for each pair of training words. The target distribution for each pair is a one-hot vector representing the target word.

The metric used for reporting the performance of a language model is its perplexity on the test set. It is defined as $e^{-\frac{1}{N}\sum_{i=1}^{N} \ln p_{\text{target}_i}}$, where $p_{\text{target}_i}$ is the probability given by the model to the ith target word. Perplexity is a decreasing function of the average log probability that the model assigns to each target word. We want to maximize the probability that we give to each target word, which means that we want to minimize the perplexity (the optimal perplexity is 1).

The perplexity for the simple model¹ is about 183 on the test set, which means that on average it assigns a probability of about $0.005$ to the correct target word in each pair in the test set. It’s much better than a naive model which would assign an equal probability to each word (which would assign a probability of $\frac {1} {N} = \frac {1} {10,000} = 0.0001$ to the correct word), but we can do much better.

Using RNNs to improve performance

The biggest problem with the simple model is that to predict the next word in the sentence, it only uses a single preceding word. If we could build a model that would remember even just a few of the preceding words there should be an improvement in its performance. To understand why adding memory helps, think of the following example: what words follow the word “drink”? You’d probably say that “coffee”, “beer” and “soda” have a high probably of following it. If I told you the word sequence was actually “Cows drink”, then you would completely change your answer.

We can add memory to our model by augmenting it with a recurrent neural network (RNN), as shown below.

This model is similar to the simple one, just that after encoding the current input word we feed the resulting representation (of size 200) into a two layer LSTM, which then outputs a vector also of size 200 (at every time step the LSTM also receives a vector representing its previous state- this is not shown in the diagram). Then, just like before, we use the decoder to convert this output vector into a vector of probability values. (LSTM is just a fancier RNN that is better at remembering the past. Its “API” is identical to the “API” of an RNN- the LSTM at each time step receives an input and its previous state, and uses those two inputs to compute an updated state and an output vector².)

Now we have a model that at each time step gets not only the current word representation, but also the state of the LSTM from the previous time step, and uses this to predict the next word. The state of the LSTM is a representation of the previously seen words (note that words that we saw recently have a much larger impact on this state than words we saw a while ago).

As expected, performance improves and the perplexity of this model on the test set is about 114. An implementation of this model³, along with a detailed explanation, is available in Tensorflow.

The importance of regularization

114 perplexity is good but we can still do much better. In this section I’ll present some recent advances that improve the performance of RNN based language models.

Dropout

We could try improving the network by increasing the size of the embeddings and LSTM layers (until now the size we used was 200), but soon enough this stops increasing the performance because the network overfits the training data (it uses its increased capacity to remember properties of the training set which leads to inferior generalization, i.e. performance on the unseen test set). One way to counter this, by regularizing the model, is to use dropout.

The diagram below is a visualization of the RNN based model unrolled across three time steps. x and y are the input and output sequences, and the gray boxes represent the LSTM layers. Vertical arrows represent an input to the layer that is from the same time step, and horizontal arrows represent connections that carry information from previous time steps.

We can apply dropout on the vertical (same time step) connections:

The arrows are colored in places where we apply dropout. A dropout mask for a certain layer indicates which of that layers activations are zeroed. In this case, we use different dropout masks for the different layers (this is indicated by the different colors in the diagram).

Applying dropout to the recurrent connections harms the performance, and so in this initial use of dropout we use it only on connections within the same time step. Using two LSTM layers, with each layer containing 1500 LSTM units, we achieve a perplexity of 78 (we dropout activations with a probability of 0.65)⁴.

The recently introduced variational dropout solves this problem and improves the model’s performance even more (to 75 perplexity) by using the same dropout masks at each time step.

Weight Tying

The input embedding and output embedding have a few properties in common. The first property they share is that they are both of the same size (in our RNN model with dropout they are both of size (10000,1500)).

The second property that they share in common is a bit more subtle. In the input embedding, words that have similar meanings are represented by similar vectors (similar in terms of cosine similarity). This is because the model learns that it needs to react to similar words in a similar fashion (the words that follow the word “quick” are similar to the ones that follow the word “rapid”).

This also occurs in the output embedding. The output embedding receives a representation of the RNNs belief about the next output word (the output of the RNN) and has to transform it into a distribution. Given the representation from the RNN, the probability that the decoder assigns a word depends mostly on its representation in the output embedding (the probability is exactly the softmax normalized dot product of this representation and the output of the RNN).

Given the RNN output at a certain time step, the model would like to assign similar probability values to similar words. Therefore, similar words are represented by similar vectors in the output embedding. (Again, if a certain RNN output results in a high probability for the word “quick”, we expect that the probability for the word “rapid” will be high as well.)

These two similarities led us to recently propose a very simple method, weight tying, to lower the model’s parameters and improve its performance. We simply tie its input and output embedding (i.e. we set U=V, meaning that we now have a single embedding matrix that is used both as an input and output embedding). This reduces the perplexity of the RNN model that uses dropout to 73, and its size is reduced by more than 20%⁵.

Why does weight tying work?

The perplexity of the variational dropout RNN model on the test set is 75. The same model achieves 24 perplexity on the training set. So the model performs much better on the training set then it does on the test set. This means that it has started to remember certain patterns or sequences that occur only in the train set and do not help the model to generalize to unseen data. One of the ways to counter this overfitting is to reduce the model’s ability to ‘memorize’ by reducing its capacity (number of parameters). By applying weight tying, we remove a large number of parameters.

In addition to the regularizing effect of weight tying we presented another reason for the improved results. We showed that in untied language models the word representations in the output embedding are of much higher quality than the ones in the input embedding. This is shown using embedding evaluation benchmarks such as Simlex999. In a weight tied model, because the tied embedding’s parameter updates at each training iteration are very similar to the updates of the output embedding of the untied model, the tied embedding performs similarly to the output embedding of the untied model. So in the tied model, we use a single high quality embedding matrix in two places in the model. This contributes to the improved performance of the tied model⁶.

To summarize, this post presented how to improve a very simple feedforward neural network language model, by first adding an RNN, and then adding variational dropout and weight tying to it.

In recent months, we’ve seen further improvements to the state of the art in RNN language modeling. The current state of the art results are held by two recent papers by Melis et al. and Merity et al.. These models make use of most, if not all, of the methods shown above, and extend them by using better optimization techniques, new regularization methods, and by finding better hyperparameters for existing models.

This model is the skip-gram word2vec model presented in Efficient Estimation of Word Representations in Vector Space. ↩
For a detailed explanation of this watch Edward Grefenstette’s Beyond Seq2Seq with Augmented RNNs lecture. ↩
This model is the small model presented in Recurrent Neural Network Regularization. ↩
This is the large model from Recurrent Neural Network Regularization. ↩
In parallel to our work, an explanation for weight tying based on Distilling the Knowledge in a Neural Network was presented in Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling. ↩
Our paper explains this in detail. ↩

How to Start Learning Deep Learning

2016-06-26T00:00:00+00:00

Due to the recent achievements of artificial neural networks across many different tasks (such as face recognition, object detection and Go), deep learning has become extremely popular. This post aims to be a starting point for those interested in learning more about it.

If you already have a basic understanding of linear algebra, calculus, probability and programming: I recommend starting with Stanford’s CS231n. The course notes are comprehensive and well-written. The slides for each lesson are also available, and even though the accompanying videos were removed from the official site, re-uploads are quite easy to find online.

If you don’t have the relevant math background: There is an incredible amount of free material online that can be used to learn the required math knowledge. Gilbert Strang’s course on linear algebra is a great introduction to the field. For the other subjects, edX has courses from MIT on both calculus and probability.

If you are interested in learning more about machine learning: Andrew Ng’s Coursera class is a popular choice as a first class in machine learning. There are other great options available such as Yaser Abu-Mostafa’s machine learning course which focuses much more on theory than the Coursera class but it is still relevant for beginners. Knowledge in machine learning isn’t really a prerequisite to learning deep learning, but it does help. In addition, learning classical machine learning and not only deep learning is important because it provides a theoretical background and because deep learning isn’t always the correct solution.

CS231n isn’t the only deep learning course available online. Geoffrey Hinton’s course “Neural Networks for Machine Learning” covers a lot of different topics, and so does Hugo Larochelle’s “Neural Networks Class”. Both of these classes contain video lectures. Nando de Freitas also has a course available online which contains videos, slides and also a list of homework assignments.

If you prefer reading over watching video lectures: Neural Networks and Deep Learning is a free online book for beginners to the field. The Deep Learning Book is also a great free book, but it is slightly more advanced.

Where to go after you’ve got the basics:

Computer Vision is covered by most, if not all, of the deep learning resources mentoined above.
Recurrent Neural Networks (RNNs) are the basis of neural network based models that solve tasks related to sequences such as machine translation or speech recognition. Andrej Karpathy’s blog post on RNNs is a great place to start learning about them. Christopher Olah has a great blog where many deep learning concepts are explained in a very visual and easy to understand way. His post on LSTM networks is an introduction to LSTM networks which are a wildly used RNN variant.
Natural Language Processing (NLP): CS224n is an introduction to NLP with deep learning. Advanced courses are available from both Kyunghyun Cho (with lecture notes here) and Yoav Goldberg.
Reinforcement Learning: If you’d like to control robots or beat the human champion of Go, you should probably use reinforcement learning. Andrej Karpathy’s post on deep reinforcement learning is an excellent starting point. David Silver also recently published a short blog post introducing deep reinforcement learning.

Deep learning frameworks: There are many frameworks for deep learning but the top two are Tensorflow (by Google) and PyTorch (by Facebook). They are both great, but if I had to select just one to recommend I’d say that PyTorch is the best for beginners, mostly because of the great tutorials available and how simple its API is.

If you’d like to train neural networks you should probably do it on a GPU. You dont have to, but its much faster if you do. NVIDIA cards are the industry standard, and while most research labs use $1000 dollar graphics cards, there are a few affordable cards that can also get the work done. An even cheaper option is to rent a GPU-enabled instance from a cloud server provider like Amazon’s EC2 (short guide here).

Good luck!

Updated February 2018