How to Build Good Language Modeling Benchmarks

Building benchmarks is important because they shine a spotlight on the weaknesses of existing language models and so can guide the community on how to improve them.

5 Tips for Finding Research Topics

I’ve spent the past eight years doing research on neural language models. While my top-level goal of making language models more useful to humans has remained stable, my path to achieving it has changed over time. I’ve learned that in research, you not only have to learn how to execute on a given idea, but you also have to learn how to pick which direction to direct your efforts at.

PhD Thesis Acknowledgments and Dedication

The only important parts of a PhD thesis are the acknowledgments and dedication sections 😉 so I’ve uploaded the ones from my thesis here.

The Use Case for Relative Position Embeddings

We’re in 2022 but many of our most popular causal language models (LMs), including GPT-3, still use absolute positional embeddings. I believe we should stop using those and move to relative positional embeddings such as ALiBi. Deepmind’s Gopher and BigScience’s BLOOM already use relative positioning methods, and I’ve heard that multiple upcoming models also will, and so hopefully this post will help in encouraging the remanining holdouts to follow suit.

Research:

Research is hard and involves a tremendous amount of failure. It’s totally normal to feel like you didn’t progress in the past week, month, or year. When you read a paper, the finding is frequently presented as an obvious step that was the result of a seemingly short exploration, but when you actually do research you realize that good results never just fall into your hands. They’re the results of many months if not years of failed expeditions and many experiments that did not work. Failure seems extremely demoralizing at first but experienced researchers will attest that failure isn’t that bad- you learn a lot from failing (you learn what not to do, or what doesn’t work) and after enough failure, you achieve enough understanding to have a good enough idea for something that will work. Even as a senior Ph.D. candidate, I will still sometimes “fail” for months on end, running experiments and exploring ideas that lead to nothing interesting. It’s still tough for me to go through this and while I’m in that zone of trying things and constantly failing it feels depressing. But at some point things start working and it makes all that failure worthwhile.
Don’t ever lie, make up results or sweep negative findings under the rug. These things might help you in the short term but in the long term they never will, and good research is all about the long term. For example, if you run your new model with 4 different random seeds, and in one of those runs the improvement is 10%, and in the other three runs the improvements are -2%, 1% and -9%, that means that the 10% run was a fluke. If you wanted to, you could submit a paper where you don’t mention the other runs, but that would be deceitful. Sure, that paper might get accepted, but eventually someone will try your idea, and they’ll probably run it with a few different random seeds, and notice that your improvement is not statistically significant. Not only will they then not use your method, they’ll also be wary of your future papers. Junior researchers are anxious to get an initial publishable result and might put aside annoying things like statistical significance, but in the long run that will hurt them. There’s no rush- good research takes time, and it’s better to take one year to write a solid paper than it is to write four low-quality papers that each took three months to write.
Focusing is important, especially in the beginning. I’ve noticed that some junior researchers are afraid to pick one project and stick to it, instead preferring to simultaneously work multiple projects. Their explanation is that they’re not sure which of the projects will pan out, so they want to try multiple projects at once so they have a better chance of one of the projects making it big. In my experience this logic is flawed. Good research requires intense focus and if you work on three projects at once you’re not really going to focus on any of them and so the outcome will probably not be as good as it could’ve been. Instead, I recommend picking the one direction that you’re most excited about and focusing on that. You’ll see that as you keep working on it, it will evolve, pivot, and change many times into many different directions. You may even get exhausted at some point and want to take a small break to work on something else, before coming to back to the original direction. That’s ok. Just try to, at least for your first year in grad school, work on one direction at a time. Once you’ve submitted your first paper, you’ll be much wiser and have a much better understanding of the academic idea-to-submission cycle, and then if you feel towards your later years that you’d like to work on two directions at once, I believe that you’ll understand what that means and be able to handle that.

The Bamboogle Dataset

Bamboogle is a dataset that we constructed, made up only of questions that Google answers incorrectly. The leaderboard for it is here.

The Compositionality Gap and the Compositional Celebrities Dataset

As language models grow in size they know more, but do they get better at reasoning? To test GPT-3, we generated lots of questions such as “What is the calling code of the birthplace of Adele?”. We show that as GPT-3 size grows, it does not improve its reasoning abilities on these types of questions.

Self-ask Prompting

Self-ask is a new prompting method which improves the ability of language models to answer complex questions.

Improving Transformer Models by Reordering their Sublayers

The transformer layer is currently the primary component in natural language processing, playing a leading role in recent innovations such as BERT and GPT-3. Each transformer layer consists of a self-attention sublayer (s) followed by a feedforward sublayer (f), creating an interleaving pattern of self-attention and feedforward sublayers throughout a multilayer transformer model.

Neural Language Models Explained

Language models assign probability values to sequences of words. Those three words that appear right above your keyboard on your phone that try to predict the next word you’ll type are one of the uses of language modeling. In the case shown below, the language model is predicting that “from”, “on” and “it” have a high probability of being the next word in the given sentence. Internally, for each word in its vocabulary, the language model computes the probability that it will be the next word, but the user only gets to see the top three most probable words.

How to Start Learning Deep Learning

Due to the recent achievements of artificial neural networks across many different tasks (such as face recognition, object detection and Go), deep learning has become extremely popular. This post aims to be a starting point for those interested in learning more about it.

Ofir Press