We’re in 2022 but many of our most popular causal language models (LMs), including GPT-3, still use absolute positional embeddings. I believe we should stop using those and move to relative positional embeddings such as ALiBi. Deepmind’s Gopher and BigScience’s BLOOM already use relative positioning methods, and I’ve heard that multiple upcoming models also will, and so hopefully this post will help in encouraging the remanining holdouts to follow suit.
- Research is hard and involves a tremendous amount of failure. It’s totally normal to feel like you didn’t progress in the past week, month, or year. When you read a paper, the finding is frequently presented as an obvious step that was the result of a seemingly short exploration, but when you actually do research you realize that good results never just fall into your hands. They’re the results of many months if not years of failed expeditions and many experiments that did not work. Failure seems extremely demoralizing at first but experienced researchers will attest that failure isn’t that bad- you learn a lot from failing (you learn what not to do, or what doesn’t work) and after enough failure, you achieve enough understanding to have a good enough idea for something that will work. Even as a senior Ph.D. candidate, I will still sometimes “fail” for months on end, running experiments and exploring ideas that lead to nothing interesting. It’s still tough for me to go through this and while I’m in that zone of trying things and constantly failing it feels depressing. But at some point things start working and it makes all that failure worthwhile.
- Don’t ever lie, make up results or sweep negative findings under the rug. These things might help you in the short term but in the long term they never will, and good research is all about the long term. For example, if you run your new model with 4 different random seeds, and in one of those runs the improvement is 10%, and in the other three runs the improvements are -2%, 1% and -9%, that means that the 10% run was a fluke. If you wanted to, you could submit a paper where you don’t mention the other runs, but that would be deceitful. Sure, that paper might get accepted, but eventually someone will try your idea, and they’ll probably run it with a few different random seeds, and notice that your improvement is not statistically significant. Not only will they then not use your method, they’ll also be wary of your future papers. Junior researchers are anxious to get an initial publishable result and might put aside annoying things like statistical significance, but in the long run that will hurt them. There’s no rush- good research takes time, and it’s better to take two years to write a solid paper than it is to write four low-quality papers that each took half a year to write.
Bamboogle is a dataset that we constructed, made up only of questions that Google answers incorrectly.
As language models grow in size they know more, but do they get better at reasoning? To test GPT-3, we generated lots of questions such as “What is the calling code of the birthplace of Adele?”. We show that as GPT-3 size grows, it does not improve its reasoning abilities on these types of questions.
Self-ask is a new prompting method which improves the ability of language models to answer complex questions.
The transformer layer is currently the primary component in natural language processing, playing a leading role in recent innovations such as BERT and GPT-3. Each transformer layer consists of a self-attention sublayer (s) followed by a feedforward sublayer (f), creating an interleaving pattern of self-attention and feedforward sublayers throughout a multilayer transformer model.
Language models assign probability values to sequences of words. Those three words that appear right above your keyboard on your phone that try to predict the next word you’ll type are one of the uses of language modeling. In the case shown below, the language model is predicting that “from”, “on” and “it” have a high probability of being the next word in the given sentence. Internally, for each word in its vocabulary, the language model computes the probability that it will be the next word, but the user only gets to see the top three most probable words.
Due to the recent achievements of artificial neural networks across many different tasks (such as face recognition, object detection and Go), deep learning has become extremely popular. This post aims to be a starting point for those interested in learning more about it.