Almost no data and no time? Unlock the true potential of GPT3! A case study.

In this post, I will explore how the advent of large pre-trained language models (such as GPT3 [1]) are giving rise to the new paradigm of 'prompt engineering' in the field of NLP. This new paradigm allows us to rapidly prototype complex NLP applications with little to no effort and based on very small amounts of data. I will present a case study where I used this technique at my summer internship at Waylay to create an application that makes industry-level automatisations accessible to everyone using voice and text inputs (think of something like google assistant, but for IoT and on steroids!). Finally, I will conclude with some remarks on this new and exciting trend.

If you don't feel like reading, you can watch this recording of the internal meeting where I presented my solution to the company.

There is a small little bonus at the end that shows the true capabilities of this solution while shocking everyone present (it starts at 24:40).

Prompt-engineering: the new deep learning paradigm?

Back in the old days of machine learning, engineers had to spent countless hours creating informative and discriminative features to improve the quality of their models. With the advent of deep learning, models were now able to extract their own quality features given heaps of data. Eventually, data scientist created techniques to apply a large pre-trained model to many downstream tasks via transfer learning, which allowed these models to attain strong performance on many tasks with only a seemingly small amount of data.

The field of Natural Language Processing (NLP) has undergone these evolutions as well. Interestingly, a new paradigm seems to be emerging in this field which could have a major impact on how we will use deep learning in the future.


Recently, large pre-trained generative models are all the rage. GPT3, one of these models, was trained by predicting the continuation of a given sentence as accurately as possible (this task is called causal language modeling). Because this task requires no additional annotations (we call it self-supervised), researchers were able to gather large amounts of data (40GB of text!) and train a ridiculously large deep learning model (175 billion parameters!). As a result, GPT3 is currently the best-in- class model for generating human-like language. Given an input sentence, GPT3 continues it in a natural way.

It turns out that being good at generating coherent natural language has many benefits. When continuing the sentence "While the cast was okay, the script of the movie was horrible. This movie is ...", GPT3 is much more likely to output the word "bad" compared to "good". So in essence, by learning to generate text, we also learned how to do sentiment classification without needing access to sentiment annotations. A similar situation arises to some extend with many different NLP applications.

We can even guide GPT3 during the generation process. When we want to know what the capital of Belgium is, we can ask GPT3 to continue the following sentence: "The capital of France is Paris. The capital of Japan is Tokyo. The capital of Belgium is ...". If we would just ask "The capital of Belgium is ...", we are likely to get a continuation such as "a nice city" instead of "Brussels".


How we ask GPT3 these questions, what format we use and what examples we include is referred to as prompt engineering. Prompt engineering is a new paradigm in the world of NLP that can radically change how we will interact with deep learning models. Instead of gathering large amounts of data and fine-tuning an existing model,

we are now able to leverage only a few data points to get remarkable results. As an extra bonus, we don't even need to retrain our model when new data becomes available or when we want to change the task definition.

We don't even need to host the model ourselves! As with every new paradigm, researchers and practitioners are quickly exploring many different ways of leveraging prompt engineering. his can go from dynamically selecting which examples to feature in a prompt (which we will do in the case study) to optimizing the shape of the prompt via learning techniques.

The paper 'Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing' [2] provides a nice overview of the current research into prompting methods.


Using GPT3 for real life value: a case study

IoT automation

Waylay is a low-code platform that allows developers to apply enterprise level automation anywhere. Hook up sensors, push data and start enjoying the benefits of low-code automation.

Automation rules are the core of the Waylay platform. Developers write small code snippets (or use pre-existing ones) and chain these together with logical operators to define automation rules. Think about rules as something that allow you to turn on the water sprinklers if it has been sunny without rain for 3 days, or schedule an inspection for an industrial machine if an anomaly is detected on one of its many sensors. By chaining these rules together, we can create arbitrarily complex automation software.

Making this automation technology accessible to everyone is one of Waylay's core values. Imagine if we could simple interact with this automation engine over voice or text control, in a natural fashion. This is where NLP comes in. Instead of having to interact with a computer in the typical manner, we can foresee a factory worker asking their machine "What is the temperature of oven 5?" or telling it "Raise a critical warning if the temperature of the freezer rises above -10 degrees and the door is open".

Getting this right certainly isn't easy. Human-spoken rules can carry a lot of ambiguity and require a lot of intelligence to correctly parse and translate into Waylay automation rules.


The solution

If we want to build a deep learning solution based on 'traditional' methods, we have a few problems to deal with. Primarily, we are dealing with a lack of data. In order to robustly parse human utterances and capture the necessary information to translate them into something the Waylay system can understand, we would need a large amount of data spanning different ways of speaking and different types of Waylay rules. This data is currently not available. Even if we would have this data, our model would need to be retrained every time we want it to serve a new manner of speaking or a new type of Waylay rule.

We turn to prompt engineering to solve our problems. If we can use GPT3 to do the hard work for us, we can build a highly data efficient system which does not need to be retrained to deal with new cases. How nice would that be?

The question now becomes "How can we leverage the capabilities of GPT3 to do the dirty work for us?". Unfortunately, it is very hard to teach GPT3 to output the correct internal data structure Waylay requires based on a natural language input. Luckily, we can work our way around this with a clever hack (for which we have to credit the smart folks over at Microsoft [3]). In our solution, we will let GPT3 output a canonical sentence. This sentence holds the same information as our natural language input, but in a more structured fashion. For example, the utterances 'send David a message telling him to drive safe when it is raining in Paris' and 'only when the weather in Paris is raining tell David "Drive safe!" via sms' can both be reduced to the canonical sentence 'if weather in Paris is raining, then send SMS to David with message "Drive safe!"'.




Turning these utterances into canonicals is more like a translation or summarisation task, which GPT3 can more easily handle. Based on only a few examples (less than 10), GPT3 is already very good at translating natural sounding language into this structured canonical. Once we get this canonical, we need some additional parsing to turn it into the internal data structure representing a Waylay rule. While this last step certainly requires some additional engineering effort, we have successfully delegated the 'intelligent part' of our problem has to GPT3.

To get the maximal performance, we have an additional tricks op our sleeve. Initially, we only have a small amount of examples available. However, once more examples become available, we need to select which examples get chosen as input to GPT3 based on the utterance we want to transform. One simple way to do this is to take the examples that are most similar to the given utterance. This way, we can ensure GPT3 will always have access to the most relevant examples when processing an utterance. To validate this idea we took the simplest approach possible: embed all the sentences using mean pooling of static word embeddings and use the cosine similarity as distance metric. While more advanced similarity metrics are certainly possible, this quick and easy solution already provides the bulk of the value: given a large dataset, we can now quickly extract samples which GPT3 will be able to use to process a given utterance.

Implementing this solution was quite simple. A couple of Flask microservices, a small database and a Vue.js frontend did the trick just fine. No expertise deploying a deep learning model required!


Results

And now the fun part, exploring the results.

Based on less than 10 examples of utterances and their corresponding canonicals in the prompt, the solution already shows remarkable results. The model is able to handle different kinds of utterances which are not represented in the 10 given examples. This is very important, because we can expect the solution to be invoked in many different unforeseen ways when in production. The solution is robust against typos and grammatical errors. Additionally, the solution exceeds at leveraging the strong 'common sense' reasoning of GPT3. It knows to translate 'the capital of Belgium' to 'Brussels' and it maps 'a fact about unicorns' to 'Unicorns are mythical creatures!'. It can even tell jokes on demand (albeit they aren't always funny)!

The generalization towards unseen scenarios based on < 10 examples is remarkable. If we input a rule asking to only do something on Monday, GPT3 dreams up a 'time of day' rule while this is not represented in the given examples. We basically only need to link 'time of day' to the proper Waylay functionality in our parser and we are good to go.

During an internal demo, a Waylay employee asked me to input something in Dutch. What followed shocked everyone present at the demo... Having seen 0 Dutch examples, GPT3 manages to correctly interpret the Dutch utterance and output the corresponding English canonical. If you have the time, I suggest watching the video  recording of this demo, the look on all our faces is priceless.


Conclusion

By rephrasing our semantic parsing task to a translation task, we were able to leverage large pre-trained language models (GPT3) to do all the hard work for us. Our solution works with extremely few data points, can easily be adapted towards new situations without retraining and we don't even need to host the deep learning model ourselves. Because of the strong capabilities of GPT3, our solution shows remarkable generalization towards unseen scenarios (and even unseen languages!).

In the following years, this prompt engineering approach will allow data scientist to prototype many different natural language applications, at a previously unseen speed and convenience. These prototypes can be rolled out as experimental features, increasing in performance when more and more data becomes available. Once a critical amount of data is gathered, more conventional deep learning techniques can be explored.

With companies currently focusing on building even larger pre-trained language models and researchers open sourcing their implementations of these models, we can only expect prompt engineering to grow in significance the following years.


The paper 'Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing' summarizes this new paradigm very well. Not all prompting techniques described in this paper are very pragmatic though. Some require that you host the language model yourself or that you optimize the format of the prompt via a learning technique. While this can provide better results, you lose a lot of the convenience which makes this technique so attractive.

It will be interesting to see which direction the field of prompt engineering will take in the near future. One thing is for certain, with every bigger language models being released (behind API but also open source), we are bound to see a plethora of interesting techniques which can be leveraged to build highly data efficient NLP applications in no time.


References

  • Brown, Tom B., et al. "Language models are few-shot learners." arXiv preprint arXiv:2005.14165 (2020).
  • Liu, Pengfei, et al. "Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing." arXiv preprint arXiv:2107.13586 (2021).
  • Shin, Richard, et al. "Constrained Language Models Yield Few-Shot Semantic Parsers." arXiv preprint arXiv:2104.08768 (2021).