Do LLMs Reign Supreme in Few-Shot NER? Part III

In our previous blog posts in the series, we have described traditional methods for few-shot named entity recognition (NER) and discussed how large language models (LLMs) are being used to solve the NER task. In this post, we close the gap between these two areas and apply an LLM-based method for few-shot NER.
As a reminder, NER is the task of finding and categorizing named entities in text, for example, names of people, organizations, locations, etc. In a few-shot scenario, there are only a handful of labeled examples available for training or adapting an NER system, in contrast to the vast amounts of data typically needed to train a deep learning model.

Example of a labeled NER sentence
Using LLMs for few-shot NER
While Transformer-based models, such as BERT, have been used as a backbone for models fine-tuned to NER for quite some time, recently there is increasing interest in understanding the effectiveness of prompting pre-trained decoder-only LLMs with few-shot examples for a variety of tasks.
GPT-NER is a method of prompting LLMs to perform NER proposed by Shuhe Wang et al. They prompt a language model to detect a class of named entities, showing a few input and output examples in the prompt, where in the output the entities are marked with special symbols (@@ marks the start and ## the end of a named entity).

A GPT-NER prompt. All event entities in the example outputs in the prompt are marked with “@@” (beginning of the named entity) and “##” (end of the named entity)
While Wang et al. evaluate their method in the low-resource setting, they imitate this scenario by selecting a random subset of a larger, general-purpose dataset (CoNLL-2003). They also put considerable emphasis on choosing the best possible few-shot examples to include in the prompt; however, in a truly few-shot scenario there is no wealth of examples to choose from.
To close this gap, we apply the prompting method in a true few-shot scenario, using a purposefully constructed dataset for few-shot NER, specifically, the Few-NERD dataset.
What is Few-NERD?
The task of few-shot NER has gained popularity in recent years, but there is not much benchmark data focused on this specific task. Often, data scarcity for the few-shot case is simulated by using a larger dataset and selecting a random subset of it to use for training. Few-NERD is one dataset that was designed specifically for the few-shot NER task. 
The few-shot dataset is organized in episodes. Each episode consists of a support set containing several few-shot examples (labeled sentences), and a query set for which labels need to be predicted using the information of the support set. The dataset has training, development, and test splits; however, as we are using a pre-trained LLM without any fine-tuning, we only use the test split in our experiments. The support sets serve as the few-shot examples provided in the prompt, and we predict the labels for the query sets.

Coarse- and fine-grained entity types in the Few-NERD dataset (Ding et al., 2021)
The types, or classes, of named entities in Few-NERD have two levels: coarse-grained (person, location, etc.) and fine-grained (e.g. actor is a subclass of person, island is a subclass of location, etc.). In our experiments described here, we only deal with the easier coarse-grained classification.
The full dataset includes a few tasks. There is a supervised task, which is not few-shot and is not organized in episodes: the data is split into train (70% of all data), development (10%), and test (20%) sets. The few-shot task organizes data in episodes. Moreover, there is a distinction between the inter and intra tasks. In the intra task, each coarse-grained entity type will only be labeled in one of the train, development, and test splits, and will be completely unseen in the other two. We use the second task, inter, where the same coarse-grained entity type may appear in all data splits (train, development, and test), but any fine-grained type will only be labeled in one of the splits. Furthermore, the dataset includes variants where either 5 or 10 entity types are present in an episode, and where either 1-2 or 5-10 examples per class are included in the support set of an episode.
How good are LLMs at few-shot NER?
In our experiments, we aimed to evaluate the GPT-NER prompting setup, but a) do that in a truly few-shot scenario using the Few-NERD dataset, and b) use LLMs from Llama 2 family, which are available on the Clarifai platform, instead of the closed models used by the GPT-NER authors. Our code can be found in this Github repository.
We aim to answer these questions:

How can the prompting style of GPT-NER be applied to the truly few-shot NER setting?
How do differently sized open LLMs compare to each other on this task?
How does the number of examples affect few-shot performance?

Results
We compare the results along two dimensions: first, we compare the performance of different Llama 2 model sizes on the same dataset; then, we also compare the behavior of the models when a different number of few-shot input-output examples are shown in the prompt.
1) Model size
We compared the three different-sized Llama-2-chat models available on the Clarifai platform. As an example, let us look at the scores of 7B, 13B, and 70B models on the inter 5-way 1-2-shot Few-NERD test set.  
The largest, 70B model has the best F1 scores, but the 13B model is worse on this metric than the smallest 7B model. 

F1 scores of Llama 2 7B (blue), 13B (cyan), and 70B (black) models on the “inter” 5-way, 1~2-shot test set of Few-NERD
However, if we look at the precision and recall metrics which contribute to F1, the situation becomes even more nuanced. The 13B model turns out to have the best precision scores out of all three model sizes, and the 70B model is, in fact, the worst on precision for all classes.

Precision scores of Llama 2 7B (blue), 13B (cyan), and 70B (black) models on the “inter” 5-way, 1~2-shot test set of Few-NERD
This is compensated by recall, which is much higher for the 70B model than for the smaller ones. Thus, it seems that the largest model detects more named entities than the others, but the 13B model needs to be more certain about named entities to detect them. From these results, we can expect the 13B model to have the fewest false positives, and the 70B the fewest false negatives, while the smallest, 7B model falls somewhere in between on both types of errors.

Recall scores of Llama 2 7B (blue), 13B (cyan), and 70B (black) models on the “inter” 5-way, 1~2-shot test set of Few-NERD

2) Number of examples in prompt
We also compare differently sized Llama 2 models on datasets with different numbers of named entity examples in few-shot prompts: 1-2 or 5-10 examples per (fine-grained) class. 
As expected, all models do better when there are more few-shot examples in the prompt. At the same time, we notice that the difference in scores is much smaller for the 70B model than for the smaller ones, which suggests that the larger model can do well with fewer examples. The trend is not entirely consistent with model size though: for the medium-sized 13B model, the difference between seeing 1-2 or 5-10 examples in the prompt is the most drastic. 

F1 scores of Llama 2 7B (left), 13B (center), and 70B (right) models on the “inter” 5-way 1~2-shot (blue) and 5~10-shot (cyan) test sets of Few-NERD

Challenges with using LLMs for few-shot NER
A few issues need to be considered when we prompt LLMs to do NER in the GPT-NER style.

The GPT-NER prompt template only uses one set of tags in the output, and the model is only asked to find one specific type of named entity at a time. This means that, if we need to identify a few different classes, we need to query the model several times, asking about a different named entity class every time. This may become resource-intensive and slow, especially as the number of different classes grows.
A single sentence often contains more than one entity type, which means the LLM needs to be prompted separately for each type

The next issue is also related to the fact that the LLM is queried for each entity type separately. A traditional token classification system would typically predict one set of class probabilities for each token. However, in our case, if we are using the LLM as a black box (only looking at its text output and not internal token probabilities), we only get yes/no answers, but several of them for each token (as many as there are possible classes). This means that, if the model’s prediction for the same token is positive for more than one class, there is no easy way to know which of those classes is more probable. This fact also makes it hard to calculate overall metrics for a test set, and we have to make do with per-class evaluation only.
The model-generated output is also not always well-formed. Sometimes, the model will generate the opening tag for an entity (@@), but not the closing one (##), or some other invalid combination. As with many applications of LLMs to formalized tasks, this requires an extra step of verifying the validity of the model’s free-form output and parsing it into structured predictions.
Sometimes, the model output is not well-formed: in output 1, there is the opening tag “@@”, but the closing tag “##” never appears; in output 2, the model used the opening tag instead of the closing one

There are a few other issues related to the model’s way of generating output. For instance, it tends to over-generate: when asked to only tag one input sentence according to the given format, it does that, but then continues creating its own input-output examples, continuing the pattern of the prompt, and sometimes also tries to provide explanations. Due to this, we found it best to limit the maximum length of the model’s output to avoid unnecessary computation.
After producing the output sentence, the LLM keeps inventing new input-output pairs

Moreover, the LLM’s output sentence does not have to exactly replicate the input. For example, although the input sentences in GPT-NER are tokenized, the model outputs de-tokenized texts, probably because it has learned to produce exclusively (or almost exclusively) well-formed, de-tokenized text. While this adds another extra step of tokenizing the output text again to do evaluation later, that step is easy to do. A bigger problem may appear when the model does not actually use all the same tokens as were given in the input. We have seen, for example, that the model may translate foreign words into English, which makes it harder to match output tokens to input ones. These issues related to output could potentially be mitigated by more sophisticated prompt engineering.
Sometimes the LLM may generate tokens which are different from those in the input, for example, translating foreign words into English

As only some entity classes are labeled in each split of the Few-NERD episode data and annotations for all other classes are removed, the model will not have full information for coarse-grained classes by the nature of the data. Only the data for the supervised task contains full labels, and some extra processing needs to be done if we want to match those. For instance, in the example below only the character is labeled in the episode data, but the actors are not labeled. This may cause issues for both prompting and evaluation. This may be one of the reasons for the larger model’s low precision scores: if the LLM has enough prior knowledge to label all the person entities, some of them may be identified as false positives.
Not all entities are labeled in the episode data of Few-NERD, only the supervised task contains full labels
The authors of GPT-NER put considerable emphasis on selecting the most useful few-shot examples to include into the prompt given to the LLM. However, in a truly few-shot scenario we do not have the luxury of extra labeled examples to choose from. Thus, we slightly modified the setup and simply included all support examples of a given test episode in the prompt.
Finally, even though the data in Few-NERD is human-annotated, the labeling is not always perfect and unambiguous, and some mistakes are present. But more importantly, Few-NERD is a rather hard dataset in general: for a human, it is not always easy to say what the correct class of some named entities should be!

The labels are not always obviously correct: for example, here the character Spider-Man is labeled as a painting, and a racehorse is labeled as a person

Future work
An important note is that in Few-NERD, the classes have two levels of granularity: for example, “person-actor”, where “person” is the coarse-grained, and “actor” the fine-grained class. For now, we only consider the broader coarse-grained classes, which are easier for the models to detect than the more specific fine-grained classes would be.
In the GPT-NER pre-print, there is some emphasis placed on the self-verification technique. After finding a named entity, the model is then prompted to reconsider its decision: given the sentence and the entity that the model found in that sentence, it has to answer whether that entity does indeed belong to the class in question. While we have replicated the basic GPT-NER setup with Few-NERD and Llama 2, we have not yet explored the self-verification technique in detail.
We focus on recreating the main setup of GPT-NER and use the prompts as shown in the pre-print. However, we think that the results could be improved and some of the issues described above could be fixed with more sophisticated prompt engineering. This is also something we leave for future experiments.
Finally, there are other exciting LLMs to experiment with, including the recently released Llama 3 models available on the Clarifai platform.
Summary
We applied the prompting approach of GPT-NER to the task of few-shot NER using the Few-NERD dataset and the Llama 2 models hosted by Clarifai. While there are a few issues to be considered, we have found that, as would be expected, the models do better when there are more few-shot examples shown in the prompt, but, less expectedly, the trends related to model sizes are varied. There is still a lot to be explored as well: better prompt engineering, more advanced techniques such as self-verification, how the models perform when detecting fine-grained instead of coarse-grained classes, and much more.
Try out one of the LLMs on the Clarifai platform today. Can’t find what you need? Consult our docs page or send us a message in our Community Discord channel.