Mastering Text Classification with BERT: A Practical Guide

Ederson Corbari
5 min readNov 26, 2023

--

Here at NeuroQuest AI, we use the pre-trained BERT model for text classification tasks. For training the model in languages like Brazilian Portuguese, we employ BERTimbau developed by NeuralMind because it can better capture the nuances of the language.

But what exactly is BERT?

BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained language model developed by Google. It’s a transformer-based model that can be fine-tuned for a wide range of natural language processing tasks, such as sentiment analysis, question answering, language translation, etc.

One of BERT’s key features is its bidirectional nature. Unlike traditional language models that process input text unidirectionally (from left to right or vice versa), BERT processes input text in both directions, considering the context of words before and after a given word. This allows BERT to capture more contextual information, improving performance across a variety of tasks.

BERT has achieved state-of-the-art performance in various natural language processing benchmarks and is widely used in both industry and academia. It has been pre-trained on a large corpus of textual data and can be fine-tuned for specific tasks using a small amount of labeled data.

Regarding the difference between ChatGPT, BART, and BERT:

To clarify why these are in the spotlight, ChatGPT and BART are generative models focused on conversation, although they can be used for classification, it’s not their primary goal. BERT, on the other hand, is known for its effectiveness in specific NLP tasks like classification. All three models are based on the transformer architecture.

How does BERT work?

BERT is based on the transformer architecture, which employs self-attention mechanisms to process input data. The transformer takes in a sequence of input tokens (like words or subwords in a sentence) and produces a sequence of output tokens, also known as embeddings.

One of BERT’s key features is its bidirectional nature. Unlike traditional language models that process input text unidirectionally, BERT processes input text in both directions, considering the context of words before and after a given word. This allows BERT to capture more contextual information, improving performance across a wide range of tasks.

BERT uses a multi-layer transformer encoder to process input data. The input tokens are first embedded using a token embedding layer and then passed through the transformer encoder. The transformer encoder consists of multiple self-attention layers, enabling the model to attend to different parts of the input sequence and capture long-range dependencies. The output of the transformer encoder is a sequence of contextualized token embeddings, capturing the meaning of the input tokens in the context of the entire input sequence.

BERT has been pre-trained on a large corpus of textual data and can be fine-tuned for specific tasks using a small amount of labeled data. This allows the model to be used for a wide range of natural language processing tasks with minimal task-specific training data.

Text Classification using BERT

BERT can be used for text classification tasks by fine-tuning the pre-trained model on a labeled dataset. Here is a general outline of the process:

  1. Preprocess the text data: This may include tasks such as lowercasing, tokenization, and removing stop words;
  2. Convert the text data into numerical input features: BERT operates on numerical input data, so it is necessary to convert the text data into numerical form. This can be done using techniques such as word embeddings or sentence embeddings;
  3. Load the pre-trained BERT model and add a classification layer: The BERT model can be loaded from a checkpoint, and a classification layer can be added on top of it. The classification layer will be responsible for making the final prediction;
  4. Fine-tune the model on the labeled dataset: The model can be fine-tuned by adjusting the weights of the classification layer and the pre-trained layers using gradient descent. This can be done using a small labeled dataset and a labeled text classification dataset;
  5. Evaluate the model on a test set: After fine-tuning, the model can be evaluated on a test set to assess its performance. Performance metrics, such as accuracy and the F1 score, can be used to measure the model’s performance;

BERT can be fine-tuned for text classification tasks using a small labeled dataset and has achieved state-of-the-art performance on a number of benchmarks.

Implementation of Text Classification using BERT

Here is the classic NLP Hello World, a sentiment analysis using two phrases in Brazilian Portuguese:

import torch
from transformers import BertTokenizer, BertForSequenceClassification

model_name = 'neuralmind/bert-base-portuguese-cased'
model = BertForSequenceClassification.from_pretrained(model_name)
tokenizer = BertTokenizer.from_pretrained(model_name)

text = 'O cliente elogiou o nosso produto novamente.'
# text = 'O cliente está reclamando do nosso produto novamente.'
input_ids = torch.tensor(tokenizer.encode(text)).unsqueeze(0)

output = model(input_ids)
prediction = output[0].argmax().item()

class_labels = ['BAD', 'GOOD']
predicted_class = class_labels[prediction]

print(f'Predict: {predicted_class}') # GOOD or BAD

This example loads the pre-trained BERTimbau model and tokenizer from the ‘bert-base-portuguese-cased’ model, converts the input text to input features using the tokenizer, and makes a prediction using the model. The final prediction is a class label, with 0 corresponding to the negative class and 1 corresponding to the positive class.

  • Sentence (O cliente elogiou o nosso produto novamente.) generates a prediction with positive sentiment GOOD;
  • The second sentence (O cliente está reclamando do nosso produto novamente.) generates a prediction with negative sentiment BAD.

You can tune the BERT model for text classification by tuning the model weights using gradient descent and a labeled dataset. You can also use other pre-trained models and fine-tune them for text classification tasks.

Conclusion

In conclusion, BERT (Bidirectional Encoder Representations from Transformers) is a powerful tool for text classification tasks. It is a pre-trained language model that has been trained on a large dataset of text, allowing it to capture the context and meaning of words in a way that is useful for natural language processing tasks. BERT can be fine-tuned for a specific text classification task by adding a classification layer on top of the pre-trained model and training it on the task-specific dataset.

One of the key advantages of using BERT for text classification is that it allows for the use of large amounts of unannotated data, which can significantly improve the performance of the model. BERT also has the ability to handle long-range dependencies in text, which is important for many natural language processing tasks.

Overall, BERT has shown to be a very effective tool for text classification and has achieved state-of-the-art results on a number of benchmarks. It is a versatile model that can be fine-tuned for a variety of text classification tasks and is an important tool for natural language processing researchers and practitioners.

--

--