Unveiling the Secrets of Natural Language Processing + Big Five

2 min readNov 20, 2023

In this article, we will delve into the fascinating realm of natural language processing, elucidating how this technology is adeptly employed to extract Big Five personality scores.

At NeuroQuest AI, we introduce our latest product, Persona Predict, a meticulously crafted machine learning model designed to decipher the intricacies of the Big Five from an author’s text.

A Real-life Example

Let’s examine how the model interprets an actual passage:

My name is Alex, I am single, 30 years old, and I enjoy attending parties on the weekends. I am always eager and open to new experiences. I am quite forgetful and disorganized; people often refer to me as a ‘flash in the pan’.

How the model comprehends this:

“I enjoy attending parties” = E+ indicates a person with a higher Extraversion trait;
“I am always eager and open to new experiences” = O+ indicates a person with a higher Openness trait.
“I am quite forgetful and disorganized; people often refer to me as a ‘flash in the pan’” = C- indicates a person with a lower Conscientiousness trait.

The text is analyzed in this manner, and a score is generated.

Model Construction

Below are the main steps to build the prediction model.

Data

The initial step involves data collection, the true essence for our model. These data are meticulously classified by psychologists well-versed in the Big Five theory, providing a solid foundation. Subsequently, we apply data augmentation, a technique that expands the volume of data, enhancing the quality of datasets and the model’s performance.

Model Training

These data are then used to train the model, enabling it to learn patterns in the relationships between text content and personality scores.

Tokenization

The tokenization stage divides the text into smaller units called “tokens”, facilitating processing and enabling more efficient analysis.

Feature Extraction

During the analysis of a new text, the model extracts relevant features, such as word frequency, language choice, and sentence length.

Association with Big Five Scores

Based on the extracted features, the model associates the text with estimates of scores in the five personality traits.

Fine-tuning

To enhance precision, the model undergoes a fine-tuning process, adapting its parameters based on performance in validation data.

Final Result

Following training and adjustment, the model is ready to analyze new texts, providing estimates of Big Five scores for each personality trait.

It is crucial to emphasize that models are statistical tools operating on probabilities. Predictions are based on patterns learned from training data, and results may vary. Additionally, context, culture, and other factors can influence the interpretation of results.