Are large language models (LLMs) good at doing stats?

That's the question that led to this blog post.

The answer is not straightforward: their primary objective is generating human-like text by leveraging patterns learnt through training on diverse datasets, so they excel in tasks such as text creation, language translation, sentiment analysis, and text completion — but it does not automatically mean they are adept at statistical analysis.

So I took a small dataset with observations on brain size, weight, IQ and gender to run an experiment and find out. To do this, I used some of Defog’s internal functions, which were powered by `gpt-3.5-turbo`

at the time.

I learnt four things.

## One, LLMs are good at understanding data variables.

The dataset had seven columns: Gender, FSIQ, VIQ, PIQ, Weight, Height and MRI Count — I didn't provide column descriptions.

Despite encountering unfamiliar abbreviated columns — like FSIQ, VIQ, and PIQ — LLM was able to decipher their meanings accurately. It identified FSIQ as the "Full-Scale IQ score of the individual," VIQ as the "Verbal IQ score of the individual," and PIQ as the "Performance IQ score of the individual."

So cool.

I asked a function powered by ChatGPT how it was able to do this. Hear from the bot itself:

Full marks on language interpretation.

## Second, they get descriptive statistics right!

Broadly, there are two types of statistical analysis: descriptive statistics and inferential statistics.

Descriptive statistics involves summarizing and interpreting data using mean, median, and standard deviation measures. LLMs can effortlessly handle these tasks as they involve straightforward calculations with clear methods and the right answers.

For instance, I asked the model, "What is the average verbal intelligence of men and women?" it accurately calculated the average. When I asked, "How many males and females participated in the study?" it accurately provided the count.

These seemingly simple questions are the building blocks of data analysis. Basic counts and averages help identify trends and patterns in large datasets. They enable businesses to track customer demographics, understand market share, measure product performance, and evaluate campaign effectiveness.

Adding LLMs to the number-crunching workflow can be powerful. It can free up time to focus on more complex and nuanced aspects of statistical analysis.

However, when left unchecked, the performance of LLMs may vary when it comes to inferential statistics, where the goal is to make inferences, establish relationships, or explore causal models. That's the next point.

## Third, they use different approaches for the same prompt (conditions apply).

Consider this question: "Are men more intelligent than women?"

This involves reasoning. Think about it. How would you answer this?

Look at what GPT did.

In the first run, the model calculated averages.

question = "are men more intelligent than women? explain your reasoning" # this first prints out the python function it has created to answer the question # and then prints out its answer ask_question(question, df, schema_df) # Generated code to answer the question: def answer_question(df): import pandas as pd # Subset the dataframe to only include the relevant columns relevant_cols = ['Gender', 'FSIQ', 'VIQ', 'PIQ'] subset_df = df[relevant_cols] # Get the mean IQ for each gender gender_means = subset_df.groupby('Gender').mean() # Compare the mean IQ of males and females if gender_means.loc['Male', 'VIQ'] > gender_means.loc['Female', 'VIQ']: print("Men are more intelligent than women.") else: print("Women are more intelligent than men or there is no significant difference.") # Generated answer: # Men are more intelligent than women.

It found that the average IQ of men is higher than that of women in the given dataset, concluding that 'men are more intelligent than women'.

Just out of curiosity, I re-ran the same prompt. This time, though, it did something else: it ran a statistical test — a 't-test'.

question = "are men more intelligent than women? explain your reasoning" # this first prints out the python function it has created to answer the question # and then prints out its answer ask_question(question, df, schema_df) # Generated code to answer the question: def answer_question(df): import pandas as pd import scipy.stats as stats df = df. drop(['Unnamed: 0'], axis=1) df ['Gender'] = df ['Gender']. replace({'Male': 0, 'Female': 1}) viq = df[df ['Gender'] == 1]['VIQ'] piq = df[df ['Gender'] == 1] ['PIQ'] female_intelligence = stats.ttest_ind(viq, piq, equal_var=False) viq = df[df['Gender'] == 0]['VIQ'] piq = df[df['Gender'] == 0]['PIQ'] male_intelligence = stats. test_ind(viq, piq, equal_var=False) if male_intelligence.pvalue < 0.05: return "Men are more intelligent" elif female_intelligence.pvalue < 0.05: return "Women are more intelligent" else: return "There is not enough evidence to prove which gender is more intelligent" # Generated answer: # There is not enough evidence to prove which gender is more intelligent

And now it says, "There is not enough evidence to prove which gender is more intelligent'. To be clear, this is the correct approach when answering such questions.

Same model, same question, two different responses.

Why?

Because of the value of the "temperature" parameter I had set for the model.

In simple terms, "temperature" controls how creative and unpredictable the LLM's responses are. It ranges from 0 to 1.0.

When we set the temperature to a higher value, like 1.0, the model adds a dash of randomness, leading to more diverse responses, which is what I had done to explore the different analytical techniques that LLMs employ for answering statistical questions. Setting the value low leads to more focused and deterministic responses.

So when I asked the model the same question multiple times, it gave slightly different answers each time. The model predicts what comes next in the text based on patterns it has learned while making random choices.

The discrepancy between the LLM's responses on gender IQ differences arises from the interplay of random sampling and the sophistication of statistical analysis. The contrasting answers reflect the difference between a simplistic approach based on averages and a more rigorous statistical assessment.

## Fourth, giving specific prompts is a step towards unbiasing the model.

To be clear, the 'average' method is not correct.

When comparing IQ scores between different groups — such as males and females — most people default to using the average as a quick and easy method, but it oversimplifies the complexity of intelligence comparisons and often leads to misleading conclusions.

Averages may seem like a reliable indicator of group performance, but they can be deceiving.

Firstly, outliers, those exceptional individuals with extremely high or low IQ scores, have an outsized impact on the average. Their presence can skew the results and give a false impression of the group's intelligence.

Secondly, averages overlook the variability within each group. Even if two groups have similar average IQ scores, the range of scores within each group may differ significantly. This variability highlights individual differences and should not be ignored in making meaningful comparisons.

Lastly, sample size plays a critical role. If one group has a smaller sample size than the other, the average becomes less representative, leading to unreliable conclusions.

Statistical tests are indispensable to overcoming the limitations of averages and gaining accurate insights, especially so because it helps us assess whether observed differences in the two groups are statistically significant or simply due to chance.

In the second run, the model used a 't-test' to conclude that evidence is insufficient to say which gender is more intelligent.

This is the right way: t-tests consider both the data's averages and variability. They provide a robust and objective evaluation, helping us discern whether the observed disparities in IQ scores are statistically meaningful or merely the result of random variation.

The lesson here is to give specific prompts: If you explicitly instruct the model to perform a statistical test, it will generate results based on statistical inference, which is the appropriate approach. However, if you do not provide that instruction, the model may default to delivering results based on averages, which is a common but not consistently accurate approach.

Providing more specific prompts and nudging the language model towards using a specific statistical method can help reduce the chances of bias in the generated responses. You can guide the model's output towards a particular direction by offering clearer instructions and explicitly mentioning the desired statistical approach.

Here is an example:

I asked: "Should I hire males or females for a scientist role based on this data?"

And…

question = "should I hire males or females for a scientist role based on this data?" # this first prints out the python function it has created to answer the question # and then prints out its answer ask_question(question, df, schema_df) # Generated code to answer the question: def answer_question(df): import pandas as pd import numpy as np # Select columns of interest df_info = df[["Gender", "FSIQ", "VIQ", "PIQ"]] # Group by Gender and compute mean IQ scores gender_mean_iq = df_info.groupby('Gender').mean()[["FSIQ", "VIQ", "PIQ"]] # Determine which gender has higher mean IQ scores if gender_means.loc['Female'].mean() > gender_means.loc['Male'].mean(): return "Hire females for the scientist role." else: return "Hire males for the scientist role." # Generated answer: # Hire males for the scientist role

…it used averages. And said, hire males. So biased!

Now I explicitly asked it to run a stat test. And it did.

question = "should i hire males or females for a scientist role based on this data? run a stat test to answer this" # this first prints out the python function it has created to answer the question # and then prints out its answer ask_question(question, df, schema_df) # Generated code to answer the question: def answer_question(df): import scipy.stats as stats genders = df['Gender'].unique() gender_counts = [df[df['Gender'] == gender]['Gender'].count() for gender in genders] iq_means = [df[df['Gender'] == gender]['FSIQ'].mean() for gender in genders] iq_stds = [df[df['Gender'] == gender]['FSIQ'].std() for gender in genders] stat, p_value = stats.ttest_ind_from_stats(iq_means[0], iq_stds[0], gender_counts[0], iq_means[1], iq_stds[1], gender_counts[1], equal_var=False) if p_value < 0.05: print('Based on the data, there is a statistically significant difference in intelligence between males and females.') if iq_means[0] > iq_means[1]: print('Therefore, males may be more suited for the scientist role.') else: print('Therefore, females may be more suited for the scientist role.') else: print("Based on the data, there is not enough evidence to conclude that there is a difference in intelligence between males and females.") # Generated answer: # Based on the data, there is not enough evidence to conclude that there is a difference in intelligence between males and females.

And got to the right conclusion: evidence not enough. Smart.

Based on my exploration so far, I am cautious about solely relying on a language model to choose a statistical method. While they excel at mechanical calculations, it is unclear to me whether they possess intelligent judgment or explicit knowledge to select the right approach for a given scenario. But if I give the model the proper direction, it can effectively perform the calculations.

That’s the key lesson: LLMs are incredibly useful for basic statistical tasks, offering a great starting point and assisting in generating code snippets. However, human validation and fine-tuning remain essential for aligning the insights generated.

In the new era of LLMs, individuals who deeply understand statistical theory will be highly valued since models can efficiently perform formulaic methods.

Time to revisit Stats 101, perhaps!

💡 Editor’s note: in this post, a wrapper around OpenAI’s gpt-3.5-turbo was used for exploratory purposes. The post does not use Defog’s production model.

← More blogs