Introducing SQL-Eval
SQL-Eval is a framework that evaluates the correctness of LLM generated SQL. As we built Defog, we recognized the need for evaluating how many of the queries we generated were correct. But evaluating the “correctness” of these queries proved difficult. We considered using GPT-4 as a judge, but ran into multiple issues with accuracy, latency, and costs.
We also realized that two different SQL queries can both be "correct". For the question, who are the 10 most recent users from Toronto, both of the following are correct in their own ways:
-- query 1 SELECT userid, username, created_at from users where city='Toronto' order by created_at DESC LIMIT 10; -- query 2 SELECT userid, firstname || ' ' || lastname, created_at from users where city='Toronto' order by created_at DESC LIMIT 10;
SQL-Eval was our effort to fix this. We have open-sourced this to the community, and you can find our Github repo at https://github.com/defog-ai/sql-eval.
Goals
Before writing out a single line of code, we put down the following non-negotiable requirements for our evaluation framework:
Other desiderata were:
Existing Options and our approach
Before deciding to build our own, we surveyed the landscape of SQL data generation datasets, and found a few. Here they are, their strengths and why we decided not to use them:
Spider
This dataset + evaluation framework was released by the Lily Lab from Stanford back in 2018, and has seen multiple iterations on top of its original question/dataset, like SParC for more conversational benchmarks. In its raw form, the spider dataset comprises a bunch of sqlite files describing 8 different databases, together with question/query pairs.
They were the first to release evaluation code which ran the queries alongside the data provided, instead of just performing sql string / AST matching, which would result in a lot of false negatives. We really liked the complexity of the database schema (atis has 131 columns) and the way each query was classified with its own “difficulty” score based on the parsed expected SQL.
WikiSQL
This dataset was released by Salesforce and encompasses a wide variety of question contexts, owing to its curation method (tables from wikipedia). It was used to develop and benchmark Seq2SQL, one of our predecessors and state-of-the-art models for text-to-SQL generation.
We liked the wide variety of question contexts, but the SQL to be generated was often quite simple, given how most questions’ context involved not more than a dozen columns and seldom more than a table.
SQL-Create-Context
This was created on top of Spider + WikiSQL, using sqlglot to parse and clean the queries. Its structure and content take more closely to WikiSQL, with a helpful addition of the context column containing the actual SQL DDL (CREATE TABLE …
), which we found to be very useful for actually testing and running the queries.
OpenAI
We considered getting GPT-4 to generate our question/query pairs, as well as the data, given the schema we provide. But this was expensive, slow, and non-deterministic.
We also considered using GPT-4 to evaluate the correctness of generated SQL, especially since GPT-4 for self-grading in some form of a actor-critic loop seems to a commonplace solution (e.g. LangChain’s benchmarks).
However, the rate limits of the API, cost (each complete eval run could cost 3 digits excluding inference), potential for model drift and the decreasing marginal information added in the information loop (using GPT4 to generate queries + using GPT4 to evaluate its own responses) became deal-breakers that failed our first and foremost criterion: reproducibility.
The simplest analogy underscoring the dangers of the self-grading approach is: would you trust the grade that a student gives to his/her own work, without external feedback?
Dataset Creation
Populating SQL Databases
We decided to fork off 8 initial database schemas from Spider, manually fixing some data discrepancies in them. But we ran into significant data quality issues.
With this in mind, we decided to turn to OpenAI to assist us for the first draft of our dataset generation. Similar to WikiSQL, we generated just a dozen or more rows that were semantically meaningful and related, and edited the generated data to incorporate various constraints (e.g. join keys, specific names and substrings we wanted to incorporate in our queries). That data has now been exported in a .sql file for each schema for easy importing here.
Creating Question / Query Pairs
We generated a set of question and “gold” query pairs, where the “gold” query would produce the desired results. In contrast to other benchmarks which use string/token-based matching (eg ROUGE), we care more about the results generated, not the exact SQL string.
We also acknowledge that many times, there are multiple correct ways to answer a given question, which we would like to accept / mark as correct. Here are 4 acceptable examples for the question: What are the top 3 restaurants in New York?
-- query 1 SELECT name FROM restaurants GROUP BY name ORDER BY AVG(rating) DESC LIMIT 3 -- query 2 SELECT id, name FROM restaurants GROUP BY name ORDER BY AVG(rating) DESC LIMIT 3 -- query 3 SELECT name, AVG(rating) FROM restaurants GROUP BY 1 ORDER BY 2 DESC LIMIT 3` -- query 4 SELECT id, AVG(rating) FROM restaurants GROUP BY 1 ORDER BY 2 DESC LIMIT 3
All 4 queries are acceptable, and we would like to ensure that we do not exclude any of these as false positives. As such, we have specified a minimal set of columns in our gold queries that a user asking a question would expect.
Practically, we specify multiple acceptable columns using curly brackets in our gold queries, e.g. SELECT {id,user_name} FROM users
would mean that the results from running SELECT id FROM users
or SELECT user_name FROM users
or SELECT id, user_name FROM users
are all acceptable.
Revisiting our earlier example, the gold query for the top 3 restaurants in New York can now be concisely represented as SELECT {id,name}, AVG(rating) FROM restaurants ORDER BY 2 DESC LIMIT 3
Finally, we ensure a uniform representation across the different SQL syntax types, specifically
Evaluation Method
Big-Picture Approach
We decided to go down the hard path of validating the results of the executed queries deterministically, against the results of a set of “gold” queries. As alluded to in the previous section, we’d need some way to check for “equivalence” to the “gold” query, given the minimal set of alternatives specified in the question/query pairs, while handling the host of edge cases. The detailed logic of what we’ll be describing below can be found in this python function in our repository, and we’ll attempt to explain that using a simple sample question: return me our users and if they like movies
. Here’s the sample schema we’ll be working with:
CREATE TABLE users ( uid BIGINT, name TEXT, likes_movies BOOLEAN, likes_plays BOOLEAN )
The gold query for this would be SELECT {uid,name}, likes_movies FROM users
, since the query is ambiguous about whether the user’s uid
or name
or both would be required, and any of these permutations are acceptable given the question.
1. We start by expanding the braces in the gold query to generate every possible combination of the columns within. This gives us the following 2^2 - 1 (no empty set) combinations:
SELECT uid, likes_movies FROM users
SELECT name, likes_movies FROM users
SELECT uid, name, likes_movies FROM users
2. We then run each of these queries against our database and get a dataframe for each of them. We use 2 rows here for brevity:
dfs_gold = [ pd.DataFrame({"uid": [1, 2], "likes_movies": [True, False]}), pd.DataFrame({"name": ["alice", "bob"], "likes_movies": [True, False]}), pd.DataFrame({"uid": [1, 2], "name": ["alice", "bob"], "likes_movies": [True, False]}), ]
3. Given the results of running the acceptable “gold” queries and our generated query, for each result of an acceptable gold query, we compare and see if the result of the given gold query matches the result of generated query. If there is an exact match of the dataframes’ contents (ignoring data types), the result is immediately marked as correct. The function responsible for this is compare_df. Comparing the result of the generated query df_generated
with each of the dataframe’s in dfs_gold
, only the first df should match as both dataframes contain the exact same columns (id,likes_movies
) containing the exact same values.
# result of executing a generated query: # `SELECT u.id, u.likes_movies FROM users u` df_generated = pd.DataFrame({"uid": [1, 2], "likes_movies": [True, False]}) compare_df(df_generated, dfs_gold[0]) # True compare_df(df_generated, dfs_gold[1]) # False compare_df(df_generated, dfs_gold[2]) # False
4. However, sometimes we might also want to allow certain alternatives to be flagged as correct still. Here are some examples:
a. Column aliases that do not match the original column name.
E.g. SELECT u id AS id, likes_movies FROM users u
id likes_movies 1 True 2 False
b. Selecting additional columns.
E.g. SELECT users, name, likes_movies FROM users u
id name likes_movies 1 alice True 2 bob False
c. Different row order.
E.g. SELECT uid, likes_movies FROM users u ORDER BY likes_movies
id likes_movies 2 False 1 True
These variations are harmless, and could be helpful occasionally, depending on the question. Hence, we came up with a relaxed form of comparison called subset_df, which checks if the 1st dataframe (the gold query) is a subset of the 2nd (the generated query), if it does not pass the exact match in compare_df.
Subset Evaluation
Here’s how subset_df works:
1. For each column in df1, we check to see if the same column of values exists in df2. We ignore data types, column names (since these could be aliased) and row order while doing so.
2. After picking out the relevant columns from df2, renaming them with the names from df1, we check that the overall dataframe matches df1. This is to ensure that we don’t accidentally match on shuffled data columns (which is quite common for column data types with low cardinality like booleans or enums) with the same unordered list of values but from the wrong column.
Conclusion
Our intention in releasing this dataset is to diversify the available benchmarks to help researchers and engineers better understand their model performance for text-to-SQL generation, specifically one that is robust to the harmless variations in the results returned, like column renames, additional columns, and reordering.
By extension, this also means that you can evaluate your own prompt-engineered solutions with OpenAI’s api (example code here), as fine-tuning a model might not be an option within reach of everyone. By rigorously, openly and reproducibly testing the models we release alongside other state-of-the-art solutions, we hope to advance the frontier of OSS text-to-SQL solutions.
Separately, this benchmark is not meant to function as an authoritative leaderboard, since it is trivial to game it by training on this dataset (see Goodhart’s law). If you are considering a text-to-SQL model/product for your own needs, do consider the nature of this benchmark, and if it correlates with your own utility (e.g. semantic diversity, SQL complexity, specificity in a narrow domain).
You can find our repo at https://github.com/defog-ai/sql-eval. Please open an issue or drop us a note on Twitter with your feedback!
About the author
Wong Jing Ping is a Software Engineer at Defog. He works on building Defog’s Machine Learning and engineering stack. Before joining Defog, he worked on account and tweet recommendations at Twitter for six years.
Outside of work, he enjoys recreational climbing and perfecting that single-origin pourover 👌🏻 He also builds small keyboards.
← More blogs