Open-sourcing SQLEval: our framework for evaluating LLM-generated SQL

This post details the design choices we made when building our SQL evaluation tools, and the underlying motivations behind them.

Aug 20, 2023

Wong Jing PingSoftware Engineer

Introducing SQL-Eval

SQL-Eval is a framework that evaluates the correctness of LLM generated SQL. As we built Defog, we recognized the need for evaluating how many of the queries we generated were correct. But evaluating the “correctness” of these queries proved difficult. We considered using GPT-4 as a judge, but ran into multiple issues with accuracy, latency, and costs.

We also realized that two different SQL queries can both be "correct". For the question, who are the 10 most recent users from Toronto, both of the following are correct in their own ways:

1-- query 1
2SELECT userid, username, created_at
3from users
4where city='Toronto'
5order by created_at
8-- query 2
9SELECT userid, firstname || ' ' || lastname, created_at
10from users
11where city='Toronto'
12order by created_at

SQL-Eval was our effort to fix this. We have open-sourced this to the community, and you can find our Github repo at https://github.com/defog-ai/sql-eval.


Before writing out a single line of code, we put down the following non-negotiable requirements for our evaluation framework:

  • Reproducible / Repeatable, as per the scientific method
  • Measures both SQL complexity (think nested queries and multiple joins) and semantic diversity (uses language in different settings)
  • Other desiderata were:

  • Fast (shouldn’t have to spend hours to run a single eval)
  • Cheap (shouldn’t have to shell out $$$ for each eval, eg via human annotators)
  • Easy to debug / inspect where the model went wrong
  • Rigorous (should have really hard questions with multiple joins, outer joins and nested queries)
  • Robustly grade and handle multiple correct answers. Some questions are inherently ambiguous (eg asking for the top 5 restaurants could return the id of the restaurants or the names) and sometimes returning a little more info than expected (eg both ids and names) is still “acceptable”
  • Existing Options and our approach

    Before deciding to build our own, we surveyed the landscape of SQL data generation datasets, and found a few. Here they are, their strengths and why we decided not to use them:


    This dataset + evaluation framework was released by the Lily Lab from Stanford back in 2018, and has seen multiple iterations on top of its original question/dataset, like SParC for more conversational benchmarks. In its raw form, the spider dataset comprises a bunch of sqlite files describing 8 different databases, together with question/query pairs.

    They were the first to release evaluation code which ran the queries alongside the data provided, instead of just performing sql string / AST matching, which would result in a lot of false negatives. We really liked the complexity of the database schema (atis has 131 columns) and the way each query was classified with its own “difficulty” score based on the parsed expected SQL.


    This dataset was released by Salesforce and encompasses a wide variety of question contexts, owing to its curation method (tables from wikipedia). It was used to develop and benchmark Seq2SQL, one of our predecessors and state-of-the-art models for text-to-SQL generation.

    We liked the wide variety of question contexts, but the SQL to be generated was often quite simple, given how most questions’ context involved not more than a dozen columns and seldom more than a table.


    This was created on top of Spider + WikiSQL, using sqlglot to parse and clean the queries. Its structure and content take more closely to WikiSQL, with a helpful addition of the context column containing the actual SQL DDL (CREATE TABLE …), which we found to be very useful for actually testing and running the queries.


    We considered getting GPT-4 to generate our question/query pairs, as well as the data, given the schema we provide. But this was expensive, slow, and non-deterministic.

    We also considered using GPT-4 to evaluate the correctness of generated SQL, especially since GPT-4 for self-grading in some form of a actor-critic loop seems to a commonplace solution (e.g. LangChain’s benchmarks).

    However, the rate limits of the API, cost (each complete eval run could cost 3 digits excluding inference), potential for model drift and the decreasing marginal information added in the information loop (using GPT4 to generate queries + using GPT4 to evaluate its own responses) became deal-breakers that failed our first and foremost criterion: reproducibility.

    The simplest analogy underscoring the dangers of the self-grading approach is: would you trust the grade that a student gives to his/her own work, without external feedback?

    Dataset Creation

    Populating SQL Databases

    We decided to fork off 8 initial database schemas from Spider, manually fixing some data discrepancies in them. But we ran into significant data quality issues.

    With this in mind, we decided to turn to OpenAI to assist us for the first draft of our dataset generation. Similar to WikiSQL, we generated just a dozen or more rows that were semantically meaningful and related, and edited the generated data to incorporate various constraints (e.g. join keys, specific names and substrings we wanted to incorporate in our queries). That data has now been exported in a .sql file for each schema for easy importing here.

    Creating Question / Query Pairs

    We generated a set of question and “gold” query pairs, where the “gold” query would produce the desired results. In contrast to other benchmarks which use string/token-based matching (eg ROUGE), we care more about the results generated, not the exact SQL string.

    We also acknowledge that many times, there are multiple correct ways to answer a given question, which we would like to accept / mark as correct. Here are 4 acceptable examples for the question: What are the top 3 restaurants in New York?

    1-- query 1
    2SELECT name
    3FROM restaurants
    4GROUP BY name
    5ORDER BY AVG(rating) DESC LIMIT 3
    7-- query 2
    8SELECT id, name
    9FROM restaurants
    10GROUP BY name
    11ORDER BY AVG(rating) DESC LIMIT 3
    13-- query 3
    14SELECT name, AVG(rating)
    15FROM restaurants
    16GROUP BY 1
    19-- query 4
    20SELECT id, AVG(rating)
    21FROM restaurants
    22GROUP BY 1
    23ORDER BY 2
    24DESC LIMIT 3

    All 4 queries are acceptable, and we would like to ensure that we do not exclude any of these as false positives. As such, we have specified a minimal set of columns in our gold queries that a user asking a question would expect.

    Practically, we specify multiple acceptable columns using curly brackets in our gold queries, e.g. SELECT {id,user_name} FROM users would mean that the results from running SELECT id FROM users or SELECT user_name FROM users or SELECT id, user_name FROM users are all acceptable.

    Revisiting our earlier example, the gold query for the top 3 restaurants in New York can now be concisely represented as SELECT {id,name}, AVG(rating) FROM restaurants ORDER BY 2 DESC LIMIT 3

    Finally, we ensure a uniform representation across the different SQL syntax types, specifically

  • joins (JOIN … ON …). we test for queries involving multiple joins and joins in nested queries as well.
  • aggregations (GROUP BY)
  • ordered results (ORDER BY)
  • ratios (SELECT <some quantity> / <another quantity>). important to ensure we don’t divide by 0 and pre-aggregate meaningfully if necessary.
  • filters (WHERE)
  • Evaluation Method

    Big-Picture Approach

    We decided to go down the hard path of validating the results of the executed queries deterministically, against the results of a set of “gold” queries. As alluded to in the previous section, we’d need some way to check for “equivalence” to the “gold” query, given the minimal set of alternatives specified in the question/query pairs, while handling the host of edge cases. The detailed logic of what we’ll be describing below can be found in this python function in our repository, and we’ll attempt to explain that using a simple sample question: return me our users and if they like movies. Here’s the sample schema we’ll be working with:

    1CREATE TABLE users (
    2    uid BIGINT,
    3    name TEXT,
    4    likes_movies BOOLEAN,
    5    likes_plays BOOLEAN

    The gold query for this would be SELECT {uid,name}, likes_movies FROM users, since the query is ambiguous about whether the user’s uid or name or both would be required, and any of these permutations are acceptable given the question.

    1. We start by expanding the braces in the gold query to generate every possible combination of the columns within. This gives us the following 2^2 - 1 (no empty set) combinations: SELECT uid, likes_movies FROM users SELECT name, likes_movies FROM users SELECT uid, name, likes_movies FROM users

    2. We then run each of these queries against our database and get a dataframe for each of them. We use 2 rows here for brevity:

    1dfs_gold = [
    2	pd.DataFrame({"uid": [1, 2], "likes_movies": [True, False]}),
    3	pd.DataFrame({"name": ["alice", "bob"], "likes_movies": [True, False]}),
    4	pd.DataFrame({"uid": [1, 2], "name": ["alice", "bob"], "likes_movies": [True, False]}),

    3. Given the results of running the acceptable “gold” queries and our generated query, for each result of an acceptable gold query, we compare and see if the result of the given gold query matches the result of generated query. If there is an exact match of the dataframes’ contents (ignoring data types), the result is immediately marked as correct. The function responsible for this is compare_df. Comparing the result of the generated query df_generated with each of the dataframe’s in dfs_gold, only the first df should match as both dataframes contain the exact same columns (id,likes_movies) containing the exact same values.

    1# result of executing a generated query:
    2# `SELECT u.id, u.likes_movies FROM users u`
    3df_generated = pd.DataFrame({"uid": [1, 2], "likes_movies": [True, False]})
    4compare_df(df_generated, dfs_gold[0]) # True
    5compare_df(df_generated, dfs_gold[1]) # False
    6compare_df(df_generated, dfs_gold[2]) # False

    4. However, sometimes we might also want to allow certain alternatives to be flagged as correct still. Here are some examples:

    a. Column aliases that do not match the original column name. E.g. SELECT u id AS id, likes_movies FROM users u

    1id   likes_movies
    21    True
    32    False

    b. Selecting additional columns. E.g. SELECT users, name, likes_movies FROM users u

    1id    name   likes_movies
    21     alice  True
    32     bob    False

    c. Different row order. E.g. SELECT uid, likes_movies FROM users u ORDER BY likes_movies

    1id    likes_movies
    22     False
    31     True

    These variations are harmless, and could be helpful occasionally, depending on the question. Hence, we came up with a relaxed form of comparison called subset_df, which checks if the 1st dataframe (the gold query) is a subset of the 2nd (the generated query), if it does not pass the exact match in compare_df.

    Subset Evaluation

    Here’s how subset_df works:

    1. For each column in df1, we check to see if the same column of values exists in df2. We ignore data types, column names (since these could be aliased) and row order while doing so.

    2. After picking out the relevant columns from df2, renaming them with the names from df1, we check that the overall dataframe matches df1. This is to ensure that we don’t accidentally match on shuffled data columns (which is quite common for column data types with low cardinality like booleans or enums) with the same unordered list of values but from the wrong column.


    Our intention in releasing this dataset is to diversify the available benchmarks to help researchers and engineers better understand their model performance for text-to-SQL generation, specifically one that is robust to the harmless variations in the results returned, like column renames, additional columns, and reordering.

    By extension, this also means that you can evaluate your own prompt-engineered solutions with OpenAI’s api (example code here), as fine-tuning a model might not be an option within reach of everyone. By rigorously, openly and reproducibly testing the models we release alongside other state-of-the-art solutions, we hope to advance the frontier of OSS text-to-SQL solutions.

    Separately, this benchmark is not meant to function as an authoritative leaderboard, since it is trivial to game it by training on this dataset (see Goodhart’s law). If you are considering a text-to-SQL model/product for your own needs, do consider the nature of this benchmark, and if it correlates with your own utility (e.g. semantic diversity, SQL complexity, specificity in a narrow domain).

    You can find our repo at https://github.com/defog-ai/sql-eval. Please open an issue or drop us a note on Twitter with your feedback!

    About the author

    Wong Jing Ping is a Software Engineer at Defog. He works on building Defog’s Machine Learning and engineering stack. Before joining Defog, he worked on account and tweet recommendations at Twitter for six years.

    Outside of work, he enjoys recreational climbing and perfecting that single-origin pourover 👌🏻 He also builds small keyboards.

    ← More blogs

    Fine-tune our LLM on your enterprise data