Update

SQLCoder 2: technical insights for fine-tuning and dataset generation

How we got a double digit percentage point improvement over the original SQLCoder model

Wong Jing Ping, Wendy Aw

Defog Staff

Oct 4, 2023

TL;DR

To significantly surpass the performance of the original SQLCoder 1, we boosted the quantity and quality of our training data, and optimized our fine-tuning methods.

Improving Training Data Quality

We systematically expanded and improved our training data through an overhaul of our data generation process.

Curriculum Learning

Noticing that our initial set of training data did not cover a full range of SQL patterns and syntax, we designed a comprehensive SQL curriculum to guide the data generation much like how a teacher uses an academic curriculum to shape the learning objectives of a student.

The curriculum outlined the core SQL concepts we wanted represented, ranging from basic clauses to underrepresented categories like date functions and advanced operations like window functions and CTEs. We also added three new database schemas to our current set of 10, exposing our model to more varied vocabularies and schema relations that would promote generalization across subjects as well as improve its robustness to novel concepts.

Paraphrasing questions for syntactic diversity

One technique we utilized to expand the training data efficiently was to take existing SQL query examples and rephrase the accompanying plain-language questions in multiple ways while keeping the underlying SQL query unchanged. In this way, we could expose our model to far greater syntactic and semantic diversity in a much faster and scalable way.

Normalizing Whitespace

In preprocessing the SQL query labels, we normalized all whitespace formatting, removing arbitrary indentation and line breaks that did not impact query functionality. Standardizing the whitespace allowed SQLCoder to focus on learning meaningful SQL syntax and structure rather than expend effort predicting inconsequential spacing that could take as much as half the completion token count.

Infinite WikiSQL variations

In addition to our core training data, we leveraged the WikiSQL dataset of query-question pairs. To expand the diversity of these examples at scale, we implemented a novel generative technique to automatically produce infinite variations of the WikiSQL data. As the WikiSQL questions were predominantly simple, single-table queries, we utilized these synthetic variations primarily during the early phases of training to prime the model for more complex training data in the later stages.

Creating new training data based on failure modes

To further improve performance on challenging queries, we also identified common failure patterns during evaluation. Based on these problematic examples, we created new training data with similar complex characteristics such as advanced datetime logic or multi-step ratios and assigned these higher weightage in subsequent training runs. This approach of augmentation with hard negative samples allowed us to iteratively enhance performance on difficult cases that would previously trip up the model.

These datasets were carefully balanced and interleaved during training in a staged progression to ensure the model sees a breadth of data at every stage that is also steadily increasing in difficulty over time. We observed that thoughtfully evolving the data complexity led to improved SQL proficiency and generalization capabilities.

Optimizing fine tuning

To improve model training and increase accuracy on the text-to-SQL task, we implemented several key changes to the fine-tuning process.

Ignoring Prompt Loss

When training our model, we initially calculated the loss over all tokens in both the prompt and completion. However, we changed our approach by only considering the completion tokens in the revised loss calculation. This allowed the model to focus its efforts on generating the right SQL query completion rather than the provided prompt text, which solely served as context.

Avoiding packing of prompt-completion pairs

Previously, we would pack multiple prompt-completion pairs together into fixed token lengths in order to maximize the model’s context window. Now, we keep each prompt-completion pair contained within a single input sequence, avoiding any separation that might jeopardize the loss of essential contextual information between prompts and completions. This approach also helps reduce the risk of hallucination in the model's responses.

Fine-tuning all available layers

Unlike typical fine-tuning where only the attention layers are updated, we fine-tuned all of the trainable parameters in the model on our text-to-SQL task this time. By updating all of the weights rather than just those in the attention layers, the model could better specialize to our particular task, leading to improved overall task performance.

Hyperparameter Sweeping

To find the optimal hyperparameters for maximizing model accuracy, we conducted a hyperparameter search, sweeping through different combinations of learning rate profiles and batch sizes during training. Tuning these key hyperparameters through search allowed the model to train more effectively and achieve better final accuracy.

Explore the model

You can explore our model at https://defog.ai/sqlcoder-demo

About the authors

Wendy Aw is a Machine Learning Engineer at Defog, working on model-finetuning and dataset curation. Before joining Defog, Wendy spent most of the last decade as a copywriter, where she helped build some of the world’s biggest brands.

Wong Jing Ping is a Software Engineer at Defog. He works on building Defog’s Machine Learning and engineering stack. Before joining Defog, he worked on account and tweet recommendations at Twitter for six years. Outside of work, he enjoys recreational climbing and perfecting that single-origin pourover 👌🏻 He also builds small keyboards.


← More blogs