Genloop

LLM Customization

Pricing

Blogs

Resources

Talk to a GenAI Expert

Genloop

Text-2-Sql Generation with Private LLMs

Sep 7, 2024

Table of Contents

Introduction‍

What is the best way to generate SQL queries from natural language prompts? This question has long been a subject of intense interest and debate within the AI community. The ability to convert natural language commands into structured SQL queries is a critical capability for a wide range of applications, from virtual assistants to data analytics tools. However, this task is far from trivial, as it requires a deep understanding of both natural language and SQL syntax.

Large Language Models (or LLMs) are a powerful tool for addressing this challenge. These models have demonstrated remarkable capabilities in understanding and generating natural language text. We have received multiple requests on how private LLMs can be leveraged to generate these SQL queries and enable natural language data analysis for CXO and upper management. Enterprises are wary of using closed-source LLMs due to privacy and security concerns. They are looking for a private LLM that can be fine-tuned on their proprietary data and deployed within their secure environment.

In this blog, we go over how Genloop helps develop and maintain private LLMs that can generate SQL queries for your private data, with continued performance improvements and obvious advantages of private LLMs.

The Challenge

The critical challenge with generating SQL queries from natural language prompts is the complexity of the task. SQL is a highly structured language with specific syntax and grammar. Translating natural language commands into SQL requires a deep understanding of both the natural language and the SQL syntax. This task is further complicated by the wide variety of SQL queries that can be generated from a single natural language prompt. For example, a single natural language command could correspond to multiple SQL queries depending on the context and the underlying data. This makes evaluation and fine-tuning of the model challenging.

Moreover, the context size of the SQL query can be large, and the model needs to understand the context to generate the correct SQL query. This requires the model to have a large memory capacity and the ability to retain and recall information over long sequences.

Another challenge is the latency involved in generating SQL queries from natural language prompts. The model needs to process the entire natural language prompt to generate the SQL query, which can be time-consuming. This latency can be a significant bottleneck in real-time applications where quick responses are required.

And finally, any text-2-sql solution needs to work perfectly for each client, free of hallucination and incomplete information. Establishing trust in the model is crucial for its adoption in real-world applications.

Any private LLM system needs to mitigate these challenges while abstracting away the operational challenges of maintaining and managing private LLMs. No enterprise will have the bandwidth to manage the complexities of data refinement, model training, model upgradation, managing deployment, and monitoring.

Genloop is the perfect partner for all these challenges. It is a platform that enables enterprises to create, manage, and improve their private LLMs with the operational abstraction of a GPT. The following sections go over how a text-2-sql LLM flywheel can be generated with Genloop.

Initial Fine-Tuning with Genloop

The first fine-tuning with Genloop needs some seed data from GPT4 calls or enterprise records. In this case, let us use the Sql-Create-Context Dataset. You can find it here: Huggingface

When tested with the Llama2 7B chat model, without finetuning, the results were pretty miserable. The model was unable to understand the context and generate the correct SQL query. However, fine-tuning techniques have a problem to significantly help in improving LLM performance for SQL generation cases, as the model can learn the behaviour cloning processes with reference datasets.

We first pre-processed the dataset into an Alpaca format that could be used for fine-tuning. Genloop provides an Auto-ML like interface to start fine-tuning multiple candidate models with feasible configurations and present the ones best performing for further testing and evaluation. At the time of conducting this experiment, the most feasible options were Mistral-7B v0.1, Mistral-7B v0.2, and Llama2-7B. Genloop auto-triggered parallel training on parallel A100 40GB GPU machines for each of these models.

Evaluation is a fundamental part of model training, and Genloop provides a variety of offline and online measures of evaluation. Text-2-Sql evaluation is notoriously tricky because a single SQL query can have multiple correct answers. While we will go deeper into evaluation in a later blog, our experiments have shown LLM as a judge to have the highest correlation with human judgements, and that is what we used to evaluate our candidate models. The prompt for the LLM judge used for evaluation is presented below

Compare two SQL queries and provide a matching score from 0 to 1, focusing on the overall logic and purpose. The comparison should assess the underlying intent and function of each query, disregarding syntactic variations or minor differences in implementation details.
Queries: 
Query 1: 
{query_1} 

Query 2: 
{query_2} 

Instructions: 
    1. Analyze the overarching purpose of each query, considering the desired outcome or action. 
    2. Evaluate the logic flow of the queries, including the sequence of operations and conditions. 
    3. Disregard syntactic differences such as whitespace, comments, or variations in naming conventions. 
    4. Assess whether the queries address similar tasks or objectives, even if implemented differently. 
Assign a matching score between 0 and 1, where 1 indicates that the queries serve identical or highly similar purposes. 
Values closer to 0 indicate significant differences in the overall logic or purpose of the queries. 
Ensure that your comparison focuses solely on the fundamental intent and function of the queries, rather than their specific implementation details. 
  
Return the output in the JSON format: 
  { 
      'score' : [Matching Score] 
  }

‍

The LLM judge suggested Llama2-7B as the best performing model. Compared to GPT4 itself, it was offering 5% additional improvement. Additionally, it was 14x cheaper and 4x faster than the GPT4 model.

4x faster resposes with private LLM over GPT4

14x cost benefit with private LLM over GPT4

Auto-Aligning and Auto-Upgrading Private LLMs

However, owning a private LLM is much more than a single-time fine-tuning. You wouldn't want to lock your performance to a specific model that will get archaic within a month. Such is the exciting development happening in the field of LLM that new models are being released every week.

And what about drift? Model and data drift are real production problems. These problems can render your system inaccurate and operationally useless.

One-Time Fine-tuning is insufficient! Private LLMs are much better!

Genloop provides solutions to both with an LLM flywheel. Using previous production records, it has agentic workflows that surface the examples that could be wrong and get it reviewed by humans. Creating better refined data in this process, Genloop helps build a better quality dataset that can guarantee a better quality model. Moreover, it helps create a golden benchmark dataset from these records, that can help give a better unbiased understanding of the model performance.

Improving on the production records, in this case, better-refined data was produced. This data was then used to fine-tune new candidate models. By this time Mistral-7B v0.2 Base was also publicly released and added to the candidate set of models.

The results were pretty interesting. Mistral-7B v0.2 Base performed well but auto-aligned Llama2 fared better. This informs that performance on academic benchmark dataset does not completely inform which LLM could be the best option for one's use case. One needs to get their hands dirty with multiple options to get the best optimal results. Gladly, Genloop does not require you to get your hands dirty. It does all the heavy lifting for you.

This iteration improved the performance further by 1%. The latency and cost remained the same, as the underlying model remained the same.

Once we have the best performing model, we can deploy it in our secure environment. Genloop manages the deployment on your private compute and gives ops features like serverless GPU scaling, health monitoring, and alerting.

‍Continuous performance improvement with Genloop Private LLMs

Conclusion

In this blog, we have demonstrated that owning private LLMs is rewarding but not a one-time tuning. Moreover, it is not a trivial process. There is a tangible benefit of completing the LLM flywheel, that keeps practising your private LLM to perfection. Your model + data + guidance >> GPTn.

Genloop is a platform that streamlines the entire process. It abstracts away the operational complexities of managing private LLMs, enabling enterprises to focus on building and deploying their applications on top of a private runtime layer powered by Genloop.

Create, Manage, and Improve your private LLMs with Genloop with an operational abstraction of a GPT. Get in touch for a free fine-tuned model.

About Genloop

Genloop is a US-based company with expertise in delivering fine-tuned private LLMs that outperform GPT4. They are saving enterprises $100K every month in their LLM costs by upgrading their GPT4 to a private specialized LLM without any development effort from the enterprises. Please visit genloop.ai or email founder@genloop.ai for more details.

View all

A blockbuster month for GenAI models - GPT5, Claude 4.1, and Genie3 released

Aug 14, 2025

A blockbuster month for GenAI models - GPT5, Claude 4.1, and Genie3 released

Aug 14, 2025

Text to SQL: The Ultimate Guide for 2025

Feb 13, 2025

Text to SQL: The Ultimate Guide for 2025

Feb 13, 2025

Genloop x Axtria: Powering Agentic AI for Pharma

May 30, 2025

Genloop x Axtria: Powering Agentic AI for Pharma

May 30, 2025

A blockbuster month for GenAI models - GPT5, Claude 4.1, and Genie3 released

Aug 14, 2025

Text to SQL: The Ultimate Guide for 2025

Feb 13, 2025

Genloop x Axtria: Powering Agentic AI for Pharma

May 30, 2025

Ready to Elevate Your Business with Personalized LLMs?

Talk to a GenAI Expert

Genloop

Santa Clara, California, United States 95051

Product

Home

LLM Customization

Pricing

Resources

Should You Fine Tune

LLM Research Hub

Blogs

Company

Newsroom

Ready to Elevate Your Business with Personalized LLMs?

Talk to a GenAI Expert

Genloop

Santa Clara, California, United States 95051

Product

Home

LLM Customization

Pricing

Resources

Should You Fine Tune

LLM Research Hub

Blogs

Company

Newsroom

Ready to Elevate Your Business with Personalized LLMs?

Talk to a GenAI Expert

Genloop

Product

Home

LLM Customization

Pricing

Santa Clara, California, United States 95051

Resources

Should You Fine Tune

LLM Research Hub

Blogs

Company

Newsroom

Ready to Elevate Your Business with Personalized LLMs?

Talk to a GenAI Expert

Genloop

Product

Home

LLM Customization

Pricing

Santa Clara, California, United States 95051

Resources

Should You Fine Tune

LLM Research Hub

Blogs

Company

Newsroom