Genloop

LLM Customization

Pricing

Blogs

Resources

Talk to a GenAI Expert

Genloop

Classification with GenAI for Enterprises: GPT vs Customized LLMs

Apr 16, 2025

Table of Contents

Introduction

Enterprise applications use classification problems throughout their technology stack. These can include intent detection to route user requests to the right service, product category classification in retail, or document type identification when extracting data from unstructured files. The number of classes in these tasks typically ranges from 10 to 100. During our work with enterprises, we've frequently observed that public LLMs like GPT become less effective as the number of classes increases. This experiment aims to validate this performance decline and evaluate whether personalized LLMs provide better results.

Objective

While public LLMs like OpenAI's GPT-4o and Claude Sonnet 3.5/3.7 are attractive for their seamless integration, they often underperform in real-world production environments. At scale, these models become prohibitively expensive to use. For example, a marketing technology company processing 2.7M NLP classifications on marketing documents chose to continue using classic NLP models despite GenAI's accuracy benefits, as OpenAI's estimated costs would reach $0.5M per month. These models also present challenges in regulated industries like fintech and healthcare due to data privacy concerns, and they typically respond more slowly than lighter, optimized alternatives. A viable alternative is customizing LLMs through fine-tuning smaller open-weight models and embedding business knowledge to achieve accurate classifications.

Since the privacy and cost advantages of customized LLMs are well established, this experiment focuses on evaluating how classification accuracy changes as the number of classes increases. We also compare the performance of proprietary, general-purpose models against customized, fine-tuned LLMs to determine if domain-specific training provides meaningful advantages.

Data Preparation

For this experiment, we used an open-source dataset available on Hugging Face: gretelai/synthetic_text_to_sql. This dataset contains high-quality, synthetic Text-to-SQL samples created using Gretel Navigator and spans across 100 distinct domains or verticals, with each vertical containing nearly 900-1000 samples.

Out of the full dataset, we focused on two key columns:

domain: This column includes around 100 unique domain values, which we used as classification labels.
sql_prompt: This contains the actual user input. This serves as the input for our classification task.

To build a meaningful and scalable classification problem, we needed a subset of domains that were closely related. For this, we applied a clustering-based approach:

We embedded the sql_prompt texts using a Sentence Transformer model—"all-MiniLM-L6-v2", an open-source, lightweight model suitable for semantic understanding.
We then applied K-Means clustering with n_clusters=3 on the embedded representations to group similar domains based on prompt semantics.
From the resulting clusters, we selected one cluster containing 58 domains with strong semantic overlap. After manually reviewing the cluster, we removed 8 loosely connected classes to finalize a set of 50 well-aligned classes for our experiments.

Using this refined group, we created benchmark datasets with varying numbers of classes specifically 5, 10, 20, 25, and 50. For each class count, we sampled 10 examples per class. This means the dataset size scaled as follows:

5 classes → 50 samples
10 classes → 100 samples
20 classes → 200 samples

and so on.

Importantly, we ensured class continuity across groups. The classes included in the 5-class set are also part of the 10-class set, and so forth. This approach helps maintain the experimental integrity and allows us to reliably assess the impact of increasing class diversity on model performance.

The data, excluding the benchmark samples for each of the selected classes, was used to prepare the training dataset. Since each class had around 1,000 examples, we created a total training set of around 40,000 samples to fine-tune the Llama model and compare its performance against the GPT model.

GPT-4o Evaluation

Using benchmark datasets with varying class sizes, we crafted a clear and consistent prompt format to evaluate GPT-4o's classification performance. The prompt provided unambiguous instructions for the model to select the most appropriate domain from a provided list of domain names for a given SQL-related question. We structured the output as a simple JSON format to ensure reliable evaluation.

Here’s an example of the prompt structure:

Given a domain list: ['forestry', 'marine biology', 'aquaculture', 'agriculture', 'wildlife conservation'], select the most appropriate domain for the provided question. Your response should be a JSON with the "domain" key containing the domain name and nothing else.

Sample Format:
Sample Question: Find the latest movie which "Gabriele Ferzetti" acted in.
Response: {"domain": "imdb"}

Now, process the following question accordingly.
Question: {{question}}

For each benchmark dataset (with 5, 10, 20, 25, and 50 classes), the corresponding list of domains in the prompt was updated to match the set of classes used. The same structure was maintained across all experiments to ensure consistency.

The results with GPT-4o showed a clear trend: as the number of classes increased, the model’s accuracy in correctly identifying the appropriate domain decreased beyond use. This decline highlights the challenge of scaling general-purpose models for fine-grained classification tasks.

The performance drop is visualized in the graph below

Fig 1: Accuracy of GPT-4o Model Varying Across Increasing Number of Classes

Fine-tuning and Evaluating Meta-Llama-3.1-8B-Instruct

In the second experiment, we evaluated how a customized model—Meta-Llama-3.1-8B-Instruct—performs when fine-tuned on our domain classification task. We trained this base model first on the dataset with 50 classes, and then applied it to inference with different number of classes.

We trained three versions of the model, each with a different amount of training data to study impact of training examples on accuracy:

Model A: Fine-tuned with 2,000 samples
Model B: Fine-tuned with 5,000 samples
Model C: Fine-tuned with 40,000 samples

All models were evaluated using the same benchmark datasets and inference format as in the GPT-4o experiment. This ensured a fair comparison of performance across both approaches.

The table below summarizes the accuracy scores for GPT-4o and all three fine-tuned models across the benchmark datasets

Table 1: Accuracy Comparison of Various Experiments with Increasing Number of Classes

The following graph presents a visual representation of the accuracy trends highlighted in the comparison table above:

Fig 2: Accuracy Trend Across Varying Class Distributions During Inference

It can be seen from the graph and the comparison table, the fine-tuned models performed significantly better than general models, and showed consistent competency with increasing number of classes. As expected, accuracy improved with more training data. Model C (trained on 40K samples) consistently achieved the highest accuracy across all class groupings (5, 10, 20, 25, 50).

It can be noted that while we fine-tuned a single model and evaluated it across multiple class groupings, training separate models for specific class ranges could potentially yield even higher accuracy.

Conclusion

Our experiments clearly highlight a key insight: general-purpose LLMs struggle with accuracy as task complexity increases, especially in classification problems with a growing number of classes. This limitation becomes critical in enterprise applications where precision drives downstream decisions.

In contrast, customized LLMs, fine-tuned on domain-specific data, consistently outperform general models, making them a more reliable choice for production use. While fine-tuning requires initial development effort, tools like Genloop make this process more accessible and scalable.

As complexity grows, so does the performance gap, favoring tailored solutions over generic ones. For enterprises, the path forward is clear: invest in quality data and model customization to unlock the true potential of AI in production systems.

About Genloop

Genloop delivers customized LLMs that provide unmatched cost, control, simplicity, and performance for production enterprise applications. Please visit genloop.ai or email founder@genloop.ai for more details. Schedule a free consultation call with our GenAI experts for personalized guidance and recommendations.

View all

What is OSI, the Open Semantic Interchange?

Sep 24, 2025

What is OSI, the Open Semantic Interchange?

Sep 24, 2025

Genloop Partners with IndiaAI to Build Foundation Models for 1.5 Billion Indians

Sep 25, 2025

Genloop Partners with IndiaAI to Build Foundation Models for 1.5 Billion Indians

Sep 25, 2025

Text to SQL: The Ultimate Guide for 2025

Feb 13, 2025

Text to SQL: The Ultimate Guide for 2025

Feb 13, 2025

What is OSI, the Open Semantic Interchange?

Sep 24, 2025

Genloop Partners with IndiaAI to Build Foundation Models for 1.5 Billion Indians

Sep 25, 2025

Text to SQL: The Ultimate Guide for 2025

Feb 13, 2025

Ready to Elevate Your Business with Personalized LLMs?

Talk to a GenAI Expert

Genloop

Santa Clara, California, United States 95051

Product

Home

LLM Customization

Pricing

Resources

Should You Fine Tune

LLM Research Hub

Blogs

Company

Newsroom

Ready to Elevate Your Business with Personalized LLMs?

Talk to a GenAI Expert

Genloop

Santa Clara, California, United States 95051

Product

Home

LLM Customization

Pricing

Resources

Should You Fine Tune

LLM Research Hub

Blogs

Company

Newsroom

Ready to Elevate Your Business with Personalized LLMs?

Talk to a GenAI Expert

Genloop

Product

Home

LLM Customization

Pricing

Santa Clara, California,

United States 95051

Resources

Should You Fine Tune

LLM Research Hub

Blogs

Company

Newsroom

Ready to Elevate Your Business with Personalized LLMs?

Talk to a GenAI Expert

Genloop

Product

Home

LLM Customization

Pricing

Santa Clara, California,

United States 95051

Resources

Should You Fine Tune

LLM Research Hub

Blogs

Company

Newsroom