Apr 16, 2025
Introduction
Enterprise applications use classification problems throughout their technology stack. These can include intent detection to route user requests to the right service, product category classification in retail, or document type identification when extracting data from unstructured files. The number of classes in these tasks typically ranges from 10 to 100. During our work with enterprises, we've frequently observed that public LLMs like GPT become less effective as the number of classes increases. This experiment aims to validate this performance decline and evaluate whether personalized LLMs provide better results.
Objective
While public LLMs like OpenAI's GPT-4o and Claude Sonnet 3.5/3.7 are attractive for their seamless integration, they often underperform in real-world production environments. At scale, these models become prohibitively expensive to use. For example, a marketing technology company processing 2.7M NLP classifications on marketing documents chose to continue using classic NLP models despite GenAI's accuracy benefits, as OpenAI's estimated costs would reach $0.5M per month. These models also present challenges in regulated industries like fintech and healthcare due to data privacy concerns, and they typically respond more slowly than lighter, optimized alternatives. A viable alternative is customizing LLMs through fine-tuning smaller open-weight models and embedding business knowledge to achieve accurate classifications.
Since the privacy and cost advantages of customized LLMs are well established, this experiment focuses on evaluating how classification accuracy changes as the number of classes increases. We also compare the performance of proprietary, general-purpose models against customized, fine-tuned LLMs to determine if domain-specific training provides meaningful advantages.
Data Preparation
For this experiment, we used an open-source dataset available on Hugging Face: gretelai/synthetic_text_to_sql. This dataset contains high-quality, synthetic Text-to-SQL samples created using Gretel Navigator and spans across 100 distinct domains or verticals, with each vertical containing nearly 900-1000 samples.
Out of the full dataset, we focused on two key columns:
domain: This column includes around 100 unique domain values, which we used as classification labels.
sql_prompt: This contains the actual user input. This serves as the input for our classification task.
To build a meaningful and scalable classification problem, we needed a subset of domains that were closely related. For this, we applied a clustering-based approach:
We embedded the
sql_prompt
texts using a Sentence Transformer model—"all-MiniLM-L6-v2", an open-source, lightweight model suitable for semantic understanding.We then applied K-Means clustering with
n_clusters=3
on the embedded representations to group similar domains based on prompt semantics.From the resulting clusters, we selected one cluster containing 58 domains with strong semantic overlap. After manually reviewing the cluster, we removed 8 loosely connected classes to finalize a set of 50 well-aligned classes for our experiments.
Using this refined group, we created benchmark datasets with varying numbers of classes specifically 5, 10, 20, 25, and 50. For each class count, we sampled 10 examples per class. This means the dataset size scaled as follows:
5 classes → 50 samples
10 classes → 100 samples
20 classes → 200 samples
and so on.
Importantly, we ensured class continuity across groups. The classes included in the 5-class set are also part of the 10-class set, and so forth. This approach helps maintain the experimental integrity and allows us to reliably assess the impact of increasing class diversity on model performance.
The data, excluding the benchmark samples for each of the selected classes, was used to prepare the training dataset. Since each class had around 1,000 examples, we created a total training set of around 40,000 samples to fine-tune the Llama model and compare its performance against the GPT model.
GPT-4o Evaluation
Using benchmark datasets with varying class sizes, we crafted a clear and consistent prompt format to evaluate GPT-4o's classification performance. The prompt provided unambiguous instructions for the model to select the most appropriate domain from a provided list of domain names for a given SQL-related question. We structured the output as a simple JSON format to ensure reliable evaluation.
Here’s an example of the prompt structure:
For each benchmark dataset (with 5, 10, 20, 25, and 50 classes), the corresponding list of domains in the prompt was updated to match the set of classes used. The same structure was maintained across all experiments to ensure consistency.
The results with GPT-4o showed a clear trend: as the number of classes increased, the model’s accuracy in correctly identifying the appropriate domain decreased beyond use. This decline highlights the challenge of scaling general-purpose models for fine-grained classification tasks.
The performance drop is visualized in the graph below

Fig 1: Accuracy of GPT-4o Model Varying Across Increasing Number of Classes
Fine-tuning and Evaluating Meta-Llama-3.1-8B-Instruct
In the second experiment, we evaluated how a customized model—Meta-Llama-3.1-8B-Instruct—performs when fine-tuned on our domain classification task. We trained this base model first on the dataset with 50 classes, and then applied it to inference with different number of classes.
We trained three versions of the model, each with a different amount of training data to study impact of training examples on accuracy:
Model A: Fine-tuned with 2,000 samples
Model B: Fine-tuned with 5,000 samples
Model C: Fine-tuned with 40,000 samples
All models were evaluated using the same benchmark datasets and inference format as in the GPT-4o experiment. This ensured a fair comparison of performance across both approaches.
The table below summarizes the accuracy scores for GPT-4o and all three fine-tuned models across the benchmark datasets

Table 1: Accuracy Comparison of Various Experiments with Increasing Number of Classes
The following graph presents a visual representation of the accuracy trends highlighted in the comparison table above:

Fig 2: Accuracy Trend Across Varying Class Distributions During Inference
It can be seen from the graph and the comparison table, the fine-tuned models performed significantly better than general models, and showed consistent competency with increasing number of classes. As expected, accuracy improved with more training data. Model C (trained on 40K samples) consistently achieved the highest accuracy across all class groupings (5, 10, 20, 25, 50).
It can be noted that while we fine-tuned a single model and evaluated it across multiple class groupings, training separate models for specific class ranges could potentially yield even higher accuracy.
Conclusion
Our experiments clearly highlight a key insight: general-purpose LLMs struggle with accuracy as task complexity increases, especially in classification problems with a growing number of classes. This limitation becomes critical in enterprise applications where precision drives downstream decisions.
In contrast, customized LLMs, fine-tuned on domain-specific data, consistently outperform general models, making them a more reliable choice for production use. While fine-tuning requires initial development effort, tools like Genloop make this process more accessible and scalable.
As complexity grows, so does the performance gap, favoring tailored solutions over generic ones. For enterprises, the path forward is clear: invest in quality data and model customization to unlock the true potential of AI in production systems.
About Genloop
Genloop delivers customized LLMs that provide unmatched cost, control, simplicity, and performance for production enterprise applications. Please visit genloop.ai or email founder@genloop.ai for more details. Schedule a free consultation call with our GenAI experts for personalized guidance and recommendations.