Genloop is #1 on LiveSQLBench

Genloop is #1 on LiveSQLBench

Ayush Gupta

Ayush Gupta

CEO, Genloop

CEO, Genloop

Table of Contents (Add CSS Target and Preview)
Genloop is number 1 on LiveSQLBench

We just ranked first on LiveSQLBench, a live text-to-SQL benchmark built around real end-user databases and tasks that actually reflect how messy SQL problems look in practice.

Agent Sentinel V1 scored 68.15%, putting us ahead of:

  • o3-mini from OpenAI (50%)

  • Claude Opus 4.6 (49.60%)

  • Claude Sonnet 4.5 (47.40%)

That gap is not small.

What LiveSQLBench actually tests

LiveSQLBench-Base-Lite runs 270 tasks across 18 real-world databases, including HKB-JSON and JSON operations in SQL. JSON handling in SQL is one of those areas where most text-to-SQL systems quietly fall apart. The queries get structurally complex fast, and systems that rely on pattern-matching over clean tabular schemas don't have a clean path through it.

Of the 270 tasks, 180 are read-based operations and 90 are CRUD manipulations. Genloop isn't built for manipulating data sources, so we're counting the 180 read tasks only. Sentinel scored 162/180 on those, which is 90%.

That's what makes this result meaningful. Not just the score, but where the benchmark pushes.

We've been here before

Earlier this year in March, Sentinel Agent v2 Pro scored 96.70 on Spider 2.0-Snow, putting us ahead of teams from Tencent, AT&T, ByteDance, and Snowflake. That benchmark tested reasoning across 150+ enterprise databases with messy schemas and multi-step queries.

LiveSQLBench is a different kind of hard. Real user-submitted tasks, live leaderboard, no advance look at what's coming.

We're first on both.

How Genloop reasons differently

We don't translate questions into SQL. We build a context graph of your data environment and reason against that. Genloop's Living Context Graph holds your business logic, metric definitions, join paths, and team-specific context in one governed layer. When Sentinel processes a question, it already understands what your data means, not just what your schema says.

That's the difference between 68% and 50%.

If you want to see how this performs on your own data, try for free at app.genloop.ai.

Give Every Team the Analyst They've Been Waiting For

Give Every Team the Analyst They've Been Waiting For