We just ranked first on LiveSQLBench, a live text-to-SQL benchmark built around real end-user databases and tasks that actually reflect how messy SQL problems look in practice.
Agent Sentinel V1 scored 68.15%, putting us ahead of:
o3-mini from OpenAI (50%)
Claude Opus 4.6 (49.60%)
Claude Sonnet 4.5 (47.40%)
That gap is not small.
What LiveSQLBench actually tests
LiveSQLBench-Base-Lite runs 270 tasks across 18 real-world databases, including HKB-JSON and JSON operations in SQL. JSON handling in SQL is one of those areas where most text-to-SQL systems quietly fall apart. The queries get structurally complex fast, and systems that rely on pattern-matching over clean tabular schemas don't have a clean path through it.
Of the 270 tasks, 180 are read-based operations and 90 are CRUD manipulations. Genloop isn't built for manipulating data sources, so we're counting the 180 read tasks only. Sentinel scored 162/180 on those, which is 90%.
That's what makes this result meaningful. Not just the score, but where the benchmark pushes.
We've been here before
Earlier this year in March, Sentinel Agent v2 Pro scored 96.70 on Spider 2.0-Snow, putting us ahead of teams from Tencent, AT&T, ByteDance, and Snowflake. That benchmark tested reasoning across 150+ enterprise databases with messy schemas and multi-step queries.
LiveSQLBench is a different kind of hard. Real user-submitted tasks, live leaderboard, no advance look at what's coming.
We're first on both.
How Genloop reasons differently
We don't translate questions into SQL. We build a context graph of your data environment and reason against that. Genloop's Living Context Graph holds your business logic, metric definitions, join paths, and team-specific context in one governed layer. When Sentinel processes a question, it already understands what your data means, not just what your schema says.
That's the difference between 68% and 50%.
If you want to see how this performs on your own data, try for free at app.genloop.ai.





