What Anthropic got right about agentic analytics, and got wrong for everyone else

What Anthropic got right about agentic analytics, and got wrong for everyone else

Ayush Gupta

Ayush Gupta

CEO, Genloop

CEO, Genloop

Table of Contents (Add CSS Target and Preview)

Anthropic's self-service analytics post is the clearest public case any AI lab has made that you cannot point a model at a warehouse and get reliable answers. Five thousand words in, they say it directly: pointing Claude at a warehouse "can create a false sense of precision." The rest of the post is the infrastructure required to undo that. It is the most honest public account of what production-grade agentic analytics actually demands. It is also a near-perfect illustration of why most companies will not be able to follow the same path.

Three approaches to agentic analytics

Every team building AI on data is choosing between three approaches, whether they have named them or not.

Approach 1: Connect the model directly to the warehouse. Give Claude SQL access, let it write queries, return the answer. Anthropic confirms this does not work. The model has no way to choose between forty plausible definitions of "active user" or detect that a table was deprecated last quarter. The same structural problem is why MCPs are a dead end for talking to data on their own.

Approach 2: Bring the data to the AI. Curate canonical datasets. Author a semantic layer. Write skill documentation. Co-locate everything with the transformation code. Run offline evals. This is what Anthropic did, and it works at Anthropic.

Approach 3: Bring the AI to the data. Build a continuous system that holds business context, governance, and provenance over time, and lets the model query against it without forcing the underlying data estate to be rebuilt for the model's convenience. This is the structural alternative we explore in what agentic analytics actually needs.

The three failure modes Anthropic named

The post frames the problem cleanly. Three modes account for nearly every failure in production.

Concept-to-entity ambiguity. A user asks for "active users." The model sees hundreds of candidate fields across dozens of tables and picks one. There is usually no signal in the schema itself that distinguishes the marketing team's definition from the finance team's.

Data staleness. Schemas drift, definitions evolve, tables get deprecated. The model's understanding of the warehouse is a snapshot, and the snapshot is wrong within weeks.

Retrieval failure. The right answer exists in the documentation, the past queries, the dbt models. The model fails to find it, or finds it and ignores it.

Every honest assessment of agentic analytics has to start here.

What Anthropic got right

1. Context engineering matters more than model selection. Without skill files, accuracy on Anthropic's internal evals sits at 21%. With them, it consistently exceeds 95%. The skill files are the system. The model is interchangeable.

2. Raw retrieval over a query corpus does almost nothing. Giving the agent direct access to thousands of prior queries moved accuracy less than a single point. Volume is not context. Patterns distilled into reference documentation do the actual work.

3. The semantic layer has to be the mandatory first stop, not a fallback. Anything else, and the model picks the wrong field at the start of the query and never recovers.

4. Maintenance is the silent killer. Without active upkeep, accuracy drifts from 95% to 65% in a single month. The system rots faster than most teams will notice.

5. The trade-off accounting is honest. Adversarial review buys 6% more accuracy at 32% more tokens and 72% percent higher latency. Anthropic publishes both sides of that ledger. That is the right way to report this work.

What they got wrong for everyone else

1. The 95% number hides months of senior data work. It is the steady-state accuracy of a system actively curated for months on a known eval set. Cold-start accuracy on a new domain is 21%. The seventy-four-point gap is bridged by humans, not by the model, and the cost of that bridge does not appear in the post.

2. Dimensional modeling and concept-to-entity mapping are not a context layer. They are an ETL rewrite. For a small or greenfield warehouse, this is feasible. For any established business, the relevant context lives in people's heads, in Slack threads, in three years of analyst conventions that were never written down. You cannot retroactively run an ontology project to formalize all of it. It is not a sprint. It is not a quarter. It is not a scalable production move. The Anthropic blueprint, applied to a fifteen-year-old data estate, becomes a permanent engineering program with no shipping date.

3. Co-located skills and shared review queues assume one team. Most companies have separate data engineering and analytics organizations on different release cadences, often with different definitions for the same metric. Co-located maintenance breaks at the team boundary. The skill file goes stale a sprint after it is written, and the analyst who would have caught it never sees the pull request. This is the same structural reason most AI rollouts die from two sources of resistance, not one.

4. "Mandatory but bypassable" semantic enforcement is a prompt instruction, not an access control. The post describes the semantic layer as the first stop the agent is told to consult. For data with row-level governance or regulated fields, a prompt is not a security boundary. The same model that is asked nicely to use the semantic layer can be asked nicely to skip it.

5. Cost is not addressed anywhere in the post. Adversarial review buys six points of accuracy at thirty-two percent more tokens and seventy-two percent higher latency. Having a system recreate analyses from scratch repeatedly, without deterministic reasoning, invites huge token burn. That budget exists at Anthropic and almost nowhere else.

6. Banning LLM-drafted metric definitions misses the real failure. Anthropic declares auto-generated definitions a failure mode because they encode existing ambiguities. The actual failure is shipping them without a review loop where every drafted definition is stamped or rejected by a named owner on a specific date.

When each approach actually works

Situation

Approach 1: Plug & Pray

Approach 2: Reshape Warehouse

Approach 3: Data Intelligence Layer

Small warehouse, single data team

❌ Wrong fields selected; no recovery.

✅ Works with ongoing maintenance.

⚠️ Usually overkill.

Knowledge trapped in people's heads

❌ Missing context cannot be prompted into existence.

❌ Context can't be fully formalized.

✅ Learns conventions from usage.

Regulated or access-controlled data

❌ No governance.

❌ Semantic layer ≠ access control.

✅ Governance built into the system.

Separate engineering and analyst teams

❌ Same failure mode.

❌ Ownership breaks down.

✅ Shared context product.

Frequent schema and business changes

❌ Breaks immediately.

❌ Manual curation can't keep up.

✅ Drift detection and continuous review.


The right mental model is onboarding a new analyst, not rebuilding the database

When a company hires a new data analyst, no one rewrites the warehouse to suit them. The analyst learns the schema over weeks. They ask clarifying questions. They remember which definition of "active user" the marketing team uses. They pick up that the finance team excludes a specific account. They build a mental model that evolves as the business does, and the mental model is the work.

Anthropic's stack assumes the inverse. The AI is the fixed point, and the organization reshapes itself around it. New canonical datasets. New semantic layer entries. New skill files co-located with the dbt models. New maintenance discipline gated into every pull request. This is feasible for a company willing to spend a year of senior data engineering on it. It is not the shape of a system most companies can adopt.

The right trajectory is the opposite. The system learns the conventions, retains them, surfaces them for human review, and treats governance as a structural property rather than a prompt instruction. Context is harvested from real usage over time, not authored upfront. The system trends toward the way a senior analyst already operates, because that is the only model of analytics work that scales without rewriting the company's data estate first. The architecture this requires is laid out in what agentic analytics actually needs.

What production-grade agentic analytics actually requires

Six properties separate a system that works from a system that demos.

  • Cost discipline that survives scale. Token budgets that do not double every time the eval set grows.

  • Latency low enough for synchronous answers. Adversarial review on every query is not a production setting.

  • Governance enforced deterministically. Access controls live in the system, not in the prompt.

  • Trust grounded in provenance. Every answer carries its source tier, its freshness, and its named owner.

  • Ambiguity resolved through clarifying workflows. A best-guess answer to the wrong reading of the question is the worst outcome.

  • Continuous learning with drift detection built in. The system flags its own decay and queues fixes for human review.

Anthropic's post is the strongest argument yet that agentic analytics is a product, not a prompt. The intelligence layer is the product. They just spent five thousand words proving it.