Lessons from Anthropic’s implementation of agentic self-service BI

Key Takeaways

Anthropic case study: Anthropic published an article about their success with agentic self-service business intelligence, with lessons relevant to Microsoft BI, Copilot, data agents, and agentic development.
Analytics versus code: Self-service analytics is different from writing code with agents, because business data usually needs one correct answer rather than one of many acceptable implementations.
Data and context quality: Bad options are reduced by ensuring quality and completeness of both data and context through dimensional modeling, semantic modeling, testing, and CI/CD.
Metadata and skills: Good options become more likely when metadata, context, and skills are treated as first-class, versioned products.
Investment areas: The article points toward future investment in curated metadata, skills, governance, source control, tests, evals, and CI/CD.

This summary is produced by the author, and not by AI.

How Anthropic’s learnings can apply to Microsoft BI

Anthropic recently published an article sharing their success in enabling self-service analytics (SSBI), with some impressive marketing numbers. More importantly, they shared some key insights that align with some of our own thoughts and experiences in the last year. In this article, we want to highlight two things:

The importance of solid fundamentals, which are necessary for the success of BI irrespective of whether you are using AI or not.
Why metadata (like that of semantic models and reports) and context (like documentation and skills) should be a priority.

Why analytics is harder than coding

A distinction made in the article is important to keep in mind: analytics is harder to get right than coding. When an agent writes code, many paths work; wrong ones are caught cheaply by a type checker, test or compiler. When it tries answering a business question, many potential paths are wrong, and the single right answers depends on the definitions only your business users hold. There is no objective technical process that can guarantee correctness out-of-the-box. Inherently, this means that using agents with conversational BI and analytics presents some unique challenges.

Our colleague Eugene provided an analogy, summarized below:

Failure states of conversational BI

Anthropic’s focus is enabling conversational BI and agentic analytics. In this pursuit, they describe three “dangerous paths”, or failure states:

Ambiguity. The agent can’t map the user’s wording to the right fields.
Staleness. The agent can’t tell its context has gone stale, and it answers from outdated assumptions.
Retrieval. The agent can’t find the right field in a vast space of tables and schemas, even when it exists and is documented.

KB039 Figure 2 - Three failure states for agentic self-service BI described in Anthropic's article: ambiguity, where the agent cannot map the user's wording to the right fields; staleness, where outdated context leads the agent to answer from stale assumptions; and retrieval, where a documented field exists but is not retrieved

In the article, Anthropic describes how they built four layers to address these failures: data foundations, sources of truth, skills, and validation. In short, these layers did one of two things:

Eliminated bad options from being available, by:
- Applying the fundamentals with dimensional and semantic modelling.
- Ensuring high and consistent data quality.
- Setting up automated tests and deployment of code, metadata, and context.
Increase the likelihood for good options being available, by:
- Ensuring good documentation in metadata (descriptions, definitions, etc.).
- Ensuring a way to trace and discover lineages.
- Providing quality agent-first context in skills.

In this article, we want to highlight how you can do these things in Microsoft Power BI and Fabric.

Getting the fundamentals right helps you with or without AI

Before you can effectively use AI, you need a solid foundation. This is nothing new or surprising, and it’s a drum that many of us have been beating ever since interest in Copilot and AI first emerged. If you get the fundamentals right, then you are more likely to not only have success with AI, but BI in general. These are things that you should be doing and investing in, anyway.

While not an insight, it’s an important reminder that these are all worthwhile investments since they help across the board. It's also a reminder not to re-invent the wheel… for instance, by pointing an agent at raw tables with a context file. A properly governed semantic model already encodes those definitions, relationships, and rules.

KB039 Figure 3 - Six key learnings for agentic self-service BI: strong dimensional modeling and a semantic layer, governance and oversight with enforcement, treating metadata and context as first-class citizens, co-locating data and artifacts, designing tests and evals to measure performance and accuracy, and managing skills rather than only instructions or ephemeral prompts

In the Microsoft stack, that base you need to get right includes:

Knowing and applying best practices, enforced across the lifecycle rather than left to convention.
Using endorsement (the process to promote and certify trusted data and models) as enforced process, so an agent searching for a concept finds one governed answer, not five plausible ones.
Having oversight: auditing tenant inventory and user activity, for example scheduled processes hitting APIs with the Fabric CLI, or async agents that audit and flag areas needing attention.
Leveraging tooling like Microsoft Purview to reduce the burden of governance and oversight.

NOTE

Interestingly, what Anthropic left out completely in their article was emphasizing the importance of user training and adoption. Inside of an AI company, it’s a given that people know how to use agents and analyze the data. However, in most organizations, this requires a lot of effort. You need to make sure that you educate users about how to use AI effectively, when to use it, and when not to use it.

Remember that success with BI is not measured by a technical KPI; it’s not a technical problem. Rather, it’s whether people in your business actually use and derive value from the BI solutions and analytics you produce.

You should invest in testing and CI/CD

Throughout the article, Anthropic talked about the importance of source/version control and testing. They emphasized this not only with data artifacts (like a semantic model) but also context (like AI instructions or agent skills). A few of their more nuanced findings are worth carrying over:

We should invest (more) in testing: Testing ensures the quality and stability of the things we make, not just before release, but to demonstrate stability with change over time. In Fabric and Power BI, we need to admit that this is an area that is rather weak. There are very few real frameworks for testing aside from basic assertions. This should be a lesson that we all need to invest much, much more in this area if we want to scale our BI with or without agents.
Regression testing and result telemetry: The importance of ensuring that data is snapshotted but then tracked over time for drift against a “ground truth”. Test results themselves should also not be ephemeral but stored. In Fabric, this would make sense to store the results in OneLake after they’re run from a notebook, for instance. That way, they can be analyzed to catch slow regressions over time.
Use tests as gates: You should make sure that when tests fail, artifacts (and skills) are not deployed or used. Ensuring that the amount of testing is proportional to the complexity and criticality of the business case and ensuring that a certain pass rate is required before announcing or sharing anything with stakeholders. This kind of discipline ensures objective trust and can be applied to any consumption item.

TIP

The Tabular Editor CLI provides several tools to help you automatically test your semantic model. From best practice rules to data quality checks and even full test suites, you can use the te-cli in GitHub Actions and Azure DevOps Pipelines to ensure that your semantic models are ready to deploy and use. You can go here to download and try the CLI, yourself (or with an agent).

There needs to be governance, monitoring, and oversight

You don’t know what you’re doing wrong or right if you don’t know what you’re doing at all.

Surfacing query context (where an answer came from) helped users judge whether to trust it. Copilot and Fabric Data Agents expose this but make sure it's meaningful for a business user who doesn't know or care about Fabric icons, item types, and technical taxonomy.
Tracking "correction language" from users ("that's the wrong table", "you're missing the fraud filter") helped close loops on quality and even auto-improve docs and skills. For Copilot and Fabric Data Agents you need Purview or the Microsoft 365 unified audit log to get at this.
Accuracy is not a solved problem. Agents still give wrong answers, and users act on them anyway. Anthropic keeps human sign-off on anything reaching leadership, and tests against "golden dashboards". In Power BI and Fabric this is something one might handle with ground-truth queries, hand-written or extracted automatically from existing Power BI reports.

Treat metadata, skills, and other context as a first-class citizen

Metadata includes not only the calculations and structures of your semantic models and reports, but also the descriptions, annotations, and AI instructions you add to them. This metadata is not just important for AI; it also helps both users and other developers to use your semantic models and reports. Again, this was stuff we should have been doing anyway, but with AI, we have even more incentive to do it right.

Something that is specific to AI, though, are skills. Skills are a collection of instruction files, scripts, and examples (among other things) that follow a specific structure and can be used well by AI. A skill can be as simple as a single markdown that describes a process, or a complex collection of dozens of documents, examples, scripts, and even full programs. The purpose of a skill is – in layman terms – to teach the AI about a particular process or concept so that it performs better than if it did not have or use that skill. An agent can choose to invoke a skill, but you can also invoke it explicitly yourself when you talk with it. Skills are thus switched on and off when needed, unlike memory files, which are “always available”.

NOTE

You need to own your context. It is not sufficient to rely on out-of-the-box skills from Microsoft or the community; you need to write your own skills that are tailored to your business, processes, and workflows. This can be quite challenging to get started, so we’re preparing some articles, videos, and trainings to show you how to do this.

Clearly, for agentic development and conversational BI, metadata and context matter a lot. For AI-generated queries, without skills, Anthropic saw 21% accuracy; with them, over 95%. Note that these skills are not technical in nature, but information about the specific business and data context. Metadata can't just be visible, but well-annotated and curated with clear definitions, descriptions, and documentation. In practice for actual artifacts:

Follow good practices for organization and documentation. This includes things like having good naming conventions, setting clear, meaningful descriptions, using the proper format strings, and organizing the model with display folders.
Reconsider relying on PBIX and legacy metadata formats. The newer PBIP, TMDL, and PBIR formats make Power BI items legible and diffable for humans and agents alike. They are also necessary to make changes to reports and models when they aren’t loaded in Power BI Desktop or deployed to Fabric.
Prepare your data for AI: Set proper AI schema, instructions, and descriptions. Since Anthropic found AI doing this for itself wasn't successful, developers, data owners and stewards must own this curation.
Use source control, so changes to both code and descriptions are visible and reviewable. This implies shifting away from binaries and OneDrive version history, and possibly rethinking deployment pipelines as the primary promotion mechanism.
Use CI/CD: CI tests items automatically before automated deployment (CD). For instance you can use the Tabular Editor CLI to test semantic models, and deploy only once those tests pass.

This is relevant not just for artefacts you create, but also for the context that steers the agents:

Create skills, not just AI instructions and ephemeral prompts. Skills are durable, versioned, and ideally testable procedural knowledge that you maintain like code. This is much better than having ephemeral prompts.
Scaffold skills in an explicit and intentional way, tailored to your business. Anthropic describes a pattern of pairwise skills to route to more details. Importantly, these are tailored skills for their business, and not generic. This highlights the importance of owning your own context, and ensuring that your tools support the ability to provide these custom skills. You can’t rely only on templates from vendors and community, but really need to provide information for your business and team.
Automate testing of skills with evals. Anthropic tested performance before and after every change, since well-intentioned additions often made things worse. Test skill and context changes atomically, like code, or you can't tell whether a change helped. We've experienced this ourselves; designing a good eval is hard, which only emphasizes investing in good tests. They pay off for both artefacts and their supporting context.
Treat context as a recurring and not one-off task: Context must be continuously curated to move with your business and users, just like your semantic models and reports. Anthropic describes a drift from _{95% at launch to *}*65% accuracy over a month, as an example. If you just do a big one-off audit-and-fix project, you’ll be back at ground zero in weeks to months.
Re-use skills across your teams and organizations: Another good pattern was ensuring that skills are used everywhere, not just in one tool or context. There are many places where you need relevant context about your business, so coming up with solutions to centralize and distribute these documents is imperative. This is analogous to ensuring centralized distribution of a data asset… except now, it’s a context asset.

Some things we may need to change

Not everything aligns with current dogma and practices. One finding pushes against how many of us plan our tenants today, and it's worth a rethink rather than a reflex:

Self-service simplicity vs enterprise complexity: In Fabric and Power BI, there’s been this idea that we should shield self-service or decentralized users from the complexity of things like metadata and source control; using the PBIX format and OneDrive instead of PBIP and Git. With both conversational BI and agents, though, it may be time to retire these ideas and consider metadata-first formats and source control a pre-requisite for agent use.
Workspace planning: historically, many teams split Fabric and Power BI items into separate workspaces by item type. Anthropic instead argues in favor of co-locating data and artifacts together so that they’re easier for an agent to find and use.
Query stores aren’t a solution as raw context: Some teams and organizations have experimented with logging queries (to the semantic model, for instance with Azure Log Analytics) and providing this information to data agents. Anthropic noted that providing the agent a history of raw queries doesn’t seem to help. Instead, treating this like additional data to distill into more helpful context and skills or identify patterns may produce better results.
Agent-generated documentation and context is also not a (total) solution: Many people are deferring to AI to create documentation, skills, metadata, and other context. Anthropic emphasizes several times in their article the importance of human-owned and curated context. AI can be involved in the process (to flag gaps, make informed edits or drafts) but the core ownership must lie with a person. They do describe ways to better leverage LLMs in this process, though, with templates of reference documentation, for instance.

In conclusion

Anthropic's results reinforce what good BI teams already should value: solid dimensional modelling, a governed semantic layer with human-curated definitions, data-quality checks, and endorsement. They stress the importance of a mature data lifecycle with source control and CI/CD. In Microsoft Power BI and Fabric, tools like Tabular Editor can help you achieve that with your semantic models. The new part is treating metadata, context, and skills as a first-class, versioned product, with rigorous discipline and in source control, testing changes before use.

Apply agentic BI lessons to governed semantic models in Tabular Editor 3.

Give Tabular Editor a spin

Plagiarism-freeScanned on July 1, 2026 Human-writtenScanned on July 1, 2026

Lessons from Anthropic’s implementation of agentic self-service BI

Key Takeaways

How Anthropic’s learnings can apply to Microsoft BI

Why analytics is harder than coding

Failure states of conversational BI

Getting the fundamentals right helps you with or without AI

NOTE

You should invest in testing and CI/CD

TIP

There needs to be governance, monitoring, and oversight

Treat metadata, skills, and other context as a first-class citizen

NOTE

Some things we may need to change

Further Reading

In conclusion

Related articles

Writing good descriptions for semantic model columns and measures

How to write good AI instructions for a semantic model

How Data Apps make semantic models better in Fabric

Ready to get started?