- Ryan Howard
- Posts
- Why LLM tracking is tougher than you think
Why LLM tracking is tougher than you think

With traditional SEO we monitor keyword rankings and backlinks as proxies for visibility and authority. New tools are starting to offer AI visibility tracking, but we’re still early. Building something that goes beyond brand sentiment and gives you actionable data to improve mentions and LLM visibility is not as straightforward as it sounds.
Here’s a look at why tracking your presence in LLMs is challenging, and why most of the solutions available today offer only a partial picture.
The limits of current tracking tools
I’ve researched over two dozen AI search ranking tools, including SEMrush’s AI Toolkit, Ahrefs Brand Radar, Profound, FalconRank, Geostar, Writesonic, ZipTie, Nightwatch, AthenaHQ, Otterly, the Hubspot AI share of voice tool, SE Ranking, and more. Most are expensive, some exorbitantly so for what they do, and even the proven tools like SEMRush (which we use for 100+ agency clients and internal projects) are still early on AI tracking features. I’ve used traditional rank trackers for more than 15 years, and the gap between what we could track in SEO and what we can track with LLMs is significant.
My conclusion is that the real challenge isn’t just the pricing or polish of the tools. It’s that LLMs don’t return consistent or deterministic results. They generate answers based on probabilistic models that shift depending on training data, prompt phrasing, context windows, and model versioning.
So what does that mean for tracking?
There’s no fixed “ranking” for a given keyword. What you get back is a sampled snapshot, not a stable leaderboard.
Prompt phrasing matters. A small change in wording can completely change the answer and the brands mentioned.
You don’t know the original prompt. Unlike Google, where you can target specific queries, LLMs are opaque. You’re guessing what users might have asked and hoping your brand shows up.
The LLM tracking tools currently available can give you a rough sense of your presence. Some results may even reflect hallucinated citations from the model. That isn’t a problem with the tracking tools themselves. It’s just what happens when you try to monitor a system that builds responses dynamically and can’t show you what the original question was.
What’s actually being measured?
Take Nightwatch’s LLM tracking feature as an example. It lets you enter keywords and view how your brand performs in outputs from models like ChatGPT 4o-mini or Claude 3.5 Haiku. You get metrics like:
Average Rank: An average position across all generated outputs tied to your tracked prompts
Rank Distribution: How often you appear in different visibility tiers
Keyword Movement: Shifts in visibility over time
These rankings are the result of simulated prompts, not real user interactions. You’re not measuring what actual users are asking. You’re measuring how your brand shows up across a fixed set of test queries that approximate user behavior.
This kind of tracking gives you valuable directional data, especially over time. But you won’t get access to the original prompts, and you can’t tell how often users were actually exposed to your brand in a real-world setting. The visibility is estimated, not observed.
Impressions ≠ Influence
One of the few true data signals you can reliably measure is impressions from AI tools that pass referrer data. If someone clicks through to your site from an AI-generated response, it will show up in your analytics.
This creates a simple but useful framework in GA4:
Impressions = Visibility
How often your brand is referenced or cited in AI responsesClicks = Action
Whether those mentions led to someone visiting your site
If you’re getting lots of impressions but no clicks, it might be a positioning issue. Your brand is showing up but not compelling enough to act on. If you’re getting neither, it likely means you’re not present in the LLM’s knowledge graph at all.
It’s not perfect, but it’s one of the few signals that ties visibility to behavior. And that makes it actionable.
Why the model matters
Not all AI search experiences are the same.
ChatGPT with browsing behaves differently than models that rely only on internal knowledge
Claude leans on Google, ChatGPT on Bing, and Perplexity uses a blend
Some LLMs use retrieval augmented generation. Others hallucinate citations outright
This means your visibility isn’t just about content or links. It’s about how the LLM is built, what data it has access to, and how it decides to surface information.
With AI-powered tools like Perplexity, you can sometimes reverse-engineer your presence by checking which source URLs are cited. With pure LLM outputs like ChatGPT 4o-mini without browsing, you’re flying blind. There’s no way to know why you were or weren’t mentioned. And the next time you ask the same question, the answer may be different.
What's missing from current LLM tracking capabilities
Current LLM tracking tools can’t show you what users are actually asking or how the model interpreted the prompt. This is not a shortcoming of the tools. It’s a limitation of how LLMs work.
Outputs are fluid. Prompt phrasing, timing, and model updates all affect what’s returned. Tools can track whether your brand is mentioned in responses to test prompts, but they can’t surface deeper intent or sentiment unless it’s explicitly stated in the response.
Most tools rely on detecting direct citations. That works for some brands, but not all. More nuanced references or indirect mentions may be missed. Only a few platforms pass referrer data that lets you tie AI-generated impressions to on-site behavior.
What you’re left with is directionally useful data. You can see how often your brand is mentioned across test queries, whether you’re gaining or losing visibility, and which prompts you’re associated with. But you won’t see real prompts, real users, or real intent.
A potential way forward
One approach is to triangulate visibility using multiple signals:
Run scheduled prompt tests to monitor directional presence
Track AI referral traffic from tools that pass headers
Use branded search queries and on-site behavior to identify what users are asking
Optimize third-party citations and mentions that LLMs might ingest
We’re seeing an onrush of visibility dashboards from entrenched tools like Semrush and new VC-backed startups. Some offer SERP overlays. Some focus on prompt simulation. Others monitor AI citations across articles and knowledge panels.
For now, I use my own lightweight dashboard that tracks visibility with a defined set of prompts across LLMs. It tells us what’s being cited, what’s getting clicked, and what’s trending up or down. It doesn’t try to reverse-engineer the models. It helps us observe what’s actually happening and lets us adjust accordingly.