How to Run Prompt-Level SEO Experiments to Diagnose Your AI Search Visibility in 2026

A small-business methodology for designing, running, and reading 5–10 prompt-level experiments per month against ChatGPT, Perplexity, and Google AI Mode — without enterprise tooling.

Haley C.R. Button-Smith - Content Creator / Digital Marketing Specialist at Button Block
Haley C.R. Button-Smith

Content Creator / Digital Marketing Specialist

Published: May 9, 202615 min read
Quiet evening home office with two laptops side by side showing ChatGPT and Perplexity search interfaces, a paper notebook with a prompt experiment log, and a warm desk lamp

For most of the last two years, small business owners have been told to “track AI search visibility” without anyone really explaining what to track or how to test it. The advice usually arrives wrapped in enterprise tooling — Profound dashboards, Otterly.ai monitors, Peec daily snapshots — that costs more per month than many Fort Wayne businesses spend on their entire SEO program. The result has been a strange middle ground where small businesses either ignore AI visibility entirely or pay for tracking software they barely use.

There is a third option, and it is the methodology a growing number of in-house teams are quietly adopting in 2026: run your own prompt-level experiments. Pick a hypothesis, change one thing on your site, run the same set of prompts against the AI engine before and after, and read the results yourself. Done thoughtfully, this gets you most of the diagnostic value of the enterprise tools at roughly the cost of a notebook and an hour a week. Done thoughtlessly, it gets you noise that will lead you to make bad changes to your site with confidence.

This guide walks through how to design and run a prompt-level experiment that you can actually trust on a small-business sample size. It draws on Jason Tabeling's recent piece for Search Engine Land outlining the methodology Further uses with brand clients, the AirOps citation study published in April, and what we have been seeing across the small Fort Wayne and Northeast Indiana businesses we work with. We've included a Fort Wayne example near the end so the abstract framework lands on something concrete.

Key Takeaways

  • Prompt-level SEO experiments are structured before/after tests where you change one thing on your site, then re-run the same set of prompts against ChatGPT, Perplexity, or Google AI Mode.
  • Identical prompts return different answers across days — “prompt drift” — which is why running each prompt on multiple consecutive days is non-negotiable for trustworthy results.
  • The standard small-business cadence is 5–10 prompts daily for seven days as a baseline, then the same prompts for seven days after a single change.
  • With 35–70 observations per condition, you cannot detect tiny effects, but you can detect meaningful ones in inclusion rate or position.
  • Most prompt-level experiments fail by changing too many things at once or comparing across different LLM versions — neither produces a usable signal.
  • The single experiment small businesses get the most from is FAQ schema or H2-promotion tests on a high-intent service page.

What Are Prompt-Level SEO Experiments?

Prompt-level SEO experiments are structured tests where you treat individual AI prompts the way an SEO would treat individual keyword queries. Instead of asking “are we more visible in AI search this month?” you ask “did this specific change to this specific page move our inclusion rate on these specific prompts?” That shift from broad question to narrow, testable claim is most of what makes the methodology work.

Jason Tabeling's Search Engine Land piece frames this with an “if, then, because” structure that is genuinely useful for keeping experiments honest. If we add FAQ schema to pages that already have Q&A sections in their HTML, then we should see those sections more often included in ChatGPT responses, because explicit schema markup makes the structure easier for the LLM to ingest. Each clause does work: “if” forces you to define the change, “then” forces a measurable prediction, and “because” forces you to articulate why it should work — which is what stops you from running tests with no theory and finding random results.

The practical scope is narrower than enterprise GEO tools. You are not trying to measure your share of voice across 500 prompts in real time; you are trying to answer a single question about a single change to a single page or content block. That is the right altitude for a small business. We've made the broader case for narrow measurement in why prompt volume is the wrong GEO metric in 2026 — tracking 500 prompts where your brand “appears” is a vanity metric. Running 10 prompts where you can prove a change moved the needle is a real signal.

The metric layer matters too. Search Engine Land's GEO metrics roundup lists eight things worth tracking — citation frequency, share of model voice, answer inclusion rate, entity recognition, sentiment, prompt coverage, retrieval success, and downstream conversion influence — and prompt-level experiments are how you actually move the first three. The diagnostic value comes from pairing this with a structural understanding of where AI search visibility breaks down. Our 10-gate AI search pipeline diagnostic maps the failure points — crawl, parse, ingestion, retrieval, ranking, citation. Prompt-level experiments are how you actually test which gate a specific page is failing at. The pipeline tells you where to look; the experiment tells you whether your fix worked.

Whiteboard close-up showing three connected boxes labeled visually as if then because for a structured AI search experiment hypothesis

Why Run Experiments Instead of Just Watching Visibility Tools?

Three reasons we keep coming back to. The first is that LLM outputs are noisy in ways that watching dashboards hides. Tabeling's piece names this directly: “prompt drift” means the same prompt yields different results across days, even within hours. A daily monitoring tool that screenshots one response per prompt per day is averaging out variance you should know about. Run the same prompt three times in a row and you may get three meaningfully different answers — a brand cited once, position three the next time, and not mentioned at all on the third. If your visibility tool only saw the first one, you would draw a wrong conclusion.

Otterly.ai's published methodology tracks citations and brand mentions across ChatGPT, Perplexity, Google AI Overviews, Google AI Mode, Gemini, and Copilot, and one of their large-scale studies analyzed “1+ million AI citations across ChatGPT, Perplexity, and Google AI Overviews from January-February 2026.” That kind of sample size beats noise by sheer volume. Most small businesses cannot run that volume, so they have to beat noise the other way — by repetition on a small set of prompts.

The second reason is causal isolation. Visibility tools tell you that your AI inclusion rate went up or down, but they cannot tell you why because too many things change between snapshots. The website was updated. The model version changed. Your competitors published new content. A news event shifted the answer space. Prompt-level experiments hold most of those variables constant by design — same prompts, same model, same week — so a change in inclusion rate has a much shorter list of plausible causes.

The third reason is cost. Otterly.ai, Profound, and Peec are real products that real teams use; we have used them. They start at hundreds of dollars per month and scale with prompt volume. For a Fort Wayne dentist or a regional manufacturing company, that is a hard line item. A prompt-level experiment notebook and an hour a week is a $0 start, and it gets you the answers that matter most for content decisions. We make the case for cost-aware AI tooling in our piece on AEO tools for Fort Wayne small businesses, and the experiment methodology is the free baseline that decides whether you ever need the paid tools at all.

Overhead view of a desk with a numbered nine-step planning sheet a second sheet of comparison columns and a sharpened pencil suggesting careful experiment design

How Do You Design a Prompt-Level Experiment That Actually Works?

The standard structure that holds up across the methodologies I have read is straightforward and worth running through every time:

  1. Pick a single hypothesis. Use the if/then/because frame. Not “improve our AI visibility” — that's not testable. “If we add a 60-word direct-answer paragraph at the top of /service/hvac-tune-up, then our inclusion rate on the prompt ‘best HVAC tune-up service Fort Wayne’ will rise from baseline, because LLMs preferentially extract content from the top of pages.”
  2. Choose 5–10 target prompts. Tabeling's guidance: “Execute a set of 5-10 target prompts daily for seven consecutive days.” Pick prompts that a real customer would actually use, not the ones you wish they would. Mix exact-intent prompts (“HVAC tune-up Fort Wayne”) with broader research prompts (“how often should I service my furnace in Indiana”) and at least one comparison prompt (“HVAC companies in Fort Wayne”).
  3. Lock the model and version. Tabeling is explicit: record the “specific model and version used for testing.” Models are updated constantly and a version change between baseline and measurement is a fatal flaw. If you are testing on ChatGPT, write down whether it was GPT-5.1, the date, the time, and whether you were logged in. The same applies to Claude, Perplexity, or Google AI Mode.
  4. Define the testing environment. Cleared browser cache, no login state where possible, same device, same IP location range. Log everything. The whole methodology is built on holding things constant; one variable change invalidates the comparison.
  5. Run baseline for seven consecutive days. 5 prompts × 7 days = 35 observations. 10 prompts × 7 days = 70 observations. Each observation is one run of one prompt. Capture the entire response, the citation list if visible, and a position-in-response number for any mention of your brand.
  6. Implement one change. One. The “single-paragraph swap” methodology Tabeling describes — modifying only one targeted text element — is what makes the result interpretable. If you simultaneously add schema, rewrite an H2, and update internal links, you cannot tell which change moved the needle.
  7. Wait for re-indexing. This is the step most small businesses skip. AI search engines do not re-ingest your page the moment you publish; depending on the engine, the lag can be days to weeks. Tabeling's protocol assumes a meaningful gap between change and re-measurement. We typically wait 7–14 days.
  8. Run the measurement phase. Same 5–10 prompts, same model and version (or as close as you can get), same environment, seven consecutive days.
  9. Compare averages, not single observations. Average inclusion rate baseline vs. measurement. Average position-in-response baseline vs. measurement. Sentiment shifts if you tracked them. Single-day swings tell you nothing.

Two specific traps to avoid. First, don't use prompts you control the answer to (your brand name alone). Those are always going to cite you and don't measure anything useful. Second, don't pick prompts where you already rank #1 on Google — the AirOps study covered by Search Engine Land found pages in the top organic position were cited 58.4% of the time versus 14.2% for position ten, so you have very little headroom on prompts where you are already top-ranked. Pick prompts where you are mid-pack and citations are genuinely at stake.

Close-up of a laptop screen showing an abstract structured spreadsheet grid with multiple columns and rows representing prompt experiment observations

What Should You Capture and How Long Should You Run It?

A small spreadsheet is enough infrastructure. The columns we use:

ColumnPurpose
Prompt IDStable identifier so you can re-run the exact wording
Prompt textExact wording, no edits between runs
Model + versione.g., “ChatGPT GPT-5.1, 2026-05-12 10:14 EDT”
Run number1–7 for baseline, 1–7 for measurement
Cited (Y/N)Was your domain or brand named in the response?
PositionIf cited, where (paragraph number, citation index, etc.)
SentimentPositive / neutral / cautious — quick eyeball, not a model
Competitor citationsWhich other brands were named, in what order
NotesAnything off-pattern — model refused, mentioned a wrong fact, etc.

That is roughly 9 columns. A spreadsheet of 70 rows (10 prompts × 7 days) takes about 60–90 minutes per phase to fill out by hand and is genuinely usable. Tabeling's guidance also recommends maintaining “an organized, time-stamped repository of the exact prompt queries used for baseline and measurement phases” with inclusion rate, position-in-response, and sentiment/framing data — that is what the spreadsheet is.

For statistical caution, two honest notes. With 35–70 observations per condition, you can reliably detect large effects (inclusion rate moving from 20% to 50%) but not small ones (20% to 28%). If your measurement-phase inclusion rate is just a few percentage points above baseline, the change might be drift, not a real effect. The fix is repetition — run the same experiment again the following month. If it replicates, the effect is real; if it doesn't, it was noise.

You also cannot directly compare experiments run on different models. ChatGPT inclusion rates and Perplexity inclusion rates measure different things on different retrieval pipelines; treating them as the same metric is a common rookie mistake. If you want a multi-engine view, run parallel experiments — same change, separate baseline and measurement on each engine — and report each separately.

How Do You Read the Results Without Fooling Yourself?

Three honest interpretation rules. First, look at the change in inclusion rate, not the absolute level. If your baseline was 18% inclusion and your measurement was 27%, that's a real signal. If your baseline was 18% inclusion and your measurement was 19%, that's noise.

Second, separate inclusion from position. A prompt where you used to be cited in position 4 and are now cited in position 1 is a meaningful win even if your raw inclusion rate didn't move. Conversely, an inclusion-rate gain on prompts where you are buried at position 8 is worth less than it looks.

Third, look for consistency across the 5–10 prompts in your set. A change that moves three out of ten prompts and barely touches the others is real but narrow — your improvement applies to the kind of query that resembles those three prompts, not to the whole topic. A change that moves seven out of ten in the same direction is broad.

There is a useful structural reminder from Search Engine Land's recent piece on intent alignment vs. technical SEO: once a site has reached technical parity with competitors, the next gain comes from intent alignment, not more technical work. The author quotes the threshold directly — “once a site reaches technical parity with its competitors — the point at which a proper infrastructure no longer gives you an advantage — Google shifts its ranking criteria toward relevance.” The same logic applies inside AI search. Most prompt-level experiments where the result is “no change” are really telling you the issue is intent — your content is technically fine, but it is answering a slightly different question than the prompt. The fix in those cases is rarely more schema; it is rewriting the answer to match the prompt's actual ask. We covered the broader Google Search Console version of this in intent gap analysis with Google Search Console.

For prompts where you do see a clear lift, the next move is to figure out why so you can replicate it. A “changed nothing structural, just rewrote the first paragraph as a direct answer, gained 14 points of inclusion rate” result is actionable across your whole site. A “added FAQ schema and the result is unclear” might warrant another month of testing on a different page.

Side-by-side bar chart on paper showing two clusters of bars baseline and measurement with the second cluster slightly taller suggesting a real but modest gain

What Does This Look Like for a Fort Wayne Small Business?

Concretely, here's how a typical Fort Wayne service business — say, a residential plumber serving Allen and DeKalb counties — would run their first experiment. This is illustrative, not a case study from a specific client.

The hypothesis. Adding a 70-word direct-answer paragraph at the top of /services/water-heater-repair will increase inclusion rate on prompts about water heater repair in Northeast Indiana, because LLMs disproportionately extract content from the first 200 words of a service page.

The 7 prompts.

  1. Water heater repair Fort Wayne
  2. How much does water heater repair cost in Fort Wayne
  3. Best plumber for water heater repair in Allen County
  4. How long does water heater replacement take
  5. Plumbing companies near Auburn Indiana water heater
  6. When should I replace vs repair my water heater
  7. Same-day water heater repair Northeast Indiana

The model. ChatGPT GPT-5.1, logged out, Chrome incognito, run between 10am–11am EDT each day, cleared cache between runs.

The baseline. 7 days × 7 prompts = 49 observations. The plumber records: cited or not, citation position, which competitors were named.

The change. A single 70-word direct-answer paragraph at the top of the service page that names a price range, the typical timeline, and the service area in plain prose. No schema added, no other content changed.

The wait. 10 days for re-indexing.

The measurement. Same 7 prompts × 7 days = another 49 observations.

The honest read. Suppose baseline inclusion was 12 of 49 (24%) and measurement was 19 of 49 (39%). That's a 15-point gain across 49 observations — a meaningful signal, especially if the gain is concentrated on prompts 1, 3, and 7 (the explicit-intent prompts). If the same change shows no movement on the research prompts (4 and 6), that tells the plumber what kind of content to add next: research-stage content with explanatory paragraphs, not just service-page direct answers.

This is roughly the workflow we walk Fort Wayne and Auburn clients through, paired with the broader competitive context of a Fort Wayne AI competitor analysis so the client knows whether their competitors have been moving on the same prompts. The hyper-local angle matters because the competitive landscape on a Fort Wayne service prompt is much narrower than the landscape on a national prompt — meaningful gains are easier to detect because there are fewer brands fighting for citation slots.

Quiet Northeast Indiana service-business storefront with a clean unbranded utility van and tidy reception desk visible through a large front window at golden hour

Need Help Designing Your First Prompt-Level Experiment?

The methodology in this guide is intentionally do-it-yourself. A small business can absolutely run their first three experiments without paid tools or outside help. We've laid out the structure that way on purpose, because we'd rather see ten Fort Wayne businesses run honest small experiments than two of them paying for enterprise dashboards they don't have time to read.

Where we do help is when the experiments stop telling a clear story — when results are mixed, when sample sizes need to grow, or when an experiment surfaces a deeper content or positioning issue that needs structural work. Our AEO services cover that next layer: experiment design support, results interpretation, and the content rewrites that follow when an experiment reveals an intent gap. If you want to start with a free 30-minute review of your current AI visibility, contact us — we'll send back a one-page recommendation on the single experiment we'd run first.

For broader context, our answer engine optimization guide covers the full AEO stack, and our 10-gate AI search pipeline diagnostic is the structural complement to the experiment methodology in this guide — together they cover where to look and how to test.

Want a free read on your AI search visibility?

Button Block runs free 30-minute AI visibility reviews for Fort Wayne and Northeast Indiana small businesses and recommends the single prompt-level experiment we'd run first.

Frequently Asked Questions

A prompt-level SEO experiment is a structured before/after test where you run a fixed set of prompts against an AI search engine (ChatGPT, Perplexity, Google AI Mode, etc.), make a single change to your site, wait for re-indexing, and re-run the same prompts to measure the effect on inclusion rate, citation position, and sentiment. Jason Tabeling's Search Engine Land piece frames the methodology with an "if, then, because" hypothesis structure.
The widely cited methodology is 5–10 prompts run daily for 7 consecutive days as baseline, and the same prompts for 7 consecutive days after a change. That is 35–70 observations per condition, enough to detect meaningful effects but not tiny ones. If a measurement effect is small, repeat the experiment the following month — replication is how you separate signal from noise on small samples.
This is "prompt drift," and it is a property of how LLMs generate responses, not a bug. Models include controlled randomness in their output, retrieval pipelines re-rank sources differently across runs, and content can move in or out of the model's working context. The countermeasure is running each prompt across multiple consecutive days and comparing averages instead of single answers.
For the question "did this specific change move our inclusion rate on these specific prompts," yes — and the small-budget reality of most Fort Wayne, Auburn, and Allen County service businesses is exactly why this DIY methodology fits. A spreadsheet, a clean test environment, and 60–90 minutes a week are enough to start. Enterprise tools like Profound, Otterly.ai, and Peec become valuable when you need real-time monitoring across hundreds of prompts or a published methodology your team can audit; they are not required for an NE Indiana SMB to begin.
Changing more than one thing between baseline and measurement. Adding schema, rewriting H2s, updating internal links, and publishing new pages all at once produces a result you cannot interpret. The "single-paragraph swap" discipline — change exactly one element, hold everything else constant — is what makes prompt-level experiments useful at small sample sizes.
All three answer different questions, so the right choice depends on where your customers actually are. ChatGPT and Google AI Mode dominate broad-research and how-to queries; Perplexity skews toward citation-driven research; Google's AI Overviews and AI Mode show up directly in search results. Most small businesses we work with start with whichever engine their best customers most often mention having used, then expand once the methodology is proven.
The two are complementary, not competing. Google Search Console tells you what queries your site already ranks for in traditional search; prompt-level experiments tell you whether your content is being pulled into AI responses on those same queries (or related ones). Use Search Console to find prompts worth testing — the queries where you have impressions but the customer also asked an AI — and use experiments to test whether your content actually shows up in the AI answer.
What is a prompt-level SEO experiment?
A prompt-level SEO experiment is a structured before/after test where you run a fixed set of prompts against an AI search engine (ChatGPT, Perplexity, Google AI Mode, etc.), make a single change to your site, wait for re-indexing, and re-run the same prompts to measure the effect on inclusion rate, citation position, and sentiment. Jason Tabeling's Search Engine Land piece frames the methodology with an "if, then, because" hypothesis structure.
How many prompts and how many days do I need?
The widely cited methodology is 5–10 prompts run daily for 7 consecutive days as baseline, and the same prompts for 7 consecutive days after a change. That is 35–70 observations per condition, enough to detect meaningful effects but not tiny ones. If a measurement effect is small, repeat the experiment the following month — replication is how you separate signal from noise on small samples.
Why does the same prompt return different answers on different days?
This is "prompt drift," and it is a property of how LLMs generate responses, not a bug. Models include controlled randomness in their output, retrieval pipelines re-rank sources differently across runs, and content can move in or out of the model's working context. The countermeasure is running each prompt across multiple consecutive days and comparing averages instead of single answers.
Can a Fort Wayne or Northeast Indiana small business actually run these experiments without expensive tooling?
For the question "did this specific change move our inclusion rate on these specific prompts," yes — and the small-budget reality of most Fort Wayne, Auburn, and Allen County service businesses is exactly why this DIY methodology fits. A spreadsheet, a clean test environment, and 60–90 minutes a week are enough to start. Enterprise tools like Profound, Otterly.ai, and Peec become valuable when you need real-time monitoring across hundreds of prompts or a published methodology your team can audit; they are not required for an NE Indiana SMB to begin.
What is the most common mistake small businesses make running these experiments?
Changing more than one thing between baseline and measurement. Adding schema, rewriting H2s, updating internal links, and publishing new pages all at once produces a result you cannot interpret. The "single-paragraph swap" discipline — change exactly one element, hold everything else constant — is what makes prompt-level experiments useful at small sample sizes.
Should I run experiments on ChatGPT, Perplexity, or Google AI Mode?
All three answer different questions, so the right choice depends on where your customers actually are. ChatGPT and Google AI Mode dominate broad-research and how-to queries; Perplexity skews toward citation-driven research; Google's AI Overviews and AI Mode show up directly in search results. Most small businesses we work with start with whichever engine their best customers most often mention having used, then expand once the methodology is proven.
How does this fit with traditional Google Search Console analysis?
The two are complementary, not competing. Google Search Console tells you what queries your site already ranks for in traditional search; prompt-level experiments tell you whether your content is being pulled into AI responses on those same queries (or related ones). Use Search Console to find prompts worth testing — the queries where you have impressions but the customer also asked an AI — and use experiments to test whether your content actually shows up in the AI answer.

Sources & Further Reading

  1. Search Engine Land: searchengineland.com/prompt-level-seo-experiments-ai-search-476813 — Prompt-level SEO: How to run experiments that move AI search visibility.
  2. Search Engine Land: searchengineland.com/geo-metrics-to-track-476642 — 8 GEO metrics to track in 2026.
  3. Search Engine Land: searchengineland.com/chatgpt-citations-ranking-precision-length-study-474538 — ChatGPT citations favor ranking and precision over length.
  4. Search Engine Land: searchengineland.com/intent-alignment-technical-seo-476823 — Intent alignment beats technical SEO once parity is reached.
  5. Otterly.ai: otterly.ai/blog/llm-monitoring-methodology — Otterly.ai LLM visibility monitoring methodology.