6.7 Wizard of Oz

A user interacting with a laptop while behind a curtain another figure manually operates controls

At a Glance

~1–3 weeks~1–3 weeks AI builds the front-end in an afternoon (vibe-code it) and drafts every wizard response, so construction and typing are both fast. The time cost is human: an operator must review, edit, or rewrite each response, and you have to recruit 10 to 30 real users and run them live for a few days. Operator review time plus recruiting set the calendar, not the build.
$40–$400$40–$400 AI removes the build labor — the front-end and the operator dashboard (a no-code internal-tool builder, a spreadsheet/database tool, a team-chat bot) are near-free to stand up, and LLM API calls are a rounding error. The out-of-pocket spend is recruiting: getting 10 to 30 of the right users to engage live usually means a modest incentive per user, which lands the bill around a hundred dollars. One or two operators also have to be on call during test hours to review and override the drafted responses; their time is a real cost, but not an out-of-pocket one.

Other names WoZ Test · Manual Backend MVP · Hybrid Wizard of Oz

In Brief

A Wizard of Oz test is a product experience that looks fully built to the user, but the response on the other end is generated by an operator behind the curtain — usually an LLM under human approval today, sometimes a person typing manually. The user does not know. You collect real engagement data and qualitative feedback on a finished-feeling experience without committing to the underlying system. The output is two things at once: signal on whether the value proposition lands, and a transcript of what the “wizard” actually had to do, which becomes your spec for the real build.

Common Use Case

You have a clear hypothesis about what the product should do, but the system that would deliver it is expensive or risky to build. Instead, you stand up a polished front-end and let an operator — often an LLM under human review — generate the responses for the first 10–30 users. You learn whether the experience changes their behavior before you commit to the engineering.

Helps Answer

  • Will users come back, pay, or refer once they actually use this?
  • What does the system actually need to do — and how often does each request type come up?
  • Where does the experience break down: the inputs, the outputs, or the latency?
  • Is this a real product, or is the magic actually the human in the loop?

Description

A Wizard of Oz test is a product experiment in which a human manually performs the work behind an interface the user believes is automated. The user interacts with what looks like a finished product; an operator hidden behind the interface produces the responses. You measure how the user behaves on a finished-feeling experience without building the system that would deliver it.

The common form today is a hybrid. The user sees a real-looking interface. Behind it, an LLM generates the candidate response, and a human operator approves it, edits it, or rewrites it from scratch before it goes out. The operator is the wizard, even when they are no longer typing every word.

This matters because it changes what you’re testing. A pure-AI prototype tests the model. A pure-human prototype tests the value proposition. A hybrid Wizard of Oz tests the value proposition and shows you exactly where the model would have failed if you had shipped the AI-only version — every override the operator makes is a labeled defect in the eventual automated build.

When to reach for Wizard of Oz vs. just shipping a real LLM-backed prototype:

  • Use Wizard of Oz when getting a wrong answer to a real user would be expensive or harmful, when the failure modes of the underlying system are unknown, when you want to count override rates as a quality bar, or when the system you eventually want is materially harder than what an off-the-shelf LLM can do.
  • Use a Vibe-Coded Disposable MVP when an off-the-shelf LLM with a tight prompt is good enough that you don’t need a human in the loop, and you’d rather watch real users in a real product than read transcripts.
  • Use a Concierge Test when you don’t yet know the right shape of the product. In a Concierge Test the user knows a human is delivering the service, so you can ask “is this even useful?” before you hide the work behind a fake automation.

The pure-human wizard — no LLM involved — still fits two situations: the system you’re testing is fundamentally not a language task (logistics, physical fulfillment, hardware control) and the user-visible “magic” is a non-AI service; or you want to evaluate a target capability that current models cannot reach, and you need a human ceiling to define what success would look like.

Ethics — The user believes they are using an automated product when a human is doing the work, and if you charge them, they are paying real money on that belief. Keep the deception proportionate: do not let the wizard give advice a real person would act on in high-stakes domains (medical, legal, financial), refund anyone who paid for a capability you have not built, and never collect or retain data you would not be allowed to collect in the shipped product.

How to

Prep

1. State the value proposition you are testing.

Write it as a single sentence. “We help [user] do [task] so they get [outcome] without [the part that’s expensive to build].” If you cannot finish that sentence, you are not ready to run a Wizard of Oz — you are still in problem discovery, and a Concierge Test is the better method.

2. Decide what’s faked and what’s real.

The wizard’s job is the one capability that would be expensive or risky to build. Everything else can be real or stitched together with no-code tools. List the parts of the experience and mark each: real, faked by wizard, or stubbed (placeholder). The fewer pieces the wizard is responsible for, the cleaner your signal.

3. Pick the wizard mode.

Choose one of three:

  • LLM-assisted (default). An LLM drafts every response. The human operator reviews, edits, or rewrites before sending. Best for language-heavy products (advisors, summarizers, schedulers, agents).
  • LLM-free. The operator types every response. Use this when the task is not a language task, when you want a clean human-ceiling baseline, or when LLM responses would be obviously wrong in this domain.
  • Tool-augmented LLM. The LLM plus tool calls (search, calendar, database) drafts a response; the operator reviews. Use this when the eventual product needs to take actions, not just generate text.

4. Build the front-end.

Create whatever the user will interact with: a web form, a chat interface, an app screen, a chat-workspace bot, an email address. It should look like a finished product. Vibe-code it, use no-code form and app builders, or wire up a bot in your team chat tool. The user should not be able to tell that there’s a person in the loop — that includes latency cues. If your real product would respond instantly and your wizard takes three minutes, build in an explicit “thinking…” or “results within 5 minutes” affordance so the delay reads as design, not as a tell.

5. Build the operator console.

The operator needs three things on one screen: the incoming request, the LLM-drafted response (if applicable), and a single Approve / Edit / Reject control. Log every approval, every edit (with the diff), and every rejection. The edit log is the most valuable artifact this test produces — it’s a labeled dataset of where the eventual automation will fail.

6. Write the wizard playbook.

A two-page document the operator can read in five minutes. It defines:

  • The persona the wizard is responding as (tone, voice, length, what they will and will not do).
  • A response template for each request type you expect to see.
  • An escalation rule: when does the wizard say “I don’t know” or “let me get back to you” rather than make something up?
  • A safety rule: what categories of request must be rejected outright (medical advice, legal advice, anything outside the tested value proposition)?

If two operators will share shifts, the playbook is what keeps their behavior consistent.

Execution

1. Recruit 10–30 real users.

The same audience you would target with the real product. Free users will give you fluffy data; people who pay (or who give you something costly — time, data, a referral) will give you signal. Set the test window: one to three weeks is typical.

2. Run the test live.

The operator works requests as they come in — drafting (or reviewing the LLM draft), editing, sending. Cover the actual hours your users are active. Do not batch responses overnight unless the real product would also respond overnight; latency is part of the value proposition you are testing.

3. Log everything.

For each interaction, capture: incoming request, LLM draft (if any), final response, edit diff, response latency, operator notes (one line: “had to look up the answer,” “model hallucinated, rewrote from scratch,” “out of scope, escalated”). This log is the test.

4. Watch for drift.

When two operators share shifts, the wizard’s voice will drift. When an LLM generates the draft, the operator can become a rubber stamp (“looks fine, send”). Both kill the data. Spot-check one in five responses against the playbook — if the operator is approving without reading or editing without restraint, retrain mid-test.

5. Run the same diagnostic the real product would run.

If the eventual product is supposed to drive a behavior — a return visit, a payment, a referral — measure that behavior, not just whether the user enjoyed the response. A Wizard of Oz that produces five-star reviews and zero return visits is telling you the magic was the human, not the value proposition.

Analysis

1. Read the override log first.

Every edit and rejection is a labeled defect. Cluster them: which request types did the LLM consistently get wrong? Which did the human have to write from scratch? Which would have been actively harmful if the user had received the LLM’s first draft? This is the gap between “we have a working prototype” and “we have a real product.”

2. Compute the override rate.

Approve / Edit / Reject as a percentage of total responses. A 90% approve rate means an off-the-shelf LLM is probably enough — drop the wizard and run a Vibe-Coded Disposable MVP instead. A 30% approve rate means the human is doing most of the work — the eventual product is a much harder build than you thought, and you need to decide if the value proposition is strong enough to justify it.

3. Measure the real behavior.

Did users come back? Did they pay? Did they refer? Compare to the assumption you wrote down in Prep step 1. Surface-level satisfaction does not validate the value proposition; behavioral commitment does.

4. Estimate the cost of automation.

For each cluster of requests in the override log, estimate what it would take to build that capability: a better prompt, a fine-tune, a tool call, a custom model, or “we cannot build this in the next 12 months.” Multiply by frequency. The clusters that are common AND expensive to automate are where your engineering risk lives.

5. Decide the next move.

One of three things should happen:

  • The override rate is low, users came back, the real behavior is there → ship a real build, no wizard.
  • The override rate is high but the value proposition lands → keep the wizard, scale operators, treat the human-in-the-loop as a feature for now.
  • The override rate is irrelevant because the real behavior didn’t show up → the magic isn’t real. Stop. Go back to a Concierge Test and figure out what users actually want.
Biases & Tips
  • Observer effect Users behave differently when they suspect they are being watched or that the system is not what it appears. Make sure the front-end and the latency are good enough that the user is not actively wondering whether a human is on the other end.
  • Rubber-stamp bias (LLM-assisted mode) When the LLM drafts every response, the operator stops reviewing carefully. Flag a random 10% of responses for blind double-review by a second person, or the override-rate metric becomes meaningless.
  • Anchoring on the first generation Operators tend to keep most of what the LLM produced even when a fresh response would be better. Build “regenerate from scratch” into the console as a one-click option, and track how often it gets used.
  • Automation bias Operators under time pressure defer to plausible-sounding LLM outputs rather than evaluating them critically; a confidently wrong answer that reaches the user contaminates both the response data and the behavioral outcome you are trying to measure.
  • Human-attention false positive Users respond warmly to anything that feels personally attended to. A high satisfaction score with no return visits usually means you measured the human, not the product. Pair satisfaction with a behavioral metric.
  • Deception debt A Wizard of Oz test deceives users into believing an automated system is real, so plan to debrief participants afterward where feasible, never ship anything that could harm them on the strength of that belief, and honor any commitment the wizard made during the test.

Next Steps

  • If override rate is low, ship a Vibe-Coded Disposable MVP without the human in the loop and see if the behavioral metrics hold.
  • If override rate is high but the value proposition landed, productize the wizard: hire or contract operators, formalize the playbook, treat the human-in-the-loop as a feature, and automate the largest override clusters first — each cluster becoming a Single-Feature MVP.
  • If the value proposition did not land, drop the front-end and run a Concierge Test to re-discover what users actually need.
  • Run a Product-Market Fit Survey with the Wizard of Oz cohort once you have enough returning users to measure whether they would be disappointed without the product.
Learn more

Case Studies

Strella: Camera-off AI moderator

According to Bessemer’s account, co-founder Priya Krishnan joined Zoom interviews with her camera off and a deliberately robotic voice, pausing after each response to simulate AI latency, validating demand before any model was wired up.

Read more

Aardvark: Human-routed social search

The social-search service routed questions to friends and friends-of-friends in the asker’s network who answered them live, validating a knowledge-market interaction pattern before Google acquired it, reportedly for around $50M, in early 2010.

Read more

Zappos: Manually-fulfilled shoe orders

Nick Swinmurn photographed shoes in local stores and posted them online, buying and shipping each pair himself when an order came in — a non-language Wizard of Oz validating an inventory-and-fulfillment system that did not exist; Amazon later acquired the company in a deal reportedly valued at around $1.2B in 2009.

Read more

Got something to add? Share with the community.