6.5Usability Testing

An observer with a clipboard watching a user interact with a laptop

At a Glance

~1 day~1 day Usability tests with 5 users can be finished in half a working day with minimal resources. Tests typically require no more than 5–7 users, unless the tasks are complex and involve several parties collaborating simultaneously to complete the test. Tests are often performed with extensive equipment, including full usability labs with cameras, eye-tracking software, and one-way mirrors (although this is not strictly necessary). AI-powered tools can auto-generate session transcripts, highlight moments of frustration, and produce summary reports, reducing analysis time from hours to minutes.
freefree Usability testing can be done at low cost with just a laptop, a note-taking template, and a quiet room. Remote unmoderated tools like Maze and Lyssna offer free tiers. Full usability labs with eye-tracking and screen recording are available but not required for most early-stage tests.

In Brief

Usability testing is a qualitative method where you watch real users attempt specific tasks with your product and record what happens. The product can be anything from a paper sketch to a fully built application. You give each participant a task, observe whether they can complete it, and note where they hesitate, get lost, or express frustration. The output is a prioritized list of usability problems — the points in your design where real people struggle — along with insight into why those problems occur.

Common Use Case

You have built or redesigned a key flow in your product and want to see whether real users can complete it without getting lost or giving up. You sit with a handful of participants, give them a task, and watch where they stumble so you can fix the friction before you ship.

Helps Answer

  • How do people actually use this product or feature?
  • Can users complete the key tasks without help?
  • Where do users get confused or frustrated?
  • What do users experience at each step of the process?

Description

Usability testing is a qualitative method for observing real people as they attempt realistic tasks with your product. ISO 9241-11:2018 defines usability as the extent to which a system, product or service can be used by specified users to achieve specified goals with effectiveness, efficiency and satisfaction in a specified context of use. The test makes that definition operational: you put a representative user in front of the artifact, give them a goal, and watch where reality diverges from the design intent. Jakob Nielsen’s foundational Usability Engineering established that you do not need a large sample to find most usability problems — five carefully chosen testers, run iteratively, will surface the dominant defects in a flow.

The method works at any fidelity. A paper sketch tested with three people can expose a confusing label as effectively as a fully built product tested with twenty. Steve Krug’s Don’t Make Me Think, Revisited makes the case for treating usability testing as an everyday DIY activity rather than a heavy lab exercise: small, frequent tests done by the team that built the product close the loop faster than commissioning quarterly studies. Usability testing is qualitative — it tells you what breaks and why — and complements quantitative methods (analytics, A/B tests, surveys) that tell you how often.

AI now sits inside the workflow rather than alongside it. Platforms like Maze, Lookback, Lyssna, and UserTesting auto-generate session transcripts, flag moments of hesitation or frustration from screen and audio signals, cluster patterns across sessions, and produce draft summaries. This compresses analysis time from hours to minutes and makes it practical to run more tests with more users. AI does not replace observation — it accelerates synthesis. A flagged frustration moment still needs a human to decide whether the design is wrong or the task scenario was unfair, and the empathy gain that comes from watching a user struggle live is something a transcript summary cannot reproduce.

How to

Prep

  1. Define the task scenarios. Write 3–5 realistic tasks that cover the critical paths in your product. Each task is a goal, not a click-by-click instruction. Example: “You just signed up. Find the feature that lets you invite a teammate.” Avoid leading the user toward the answer. Set a reasonable time limit per task (2–5 minutes).
  2. Recruit 5 testers who match your target user. Five is the heuristic Nielsen established in Usability Engineering — past five, you mostly re-confirm problems you’ve already found. Recruit from existing users, social channels, or panels (UserTesting, Lyssna, Prolific). For B2B, ask customers directly; most will say yes if you make it short.
  3. Decide moderated vs unmoderated, then pick the tool. Moderated sessions (Zoom, Lookback, in-person) let you ask “what made you click that?” in real time but cost more researcher hours. Unmoderated sessions (Maze, Lyssna, UserTesting) scale cheaply and AI-assisted platforms can analyze think-aloud audio and behavioral signals across many sessions without a researcher watching each one in real time. Mix both when budget allows: a few moderated sessions to build empathy, a larger unmoderated batch to confirm patterns.
  4. Write a short intro script. Two or three sentences that frame the session. The standard reassurance: “We’re testing the product, not you. If you get stuck, that’s exactly the feedback we need.” Pre-writing the script keeps facilitators consistent across sessions.
  5. Pilot once internally before the first real session. Run the full protocol with a teammate. You’re checking whether tasks are intelligible, whether the recording setup works, and whether the timing fits. Half the issues you find in piloting would have wasted a real participant slot.
  6. Decide who moderates and who takes notes. One person facilitates and asks the questions; a second person observes and captures notes. Trying to do both at once degrades both. For unmoderated tests, the “observer” role becomes whoever reviews the AI-generated session summaries and watches the flagged moments.

Execution

  1. Frame the session. The facilitator reads the intro script: explain the purpose, reassure the participant that any difficulty is feedback about the product (not the user), and confirm consent to record. A pre-written script is what keeps consistency across facilitators and sessions.
  2. Explain the first task. Read the task scenario aloud and hand control to the user. Give context, not instructions. Do not tell them where to click or what to look for.
  3. Observe and ask the user to think aloud. Ask the participant to narrate impressions, intentions, and expectations as they work. The think-aloud protocol is what makes usability testing diagnostic rather than just a pass/fail score — it surfaces the why behind the behavior. Do not explain, coach, or interpret. Interject only to ask “what are you thinking right now?” when the user goes silent at a moment of hesitation.
  4. Capture the session. Record audio, screen, and (where available) video. For unmoderated remote sessions, AI-assisted platforms (Maze, Lookback, UserTesting, Lyssna) auto-flag moments of hesitation, backtracking, repeated clicks, and emotional cues — useful triage when you cannot watch every session in real time.
  5. Repeat for each task. Move through the rest of the task scenarios. Watch for fatigue; if the participant is flagging, cut a task rather than push through with degraded data.
  6. Run a short exit interview. Thank the user and ask 2–3 open-ended follow-up questions to clarify their experience: “What was the most frustrating part?” “If you could change one thing, what would it be?” These prompts often surface issues the user did not articulate during the task itself.

Analysis

  1. Synthesize observation notes across sessions. The facilitator and any observers compare notes, focusing on moments where users hesitated, backtracked, or expressed frustration. Even usability experts sometimes disagree on interpretation, so multiple observers reduce the chance that a single experimenter’s bias filters the findings. Identify the functional issues that affected most or all participants — those are the ones worth fixing first.
  2. Prioritize by frequency and severity. Given the small sample size, treat consistent problems (3+ of 5 users hit the same wall) as confirmed defects. One-off issues might be real or might be participant-specific — note them but do not over-weight them. Severity is independent of frequency: a problem that happens once but blocks the user from completing the task scores higher than a cosmetic issue that everyone hit.
  3. Use AI tools to triage what to watch closely. AI-powered session analysis tools (Maze, Lookback, Lyssna) auto-flag moments of confusion, generate heatmaps, calculate task-completion metrics, and produce session summaries. Treat these as a multiplier on researcher bandwidth — use them to triage which recordings to watch closely, not as a substitute for watching sessions yourself. Automated heuristic checks (contrast ratios, touch-target sizes, navigation depth) can also catch surface-level issues before you spend a real participant slot on them.
  4. Distinguish usable from desirable. If all users complete the tasks, that tells you the product is usable. It does not tell you whether the value proposition lands or whether anyone would actually pay for it. Usability testing is a necessary but insufficient validation step — pair it with desirability and value-proposition tests before you assume you are clear to ship.
Biases & Tips
  • Hawthorne effect (the observer effect) Users behave differently when they know they’re being watched. Frame the session as testing the product, not the user, and use unmoderated remote tests for a less observed signal where appropriate.
  • Social desirability bias Users may try to complete tasks (or answer questions) in a way that makes them look good to the experimenter. Reinforce that struggling is the desired output, not a failure.
  • Confirmation bias Experimenters can frame tasks or questions in ways that confirm their preconceptions. Have someone outside the design team review the task scenarios before you run sessions.
  • Selection bias Testing usability only with existing power users hides the issues new users hit. Recruit a mix that matches the segment you’re actually trying to serve.
  • AI analysis is a multiplier, not a replacement Automated tools catch surface-level issues and flag frustration signals, but they cannot tell you whether the product solves the user’s problem in a way that feels natural. Make sure at least some sessions are observed live by a team member who can ask follow-up questions and notice cues that AI still misses.

Next Steps

  • Prioritize the most critical usability issues by frequency and severity.
  • Fix the top issues and re-test with new participants to verify improvements.
  • Run Competitor Usability Testing to compare your usability scores against direct competitors.
  • Share highlight clips with stakeholders to build empathy for user struggles.
  • Use A/B Testing to measure whether usability fixes translate into improved conversion or engagement metrics at scale.
  • Run a Net Promoter Score Survey after shipping usability improvements to track whether satisfaction increases over time.
Learn more

Case Studies

Fidelity: Remote unmoderated testing

A Fidelity researcher reports running 40+ unmoderated remote studies, including a 117-participant comparison where Wikipedia users completed 71% of Apollo-program tasks correctly against NASA’s 58%.

Read more

USF: Virtual library interface study

Maryellen Allen’s academic study tested the University of South Florida’s virtual library interface with real students.

Read more

Metisa: AI dashboard redesign

Interviews and heuristic evaluation found the AI e-commerce dashboard confused users; the team rebuilt around a drag-and-drop email editor and clearer navigation.

Read more

Maze: AI usability testing platform

Vendor case write-up describes Maze’s Feedback Engine analyzing open-ended responses at scale with auto-transcripts, thematic sentiment filters, and dynamic follow-ups.

Read more

Maze: User research report 2026

Maze’s 2026 industry report finds 69% of researchers use AI in projects (up 19 points), with humans still essential for nuance (82%), ethics (80%), and framing research questions (76%).

Read more

Figma Make: AI prototype generator

Figma Make lets designers generate interactive prototypes from natural-language prompts, aimed at producing usability-test-ready flows without hand-built screens.

Read more

Nielsen Norman Group: Iterative homepage redesign

NN/g documents three rounds of testing on interactive homepage prototypes with new and returning visitors across desktop and mobile, sorting feedback into copy, architecture, layout, and visual style.

Read more

Got something to add? Share with the community.