AI Usage Playbook — QA

Introduction & Purpose

The way we build and test software is changing. Generative AI is becoming a part of the development lifecycle - not as a future possibility, but as a present reality. What exactly this shift will look like in a year or two, nobody can say with certainty. But one thing is clear: QA professionals who learn to work effectively with AI will have a significant advantage - in speed, in depth, and in the kind of work they can take on.

This playbook is written for you - a QA Specialist at Xebia - as a practical companion for that journey. It won't tell you everything there is to know (nobody has that figured out yet), but it will give you a solid foundation: the right habits, the right techniques, and enough structure to start exploring with confidence.

You won't find hype here. What you will find is an honest take on where AI helps, where it doesn't, and how to integrate it into your daily work in a way that actually makes a difference. Some of that difference will be efficiency - getting through repetitive, time-consuming tasks faster. Some of it will be capability - tackling work that was previously impractical or out of reach. Both are worth pursuing.

What This Playbook Covers

This document focuses on the QA-specific application of Generative AI. It builds on two foundational resources that you should be familiar with before diving in:

Core Principles for AI- the company-wide guidelines on how we approach AI responsibly and effectively.

Best Practices in Prompting - the shared techniques for communicating with AI tools to get quality outputs.

Everything in this playbook assumes you've read those. We won't repeat them here - instead, we'll build on top of them with guidance tailored to QA work: how to prepare the right context for AI, how to integrate it into your workflows, what to watch out for, and how to measure whether it's actually helping.

The Mindset

Think of this as the beginning of an exploration, not a finished map. The field is moving fast, and the best way to keep up is to stay curious, experiment often, and share what you learn with the team. This playbook will evolve alongside your experience - and your contributions will make it better for everyone.

Let's get started.

Want to go straight to practical starting guide?

1. Working with AI in QA – Key Principles

AI is responsible for generation and analysis. You are responsible for decisions and quality.

This holds across every step of the testing lifecycle. The efficiency gains are real - but they come from AI accelerating your work, not from AI replacing your judgment.

AI-Enhanced QA Practices

Use AI Effectively

Separate stable knowledge from story-specific input: stable context (regression scope, system architecture, test techniques etc.) persists across stories; story details are session-only
Build stable knowledge incrementally: start with core references (e.g., system context, test format, definition-of-done), then add regression scope, and NFR attributes as the team matures
Treat AI context as a long-term team asset: version, review, and maintain reference files like any shared team artifact
Follow team conventions in all AI-generated output: AI extracts patterns from example files and produces artifacts in the team's agreed format
Focus on WHAT to test, let AI generate detailed cases: provide acceptance criteria, test objective and additional context; AI produces detailed test scenarios, test steps, data needs, and coverage mapping
Use AI to cross-reference, map, and verify consistency across artifacts: AI excels at comparing acceptance criteria against test cases, existing coverage against new changes - provided the input is complete and context is accurate
Leverage AI to analyze results and failure patterns: AI can help identifies patterns, correlates with system context, and support in finding root causes

Maintain Human Oversight

Never delegate quality decisions to AI: Go/No-Go, risk acceptance and final approval are human responsibilities
Never accept AI output without human review at decision points: always validate risk priorities, technique selection, regression impact, gap judgement, and final test sets
Always validate AI output against domain knowledge: AI may miss architectural constraints, legacy behavior, and domain-specific nuance
Avoid using AI when input quality is insufficient: unstable requirements, undocumented business logic, or missing system context produces unreliable output

2. Building Effective AI Context for QA

AI output is only as good as the context behind it. Most of the information it needs already exists - in project docs, wikis, sprint trackers, team conventions. The problem is that it is scattered. This section describes how to structure it, build it, and keep it useful.

The layered model

Separate context by how often it changes. Mixing stable and volatile information in one place creates maintenance overhead and makes it hard to include only what a given task actually needs.

L1 - Project Foundation.
Stable. Established at kickoff, updated infrequently. Team conventions, test process, defect management, definition of done, automation scope. Every AI interaction draws on this layer.

L2 - Product and System.
Semi-stable. Changes as the product evolves, not sprint to sprint. Business context, environments, regression approach, NFRs, compliance constraints, cross-system dependencies.

L3 - Active Invocation.
Live. Assembled at the moment of use from whatever is currently happening: the story in scope, the build state, recent risk signals. Not stored - composed per session.

L1 and L2 are investments you make once and reuse. L3 is what you add each time.

What belongs where

L1 - Project Foundation

Team conventions: work framework, roles, task types, how requirements are delivered and stored, estimation approach
Definition of ready and definition of done
Test process: how a task moves from dev-complete to QA-done
Test documentation standards: what gets written, how detailed, in what format
Defect management: bug lifecycle, priority and severity definitions, required fields, handling of unresolved bugs
Automation strategy: scope, tooling, CI/CD integration

L2 - Product and System

Business context: project goals, client, end-user profile, project stage
Test environments: what exists, what deploys where, access rules, data constraints
Regression scope: frequency, selection approach, automation coverage
Non-functional requirements: performance thresholds, security scope, browser and device matrix, localisation
Compliance constraints: data sensitivity, applicable standards, client-specific requirements

L3 - Active Invocation

The story, bug, or change currently in scope
Relevant acceptance criteria
Current environment state, if it matters
Risk signals from the sprint or recent releases

Structuring context as files

Store L1 and L2 as plain-text Markdown files - one file per topic area, accessible to the whole team. Split by how context is used, not just how it is organised. If you would want to include something in some tasks but not others, it deserves its own file.

Some files are nearly always relevant (team conventions, test process). Others are task-specific (automation strategy matters when writing test code; it does not matter when reviewing a story for testability). Which files pair with which tasks is covered in the Efficiency Patterns section.

Building context for the first time

The first pass is mostly a packaging exercise - existing material is scattered across docs, wikis, and team knowledge. A structured interview at project kickoff is the most reliable way to gather it. An hour or two at the start avoids reconstructing everything from memory later, under pressure.

Keeping context current

Outdated context is worse than no context

Outdated context produces confident-sounding output based on stale information, which is harder to catch. A few habits keep it honest:

Pair context updates with process changes. When the bug lifecycle changes, update the file the same day.
Review L2 at the start of each release cycle. Check whether environments, regression scope, or NFR thresholds have shifted.
Treat degrading AI output as a signal. When responses start missing obvious things, the context file is usually the cause.

L3 does not need maintenance - you assemble it fresh each time from live sources.

3. Efficiency Patterns & Workflows

This section maps AI-assisted workflows to concrete QA activities across the testing lifecycle. Each pattern follows the operating model from the previous sections: AI generates and analyses, you decide and are accountable. The workflows depend on stable context - if you haven't set up your L1 and L2 files yet, start with the Building Effective AI Context for QA section first.

3.1 Tools and where workflows run

Most workflows in this section will eventually ship as skills — packaged instructions that tell the AI what context to load, what to ask, and how to format the output. You won't build prompts from scratch; you'll invoke a skill and review what it produces.

Where skills run

Skills need a tool that can see your project files — your test code, your context files, your repo structure. Any AI-capable IDE or CLI tool with workspace access supports this. The two we recommend:

GitHub Copilot in VS Code — open Copilot Chat (Ctrl+Alt+I on Windows, Cmd+Ctrl+I on Mac), switch to Agent mode using the mode picker at the top of the chat panel. In Agent mode, Copilot can read files across your workspace. This is the recommended starting point if your team already uses VS Code.
Claude Code — a command-line tool you run from your project directory. It sees everything in the repo. Better suited for workflows that involve large inputs (test result files, scan reports) or longer multi-step generation.

Other tools with similar capabilities — Cursor, Windsurf, JetBrains AI Assistant, among others — will work too. What matters is that the tool can read your project files without you pasting them in manually.

When you don't have a skill yet

If neither a skill nor a repo-aware tool is available, every workflow also works manually — you paste your context files and story into any AI chat interface (Copilot Chat in its standard mode, Claude.ai, ChatGPT, Gemini) and follow the workflow steps yourself. The output won't be as consistent, but the thinking is the same. The Quick-Start guide in this playbook uses this manual approach on purpose — it teaches you how context loading works before skills handle it for you.

Which model to use

Not every model handles every task equally well. Code generation, long document analysis, and structured reasoning each favour different models. The Core Principles document covers model selection criteria — refer to it when choosing. This playbook's default recommendation (Claude, GPT or Gemini models) applies to the Quick-Start and works adequately across most workflows. As skills ship, each skill will specify which model it's optimised for.

Skill status per workflow

Each workflow below will have a note when its skill is already available. Where it is, use it — the output will be more consistent than manual prompting. When it's not there yet, the workflow description gives you enough to work manually until it ships.

3.2 Refinement & Story Analysis

Refinement is where testing starts, and where AI delivers some of its most immediate value. The three practices below cover analysis of acceptance criteria, generation of test scenarios, and identification of non-functional requirements. All three share a pattern: you feed the AI a story with context, it finds what's missing or unclear, and you bring those findings to the team.

Acceptance Criteria Analysis

AI reviews acceptance criteria for completeness - positive and negative paths, boundary conditions, error handling, and state transitions. When backed by system context, it also flags cross-system impact: does this change affect upstream or downstream flows? Are other in-flight stories touching the same area?

Workflow:

Provide the user story and acceptance criteria.
AI analyses completeness - positive and negative paths, boundaries, error handling, state transitions.
AI cross-references against system context: upstream/downstream flows, in-flight stories in the same area, known regression hotspots.
AI flags gaps, ambiguities, and dependencies on other teams or services, with rationale.
Ask follow-up questions, criticize and discuss about the output
You review, decide what's relevant, and bring findings to refinement.
Optionally re-run after AC updates.

Input: User story and acceptance criteria.Output: Completeness assessment suggested AC additions, ambiguity flags, edge case list, cross-system impact warnings.Your job: Decide relevance. Validate domain fit. Reject out-of-scope suggestions. Verify cross-system flags against current knowledge - AI may have stale context on what's actually deployed or in flight.Watch out: Cross-system analysis quality depends heavily on your L2 context. Without a maintained system context file, the AI will either miss cross-system impact entirely or flag things that aren't applicable.

High-Level Test Scenario Generation

AI generates structured, categorised test scenarios from refined acceptance criteria: happy path, negative, boundary, error handling, and integration scenarios. Each scenario maps back to the AC it validates, and the AI flags test data needs and overlap with existing regression coverage.

Workflow

Provide the refined story with acceptance criteria and any available API contracts or wireframes.
AI generates scenarios by category: happy path, negative, boundary, error handling, integration.
AI maps each scenario to AC for traceability and flags test data needs.
AI cross-references the existing coverage map - flags overlap with regression suite, identifies gaps.
You review, prioritize by risk, add tribal-knowledge scenarios, and decide manual versus automation.

Input: Refined acceptance criteria, story description, API contracts or wireframes if available.Output: Categorized scenario list in team format, traceability matrix, test data requirements, flagged unknowns, regression impact scenarios.Your job: Validate completeness. Prioritize by risk. Add scenarios that come from tribal knowledge - edge cases you've seen before that aren't documented anywhere. Confirm that integration scenarios are actually feasible in available environments.Watch out: AI-generated scenarios can look thorough while missing domain-specific edge cases entirely. The traceability matrix gives a false sense of completeness if the AC themselves have gaps.

NFR Identification

AI flags non-functional requirements from story content and system-level impact - performance, security, accessibility, compliance. It classifies against NFR categories, checks whether the change touches high-throughput paths or exposes new attack surface and suggests NFR acceptance criteria with a recommended test approach.

Workflow

Provide the user story, affected services or components, and feature type.
AI classifies against NFR categories and flags with specific rationale.
AI cross-references the system integration map - high-throughput paths, critical flow latency, new attack surface.
AI checks historical NFR defects in the affected area.
AI suggests NFR acceptance criteria with a recommended test approach per flag.
You confirm applicability and engage specialists where needed.

Input: User story, affected services and components, feature type.Output: NFR flags with category, rationale, and severity. Suggested NFR acceptance criteria. Test approach recommendations.Your job: Confirm applicability - not every flag warrant action. Engage security or performance specialists when findings are beyond your expertise. Ensure NFRs are tracked, not just noted.Watch out: Without architecture context, NFR flags will be generic and mostly noise. This practice only becomes useful once your system context file includes service dependencies and known bottlenecks

3.3 Test Design

Test design is where AI transitions from analysis to generation - producing detailed, low-level test cases from refined stories. The value compounds over time: as your reference files mature, AI output increasingly matches your team's conventions without manual correction.

You provide a refined story with acceptance criteria, the AI loads your persistent reference files, generates test cases in your team's format, and flags anything ambiguous. You review and approve.

Workflow

Provide a refined story with acceptance criteria.
AI loads persistent reference files - test design techniques, regression scope, system context, test case format, definition of done.
AI analyses the story, cross-references regression scope, and applies test design techniques.
AI generates low-level test cases in the team's format.
AI produces a coverage summary and flags ambiguities.
You review, adjust, and approve.

Input: Refined story with acceptance criteria.Output: Test cases in team format, coverage summary, flagged ambiguities.Your job: Review coverage against your understanding of the feature - not just against the AC. Add cases the AI missed because they require context it doesn't have. Approve the final set.Watch out: AI generates what you ask for. If your test case format file is vague, output will be inconsistent. If you don't specify what to test - risk areas, priority boundaries - you'll get broad coverage instead of targeted coverage.

3.4 Test Review

AI as review partner: it maps acceptance criteria to test cases, checks traceability, evaluates coverage, and flags gaps, duplicates, and inconsistencies. This doesn't replace peer review - it front-loads the mechanical checks so the human reviewer can focus on judgment calls: clarity, reproducibility, and whether the tests actually validate what matters.

Workflow

Provide your test cases, the acceptance criteria, identified risks, and the refined user story.
AI maps AC to test cases and checks traceability.
AI checks coverage against risk areas, identifies missing scenarios and edge cases.
AI flags gaps, inconsistent steps, and unclear expected results.
A peer reviewer focuses on clarity and reproducibility.
Updates are made based on review feedback.
Final validation confirms all review comments are addressed.

Input: Test cases or scenarios, acceptance criteria, identified risks, refined user story.Output: Approved test set, review comments, test coverage summary.Your job: The AI catches structural issues - missing traceability, duplicate cases, gaps in negative paths. You catch what it can't: whether the test will work in practice, whether the expected results are meaningful, whether the test intensity matches the real risk.Watch out: AI may suggest irrelevant edge cases or miss domain-specific constraints. It may misinterpret ambiguous acceptance criteria. It does not understand architectural constraints, legacy behaviour, or historical project decisions.

3.5 Integration Testing

Integration testing with AI splits into three phases - planning, implementation, and execution - each with its own context needs and human checkpoints. The value increases as your reference material matures: a well-maintained system context file and a set of representative test examples let the AI generate integration tests that follow your team's patterns from the start.

Planning

AI maps the change to integration boundaries - affected endpoints, services, events, database tables, and third-party calls. It categorizes test objectives per boundary, cross-references existing coverage to flag overlap and gaps, and assesses environment and data prerequisites.

Workflow

Provide the test cases and feature context (affected services, change scope).
AI maps integration boundaries - affected endpoints, services, events, database tables, third-party calls.
AI categorizes test objectives per boundary: contract validation, data flow integrity, error propagation, auth flow, async flow, database interaction, third-party integration.
AI cross-references the existing coverage map - flags overlap, identifies gaps, flags tests that may need updating.
AI assesses environment and data prerequisites - which services must be deployed, what test data is needed, where mocks are required.
AI produces a prioritized test plan in team format with AC traceability.
You review, select which objectives to implement, confirm mock-versus-real strategy, and fill gaps.

Input: Test cases, feature or change context with affected services.Output: Integration boundary map, categorized test objectives with rationale, existing coverage overlap analysis, environment and data prerequisites checklist.Your job: Select which test objectives to implement - not everything flagged needs a test. Confirm mock-versus-real decisions. Fill gaps the AI identified but couldn't resolve.

Implementation

AI generates integration test code following your team's patterns - extracted from the examples you provide. It also generates supporting code for test data setup, mock configuration, and environment config. Every gap it fills with an assumption gets logged explicitly.

Workflow

Provide the approved test plan and select which objectives to implement.
AI loads reference files and example tests.
AI analyses examples to extract team patterns - test structure, framework conventions, naming, assertion style, mock configuration.
AI generates integration test code for each selected objective, following team patterns.
AI generates supporting code - test data setup/teardown, mock configurations, environment config.
AI maps each test to its plan objective and AC, and produces an assumptions log.
You review both the tests and the assumptions.

Input: Approved test plan, selected objectives.Output: Executable integration test code in team framework and style, test data setup and teardown code, mock and stub configurations, traceability comments, assumptions log.Your job: Review the assumptions log first - that's where the AI's guesses are. Validate that generated tests actually test what they claim to test, not just that they follow the right structure.

Execution

AI classifies failures from test runs into actionable categories, then proposes a specific response for each: a fix for test code issues, a defect report for possible bugs, or a resolution path for environment and data problems.

Workflow

Confirm environment readiness and run the tests.
AI collects results and classifies each failure: 1. Environment issue - service unavailable, timeout, stale deployment, known transient issue
Test data issue - missing precondition data, data consumed by another test, shared data conflict
Test code issue - incorrect assertion, wrong endpoint, malformed request, missing setup step
Possible actual defect - unexpected response with correct test logic
Ambiguous - AI cannot determine root cause with available information
AI presents classifications with evidence and reasoning.
You review and confirm or override each classification.
AI proposes actions: fixes for test code issues, defect reports for possible bugs, resolution paths for environment and data problems.
You work in a loop until all failures are resolved, reclassified, or escalated.

Your job: Confirm or override every classification. The AI is guessing based on patterns - you know whether the environment was actually stable, whether the test data was set up correctly, and whether the unexpected response is really unexpected.Watch out: Failure classification depends on the quality of your troubleshooting references and flaky test registry. Without these, the AI will default to 'ambiguous' for most failures.

3.6 Functional & E2E Testing

In functional and end-to-end testing, AI assists across the execution cycle: analyzing change scope for regression prioritization, detecting failure patterns across test runs, clustering and deduplicating defects, suggesting root causes, generating execution reports, and structuring exploratory testing charters. The human runs the tests - AI helps you make sense of the results.

Workflow

Confirm environment readiness.
AI analyses change scope and identifies potentially impacted regression areas.
AI suggests test and regression prioritisation.
You execute automated E2E scenarios, manual test cases, targeted regression, and exploratory testing sessions.
AI analyses execution output - identifies failure patterns, detects duplicate defects, suggests root cause areas.
AI generates defect reports and a test execution report.
You make the Go/No-Go decision.

Input: Feature deployed to QA or staging environment, approved test cases, updated regression suite, environment configuration, test data, risk context.Output: Test execution results, defect reports, exploratory testing notes, regression impact summary, execution report.Your job: Go/No-Go is always yours. AI can summarise defect status and regression impact, but the decision factors in business priorities, stakeholder risk tolerance, and organisational context that AI doesn't have.Watch out: AI may misinterpret the technical cause of similar failures or detect coincidental patterns. It may underestimate business impact or overestimate technical severity. Check what you're feeding into log analysis - large outputs may contain sensitive information.

3.7 Non-Functional Testing

Non-functional testing spans multiple specialist domains. The practices below cover performance, security, and accessibility - the areas where AI assistance is most practical today. Each has a different maturity level, and the value you get depends on the specificity of your context files.

Performance

Script generation

AI generates executable load test scripts from API specifications. Provide the API spec (OpenAPI or Swagger), target SLAs, load profile, and test data requirements.
Provide the API spec, target SLAs, load profile, and test data requirements.
AI generates a script with parameterised requests, authentication handling, think times, and SLA assertions.
AI produces test data templates and environment configuration.
You validate that request flows match real user behaviour, configure credentials, and run in the appropriate environment.

Input: API spec (OpenAPI or Swagger), target SLAs, load profile, test data requirements.Output: Executable load test script, test data templates, environment configuration.Watch out: Works especially well with structured API specs - the more complete the spec, the less manual correction needed. Result analysis1. AI analyses performance test output, compares against baselines and SLAs, and identifies bottlenecks with system-level correlation. 2. Provide results (CSV or JSON), baseline data, and optionally infrastructure metrics. 3. AI calculates key metrics (p50, p95, p99, throughput, error rates). 4. AI flags SLA violations and regressions, identifies patterns (degraded endpoints, load correlation, error spikes). 5. AI correlates findings with system context. 6. You confirm clean test execution, correlate with infrastructure data, and decide severity and priority.

Input: Test results (CSV or JSON), baseline data, infrastructure metrics (optional).Output: Metrics summary, SLA violation flags, regression analysis, bottleneck identification, system correlation report.Your job: Confirm clean test execution before handing results to AI. Correlate AI findings with infrastructure data. Severity and priority decisions are yours.

Security

Scan triage

AI reduces scan noise - classifying findings as true positive, false positive, or needs investigation, then prioritising by exploitability, data sensitivity, and exposure.
Provide scan results (SARIF or JSON format), scan scope, and application context.
AI deduplicates and classifies findings with reasoning.
AI maps findings to compliance requirements.
AI suggests remediation for critical findings.
You make the final triage decision, override where needed, and route vulnerabilities to the right teams.

Input: Scan results (SARIF or JSON), scan scope, application context.Output: Deduplicated findings, classifications with reasoning, compliance mapping, remediation suggestions for critical items.Your job: Final triage decision is yours. Override AI classifications where your domain knowledge says otherwise. Test case generation1. AI generates security test cases per feature based on threat context. 2. Provide the feature description, affected endpoints, authentication requirements, and data classification. 3. AI maps against threat categories (injection, broken authentication, data exposure). 4. AI generates test cases with steps, expected results, and sample payloads. 5. AI maps to compliance requirements. 6. You validate applicability, decide which cases are already covered by automated scans, and execute manual tests.

Input: Feature description, affected endpoints, authentication requirements, data classification.Output: Security test cases with steps and payloads, compliance mapping.Your job: Validate applicability. Human expertise remains critical for real-world exploitability assessment.

Accessibility

Audit

AI reviews page structure against WCAG criteria - semantic structure, ARIA usage, colour contrast, keyboard patterns, form labels, error handling.
Provide HTML, DOM structure, or screenshots with UI description, WCAG target level, and user flow context.
AI identifies violations with WCAG criterion references and fix suggestions.
AI generates manual test cases for assistive technology testing.
You test with actual assistive technology and validate visual aspects the AI can't assess.

Input: HTML or DOM structure (or screenshots with UI description), WCAG target level, user flow context.Output: WCAG violation report with fix suggestions, manual test cases for assistive technology testing.AI cannot replace assistive technology testing. It can check structural compliance and flag likely violations, but it cannot tell you what the experience is actually like with a screen reader, switch control, or voice navigation. Treat AI audit output as a pre-filter - it catches the mechanical issues so you can focus testing time on the real user experience with actual assistive technology.Your job: Test with actual assistive technology. Validate visual aspects the AI can't assess. Confirm that suggested fixes don't break functionality elsewhere.

4.Quick-Start Guide for QA

You've read the principles. You understand the model. Now do something with it.

This guide gets you from zero to a working AI-assisted QA session in two steps: building your context kit, then running your first workflow. Neither step requires permission, tooling setup, or team buy-in to start. You need a project you're currently working on and an AI tool you have access to.

Follow these steps

Step 1: Build your context kit

Before any workflow produces useful output, the AI needs to know how your project works. This is a one-time investment - once the files exist, you reuse them across every session.

Start with three files. Not all eleven. Three.

business-context.md - Who the users are, what the product does, and what the business is trying to achieve.

team-conventions.md - How the team works.

definition-of-ready.md - What must be true before a story enters a sprint.

Write them in plain Markdown, store them somewhere the team can access. Don't aim for perfect — aim for accurate. You can add the remaining L1 and L2 files as you go.

Once you have them, paste all three into your AI tool at the start of a session before you ask it anything. That's your context load.

Step 2: Run your first workflow

Pick something from the current sprint. The best starting point is a story that's in refinement or just been refined - the closer to the actual work, the more immediately useful the output.

Try this: Acceptance Criteria Analysis

It's the lowest-risk starting point. You're not generating tests yet, not writing code, not touching anything that goes into the product. You're asking AI to review a story for completeness and tell you what's missing. The output is a list of questions and gaps you bring to refinement - you were going to do this analysis anyway.

The recommended tool is GitHub Copilot with Claude Sonnet (or Opus for more complex tasks). Open Copilot Chat in Visual Studio Code with Ctrl+Alt+I (Windows) or Cmd+Ctrl+I (Mac), open the model picker at the bottom of the chat panel, and select Claude Sonnet (highest version available). If it isn't visible, your organisation administrator needs to enable the Claude policy in Copilot settings first. If that isn't resolved yet, GPT models (the current Copilot default) should be available in the same picker without additional configuration and should handle this workflow adequately — the output quality difference could be noticeable but won't block a first session.

In the Copilot Chat input, type # and attach your three context files:
1. business-context.md,
2. team-conventions.md,
3. definition-of-ready.md.
Prompt:
"You are a senior QA engineer with deep experience in [your domain, e.g. insurance / e-commerce / SaaS]. You are reviewing a user story before sprint planning. Using the project context provided, analyse the acceptance criteria below for completeness. Identify missing negative paths, boundary conditions, ambiguous statements, and edge cases a developer might interpret differently. Flag any cross-system dependencies or integration risks based on what you know about the system. Present your findings as a prioritised list with a short rationale for each item — the output will be used in a refinement session with the development team.

User story: [paste story]
Acceptance criteria: [paste AC]"

Iteratively ask follow-up questions, criticize and discuss about the output
Take the output. Discard what's irrelevant. Keep what's useful.
Bring the findings to refinement.

That's it. The first session won't be perfect. The AI will flag some things that aren't relevant to your project - that's normal, and it improves as your context files get more specific. What you're learning in this first session is how the AI interprets your context and where it needs more information.

What comes next

After your first session, you'll have a sense of where the output fell short. That feedback tells you what to add to your context files - not everything at once, but the specific gaps the session exposed. This is how the kit matures: real usage, targeted improvement.

Once Acceptance Criteria Analysis feels natural, the next step is High-Level Test Scenario Generation - you're feeding the same story back in and asking for test scenarios from the AC you've just validated. The workflow is in the Efficiency Patterns section.

5. Measuring Impact

The hardest question to answer honestly is also the one you'll be asked most: "What does it actually give you?"

The temptation is to reach for a time-saving number. The problem is that without tracked baselines - which most teams don't have - any number you produce is an estimate dressed up as a measurement. A sceptical client will spot that quickly. The more defensible approach is to focus on quality signals: observable outcomes that don't require a baseline, that speak directly to what testing is supposed to achieve, and that clients already care about.

Below are the two areas where measurement is most practical right now.

AC Analysis: acceptance criteria gap rate

The signal to track: how many stories get their acceptance criteria updated after QA review.

If AI-assisted AC analysis is finding real gaps, stories get revised. That's a binary, observable outcome per story - no timer required. Start counting it now. Within a sprint or two you have a before-and-after comparison.

The downstream metric is stronger still. Most defect trackers have a root cause field that teams fill in inconsistently or not at all. Start using it consistently - specifically, tag defects that trace back to incomplete or incorrect acceptance criteria. Over a few sprints, that rate should drop. A client understands "X% of our defects previously traced back to AC gaps; now it's Y%" without any explanation. It connects directly to something they care about: defects that shouldn't have made it through.

Track both. The gap rate shows the analysis is working. The defect root cause trend shows it's making a difference downstream.

Test Design: time spent and defect escape rate

Time spent is worth measuring here, even if imprecisely. You don't need formal time-tracking - ask engineers to note how long test design takes for a given story for two sprints before AI is introduced. A quick note in the story or a shared sheet is enough. Self-reported numbers aren't perfect, but they're consistent enough to be useful, and the difference tends to be visible enough that you don't need precision to make the point.

Defect escape rate is the metric that matters to a client. Defects that slipped through testing and were found in UAT, staging, or production - most teams can extract this from their tracker right now using environment fields on defect records. It speaks directly to test coverage quality, which is what AI-assisted test design is supposed to improve.

Together they tell a coherent story: test design takes less time and produces fewer escapes. That combination is concrete, business-relevant, and hard to argue with.

What to be honest about

These metrics show direction, not precision. Story complexity varies, sprint load varies, team composition changes. Treat the signals as evidence, not proof - and frame them that way when presenting to clients. "We observe these trends consistently across the last three sprints" is a credible claim. "We save exactly 2.5 hours per story" is not, unless you can show the working.

The team that tracks these signals consistently for a quarter will have something genuinely useful to say. Start now, before AI is fully embedded — the baseline you build in the next few sprints is the only honest one you'll ever have.