AI-Assisted Development: A Practical Guide

How to produce reliable software when AI writes most of the code.

13 min read

AI Collaboration

This guide was written by Claude (Opus 4.5) with input from Gemini 3 and GPT-5.2, based on conversations with Will Worth. It reflects practices for a rapidly evolving landscape and should be updated as tools and capabilities change.

Executive Summary

AI can now write functional code faster than humans. The bottleneck has shifted from "can we produce code?" to "can we verify it actually works?" This guide provides concrete practices for maintaining quality and avoiding common failure modes.

Do

  • Review AI-generated tests as carefully as you'd review code — tests define what "correct" means
  • Use adversarial review — have one AI model try to break another's output
  • Verify dependencies manually — AI hallucinates package names that attackers squat on
  • Generate documentation alongside code — future debugging depends on it
  • Focus human attention on architecture, not implementation — state boundaries, failure modes, data flow
  • Use property-based testing — let machines explore edge cases you wouldn't think of
  • Require observability from the start — if you can't monitor it, you can't verify it works
  • Test your tests with mutation testing — prove your tests actually catch failures

Don't

  • Trust that passing tests mean correct behaviour — AI can write tests that assert the wrong thing
  • Assume AI-generated code follows your architectural patterns — verify structure, not just function
  • Skip reading package.json, go.mod, or dependency files — supply chain attacks are real and increasing
  • Rely on AI to explain systems it built — summarisation tools hallucinate connections
  • Treat dynamic verification as free — balance thoroughness against infrastructure costs
  • Merge code without the dashboards to monitor it — observability is part of "done"

Context: Why This Approach

Traditional code review — a human reading every line — doesn't scale when AI generates code faster than humans can read it. But abandoning verification entirely leads to systems that look correct while being fundamentally broken.

The solution is to shift what humans verify:

  • Before: Humans inspect implementation details (how it's built)
  • After: Humans verify behaviour and constraints (what it does, what it must not do)

This means reading tests, traces, metrics, and architecture — not diffs. The practices below operationalise this shift.


Practice 1: Review Tests, Not Just Code

When AI generates both implementation and tests, a dangerous loop emerges: the tests might assert incorrect behaviour, and the implementation passes them perfectly.

The Problem

1// AI-generated test that "passes" but verifies nothing useful
2describe('calculateOrderTotal', () => {
3  it('should calculate the total', () => {
4    const result = calculateOrderTotal(mockOrder);
5    expect(result).toBeDefined(); // This tells us nothing
6  });
7});
1// AI-generated test that encodes the wrong requirement
2describe('applyDiscount', () => {
3  it('should apply 10% discount for premium users', () => {
4    const result = applyDiscount(100, { isPremium: true });
5    expect(result).toBe(90); // But the business requirement was 15%
6  });
7});

The second example is more dangerous. The test passes. The code works exactly as tested. But the test encodes the wrong business rule, and no amount of dynamic verification will catch this — because the verification itself is wrong.

The Practice

When reviewing AI-generated tests, ask:

  1. Do these tests encode the actual requirements? Cross-reference with specs, tickets, or stakeholder expectations.
  2. What's not being tested? Look for gaps — error cases, edge conditions, integration points.
  3. Are assertions meaningful? toBeDefined(), toBeTruthy(), or checking that a function "doesn't throw" are often useless.
  4. Would a broken implementation still pass? If yes, the test has no value.

Human review time shifts from implementation to specification. This is the new bottleneck.


Practice 2: Use Adversarial Review

Have one AI model attempt to break, critique, or find flaws in another model's output.

The Practice

1Prompt for adversarial review:
2
3"Review the following pull request. Your goal is to find:
41. Logic errors or incorrect behaviour
52. Security vulnerabilities
63. Missing error handling
74. Violations of the architectural patterns described below
85. Edge cases that aren't handled
96. Dependencies that look suspicious or unnecessary
10
11Be adversarial. Assume the code has bugs and find them.
12
13[paste code here]
14[paste architectural context here]"

This doesn't guarantee catching all problems, but it catches a meaningful percentage that a single generation pass misses. Different models have different blind spots.

Limitations

Adversarial review between AI models won't reliably catch:

  • Requirements that are fundamentally misunderstood (both models share the same wrong context)
  • Novel security vulnerabilities outside training distribution
  • Subtle architectural drift over time

It's a layer of defence, not a complete solution.


Practice 3: Verify Dependencies Manually

AI models frequently hallucinate package names. Attackers register these hallucinated names and inject malware.

The Problem

1// AI might generate this in package.json
2{
3  "dependencies": {
4    "fast-json-sanitizer": "^2.1.0"  // This package doesn't exist
5  }                                    // Or worse: an attacker registered it
6}

When you run npm install, you're now executing attacker-controlled code.

The Practice

For every PR that adds or modifies dependencies:

  1. Verify the package exists and is legitimate — check npm/PyPI/crates.io directly
  2. Check the publisher — is this the expected maintainer?
  3. Look at download counts and maintenance activity — newly registered packages with low downloads are suspicious
  4. Review what the package actually does — does it match what you need?

This is one area where static review (reading the dependency file) remains essential. Dynamic verification won't catch a malicious package that's designed to pass your tests.

Tooling

  • npm audit / yarn audit catch known vulnerabilities but not malicious new packages
  • Socket.dev and similar tools can flag suspicious dependency patterns
  • Lock files (package-lock.json, go.sum) should always be reviewed for unexpected changes

Practice 4: Generate Documentation Alongside Code

When AI generates code you haven't read in detail, debugging becomes nearly impossible without documentation.

The Problem

At 3 AM, the system fails. You need to understand a module you've never read. The AI that generated it isn't available (or hallucinates when you ask it to explain). You're blind.

The Practice

For every significant feature or module, generate and maintain:

  1. Purpose statement — what does this module do and why does it exist?
  2. Data flow diagram — what comes in, what goes out, what's the happy path?
  3. Failure modes — what can go wrong and how should it be handled?
  4. Dependencies — what does this rely on and what relies on it?
  5. Key decisions — why was it built this way and not another way?
1# OrderProcessor Module
2
3## Purpose
4Transforms raw cart data into validated orders, applying pricing rules and inventory checks.
5
6## Data Flow
71. Receives CartDTO from checkout service
82. Validates inventory via InventoryService
93. Calculates final pricing via PricingEngine
104. Persists to OrderRepository
115. Emits OrderCreated event
12
13## Failure Modes
14- Inventory unavailable: Returns 409 Conflict, cart remains intact
15- Pricing service timeout: Retries 3x with exponential backoff, then fails with 503
16- Database write failure: Logs to dead letter queue for manual recovery
17
18## Dependencies
19- InventoryService (synchronous call)
20- PricingEngine (synchronous call)
21- OrderRepository (PostgreSQL)
22- EventBus (async, Kafka)
23
24## Key Decisions
25- Synchronous inventory check was chosen over eventual consistency because
26  overselling has higher business cost than occasional checkout failures
27- Pricing is calculated server-side (not trusted from client) for security

This documentation is your "just-in-time understanding" when something breaks. Generate it as you build, not after.


Practice 5: Focus Human Review on Architecture

You can't read every line. Focus on what matters most: system shape, not implementation details.

What to Review

Review ThisNot This
Service boundaries and APIsInternal function implementations
State management approachIndividual state updates
Error handling strategyEvery try/catch block
Data flow between componentsData transformations within components
Dependency choicesHow dependencies are used
Security boundariesEvery input validation

Questions for Architectural Review

  1. Does this introduce new dependencies? Are they justified?
  2. Does this change service boundaries? Will it affect other teams?
  3. Does this create new state? Where does it live? How is it managed?
  4. Does this change failure modes? What happens when it breaks?
  5. Does this violate existing patterns? Consistency matters for maintainability.

Implementation details can be wrong and still get fixed easily. Architectural mistakes compound and become expensive.


Practice 6: Use Property-Based Testing

Standard unit tests check specific examples. Property-based testing generates thousands of random inputs and verifies that properties hold across all of them.

What Is Property-Based Testing?

Instead of writing:

1test('sort returns sorted array', () => {
2  expect(sort([3, 1, 2])).toEqual([1, 2, 3]);
3  expect(sort([5, 4])).toEqual([4, 5]);
4});

You write:

1import { fc } from 'fast-check';
2
3test('sort returns sorted array', () => {
4  fc.assert(
5    fc.property(fc.array(fc.integer()), (arr) => {
6      const sorted = sort(arr);
7
8      // Property 1: Output has same length as input
9      expect(sorted.length).toBe(arr.length);
10
11      // Property 2: Output contains same elements
12      expect(sorted.slice().sort()).toEqual(arr.slice().sort());
13
14      // Property 3: Output is actually sorted
15      for (let i = 0; i < sorted.length - 1; i++) {
16        expect(sorted[i]).toBeLessThanOrEqual(sorted[i + 1]);
17      }
18    })
19  );
20});

The framework generates hundreds of random arrays and verifies your properties hold for all of them. It finds edge cases you wouldn't think to test.

Why This Matters for AI-Generated Code

AI tends to handle the happy path well. Property-based testing automatically explores the edges — empty inputs, huge inputs, negative numbers, unicode strings, null values — without you having to enumerate every case.

Tools

  • JavaScript/TypeScript: fast-check
  • Python: Hypothesis
  • Go: gopter, rapid
  • Rust: proptest, quickcheck

Practice 7: Require Observability From the Start

If you can't monitor a feature in production, you can't verify it works. Observability is part of the definition of "done."

The Practice

Every feature ships with:

  1. Structured logs for key operations
  2. Metrics for success/failure rates, latency, throughput
  3. Traces connecting requests across services
  4. Alerts for anomalous behaviour
  5. Dashboards visualising the above

Example: Minimum Observability for a New Endpoint

1async function processOrder(req: Request, res: Response) {
2  const span = tracer.startSpan('processOrder');
3  const startTime = Date.now();
4
5  try {
6    logger.info('Processing order', {
7      orderId: req.body.orderId,
8      userId: req.user.id
9    });
10
11    const result = await orderService.process(req.body);
12
13    metrics.increment('orders.processed', { status: 'success' });
14    metrics.histogram('orders.latency', Date.now() - startTime);
15
16    span.setStatus({ code: SpanStatusCode.OK });
17    res.json(result);
18
19  } catch (error) {
20    logger.error('Order processing failed', {
21      orderId: req.body.orderId,
22      error: error.message
23    });
24
25    metrics.increment('orders.processed', { status: 'failure' });
26    span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
27
28    res.status(500).json({ error: 'Processing failed' });
29  } finally {
30    span.end();
31  }
32}

When AI generates code, include observability requirements in the prompt:

1"Generate an order processing endpoint. Include:
2- Structured logging for all operations
3- Metrics for success/failure and latency
4- OpenTelemetry tracing
5- Error handling that preserves context for debugging"

Practice 8: Test Your Tests with Mutation Testing

If you're not reading the implementation, you need to know your tests actually catch bugs. Mutation testing proves this.

What Is Mutation Testing?

Mutation testing automatically modifies your code (introduces "mutants") and checks whether your tests fail. If the tests still pass when the code is broken, your tests aren't catching what they should.

Example

Your code:

1function isEligibleForDiscount(user: User): boolean {
2  return user.orderCount >= 5 && user.accountAge > 30;
3}

Mutation testing might create:

1// Mutant 1: Changed >= to >
2return user.orderCount > 5 && user.accountAge > 30;
3
4// Mutant 2: Changed && to ||
5return user.orderCount >= 5 || user.accountAge > 30;
6
7// Mutant 3: Changed 5 to 6
8return user.orderCount >= 6 && user.accountAge > 30;

If your tests pass with any of these mutants, you have a gap. A user with exactly 5 orders should be eligible, but Mutant 1 would reject them — and if your tests don't catch that, you'd never know.

Tools

  • JavaScript/TypeScript: Stryker
  • Python: mutmut, cosmic-ray
  • Java: PITest
  • Go: go-mutesting

When to Use

Mutation testing is computationally expensive — it runs your entire test suite many times. Use it:

  • On critical business logic (payment, eligibility, pricing)
  • When you're delegating test generation to AI
  • As a periodic check rather than on every commit

Practice 9: Identify and Protect Critical Paths

Not all code is equally important. Identify the paths where failure is catastrophic and apply disproportionate verification there.

The Practice

For your system, identify:

  1. Money paths — anything involving payments, pricing, billing
  2. Security boundaries — authentication, authorisation, data access
  3. Data integrity — writes to persistent storage, especially irreversible ones
  4. External commitments — anything that triggers real-world actions (emails, shipments, API calls to third parties)

These paths get:

  • More thorough testing (property-based, not just example-based)
  • Human review even when other code doesn't
  • Additional runtime checks and monitoring
  • Feature flags and gradual rollouts

Example: Identifying Critical Paths

1E-commerce system critical paths:
2├── Checkout
3│   ├── Price calculation (must match displayed price exactly)
4│   ├── Payment processing (must not double-charge)
5│   └── Inventory decrement (must not oversell)
6├── Authentication
7│   ├── Login (must not leak whether email exists)
8│   ├── Password reset (must expire tokens correctly)
9│   └── Session management (must invalidate on logout)
10└── Order fulfillment
11    ├── Address validation (must be shippable)
12    └── Inventory reservation (must handle concurrent orders)

AI can write the code for these paths. AI can write the tests. But human review of the tests and architecture remains essential because the cost of failure is high.


Summary: The Verification Stack

When AI writes most of the code, verification happens in layers:

LayerWhatWho
GenerationAI writes codeAI (Claude, GPT, Gemini, etc.)
Adversarial ReviewDifferent AI tries to break itAI
Automated TestingUnit, integration, property-based, mutationMachines
Dependency VerificationCheck packages are legitimateHuman (assisted by tools)
Test ReviewVerify tests encode correct requirementsHuman
Architectural ReviewVerify system shape is rightHuman
ObservabilityVerify behaviour in productionMachines + Human interpretation
AccountabilityDecide what risks are acceptableHuman

The goal is not to eliminate human involvement but to focus it where it produces the most value: defining correctness, reviewing verification, and accepting responsibility.


This guide was written by Claude (Opus 4.5) with input from Gemini 3 and GPT-5.2, based on conversations with Will Worth. It reflects practices for a rapidly evolving landscape and should be updated as tools and capabilities change.