AI-Assisted Development: A Practical Guide
How to produce reliable software when AI writes most of the code.
13 min read
AI Collaboration
This guide was written by Claude (Opus 4.5) with input from Gemini 3 and GPT-5.2, based on conversations with Will Worth. It reflects practices for a rapidly evolving landscape and should be updated as tools and capabilities change.
Executive Summary
AI can now write functional code faster than humans. The bottleneck has shifted from "can we produce code?" to "can we verify it actually works?" This guide provides concrete practices for maintaining quality and avoiding common failure modes.
Do
- Review AI-generated tests as carefully as you'd review code — tests define what "correct" means
- Use adversarial review — have one AI model try to break another's output
- Verify dependencies manually — AI hallucinates package names that attackers squat on
- Generate documentation alongside code — future debugging depends on it
- Focus human attention on architecture, not implementation — state boundaries, failure modes, data flow
- Use property-based testing — let machines explore edge cases you wouldn't think of
- Require observability from the start — if you can't monitor it, you can't verify it works
- Test your tests with mutation testing — prove your tests actually catch failures
Don't
- Trust that passing tests mean correct behaviour — AI can write tests that assert the wrong thing
- Assume AI-generated code follows your architectural patterns — verify structure, not just function
- Skip reading
package.json,go.mod, or dependency files — supply chain attacks are real and increasing - Rely on AI to explain systems it built — summarisation tools hallucinate connections
- Treat dynamic verification as free — balance thoroughness against infrastructure costs
- Merge code without the dashboards to monitor it — observability is part of "done"
Context: Why This Approach
Traditional code review — a human reading every line — doesn't scale when AI generates code faster than humans can read it. But abandoning verification entirely leads to systems that look correct while being fundamentally broken.
The solution is to shift what humans verify:
- Before: Humans inspect implementation details (how it's built)
- After: Humans verify behaviour and constraints (what it does, what it must not do)
This means reading tests, traces, metrics, and architecture — not diffs. The practices below operationalise this shift.
Practice 1: Review Tests, Not Just Code
When AI generates both implementation and tests, a dangerous loop emerges: the tests might assert incorrect behaviour, and the implementation passes them perfectly.
The Problem
1// AI-generated test that "passes" but verifies nothing useful
2describe('calculateOrderTotal', () => {
3 it('should calculate the total', () => {
4 const result = calculateOrderTotal(mockOrder);
5 expect(result).toBeDefined(); // This tells us nothing
6 });
7});1// AI-generated test that encodes the wrong requirement
2describe('applyDiscount', () => {
3 it('should apply 10% discount for premium users', () => {
4 const result = applyDiscount(100, { isPremium: true });
5 expect(result).toBe(90); // But the business requirement was 15%
6 });
7});The second example is more dangerous. The test passes. The code works exactly as tested. But the test encodes the wrong business rule, and no amount of dynamic verification will catch this — because the verification itself is wrong.
The Practice
When reviewing AI-generated tests, ask:
- Do these tests encode the actual requirements? Cross-reference with specs, tickets, or stakeholder expectations.
- What's not being tested? Look for gaps — error cases, edge conditions, integration points.
- Are assertions meaningful?
toBeDefined(),toBeTruthy(), or checking that a function "doesn't throw" are often useless. - Would a broken implementation still pass? If yes, the test has no value.
Human review time shifts from implementation to specification. This is the new bottleneck.
Practice 2: Use Adversarial Review
Have one AI model attempt to break, critique, or find flaws in another model's output.
The Practice
1Prompt for adversarial review:
2
3"Review the following pull request. Your goal is to find:
41. Logic errors or incorrect behaviour
52. Security vulnerabilities
63. Missing error handling
74. Violations of the architectural patterns described below
85. Edge cases that aren't handled
96. Dependencies that look suspicious or unnecessary
10
11Be adversarial. Assume the code has bugs and find them.
12
13[paste code here]
14[paste architectural context here]"This doesn't guarantee catching all problems, but it catches a meaningful percentage that a single generation pass misses. Different models have different blind spots.
Limitations
Adversarial review between AI models won't reliably catch:
- Requirements that are fundamentally misunderstood (both models share the same wrong context)
- Novel security vulnerabilities outside training distribution
- Subtle architectural drift over time
It's a layer of defence, not a complete solution.
Practice 3: Verify Dependencies Manually
AI models frequently hallucinate package names. Attackers register these hallucinated names and inject malware.
The Problem
1// AI might generate this in package.json
2{
3 "dependencies": {
4 "fast-json-sanitizer": "^2.1.0" // This package doesn't exist
5 } // Or worse: an attacker registered it
6}When you run npm install, you're now executing attacker-controlled code.
The Practice
For every PR that adds or modifies dependencies:
- Verify the package exists and is legitimate — check npm/PyPI/crates.io directly
- Check the publisher — is this the expected maintainer?
- Look at download counts and maintenance activity — newly registered packages with low downloads are suspicious
- Review what the package actually does — does it match what you need?
This is one area where static review (reading the dependency file) remains essential. Dynamic verification won't catch a malicious package that's designed to pass your tests.
Tooling
npm audit/yarn auditcatch known vulnerabilities but not malicious new packages- Socket.dev and similar tools can flag suspicious dependency patterns
- Lock files (
package-lock.json,go.sum) should always be reviewed for unexpected changes
Practice 4: Generate Documentation Alongside Code
When AI generates code you haven't read in detail, debugging becomes nearly impossible without documentation.
The Problem
At 3 AM, the system fails. You need to understand a module you've never read. The AI that generated it isn't available (or hallucinates when you ask it to explain). You're blind.
The Practice
For every significant feature or module, generate and maintain:
- Purpose statement — what does this module do and why does it exist?
- Data flow diagram — what comes in, what goes out, what's the happy path?
- Failure modes — what can go wrong and how should it be handled?
- Dependencies — what does this rely on and what relies on it?
- Key decisions — why was it built this way and not another way?
1# OrderProcessor Module
2
3## Purpose
4Transforms raw cart data into validated orders, applying pricing rules and inventory checks.
5
6## Data Flow
71. Receives CartDTO from checkout service
82. Validates inventory via InventoryService
93. Calculates final pricing via PricingEngine
104. Persists to OrderRepository
115. Emits OrderCreated event
12
13## Failure Modes
14- Inventory unavailable: Returns 409 Conflict, cart remains intact
15- Pricing service timeout: Retries 3x with exponential backoff, then fails with 503
16- Database write failure: Logs to dead letter queue for manual recovery
17
18## Dependencies
19- InventoryService (synchronous call)
20- PricingEngine (synchronous call)
21- OrderRepository (PostgreSQL)
22- EventBus (async, Kafka)
23
24## Key Decisions
25- Synchronous inventory check was chosen over eventual consistency because
26 overselling has higher business cost than occasional checkout failures
27- Pricing is calculated server-side (not trusted from client) for securityThis documentation is your "just-in-time understanding" when something breaks. Generate it as you build, not after.
Practice 5: Focus Human Review on Architecture
You can't read every line. Focus on what matters most: system shape, not implementation details.
What to Review
| Review This | Not This |
|---|---|
| Service boundaries and APIs | Internal function implementations |
| State management approach | Individual state updates |
| Error handling strategy | Every try/catch block |
| Data flow between components | Data transformations within components |
| Dependency choices | How dependencies are used |
| Security boundaries | Every input validation |
Questions for Architectural Review
- Does this introduce new dependencies? Are they justified?
- Does this change service boundaries? Will it affect other teams?
- Does this create new state? Where does it live? How is it managed?
- Does this change failure modes? What happens when it breaks?
- Does this violate existing patterns? Consistency matters for maintainability.
Implementation details can be wrong and still get fixed easily. Architectural mistakes compound and become expensive.
Practice 6: Use Property-Based Testing
Standard unit tests check specific examples. Property-based testing generates thousands of random inputs and verifies that properties hold across all of them.
What Is Property-Based Testing?
Instead of writing:
1test('sort returns sorted array', () => {
2 expect(sort([3, 1, 2])).toEqual([1, 2, 3]);
3 expect(sort([5, 4])).toEqual([4, 5]);
4});You write:
1import { fc } from 'fast-check';
2
3test('sort returns sorted array', () => {
4 fc.assert(
5 fc.property(fc.array(fc.integer()), (arr) => {
6 const sorted = sort(arr);
7
8 // Property 1: Output has same length as input
9 expect(sorted.length).toBe(arr.length);
10
11 // Property 2: Output contains same elements
12 expect(sorted.slice().sort()).toEqual(arr.slice().sort());
13
14 // Property 3: Output is actually sorted
15 for (let i = 0; i < sorted.length - 1; i++) {
16 expect(sorted[i]).toBeLessThanOrEqual(sorted[i + 1]);
17 }
18 })
19 );
20});The framework generates hundreds of random arrays and verifies your properties hold for all of them. It finds edge cases you wouldn't think to test.
Why This Matters for AI-Generated Code
AI tends to handle the happy path well. Property-based testing automatically explores the edges — empty inputs, huge inputs, negative numbers, unicode strings, null values — without you having to enumerate every case.
Tools
- JavaScript/TypeScript: fast-check
- Python: Hypothesis
- Go: gopter, rapid
- Rust: proptest, quickcheck
Practice 7: Require Observability From the Start
If you can't monitor a feature in production, you can't verify it works. Observability is part of the definition of "done."
The Practice
Every feature ships with:
- Structured logs for key operations
- Metrics for success/failure rates, latency, throughput
- Traces connecting requests across services
- Alerts for anomalous behaviour
- Dashboards visualising the above
Example: Minimum Observability for a New Endpoint
1async function processOrder(req: Request, res: Response) {
2 const span = tracer.startSpan('processOrder');
3 const startTime = Date.now();
4
5 try {
6 logger.info('Processing order', {
7 orderId: req.body.orderId,
8 userId: req.user.id
9 });
10
11 const result = await orderService.process(req.body);
12
13 metrics.increment('orders.processed', { status: 'success' });
14 metrics.histogram('orders.latency', Date.now() - startTime);
15
16 span.setStatus({ code: SpanStatusCode.OK });
17 res.json(result);
18
19 } catch (error) {
20 logger.error('Order processing failed', {
21 orderId: req.body.orderId,
22 error: error.message
23 });
24
25 metrics.increment('orders.processed', { status: 'failure' });
26 span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
27
28 res.status(500).json({ error: 'Processing failed' });
29 } finally {
30 span.end();
31 }
32}When AI generates code, include observability requirements in the prompt:
1"Generate an order processing endpoint. Include:
2- Structured logging for all operations
3- Metrics for success/failure and latency
4- OpenTelemetry tracing
5- Error handling that preserves context for debugging"Practice 8: Test Your Tests with Mutation Testing
If you're not reading the implementation, you need to know your tests actually catch bugs. Mutation testing proves this.
What Is Mutation Testing?
Mutation testing automatically modifies your code (introduces "mutants") and checks whether your tests fail. If the tests still pass when the code is broken, your tests aren't catching what they should.
Example
Your code:
1function isEligibleForDiscount(user: User): boolean {
2 return user.orderCount >= 5 && user.accountAge > 30;
3}Mutation testing might create:
1// Mutant 1: Changed >= to >
2return user.orderCount > 5 && user.accountAge > 30;
3
4// Mutant 2: Changed && to ||
5return user.orderCount >= 5 || user.accountAge > 30;
6
7// Mutant 3: Changed 5 to 6
8return user.orderCount >= 6 && user.accountAge > 30;If your tests pass with any of these mutants, you have a gap. A user with exactly 5 orders should be eligible, but Mutant 1 would reject them — and if your tests don't catch that, you'd never know.
Tools
- JavaScript/TypeScript: Stryker
- Python: mutmut, cosmic-ray
- Java: PITest
- Go: go-mutesting
When to Use
Mutation testing is computationally expensive — it runs your entire test suite many times. Use it:
- On critical business logic (payment, eligibility, pricing)
- When you're delegating test generation to AI
- As a periodic check rather than on every commit
Practice 9: Identify and Protect Critical Paths
Not all code is equally important. Identify the paths where failure is catastrophic and apply disproportionate verification there.
The Practice
For your system, identify:
- Money paths — anything involving payments, pricing, billing
- Security boundaries — authentication, authorisation, data access
- Data integrity — writes to persistent storage, especially irreversible ones
- External commitments — anything that triggers real-world actions (emails, shipments, API calls to third parties)
These paths get:
- More thorough testing (property-based, not just example-based)
- Human review even when other code doesn't
- Additional runtime checks and monitoring
- Feature flags and gradual rollouts
Example: Identifying Critical Paths
1E-commerce system critical paths:
2├── Checkout
3│ ├── Price calculation (must match displayed price exactly)
4│ ├── Payment processing (must not double-charge)
5│ └── Inventory decrement (must not oversell)
6├── Authentication
7│ ├── Login (must not leak whether email exists)
8│ ├── Password reset (must expire tokens correctly)
9│ └── Session management (must invalidate on logout)
10└── Order fulfillment
11 ├── Address validation (must be shippable)
12 └── Inventory reservation (must handle concurrent orders)AI can write the code for these paths. AI can write the tests. But human review of the tests and architecture remains essential because the cost of failure is high.
Summary: The Verification Stack
When AI writes most of the code, verification happens in layers:
| Layer | What | Who |
|---|---|---|
| Generation | AI writes code | AI (Claude, GPT, Gemini, etc.) |
| Adversarial Review | Different AI tries to break it | AI |
| Automated Testing | Unit, integration, property-based, mutation | Machines |
| Dependency Verification | Check packages are legitimate | Human (assisted by tools) |
| Test Review | Verify tests encode correct requirements | Human |
| Architectural Review | Verify system shape is right | Human |
| Observability | Verify behaviour in production | Machines + Human interpretation |
| Accountability | Decide what risks are acceptable | Human |
The goal is not to eliminate human involvement but to focus it where it produces the most value: defining correctness, reviewing verification, and accepting responsibility.
This guide was written by Claude (Opus 4.5) with input from Gemini 3 and GPT-5.2, based on conversations with Will Worth. It reflects practices for a rapidly evolving landscape and should be updated as tools and capabilities change.