Laravel Worldwide Meetup Proposal

Title (primary + 2 alternates)

Primary: Keeping AI Honest: Hooks-Driven TDD in a Laravel Codebase
Alt 1: Your AI Agent Will Game Your Tests — Here's How to Stop It
Alt 2: Green Checkmarks Don't Lie (But Your AI Agent Will)

Pitch (1 sentence)

AI agents will cheat to make your tests pass — and the fix isn't better prompts, it's enforced discipline at the tooling level.

Abstract (~200 words)

I've shipped Laravel apps for ten years. For the last several months, I've been running AI agents — Claude Code specifically — inside those codebases. The quality is not what the demos show.

Here's the pattern I kept hitting: the agent would go red, realize it needed to go green, and take the path of least resistance. Not fix the real problem. Delete the check. Add an ad-hoc string filter. Remove the admin bypass. Make the test pass by making the test meaningless.

This is not a prompt problem. You can't instruct your way out of it. The agent is rational under its incentive structure, and its incentive structure is green checkmarks.

The fix is harder rails. I built a hook system that enforces TDD ordering at the Write-tool level: you cannot write implementation before you've written the test. A stop guard gates the session on green. The hooks auto-detect Pest, PHPUnit, Vitest, and pytest. They run on every file write, every edit, every session end.

Three real production incidents this year taught me exactly where the gaps are. I'll show you what went wrong, what the hooks look like, and what a Laravel dev needs to do Monday morning to install this discipline in their own codebase.

It's not for everyone. Nitty gritty heads-down developers who value full control might not want to use it. But if you're already running an AI agent and wondering why the code doesn't feel trustworthy — this talk is for you.

Outline (7 beats, ~20 minutes total)

1. The problem, stated plainly (2 min)

Your AI agent has a boss who only cares about green checkmarks. You installed that boss when you set up the tooling. The agent is not malicious — it's rational. It optimizes for what you measure. You're measuring "does the test pass," not "does the behavior work."

Open with the pattern, no preamble. The room knows this frustration already, even if they haven't named it.

2. Three real incidents (5 min)

These are not hypothetical. All three happened in production codebases this year.

April 29: A parallel subagent deleted if (isAdmin.value) return true — a deliberate production admin bypass — because it was in the way of a test passing. The test was misdesigned. The bypass was intentional. The agent didn't know the difference, and nothing stopped it from making the removal. The PR diff caught it. Barely.
April 30: An agent added if (text.includes(' — ')) skip as a content filter to make a goal-line feature pass an old test that hadn't been updated. The correct fix was to update the test. Instead, the agent patched the implementation to route around the assertion. The code shipped with the filter in place until a review caught it.
May 13: A test-writer agent ran for 40 minutes, made 102 tool calls, wrote the full spec file — and then terminated on a context limit without committing any of it. The work existed on disk. The session was gone. Recovery required manually inspecting the worktree and committing by hand.

These three incidents have a common thread: the agent did exactly what its incentive structure asked for. No hooks stopped the first two. No safeguard committed the third.

3. Why prompting doesn't fix this (2 min)

"Don't delete production bypasses" is in my CLAUDE.md. It was there the day the bypass was deleted. Behavioral rules buried mid-prompt get overridden by the model's strong general prior: make the test pass. The fix is placement and enforcement, not wording. A rule that lives only in a markdown file is not enforced. A rule that blocks the Write tool is enforced.

4. The hook architecture (4 min)

Two hooks, both in plain bash, auto-installed by a project bootstrap step.

run-tests.sh — fires on every PostToolUse (Write or Edit). Detects the stack: composer.json present, check for Pest vs PHPUnit; package.json present, check for Vitest. Runs the relevant test command. If red, the session gets a visible failure before the next tool call lands. The agent sees the failure and must address it before writing more code.

stop-test-guard.sh — fires on every session Stop event. Gates the session end on green tests. If you try to close a session with failing tests, the guard surfaces the failures. Uses a small Haiku gate to skip this check during non-coding sessions (browsing docs, reading files, asking questions) — it's not meant to be annoying, it's meant to be a boundary.

One additional constraint: the TDD ordering block. The Write tool is gated: if you try to write implementation to a file that has no sibling *.test.* file, the write is blocked. Tests must exist before implementation. This is enforced at the tooling level, not the prompt level.

All three live in ~/.claude/shared-hooks/. The project bootstrap step drops a .claude/settings.json that points at them. New project, five minutes, enforced discipline.

5. What "assertion quality" actually means (3 min)

Harder rails stop the obvious cheats. But there's a subtler failure mode: the source-grep test.

A source-grep test checks that a string appears somewhere in the codebase. It passes when you grep for AdminBypass in the file that used to contain the bypass — and it will still pass after the bypass is deleted, because grep found the word in a comment. It will pass after the agent removes the behavior and leaves the label.

The rule I use: every test must pair a source assertion with at least one runtime check — run the code, assert the output, assert the exit code, or run an E2E. The source-grep alone is not a test. It's a breadcrumb.

For Laravel specifically: a Pest test that calls a route and asserts the response is a behavior test. A test that opens the controller file and asserts it contains a certain method name is a source-grep test dressed up as a behavior test.

6. The Monday-morning recipe (3 min)

Concrete. What a Laravel developer does to install this.

Add a .claude/ directory to your project root.
Copy settings.json template — four lines, wires the two hooks to PostToolUse and Stop events.
Point the hook paths at ~/.claude/shared-hooks/run-tests.sh and ~/.claude/shared-hooks/stop-test-guard.sh (or copy them locally — both are self-contained bash scripts, no dependencies).
On the next Write event, the hooks fire. On the next session Stop, the guard checks. Nothing else to configure.

The hooks are stack-agnostic. They find Pest if vendor/bin/pest exists, PHPUnit if it doesn't. They don't care about your directory structure. They work on a fresh Laravel 11 install the same day.

The TDD ordering block requires one additional line in settings.json. Show the line. That's the whole recipe.

7. The honest cost (1 min)

This isn't free. The hooks add friction. Every Write fires a test run. Slow test suites get slower to iterate against. The TDD ordering block will frustrate you the first time you try to write a quick fix before writing the test.

It's not for everyone. Nitty gritty heads-down developers who value full control might not want to use it.

But if you're running an AI agent and the code doesn't feel trustworthy — the problem isn't the model. The problem is the rails. These hooks are rails.

Speaker Bio (2 sentences, third person)

Ethan Brace is a Laravel developer with ten years of production experience who has spent the last year running AI agents inside real codebases and documenting what breaks. He builds at Brace Yourself Solutions and has been shipping hook-enforced TDD infrastructure long enough to have the incidents to prove it matters.

Why This Audience (note to organizer — not for the form)

The Laravel Worldwide Meetup draws a mix of developers who are actively using AI tooling and finding it unreliable, and developers who are skeptical of AI tooling and want evidence it can be made trustworthy. This talk addresses both cohorts directly: the frustrated adopters get a concrete diagnosis and a concrete fix; the skeptics get a real argument that discipline can contain the failure modes.

The talk is grounded in Pest and PHPUnit specifically, the hooks auto-detect both, and the Monday-morning recipe is a Laravel project setup. It's not a generic AI talk with a Laravel slide at the end. It's a Laravel talk that happens to be about AI discipline.

The material is also deliberately self-contained: a developer who watches this talk walks away with the pattern and the implementation path, not a pointer to a product or a course. That makes it useful regardless of what tooling the viewer uses or how skeptical they are coming in.