Making Codex stick in enterprise engineering teams

What the research says, what we've seen first-hand, and what we think a guided evaluation should actually look like.

The problem is clear

OpenAI has asked partners to run Codex Guided Evaluations because they can't do it themselves. Enterprise accounts try Codex, engineers like it in the demo, and then nothing happens. Adoption stalls. Licenses go unused. The pilot doesn't convert.

This isn't unique to Codex. It's the same pattern across every AI coding tool.

60%

of organisations evaluate AI coding tools, but only 5% reach production

Deloitte State of AI in the Enterprise 2026

42%

of companies abandoned most AI initiatives in 2025, up from 17% in 2024

Deloitte State of AI in the Enterprise 2026

29%

of developers trust AI coding tools. Down 11 points year-on-year.

Stack Overflow Developer Survey 2025

The paradox: 84% of developers use AI coding tools. But only 29% trust them. Usage is going up while confidence is going down. That's a problem you can't solve with more licenses.

Why it doesn't stick

The instinct is to think this is a training problem. Teach people how to use the tool, they'll use it more. That's the Human Productivity model and it works for ChatGPT adoption in non-technical teams.

Engineering is different. The bottleneck isn't "can the developer use Codex." It's that speeding up one part of the delivery pipeline breaks everything downstream.

Where AI coding tools speed things up

Plan

→

Write code

→

Review

→

Test

→

Deploy

→

Monitor

Writing code is 25-35% of the delivery timeline. Speed it up without fixing everything else and you just create a bigger pile of unreviewed work.

+98%

more pull requests generated by AI-assisted teams

Faros AI, 10,000+ developers

+91%

increase in PR review time on those same teams

Faros AI / Jellyfish 2025

-1.5%

drop in actual delivery throughput with 25% more AI adoption

DORA 2025

More code, slower reviews, no change in delivery speed. In some cases, delivery actually gets worse. The tool made the coding faster but the system didn't keep up.

The DORA 2025 finding: AI is an amplifier. It makes good engineering teams better and struggling teams worse. Teams without strong practices, platform maturity, and a clear way of working see no improvement or actual regression when they adopt AI coding tools.

DORA (DevOps Research and Assessment) is Google's annual study measuring software delivery performance across thousands of teams. The four key metrics are deployment frequency, lead time for changes, change failure rate, and time to restore service.

What actually works

The organisations seeing real results aren't just giving engineers a tool and hoping for the best. They're changing how the engineering environment itself works.

OpenAI's own internal team published a blog in March 2026 about building a product with zero lines of human-written code. 1,500 PRs in five months. 3.5 PRs per engineer per day. Their key finding:

Early progress was slower than we expected, not because Codex was incapable, but because the environment was underspecified. The primary job of our engineering team became enabling the agents to do useful work. Ryan Lopopolo, OpenAI — "Harness Engineering"

They call this work "harness engineering" — making a codebase and its tooling legible and useful to AI agents. Not writing code. Building the environment that lets agents write good code.

Three things that consistently make the difference

1. Structured context in the repo

A file committed to the repository (AGENTS.md, CLAUDE.md) that tells the agent: here's the architecture, here's how to build, here's the coding standards, here's what not to touch. Without this, every session starts cold. With it, every engineer gets a pre-configured agent that already understands their codebase.

OpenAI's blog takes this further: the context file should be a table of contents, not an encyclopedia. A short map with pointers to deeper sources of truth. Progressive disclosure — the agent starts small and is taught where to look.

Teams without proper context setup see 60% lower productivity gains from AI coding tools.

2. Verification infrastructure

Can the agent run the tests? Can it validate its own output? Can it see the logs? If not, every piece of generated code needs manual human verification, which is often slower than writing it in the first place.

The endgame (from OpenAI's internal team): Chrome DevTools wired into the agent, a local observability stack per worktree, agents that reproduce bugs and validate fixes without human involvement. Most teams won't start there, but the direction matters.

3. Solving the review bottleneck

This is the single biggest reason ROI disappears. AI generates more code. Humans can't review it fast enough. The pipeline chokes.

The answer isn't "review faster." It's building towards agent-assisted review: agents handle routine checks (style, patterns, missing tests, known anti-patterns), humans handle judgment calls. OpenAI's internal team pushed almost all review to agent-to-agent. Most enterprise orgs aren't ready for that yet, but the journey from "every PR needs a human" to "agents handle the routine, humans handle the hard stuff" is where the real velocity unlock lives.

Who's seeing results and what they did

None of these examples use Codex specifically — the product is too new for mature case studies. But the patterns are consistent regardless of which tool is used. The harness engineering work is the same whether the agent is Codex, Claude Code, or Copilot. That's the point.

Shopify

80% organic adoption. No cost limit on AI tokens. Built an internal proxy housing all agents. Champions wrote weekly updates assisted by AI. CEO made it non-optional. Added AI usage to performance reviews. Managers must prove AI can't do a task before requesting headcount.

Got legal involved early as collaborators, not blockers. "How can we do it safely?" rather than "can we do it?"

Stripe

1,000+ fully AI-authored PRs merged per week. Each agent gets only ~15 curated tools (from 400+ available) to prevent confusion. Max 2 CI rounds per agent. Mandatory human review on every merge.

"The walls matter more than the model." Years of investment in human developer tooling made AI integration smooth.

Treasure Data

10x increase in release velocity. From shipping every 2-3 weeks to 600 features per month. Adoption grew from 20% to 80%+ of engineers through 2025.

Projects that previously took weeks now completed in a single day.

Our work at Supercell

180+ engineers onboarded onto AI coding agents with 90% adoption rate. Multi-track programme: 2-hour Bootstrap Challenge hackathons to build trust and understanding of what agents actually are, then hands-on workflow integration across real codebases.

Same harness engineering principles, different tool. The onboarding format and environment design work are what we're bringing to Codex engagements.

Finding the right teams to start with

The most important decision in the whole engagement is which teams and codebases go first. Get this wrong and the pilot fails regardless of how good the enablement is.

The instinct is to pick the team that's most excited about AI. That's the wrong filter. The right filter is: whose engineering environment is already in a state where an agent can do good work?

Think about it from the agent's perspective. It needs to understand the codebase, verify its own work, and get clear feedback when something is wrong. That means the best starting teams are the ones who already have their house in order — not because they're "best in class," but because agents amplify whatever state the codebase is already in.

Repo readiness: what the agent needs to be effective

Tests that run
If the agent can't run the test suite, it can't verify its own work. Every output needs manual human review. This is the single biggest readiness signal. It doesn't need 100% coverage — it needs a suite that runs reliably and catches real problems.

CI that's reliable
Flaky pipelines, broken builds, 20-minute CI runs — all of these break the agent's feedback loop. The agent needs to know quickly and clearly whether what it produced was correct. A clean, fast CI pipeline is infrastructure for the agent, not just for the team.

Navigable architecture
Clear module boundaries, consistent patterns, sensible naming. If humans struggle to find things in the codebase, agents will struggle more. A well-decomposed repo or clean service boundaries give the agent something it can reason about.

Linting and type checking
These are free feedback signals. When the agent writes something that violates the style or breaks a type contract, it gets told immediately without a human needing to look at it. More static analysis = more autonomous correction.

Good candidates to start with

Test suite that runs and is trusted by the team
CI pipeline under 10 minutes and rarely flaky
Clear module boundaries or service decomposition
Active linting, formatting, and type checking
Team that ships regularly (not stuck in a long release cycle)
At least one engineer who's curious, not necessarily the loudest AI enthusiast

Poor candidates to start with

No tests, or tests that nobody trusts or runs
CI is broken, slow, or routinely ignored
Tangled monolith where everything depends on everything
No linting or type checking — no static feedback signals
Team in the middle of a large migration or rewrite
Heavy regulatory process that blocks any tooling changes

This doesn't mean struggling teams can never use agents. It means they shouldn't go first. Start with teams whose environment is ready, prove the value, then use those results to justify the investment in getting other codebases into shape. The teams that go first become the proof point and the champion network for the teams that follow.

Narrow and deep, not wide and shallow

We deliver this with two engineers, mostly on-site with the client. That's a deliberate constraint. It means we don't spread across five teams and hope something sticks — we pick 1-2 teams and 1-2 workflows where the screening criteria are met, and go deep enough to create a real, lasting result.

Engineer 1 — Harness

Lives in the repos. Does the codebase deep dives. Writes the AGENTS.md. Builds custom commands. Configures verification infrastructure. Pairs with engineers on real tasks in their code.

Focused on making the environment better for the agent. Understands the codebase deeply enough to build the right scaffolding.

Engineer 2 — Adoption

Runs the Bootstrap Challenge. Manages the champion network. Does office hours and show & tells. Handles leadership and security briefings. Tracks metrics and friction patterns.

Focused on making the people effective. Keeps momentum, removes blockers, translates technical wins into the story leadership needs to hear.

Both are technical. Both pair with engineers. But the split means one person is always making the environment better while the other is making the people better. They compound on each other — the harness work makes the adoption sessions more effective, and the adoption feedback tells the harness engineer what to fix next.

The selection filter matters even more with two people. We can't recover from picking the wrong team. The screening criteria above aren't nice-to-haves — they're how we protect the engagement. One initial screen (tests? CI? architecture? willing champion?) determines whether we proceed. If no team passes, the honest answer is to fix the foundations first.

What the four weeks look like

OpenAI's blueprint covers the process well. What's missing is the engineering substance — the compounding skill-building that turns a trial into a lasting capability.

Working with agents is a new skill. Like any skill, it has to be built from scratch — mental model first, then muscle memory through practice. Each week should compound on the last, with the golden thread being: how do we make this environment increasingly legible and useful to agents?

No slides. Everything hands-on, in real repos, on real problems. The format changes to match the job:

Session formats we use and when

Hackathon
Group, 2-3 hours. Build something together. High energy, fast learning. Good for breaking mental models and creating shared experience.

Codebase deep dive
1:1 or small group, 60-90 min. Sit with a team in their repo. Understand how they work, where things break, what the agent would need to know. This is where the real harness design happens.

Pairing session
1:1 or pair, 45-60 min. Work a real task together with the agent. Coach the engineer through reasoning about what the agent needs, not just what to type.

Office hours
Open door, drop-in. Engineers bring their problems. We help unstick them. Low friction, high signal on where adoption is stalling and why.

Show & tell
Group, 30-45 min. Engineers show each other what they've built or figured out. No slides. Must be a working thing. Spreads practices people wouldn't discover alone.

Leadership briefing
Small group, 45 min. Not a demo — a conversation about what's changing in delivery, what the data shows, and what decisions need to be made. CTO/CIO/security audience.

The golden thread

Understand agents

→

See your codebase through the agent's eyes

→

Build the scaffolding

→

Work real problems

→

Fix the system

→

Decide

Week 0

Learn the terrain

Before anyone touches Codex. Baseline delivery metrics. Walk the repos with the team — not to audit them, but to understand: how does work actually flow here? Where does it stall? What does the agent need to know about this codebase to be useful?

Identify 2-3 champions within the target team. Select 1-2 workflows to focus on. Agree on what good looks like at the end of four weeks.

Format: Codebase deep dives (1:1 or small group with each team). One leadership briefing to align on goals.

SDLC bottleneck map Baseline metrics Repo readiness audit Champion network identified

Week 1

Build the mental model

The Bootstrap Challenge: a 2-hour hackathon. Engineers start with an agent that can only read files. They use it to add write capability. Then use read + write to add shell. By the end, they've built an AI agent from nothing.

The point isn't to teach prompting. It's to kill the mystique. An agent is an LLM in a loop with shell tools. Once engineers see that, everything else clicks — they start reasoning about what the agent needs instead of just typing instructions at it.

After the hackathon, move into codebase deep dives: sit with the team in their repo with Codex open. Try real tasks together. See where it struggles and why. This builds intuition for what good agent context looks like — which feeds directly into week 2.

Format: Hackathon (Bootstrap Challenge, full cohort). Then codebase deep dives (1:1 / small group per team). Office hours open from this point onwards.

Bootstrap Challenge delivered Every engineer has completed a real task Shared channel active Initial friction log started

Week 2

Build the harness

Now engineers understand what agents are and have felt the friction first-hand. This week is about removing that friction systematically.

Pairing sessions with the team to write the AGENTS.md for their repo — not a generic template, but the actual context their agent needs: architecture, build commands, coding standards, module boundaries, what not to touch. This is the single highest-impact thing we do.

Build custom commands for the workflows selected in week 0. Set up verification infrastructure — can the agent run the tests? Can it check its own output? If not, fix that. Every piece of harness we build here is something the agent can use from this point forward.

Format: Pairing sessions (1:1 or pair per team, working in their repos). One show & tell at end of week — engineers demo what they've figured out to each other.

AGENTS.md committed per repo Custom commands for target workflows Verification infra configured First show & tell

Week 3

Work the muscle, fix the system

Engineers are now working with Codex daily on real tasks. The harness is in place. This week is about building the muscle — more reps, harder problems — and addressing the system-level issues that the first two weeks surfaced.

Pairing sessions shift to advanced workflows: multi-step tasks, parallel agent runs, using agents for code review, test generation, refactoring. This is where engineers start to feel the compounding — the harness they built in week 2 makes the agent noticeably better than it was in week 1.

Start the review conversation: the data from weeks 1-2 will show the bottleneck. Introduce agent-assisted review patterns — agents handling routine checks, humans on judgment calls. This is the beginning, not the end, of solving the review problem.

Leadership and security briefings happen this week — once there's real data and real examples to discuss, not hypotheticals.

Format: Pairing sessions on advanced workflows. Office hours. Champion check-ins. Show & tell (second round). Leadership briefing. Security/governance briefing.

Advanced workflow adoption Agent review pipeline started Harness iterated based on real usage Second show & tell Leadership briefed with real data

Week 4

Measure, decide, hand over

Compare delivery metrics to baseline. Run sentiment surveys. But more importantly: can the team maintain and improve the harness without us? Do the champions know how to update the context files, write new commands, extend verification?

The readout isn't just "did it work." It's a maturity assessment: where is this org on the journey from "agents as autocomplete" to "agents as reliable team members"? What comes next? Which teams should scale first? What infrastructure and governance changes are still needed?

Format: Executive readout (leadership). Final show & tell (full cohort — engineers present their own wins). Handover session with champions.

Impact report with before/after metrics Harness maturity assessment Executive readout Champion handover Rollout roadmap

What's left behind when we leave

The whole point of going narrow and deep is that the engagement leaves behind things the team can maintain and build on, not a report that goes in a drawer.

Deliverables that persist

AGENTS.md in the repo
Written for their codebase, not a template. Champions know how to update it as the codebase changes.

Custom commands for their workflows
Codex commands built around how this team actually works. Review checklists, test generation patterns, refactoring steps.

Verification infrastructure
The agent can run tests, check its own output, and get feedback from CI without a human in the loop.

Champions who can maintain it
2-3 engineers who understand the harness, can extend it, and can onboard the next team without us.

Baseline and comparison metrics
Before/after DORA metrics so leadership can see the impact and make an informed decision about scaling.

A rollout roadmap
Which teams should go next, what foundations they need first, and what the ongoing harness engineering work looks like.

The deliverables are the point.

If the only thing that leaves with us is knowledge, the engagement failed. Everything we build should be committed to the repo, documented for the champions, and usable the day after we're gone.

Open questions and risks

What if no team passes the screening?

If the codebase foundations aren't there — no tests, broken CI, tangled architecture — the honest recommendation is to fix those first. That could be a separate, shorter engagement focused on test coverage, CI reliability, and codebase structure. It's not a Codex engagement yet. It's getting ready for one. We should be prepared to say that.

Where does this sit as a proposition?

This isn't Human Productivity with a different audience. The buyer is different (CTO / CIO / Head of Engineering). The metrics are different (DORA, not task completion). The work is different (repo configuration, verification infrastructure, delivery pipeline analysis — not prompt training).

It's an engineering engagement that happens to be about AI adoption. The closest analogy is DevOps transformation or platform engineering — you're changing how the team delivers software, not just giving them a new tool.

The tool question

This engagement is built around Codex — that's what OpenAI is asking partners to deliver. But the harness engineering principles are the same regardless of which agent is running. AGENTS.md works for Codex. CLAUDE.md works for Claude Code. The verification infrastructure, the custom commands, the review pipeline — all transferable. That's a strength for the client (they're building a capability, not a vendor lock-in) and it's honest to say so when asked.