What the research says, what we've seen first-hand, and what we think a guided evaluation should actually look like.
OpenAI has asked partners to run Codex Guided Evaluations because they can't do it themselves. Enterprise accounts try Codex, engineers like it in the demo, and then nothing happens. Adoption stalls. Licenses go unused. The pilot doesn't convert.
This isn't unique to Codex. It's the same pattern across every AI coding tool.
The paradox: 84% of developers use AI coding tools. But only 29% trust them. Usage is going up while confidence is going down. That's a problem you can't solve with more licenses.
The instinct is to think this is a training problem. Teach people how to use the tool, they'll use it more. That's the Human Productivity model and it works for ChatGPT adoption in non-technical teams.
Engineering is different. The bottleneck isn't "can the developer use Codex." It's that speeding up one part of the delivery pipeline breaks everything downstream.
Writing code is 25-35% of the delivery timeline. Speed it up without fixing everything else and you just create a bigger pile of unreviewed work.
More code, slower reviews, no change in delivery speed. In some cases, delivery actually gets worse. The tool made the coding faster but the system didn't keep up.
The DORA 2025 finding: AI is an amplifier. It makes good engineering teams better and struggling teams worse. Teams without strong practices, platform maturity, and a clear way of working see no improvement or actual regression when they adopt AI coding tools.
DORA (DevOps Research and Assessment) is Google's annual study measuring software delivery performance across thousands of teams. The four key metrics are deployment frequency, lead time for changes, change failure rate, and time to restore service.
The organisations seeing real results aren't just giving engineers a tool and hoping for the best. They're changing how the engineering environment itself works.
OpenAI's own internal team published a blog in March 2026 about building a product with zero lines of human-written code. 1,500 PRs in five months. 3.5 PRs per engineer per day. Their key finding:
Early progress was slower than we expected, not because Codex was incapable, but because the environment was underspecified. The primary job of our engineering team became enabling the agents to do useful work. Ryan Lopopolo, OpenAI — "Harness Engineering"
They call this work "harness engineering" — making a codebase and its tooling legible and useful to AI agents. Not writing code. Building the environment that lets agents write good code.
A file committed to the repository (AGENTS.md, CLAUDE.md) that tells the agent: here's the architecture, here's how to build, here's the coding standards, here's what not to touch. Without this, every session starts cold. With it, every engineer gets a pre-configured agent that already understands their codebase.
OpenAI's blog takes this further: the context file should be a table of contents, not an encyclopedia. A short map with pointers to deeper sources of truth. Progressive disclosure — the agent starts small and is taught where to look.
Teams without proper context setup see 60% lower productivity gains from AI coding tools.
Can the agent run the tests? Can it validate its own output? Can it see the logs? If not, every piece of generated code needs manual human verification, which is often slower than writing it in the first place.
The endgame (from OpenAI's internal team): Chrome DevTools wired into the agent, a local observability stack per worktree, agents that reproduce bugs and validate fixes without human involvement. Most teams won't start there, but the direction matters.
This is the single biggest reason ROI disappears. AI generates more code. Humans can't review it fast enough. The pipeline chokes.
The answer isn't "review faster." It's building towards agent-assisted review: agents handle routine checks (style, patterns, missing tests, known anti-patterns), humans handle judgment calls. OpenAI's internal team pushed almost all review to agent-to-agent. Most enterprise orgs aren't ready for that yet, but the journey from "every PR needs a human" to "agents handle the routine, humans handle the hard stuff" is where the real velocity unlock lives.
None of these examples use Codex specifically — the product is too new for mature case studies. But the patterns are consistent regardless of which tool is used. The harness engineering work is the same whether the agent is Codex, Claude Code, or Copilot. That's the point.
80% organic adoption. No cost limit on AI tokens. Built an internal proxy housing all agents. Champions wrote weekly updates assisted by AI. CEO made it non-optional. Added AI usage to performance reviews. Managers must prove AI can't do a task before requesting headcount.
Got legal involved early as collaborators, not blockers. "How can we do it safely?" rather than "can we do it?"
1,000+ fully AI-authored PRs merged per week. Each agent gets only ~15 curated tools (from 400+ available) to prevent confusion. Max 2 CI rounds per agent. Mandatory human review on every merge.
"The walls matter more than the model." Years of investment in human developer tooling made AI integration smooth.
10x increase in release velocity. From shipping every 2-3 weeks to 600 features per month. Adoption grew from 20% to 80%+ of engineers through 2025.
Projects that previously took weeks now completed in a single day.
180+ engineers onboarded onto AI coding agents with 90% adoption rate. Multi-track programme: 2-hour Bootstrap Challenge hackathons to build trust and understanding of what agents actually are, then hands-on workflow integration across real codebases.
Same harness engineering principles, different tool. The onboarding format and environment design work are what we're bringing to Codex engagements.
The most important decision in the whole engagement is which teams and codebases go first. Get this wrong and the pilot fails regardless of how good the enablement is.
The instinct is to pick the team that's most excited about AI. That's the wrong filter. The right filter is: whose engineering environment is already in a state where an agent can do good work?
Think about it from the agent's perspective. It needs to understand the codebase, verify its own work, and get clear feedback when something is wrong. That means the best starting teams are the ones who already have their house in order — not because they're "best in class," but because agents amplify whatever state the codebase is already in.
This doesn't mean struggling teams can never use agents. It means they shouldn't go first. Start with teams whose environment is ready, prove the value, then use those results to justify the investment in getting other codebases into shape. The teams that go first become the proof point and the champion network for the teams that follow.
We deliver this with two engineers, mostly on-site with the client. That's a deliberate constraint. It means we don't spread across five teams and hope something sticks — we pick 1-2 teams and 1-2 workflows where the screening criteria are met, and go deep enough to create a real, lasting result.
Lives in the repos. Does the codebase deep dives. Writes the AGENTS.md. Builds custom commands. Configures verification infrastructure. Pairs with engineers on real tasks in their code.
Focused on making the environment better for the agent. Understands the codebase deeply enough to build the right scaffolding.
Runs the Bootstrap Challenge. Manages the champion network. Does office hours and show & tells. Handles leadership and security briefings. Tracks metrics and friction patterns.
Focused on making the people effective. Keeps momentum, removes blockers, translates technical wins into the story leadership needs to hear.
Both are technical. Both pair with engineers. But the split means one person is always making the environment better while the other is making the people better. They compound on each other — the harness work makes the adoption sessions more effective, and the adoption feedback tells the harness engineer what to fix next.
The selection filter matters even more with two people. We can't recover from picking the wrong team. The screening criteria above aren't nice-to-haves — they're how we protect the engagement. One initial screen (tests? CI? architecture? willing champion?) determines whether we proceed. If no team passes, the honest answer is to fix the foundations first.
OpenAI's blueprint covers the process well. What's missing is the engineering substance — the compounding skill-building that turns a trial into a lasting capability.
Working with agents is a new skill. Like any skill, it has to be built from scratch — mental model first, then muscle memory through practice. Each week should compound on the last, with the golden thread being: how do we make this environment increasingly legible and useful to agents?
No slides. Everything hands-on, in real repos, on real problems. The format changes to match the job:
Before anyone touches Codex. Baseline delivery metrics. Walk the repos with the team — not to audit them, but to understand: how does work actually flow here? Where does it stall? What does the agent need to know about this codebase to be useful?
Identify 2-3 champions within the target team. Select 1-2 workflows to focus on. Agree on what good looks like at the end of four weeks.
Format: Codebase deep dives (1:1 or small group with each team). One leadership briefing to align on goals.
The Bootstrap Challenge: a 2-hour hackathon. Engineers start with an agent that can only read files. They use it to add write capability. Then use read + write to add shell. By the end, they've built an AI agent from nothing.
The point isn't to teach prompting. It's to kill the mystique. An agent is an LLM in a loop with shell tools. Once engineers see that, everything else clicks — they start reasoning about what the agent needs instead of just typing instructions at it.
After the hackathon, move into codebase deep dives: sit with the team in their repo with Codex open. Try real tasks together. See where it struggles and why. This builds intuition for what good agent context looks like — which feeds directly into week 2.
Format: Hackathon (Bootstrap Challenge, full cohort). Then codebase deep dives (1:1 / small group per team). Office hours open from this point onwards.
Now engineers understand what agents are and have felt the friction first-hand. This week is about removing that friction systematically.
Pairing sessions with the team to write the AGENTS.md for their repo — not a generic template, but the actual context their agent needs: architecture, build commands, coding standards, module boundaries, what not to touch. This is the single highest-impact thing we do.
Build custom commands for the workflows selected in week 0. Set up verification infrastructure — can the agent run the tests? Can it check its own output? If not, fix that. Every piece of harness we build here is something the agent can use from this point forward.
Format: Pairing sessions (1:1 or pair per team, working in their repos). One show & tell at end of week — engineers demo what they've figured out to each other.
Engineers are now working with Codex daily on real tasks. The harness is in place. This week is about building the muscle — more reps, harder problems — and addressing the system-level issues that the first two weeks surfaced.
Pairing sessions shift to advanced workflows: multi-step tasks, parallel agent runs, using agents for code review, test generation, refactoring. This is where engineers start to feel the compounding — the harness they built in week 2 makes the agent noticeably better than it was in week 1.
Start the review conversation: the data from weeks 1-2 will show the bottleneck. Introduce agent-assisted review patterns — agents handling routine checks, humans on judgment calls. This is the beginning, not the end, of solving the review problem.
Leadership and security briefings happen this week — once there's real data and real examples to discuss, not hypotheticals.
Format: Pairing sessions on advanced workflows. Office hours. Champion check-ins. Show & tell (second round). Leadership briefing. Security/governance briefing.
Compare delivery metrics to baseline. Run sentiment surveys. But more importantly: can the team maintain and improve the harness without us? Do the champions know how to update the context files, write new commands, extend verification?
The readout isn't just "did it work." It's a maturity assessment: where is this org on the journey from "agents as autocomplete" to "agents as reliable team members"? What comes next? Which teams should scale first? What infrastructure and governance changes are still needed?
Format: Executive readout (leadership). Final show & tell (full cohort — engineers present their own wins). Handover session with champions.
The whole point of going narrow and deep is that the engagement leaves behind things the team can maintain and build on, not a report that goes in a drawer.
The deliverables are the point.
If the only thing that leaves with us is knowledge, the engagement failed. Everything we build should be committed to the repo, documented for the champions, and usable the day after we're gone.
If the codebase foundations aren't there — no tests, broken CI, tangled architecture — the honest recommendation is to fix those first. That could be a separate, shorter engagement focused on test coverage, CI reliability, and codebase structure. It's not a Codex engagement yet. It's getting ready for one. We should be prepared to say that.
This isn't Human Productivity with a different audience. The buyer is different (CTO / CIO / Head of Engineering). The metrics are different (DORA, not task completion). The work is different (repo configuration, verification infrastructure, delivery pipeline analysis — not prompt training).
It's an engineering engagement that happens to be about AI adoption. The closest analogy is DevOps transformation or platform engineering — you're changing how the team delivers software, not just giving them a new tool.
This engagement is built around Codex — that's what OpenAI is asking partners to deliver. But the harness engineering principles are the same regardless of which agent is running. AGENTS.md works for Codex. CLAUDE.md works for Claude Code. The verification infrastructure, the custom commands, the review pipeline — all transferable. That's a strength for the client (they're building a capability, not a vendor lock-in) and it's honest to say so when asked.