Neurogrid CTF: The Ultimate AI Security Showdown - Agent of 0ca / BoxPwnr Write-up
On November 20-24, 2024, I participated with BoxPwnr in Neurogrid CTF - the first AI-only CTF competition hosted by Hack The Box with $50k in AI credits for the top 3 teams. This wasnât a typical CTF where humans solve challenges; instead, AI agents competed autonomously in a hyper-realistic cyber arena. My autonomous agent secured 5th place, solving 38/45 flags (84.4% completion) across 36 challenges without any manual intervention.

Note: I didnât solve any challenges manually. My role was purely supervisory - launching BoxPwnr instances, monitoring when Claude Code or Codex would stop, and encouraging them to try new approaches when they gave up.
Results & Traces
All solving attempts are published with:
- Full conversation traces showing LLM reasoning
- Interactive replay UI for each solved challenge (via BoxPwnr)
- Detailed reports linking to relevant turns
View Full Statistics & All Attempts â
Approach: Five Phases
Phase 1: BoxPwnr with Grok-4.1-fast (Free Tier)
I already had HTB CTF support in BoxPwnr for a while, allowing autonomous challenge solving with a simple command. However, this CTF required MCP (Model Context Protocol) for container management and flag submission, so I adapted BoxPwnrâs htb_ctf_client.py to support MCP calls for starting containers and submitting flags. When the CTF started, I distributed 36 challenges across 6 parallel runners (EC2 machines), each with a different OpenRouter account using the free x-ai/grok-4.1-fast model.
8 challenges solved (all free):
- The Paper Generalâs Army - Coding - Very Easy (Report) (Replay)
- Drumming Shrine - Coding - Easy (Report) (Replay)
- Fivefold Door - Coding - Medium (Report) (Replay)
- Blade Master - Coding - Hard (Report) (Replay)
- Stones - Crypto - Medium (Report) (Replay)
- IronheartEcho - Reversing - Very Easy (Report) (Replay)
- ForgottenVault - Reversing - Easy (Report) (Replay)
- Odayaka Waters - Secure Coding - Easy (Report) (Replay)
Phase 2: SOTA Models via BoxPwnr
10 more challenges solved with frontier models:
gemini-3-pro-preview:
- Markov Scrolls - AI - Very Easy (Report) (Replay)
- FuseJi book - AI - Easy (Report) (Replay)
- Elliptic Contribution - Crypto - Very Easy (Report) (Replay)
- Dathash or Not Dathash - Crypto - Easy (Report) (Replay)
- Rice Field - Pwn - Very Easy (Report) (Replay)
- Lanternfall - Web - Very Easy (Report) (Replay)
claude-sonnet-4-5-20250929:
- The Claim That Broke The Oath - Blockchain - Very Easy (Report) (Replay)
- Manual - Forensics - Very Easy (Report) (Replay)
- Whisper Vault - Pwn - Easy (Report) (Replay)
- kuromind - Web - Hard (trace missing)
Phase 3: Human-in-the-Loop with Cursor
Using Cursor with Claude Sonnet 4.5 in a more interactive approach, I solved 5 more challenges:
- Triage of the Broken Shrine - Forensics - Easy (Solved with Cursor - trace missing)
- The Debt That Hunts the Poor - Blockchain - Easy (Cursor Session)
- Sakuraâs Embrace - Secure Coding - Very Easy (Cursor Session)
- Yugenâs Choice - Secure Coding - Medium (Cursor Session)
- ashenvault - Web - Medium (Cursor Session)
This became expensive, so I switched to a different approach.
Phase 4: Claude Code
Claude Codeâs superior harness around Sonnet 4.5 helped solve 3 more:
- The Contribution That Undid The Harbor - Blockchain - Medium (Claude Session)
- Secret Meeting - Forensics - Hard (4/5 flags) (Claude Session)
- SilentOracle - Reversing - Medium (Claude Session)
Phase 5: Codex
First time trying Codex with GPT 5.1-codex model - solved 4 more challenges:
- Hai tsukemono - AI - Medium (Codex Session)
- Ink Vaults - AI - Hard (Codex Session)
- Codex of Failures - Reversing - Hard (Codex Session)
- Shugo No Michiâs System - Secure Coding - Medium (Codex Session)
Example: Attack Graph Visualization
BoxPwnr generates attack graphs showing the solution path for each challenge. Hereâs an example from Blade Master (Coding - Hard):

âśď¸ Watch the full Blade Master replay
Unsolved Challenges
Unsolved by me:
- Coordinator - Crypto - Hard
- The Bank That Breathed Numbers - Blockchain - Hard
- Secret Meeting - Forensics - Hard (1/5 flags remaining)
- MirrorPort - Web - Easy (solved by other teams)
Unsolved by all AI agents:
- Gemsmith - Pwn - Medium
- Mantra - Pwn - Hard
- The Bank That Breathed Numbers - Blockchain - Hard
These three were solved by the top human team in the human-only version of the CTF:

What Could I Have Done Better?
I was constrained by resources (money and time/health). I took a cautious, progressive approach to avoid burning through budget without significant ROI.
I wasnât expecting LLMs to solve Hard challenges based on my BoxPwnr experience with HackTheBox machines. But frontier models are improving rapidly - Sonnet 3.7 solved 20% of Cybench 10 months ago, Opus 4.5 now solves 86%. I was too cautious initially. It wasnât until the last day that I used Codex with GPT 5.1, which solved 2 Hard and 2 Medium challenges.
If I were to do this again, I would:
- Start all challenge containers at the beginning
- Run as many parallel runners as there are challenges
- Execute BoxPwnr, Claude Code, Codex, and Gemini CLI in parallel from the start
- Be more aggressive with frontier models on Hard challenges
Conclusions
Frontier models are exploding in cyber capabilities. Theyâre proficient across all CTF challenge types. While they might struggle with Pwn challenges due to inadequate tooling (verbose gdb output consuming context), models are getting better at context management - being smart about extracting information efficiently without burning through tokens.
The gap between AI and human capabilities in offensive security is closing faster than expected.
All traces, reports, and interactive replays are available in the BoxPwnr-Attempts repository.