Skip to the content.

Neurogrid CTF: The Ultimate AI Security Showdown - Agent of 0ca / BoxPwnr Write-up

On November 20-24, 2024, I participated with BoxPwnr in Neurogrid CTF - the first AI-only CTF competition hosted by Hack The Box with $50k in AI credits for the top 3 teams. This wasn’t a typical CTF where humans solve challenges; instead, AI agents competed autonomously in a hyper-realistic cyber arena. My autonomous agent secured 5th place, solving 38/45 flags (84.4% completion) across 36 challenges without any manual intervention.

Leaderboard

Note: I didn’t solve any challenges manually. My role was purely supervisory - launching BoxPwnr instances, monitoring when Claude Code or Codex would stop, and encouraging them to try new approaches when they gave up.

Results & Traces

All solving attempts are published with:

  • Full conversation traces showing LLM reasoning
  • Interactive replay UI for each solved challenge (via BoxPwnr)
  • Detailed reports linking to relevant turns

View Full Statistics & All Attempts →

Approach: Five Phases

Phase 1: BoxPwnr with Grok-4.1-fast (Free Tier)

I already had HTB CTF support in BoxPwnr for a while, allowing autonomous challenge solving with a simple command. However, this CTF required MCP (Model Context Protocol) for container management and flag submission, so I adapted BoxPwnr’s htb_ctf_client.py to support MCP calls for starting containers and submitting flags. When the CTF started, I distributed 36 challenges across 6 parallel runners (EC2 machines), each with a different OpenRouter account using the free x-ai/grok-4.1-fast model.

8 challenges solved (all free):

Phase 2: SOTA Models via BoxPwnr

10 more challenges solved with frontier models:

gemini-3-pro-preview:

claude-sonnet-4-5-20250929:

  • The Claim That Broke The Oath - Blockchain - Very Easy (Report) (Replay)
  • Manual - Forensics - Very Easy (Report) (Replay)
  • Whisper Vault - Pwn - Easy (Report) (Replay)
  • kuromind - Web - Hard (trace missing)

Phase 3: Human-in-the-Loop with Cursor

Using Cursor with Claude Sonnet 4.5 in a more interactive approach, I solved 5 more challenges:

  • Triage of the Broken Shrine - Forensics - Easy (Solved with Cursor - trace missing)
  • The Debt That Hunts the Poor - Blockchain - Easy (Cursor Session)
  • Sakura’s Embrace - Secure Coding - Very Easy (Cursor Session)
  • Yugen’s Choice - Secure Coding - Medium (Cursor Session)
  • ashenvault - Web - Medium (Cursor Session)

This became expensive, so I switched to a different approach.

Phase 4: Claude Code

Claude Code’s superior harness around Sonnet 4.5 helped solve 3 more:

Phase 5: Codex

First time trying Codex with GPT 5.1-codex model - solved 4 more challenges:

Example: Attack Graph Visualization

BoxPwnr generates attack graphs showing the solution path for each challenge. Here’s an example from Blade Master (Coding - Hard):

Attack Graph Example

▶️ Watch the full Blade Master replay

Unsolved Challenges

Unsolved by me:

  • Coordinator - Crypto - Hard
  • The Bank That Breathed Numbers - Blockchain - Hard
  • Secret Meeting - Forensics - Hard (1/5 flags remaining)
  • MirrorPort - Web - Easy (solved by other teams)

Unsolved by all AI agents:

  • Gemsmith - Pwn - Medium
  • Mantra - Pwn - Hard
  • The Bank That Breathed Numbers - Blockchain - Hard

These three were solved by the top human team in the human-only version of the CTF:

Human Team Performance

What Could I Have Done Better?

I was constrained by resources (money and time/health). I took a cautious, progressive approach to avoid burning through budget without significant ROI.

I wasn’t expecting LLMs to solve Hard challenges based on my BoxPwnr experience with HackTheBox machines. But frontier models are improving rapidly - Sonnet 3.7 solved 20% of Cybench 10 months ago, Opus 4.5 now solves 86%. I was too cautious initially. It wasn’t until the last day that I used Codex with GPT 5.1, which solved 2 Hard and 2 Medium challenges.

If I were to do this again, I would:

  • Start all challenge containers at the beginning
  • Run as many parallel runners as there are challenges
  • Execute BoxPwnr, Claude Code, Codex, and Gemini CLI in parallel from the start
  • Be more aggressive with frontier models on Hard challenges

Conclusions

Frontier models are exploding in cyber capabilities. They’re proficient across all CTF challenge types. While they might struggle with Pwn challenges due to inadequate tooling (verbose gdb output consuming context), models are getting better at context management - being smart about extracting information efficiently without burning through tokens.

The gap between AI and human capabilities in offensive security is closing faster than expected.


All traces, reports, and interactive replays are available in the BoxPwnr-Attempts repository.