AI vs Human Teams: Lessons from NorthSec 2026 CTF

A challenge designer reflects on how AI agents reshaped NorthSec 2026 CTF, impacting competition balance, experience and the future of cybersecurity challenges.

Articles

Jun 9, 2026

AI vs Human Teams: Lessons from NorthSec 2026 CTF

Disclaimer: This blog is based on my own personal experience, observations and opinions and does not necessarily represent the views of the NorthSec team and organisation.

‍

I’ve been a challenge designer for the NorthSec Capture The Flag (CTF) since 2020, making this year's edition my 7th. I’ve been doing it as a volunteer for free without ever looking back because I enjoy making something that the participants find fun and that forces them to learn new things on the fly. I always found that helping the new generation of hackers and security professionals was a very worthy use of my time.

‍

An audience watches the main stage presentation at the NorthSec 2026 conference. — Source: Phillipe Dugre

‍

Also, I was a speaker this year alongside Ashley Manraj, Chief Technology Officer at Pvotal Tech. You can listen to the full talk and workshop here: “Shellcoding the Unshellable: Process Hooking & Advanced Shellcoding in Hardened Go Containers”.

‍

This year, something fundamentally changed compared to previous years. As with the rest of the tech industry, Artificial Intelligence (AI) or Large Language Models (LLMs) and especially agentic use of it, is having a tremendous impact that I, along with other people in the team, frankly underestimated before the competition itself. While not everybody felt the impact the same way, I wanted to write this blog post to express my view as a challenge designer and how I feel it will affect CTFs moving forward.

‍

This is not Fear, Uncertainty and Doubt (FUD)

I think it’s important to mention early that this blog is not about scaring industry professionals and CTF enjoyers. One of my observation is that, while AI is now very good at exploiting obvious vulnerabilities, like in the kind of challenges I design where finding the vulnerability is easy but exploiting it is hard, AI still struggles with needle in a haystack scenarios with a large attack surface, which are way more representative of real life scenarios and vulnerabilities in the wild.

‍

Furthermore, I do see AI as an opportunity for CTFs. It allows designers like me to create more and more fleshed-out challenges. While I don’t believe we can realistically keep the AI-first and traditional teams in the same category, I do believe we have the opportunity to make improvements to the AI category via adversarial challenges and tweaked rules to make a really fun CTFs for this new crowd.

‍

Initial Expectations

Some people would have me on record saying the week before the event that I was not too worried about LLMs. At the 2025 Northsec edition, agents and vibe coding were already a thing and we didn’t have any issues at all. Besides, I wrote most of my challenges two to three months before the event. During that time, models got a *lot* smarter, especially when it comes to finding and exploiting vulnerabilities. Furthermore, agents are now able to autonomously find and solve challenges with close to no user interaction. To put it simply, if the challenge was to hack a web server, the AI-focused teams would get the flag/points without ever seeing the website itself.

My expectations were that all top teams would have access to frontier models anyways and that the hard challenges would not be solved by AI. That way, you could use AI to steamroll the easy and medium challenges, leaving you more time to focus on the hard challenges, which would make the difference in the end. What I did not expect is that AI can now solve very complex challenges easily.

To put in a bit of perspective, the competition lasts for around 42 hours. Normally, my flagship hard challenge gets solved by one or two teams during the entire event, if it’s solved at all. This year, more than ten teams solved it using AI a few hours into the competition without really interacting with the challenge itself. Only one human solved the challenge and got it 10 minutes before the end of the competition. In other words, AI is now so good that even NorthSec, which is known for very difficult challenges, cannot design traditional challenges that resists AI while being reasonable to solve within the 42 hours window, rendering traditional skills suddenly quasi-obsolete from a competitive standpoint.

What’s different this time?

One common reaction is to compare this to some tools that came out and ended up being pretty revolutionary in their field. Angr, for instance, is a tool suite that initially introduces an easy to use symbolic execution tool for reverse engineering and renders a lot of reverse engineering challenges (specifically what we refer to as “crackme”) very easy to do. Another instance in the same field is Ghidra, which was the first free decompiler that took over the industry, giving an easier way to reverse engineer compiled code compared to reading the assembly code.

‍

The biggest difference here is the scale: while the tools aforementioned shook the reverse engineering world, most challenge categories were not affected by it. Similarly, angr still doesn’t work on every and all crackme and we’ve found reasonable ways to design those with angr in mind. Ghidra is helpful, but it’s still pretty hard to reverse binaries (especially Go and Rust binaries) using it. With AI agents, the only category that truly stays untouched are the physical challenges and very few challenges actually resist a frontier model with enough tokens.

‍

Who does this affect?

As mentioned above, not everybody is affected by the use of AI agents in the competition. A lot of participants never really cared about the scoreboard in the first place and only focused on learning and having fun. Some teams also really enjoyed light use of LLM as an assistant to give them ideas and guide them on their learning paths. A lot of AI-first teams were having a lot of fun optimizing their AI workflows and agent swarms.

On the other hand, the teams that were the most affected negatively were traditional competitive teams who either had to accept being very far down the leaderboard or switch to the new AI workflows to survive. From what I’ve heard from such teams, they had way less fun solving challenges this way compared to doing it manually, and it makes sense, as they’ve been competing for years because they loved the traditional ways of doing things.

‍

Finally, while not participants, the challenge designers were also affected very negatively, for the most part. Here, I can talk about my own experience and I can say that it is partly our own fault. We decided to allow LLM usage this year as we were gathering data and kindly asked participants to mark their AI solves as such. However, we did not penalize anyone for not disclosing their AI use and, furthermore, we haven’t adapted our internal tooling beforehand. The result is that we were not able to find out which teams solved our challenges manually and talk with them, as most of the valid flag submissions were done by undisclosed autosolvers.

‍

Personally, the most fun part of the CTF for me is walking around and congratulating the teams that get the first solve on my challenges, asking for feedback and if they had fun doing it. I also like to check the progress of teams working on it. This year, I couldn’t really do that because if I did, I’d just end up asking them how they solved it and get an answer the likes of “I don’t know, my agent did it”, which would just end the conversation because a lot of those had little to no interaction with the challenge itself. Furthermore, the few teams who I expected would be able to solve my harder and more involved challenges all switched to an AI workflow to stay competitive and got those via AI too.

‍

The Numbers

We did a survey about the experience related to AI with the participants. We released a blog article about it in french, but if you’re interested, you can find the article here: A look back at Nsec 2026: the community's pulse on the agentic CTF.

‍

I believe the most important aspect about a CTF is that the participants are having a fun time, therefore the interesting statistics in my eyes are the subjective questions about that.

‍

84% of teams reported using AI agents to solve at least one flag, showing that AI has already become deeply embedded in the CTF workflow rather than being an edge-case tool.
52.4% of the answers ticked off that autonomous agents degraded their CTF experience, while 17.2% of the participants ticked off that it improved their experience.
42.3% of the participants ticked off that they think NorthSec should adapt boldly to the situation.
75.3% of respondents found that they were more interested in learning about hacking-related knowledge than learning how to use AI.

‍

My interpretation of this is that a lot of people are dissatisfied with the current state of affairs, with AI and human teams competing in the same category. Furthermore, a still significant amount of participants reported that they enjoyed using it and saw it as a positive. I can’t predict the future but I think there is a good probability that the balance shifts further towards AI, so simply dismissing AI-first teams is not a valid solution either.

‍

Moving Forward

In my opinion, I don’t really see a good way forward without splitting traditional human-first and AI-first teams into distinct categories in their own leaderboard. While a lot of people don’t really care whether or not they have to compete against AI, splitting categories wouldn’t really be a negative thing for them in general. Competitive traditional teams would probably rather compete in the traditional category so they can stay competitive without having to completely change their CTF experience.

‍

As I mentioned above, when it comes to a full all-out AI category, I really see an opportunity instead of simply trying to fix a problem. A lot of companies value the capacity to use AI effectively in their workflow, so the leaderboard would have a lot of value for them when it comes to hiring processes. Furthermore, with a few tweaks to the scoring system, it might be possible to evaluate that efficiency even more, for example by giving more points to the first teams that solves a specific flag or by monitoring token count for solves (although I do not see a good way to monitor and enforce this on the technical side). Finally, it’s also an opportunity to add challenges that are adversarial to AI/LLMs and that wouldn’t really work in the traditional CTF: Prompt injections to confuse the AI, adversarial design and needle in a haystack type of challenges, etc.

‍

Either way, the other organizers and I will have to discuss a lot about all this so we can fine-tune all the details before deciding anything. This is a major paradigm shift but I am confident that we’ll be able to come up with something that ends up being better for everyone involved and, who knows, maybe even corner the market for AI-centric CTFs.

‍

Recognition: A Pirate’s Honour

A close-up shot shows the 3D-printed black pirate trophy awarded to Philippe Dugre for Le Gentil Pirate 2026. — Source: Phillipe Dugre

‍

Before I forget, I was honoured to receive the “Gentil Pirate, Challenge Designer of the Year” award in recognition of my contribution to the CTF. It meant a lot to be acknowledged in this way and I truly appreciate the recognition of the work, effort and care that goes into building these challenges and especially the value I bring to the community doing it. This was especially true considering the rough time we’ve had as challenge designers this year with recognition because of the aforementioned visibility issues.