Back to Blog
June 30, 2026

What an autonomous pentest agent found in 3,847 apps — and what your scanner didn't

Viktor Bulanek
Founder & CTO, Penetrify
MSc IT Security · 20+ years in security · 4x Ex-CTO

TL;DR

  • 91% of the SQL injection we found was live in apps that run SAST in their CI pipeline — static analysis in the pipeline didn't stop it reaching production.
  • 78% of critical findings were exploitable with no prior access — no login required.
  • 42% of applications had at least one broken access control flaw — the most prevalent category in the dataset.
  • Authenticated testing surfaced 3.4× more vulnerabilities than unauthenticated.
  • Dataset, full methodology, and limitations are below. Raw breakdowns are CC BY 4.0; take what's useful.

The question

Shift-left works: teams scan source code in CI, gate on SAST, and pull in dependency and secrets scanners. So we wanted to answer a narrower, practical question with data rather than opinion: when you actually try to exploit a running application, what do you find that the scanners didn't?

This isn't a product argument — it's a coverage question every AppSec team has. The numbers below come from running an autonomous testing agent that attempts exploitation (not just signature detection) against real applications, then validates each finding by proving it.

Methodology

Dataset. 47,291 findings across 3,847 distinct web applications and APIs, tested between October 2025 and March 2026. Findings are counted at the vulnerability-instance level — one application with five distinct SQL injection points contributes five findings.

"Exploitation-validated" finding. Each finding counted here was confirmed by the agent actually exploiting it — not flagged by signature or heuristic alone. Validation relies on a deterministic oracle: a unique, agent-controlled effect causally produced by its own payload. For injections, the agent triggers an out-of-band interaction (OAST), run on its own dedicated infrastructure, or reads a specific controlled marker (such as the output of SELECT @@version or a pre-planted canary), proving genuine query execution. For authorization flaws, validation is cross-checked: user A's session demonstrably retrieves an object belonging to user B, compared against a ground-truth reference — not merely an HTTP 200 on a manipulated ID. Exploitation is non-destructive: the agent reads only a minimal proof value sufficient to confirm execution, never the underlying dataset, and without modifying or exfiltrating data.

The SAST comparison. This is measured against the subset of apps known to run a SAST gate in their CI pipeline (2,231 of the 3,847, 58%) — a further-narrowed slice of an already self-selected sample. Important caveat: we did not run our own SAST scan over their source code — we only know the pipeline included a SAST step. So the claim is narrow — a SAST-gated pipeline did not prevent the issue from reaching the running app — and it does not establish that SAST failed to detect it. The finding may have been flagged and overridden, suppressed, or fallen outside the tool's language or rule coverage. "SAST" here is whatever single tool and ruleset each team ran.

Privacy. Data is anonymised and aggregated. No PII, no source code, and no identifying details about tested organisations.

Where the vulnerabilities are

By share of all findings: broken access control 34.2%, injection (SQLi/XSS/SSTI) 21.7%, security misconfiguration 18.3%, broken authentication 12.4%, sensitive data exposure 7.8%, vulnerable components 3.4%, other 2.2%. Note the two different denominators: that's 34.2% of all findings, but broken access control affected 42% of apps — the most prevalent category in this dataset.

Vulnerability categories by share of findings: broken access control 34.2%, injection 21.7%, security misconfiguration 18.3%, broken authentication 12.4%, sensitive data exposure 7.8%, vulnerable components 3.4%, other 2.2%

Severity and exploitability

49% of findings were High or Critical (Critical 18%, High 31%, Medium 34%, Low/Info 17%). The number that stood out: 78% of critical findings were exploitable with no prior access — reachable by an unauthenticated attacker, with no foothold required.

Severity breakdown: 49% of findings were High or Critical (Critical 18%, High 31%, Medium 34%, Low/Info 17%)
78% of critical findings were exploitable with no prior access

What a SAST gate in CI didn't stop

Of the SQL injection we exploited, 91% was in applications that run SAST in their CI pipeline (see the SAST caveat in Methodology — we don't claim SAST failed to detect it, only that a SAST-gated pipeline didn't prevent it reaching production). This isn't a knock on SAST — it's a coverage boundary. Static analysis reasons about source code; injection that depends on runtime data flow, framework behaviour, or composed queries — or that's flagged and then overridden — can still be live and exploitable against the running app.

91% of the SQL injection found was in apps that run SAST in CI — it shipped to production anyway

Why authenticated testing matters

Authenticated testing surfaced 3.4× more vulnerabilities than unauthenticated testing across the dataset. Most broken-object-level-authorization (IDOR) and business-logic flaws simply don't exist until you're logged in and can move between objects and roles. This sits comfortably with the 78% figure above: the critical findings skew to pre-auth bugs an anonymous attacker can reach, while authenticated testing mostly expands the total count — largely with authorization issues that are often high, not critical. Either way, testing only the unauthenticated surface misses most of the high-impact authorization issues.

Authenticated testing found 3.4x more vulnerabilities than unauthenticated testing

What this means in practice

A few takeaways that hold regardless of which tools you use:

  • SAST is necessary, not sufficient. Pair static analysis with dynamic, exploit-driven testing — they cover different failure modes, and the injection data shows the gap is large.
  • Authorization is the dominant real-world risk. Broken access control leads, and 78% of critical issues need no login. Test authenticated, across multiple roles, or you miss most of it.
  • Cadence matters. New code — and the issue classes above — ships between point-in-time audits, which is the argument for testing continuously rather than once a quarter.

Limitations — what this data does not show

This is a single-vendor dataset, not a peer-reviewed study. Read it with these caveats:

  • Selection bias. These are applications whose teams chose to run an autonomous testing platform — skewed toward startups, SMBs, and SaaS (38% of the sample). It is not a representative sample of the web.
  • The agent has blind spots. Autonomous testing is strong on broad, exploit-driven coverage but weaker than a skilled human on novel business logic and deeply context-dependent attack chains. What we found is not everything that's there.
  • The SAST comparison is a narrowed subset. It covers only apps that run a SAST gate in CI — a further-narrowed slice of an already self-selected sample, so the selection biases compound — and reflects one SAST tool and ruleset.
  • Validation isn't infallible. Exploitation-validation reduces but doesn't eliminate false positives and false negatives.
  • Point-in-time. Figures cover October 2025 to March 2026 and shift quarter to quarter.

Data availability

Aggregate data and methodology are published on our Web Application Security Report under CC BY 4.0 — quote, chart, or reproduce with attribution. Happy to share raw category and severity breakdowns or the chart pack on request.

For context, IBM's 2024 Cost of a Data Breach Report puts the average breach at $4.88M — many of which begin with the same web-application weaknesses covered here.

Frequently Asked Questions

What types of vulnerabilities does Penetrify detect?

Penetrify detects all OWASP Top 10 vulnerability categories including SQL injection, XSS, CSRF, IDOR, broken authentication, security misconfigurations, and sensitive data exposure. It also tests API security, session management, and common misconfigurations in Supabase, Firebase, and Bubble.

How long does an AI penetration test take?

A quick scan completes in 15–30 minutes. A standard scan runs 1–2 hours with broader coverage. A deep scan can run several hours for complex applications.

What does a Penetrify report include?

Every report includes an executive summary, overall security score, severity-classified findings (Critical, High, Medium, Low), step-by-step reproduction steps, and concrete remediation guidance written for developers — not compliance officers.

Related articles

SQL Injection Prevention and Testing: The 2026 Security Framework
What if your security suite was so precise that your 2026 release cycle didn't require a single manual sign-off to guarantee safety? You've likely felt the frustration when manual pentesting lags behind your deployment schedule by 72 hours, or when your current SAST tool flags 40 false positives for…
What Is SQL Injection? A Complete Guide to Attacks & Prevention
That gut-wrenching feeling when you wonder if your database queries are truly secure is a familiar one for many developers. A single, unsanitized user input could be all an attacker needs to unravel your application's defenses, turning a simple login form into a catastrophic data breach. This fear o…
Supabase RLS Misconfiguration: How a Missing Policy Exposed Every User's Profile
A solo founder shipped a Next.js + Supabase SaaS to 200+ users. Eight minutes into a Penetrify scan, we found a critical RLS misconfiguration that let any authenticated user read every other user's profile data. Here's exactly what was broken, why it happens, and how it was fixed in under two hours.

Explore more