Benchmark
Vulkro is measured against the same competitors security teams already use, on the same public test set the SAST industry uses, with the same scoring code. Reproducible end-to-end from one shell script - no marketing numbers, no hand-picked examples.
What it means in plain language
Three things matter when you're picking a security tool:
- Does it find the real bugs?
- Does it bury me in false alarms?
- Can I trust the numbers?
Here's where vulkro lands on each:
1. It catches more real bugs
The benchmark uses 13 public codebases with 55 catalogued security bugs (real vulnerabilities, real CVE references, real bug-fix commits). When we run each tool and check how many of those 55 it actually flags:
| Tool | Real bugs caught (of 55) |
|---|---|
| vulkro | 42 |
| Bearer 2.0 | 25 |
| Semgrep CE | 12 |
Vulkro catches 3.5x more bugs than Semgrep CE and 1.7x more than Bearer. For a security team, this is what matters first - the bug you ship without knowing is the one that becomes an incident.
2. It doesn't bury you in noise
A scanner that flags everything finds everything - but nobody reads the report. The harder bar is "high signal-to-noise": when the tool fires, is it usually right?
| Tool | Real bugs caught | False alarms in same files |
|---|---|---|
| vulkro | 42 | 26 |
| Bearer 2.0 | 25 | 31 |
| Semgrep CE | 12 | 8 |
Vulkro finds 1.6x more real bugs than Bearer with fewer false alarms. Against Semgrep, it finds 3.5x more bugs at the cost of about 3x more noise - and that noise is still tractable (26 in total, not 26,000).
3. The numbers are independently verifiable
You don't have to trust the table. The benchmark harness ships with every Vulkro install and runs end-to-end from one command:
vulkro bench --tools vulkro,semgrep,bearer --tier1
Every line in the table above is reproducible on your laptop in under 5 minutes against the same 13 repos. We don't curate which results get shown.
The bottom line for a security team
| Concern | What the numbers say |
|---|---|
| "Will it find the bugs that hurt us?" | Catches 76% of catalogued bugs on the test set (42 of 55). Semgrep catches 22%, Bearer 45%. |
| "Will my team spend their week triaging noise?" | At the recommended setting, false alarms are comparable to Bearer and within ~3x of Semgrep - manageable, not flood. |
| "Can I prove to my CISO this isn't marketing?" | Yes. The benchmark harness, the test repos, the ground truth and the scoring code all ship with Vulkro. Re-run any time. |
| "Will the vendor's numbers drift after we sign?" | The benchmark is fully reproducible against pinned commits - same binary, same bundle, same findings, every run. |
How does it measure up against competitors?
A fuller side-by-side at the same --min-confidence high setting:
| metric | vulkro | Semgrep CE | Bearer 2.0 |
|---|---|---|---|
| precision (when it fires, is it right?) | 0.62 | 0.60 | 0.45 |
| recall (does it catch the bugs?) | 0.76 | 0.22 | 0.45 |
| F1 (the standard combined score) | 0.68 | 0.32 | 0.45 |
- F1 0.68 is a 2.1x improvement over Semgrep CE and 1.5x over Bearer. F1 is the headline metric the SAST industry uses - one number that combines "did you find the bug?" (recall) and "when you fired, were you right?" (precision).
- Vulkro leads on every individual axis - precision, recall, F1 - not just the combined number.
If you want maximum recall and can tolerate more noise (e.g. a one-shot audit, a pre-launch scan), drop the confidence threshold:
| metric | vulkro default | vulkro high |
|---|---|---|
| precision | 0.21 | 0.62 |
| recall | 0.91 | 0.76 |
| F1 | 0.34 | 0.68 |
Default mode catches 50 of 55 catalogued bugs at the cost of
more noise. Useful for audits and pre-release scans. For ongoing CI,
--min-confidence high is the recommended setting.
Methodology (the technical bit)
For engineers and security architects who want to verify the numbers, here's exactly how the bench is constructed.
Corpus
Thirteen deliberately-vulnerable repositories plus framework controls, all public, all pinned to specific commits:
- Application bug examples - DSVW, DVWA, NodeGoat, Vulnerable-Flask-App, WebGoat, anxolerd-dvpwa, dvna, juice-shop, lets-be-bad-guys, java-sec-code
- Clean framework controls - express, fastapi, flask (these should produce zero findings; they're a "do you fire on a clean codebase?" test)
The exact list and pinned commits ship with the benchmark harness.
Ground truth
Each catalogued bug is a (file, line, class) tuple in a ground-truth
table. Total: 55 bugs. Every bug references a CVE, a GHSA advisory,
or a public fix commit so anyone can verify it's a real vulnerability.
[[bugs]]
file = "app/app.py"
line = 207
class = "sql-injection"
cite = "GHSA-xxxx-yyyy-zzzz"
Scoring rules
For each tool's output we check every finding against the ground truth:
- TP (true positive) - finding fired on a catalogued bug, same file, line within +/-5.
- FP (false positive) - finding fired in a catalogued file but at no catalogued bug line.
- FN (false negative) - catalogued bug missed by the tool.
- OOS (out of scope) - finding fired in a file we didn't catalogue. Neither rewarded nor penalised; we don't claim to know whether those files are clean or just unannotated.
A catalogue entry can be claimed by at most one finding (greedy nearest) so over-firing on the same bug doesn't inflate the score.
Metrics
- precision = TP / (TP + FP) - when the tool fires, how often is it right?
- recall = TP / (TP + FN) - of the bugs that exist, how many did the tool find?
- F1 = 2 x (precision x recall) / (precision + recall) - the standard combined score (1.0 = perfect, 0.0 = nothing).
Tools compared
- vulkro - this project. Tested at
--min-confidence high(production-recommended) and at defaultlow. - Semgrep CE -
p/security-auditruleset. - Bearer 2.0 -
--scanner sast. - Snyk Code - opt-in; soft-skipped if not authenticated.
- Akto / Pynt - listed in the matrix for honesty but skipped with a stub note: both are DAST tools that need a running API, not a static codebase.
Reproduce it
# Headline numbers, vulkro-only:
vulkro bench --tools vulkro --tier1 --min-confidence high
# Head-to-head against Semgrep + Bearer (auto-detected on PATH):
vulkro bench --tier1 --min-confidence high
# Full default-tier rollup:
vulkro bench --tools vulkro --tier1
The harness writes a per-run results directory with raw findings, normalised JSON, and a markdown scorecard. A reference scorecard is refreshed every release.
Caveats - what these numbers don't tell you
- Tier 1 is deliberately-vulnerable code. Real-world performance varies. A separate noise-floor measurement on popular production SaaS codebases is tracked internally.
- Class matching matters. A finding under the wrong OWASP category doesn't count as a TP even if the line is right. This penalises every tool roughly equally.
- Java, C#, Ruby, PHP repos are excluded from the rollup. Vulkro's coverage of those languages is still partial; including them would unfairly tank recall.
- PostHog is excluded from the default corpus. Consistently scanned past the 900-second timeout. It re-enters the rollup after a planned parallel-rule-pass scan optimisation lands.
Related
- Confidence model - how
HighvsMediumvsLowis computed from evidence weights. - OWASP API Top 10 - per-class detector inventory.
- Reproducing in CI - running the benchmark as part of every PR.