Benchmark

Vulkro is measured against the same competitors security teams already use, on the same public test set the SAST industry uses, with the same scoring code. Reproducible end-to-end from one shell script - no marketing numbers, no hand-picked examples.

What it means in plain language

Three things matter when you're picking a security tool:

Does it find the real bugs?
Does it bury me in false alarms?
Can I trust the numbers?

Here's where vulkro lands on each:

1. It catches more real bugs

The benchmark uses 10 public codebases in the languages vulkro targets (JS / TS, Python, Go) with 55 catalogued security bugs (real vulnerabilities, real CVE references, real bug-fix commits). The corpus also contains 3 Java / PHP apps (DVWA, WebGoat, java-sec-code); those are outside vulkro's language scope, so they are excluded from the head-to-head. When we run each tool at its default setting and check how many of the 55 bugs it actually flags:

Tool	Real bugs caught (of 55)
vulkro	36
Bearer 2.0.2	26
Semgrep CE	13

At its default high-confidence setting vulkro catches 2.8x more bugs than Semgrep CE and 1.4x more than Bearer. Turn the confidence threshold down to max-recall mode and it catches 51 of 55 (93%) - see the trade-off below. For a security team, this is what matters first: the bug you ship without knowing is the one that becomes an incident.

2. It doesn't bury you in noise

A scanner that flags everything finds everything - but nobody reads the report. The harder bar is "high signal-to-noise": when the tool fires, is it usually right?

Tool	Real bugs caught	False alarms in same files
vulkro	36	11
Bearer 2.0.2	26	26
Semgrep CE	13	4

Vulkro finds more real bugs than Bearer (36 vs 26) with fewer false alarms (11 vs 26) - it wins on both at once. Against Semgrep it finds 2.8x more bugs (36 vs 13) at essentially the same precision (0.77 vs 0.76). The noise stays tractable - 11 false alarms in total, not 11,000.

3. The numbers are independently verifiable

You don't have to trust the table. The benchmark harness ships with every Vulkro install and runs end-to-end from one command:

vulkro bench --tools vulkro,semgrep,bearer --tier1

Every line in the table above is reproducible on your laptop against the same pinned repos. We don't curate which results get shown.

The bottom line for a security team

Concern	What the numbers say
"Will it find the bugs that hurt us?"	Catches 65% of catalogued bugs at the default high-confidence setting (36 of 55), and 93% (51 of 55) in max-recall mode. Semgrep catches 24%, Bearer 47%.
"Will my team spend their week triaging noise?"	At the recommended setting vulkro raises fewer false alarms than Bearer (11 vs 26) and is within a handful of Semgrep (11 vs 4) while finding 2.8x more bugs - manageable, not a flood.
"Can I prove to my CISO this isn't marketing?"	Yes. The benchmark harness, the test repos, the ground truth and the scoring code all ship with Vulkro. Re-run any time.
"Will the vendor's numbers drift after we sign?"	The benchmark is fully reproducible against pinned commits - same binary, same bundle, same findings, every run.

How does it measure up against competitors?

A fuller side-by-side at the same --min-confidence high setting:

metric	vulkro	Semgrep CE	Bearer 2.0.2
precision (when it fires, is it right?)	0.77	0.76	0.50
recall (does it catch the bugs?)	0.65	0.24	0.47
F1 (the standard combined score)	0.71	0.36	0.49

F1 0.71 is a 2.0x improvement over Semgrep CE and 1.4x over Bearer. F1 is the headline metric the SAST industry uses - one number that combines "did you find the bug?" (recall) and "when you fired, were you right?" (precision).
Vulkro leads on every individual axis - precision, recall, F1 - not just the combined number. It matches Semgrep's precision (0.77 vs 0.76) while catching 2.8x as many bugs.

If you want maximum recall and can tolerate more noise (e.g. a one-shot audit, a pre-launch scan), drop the confidence threshold:

metric	vulkro max-recall	vulkro default (high)
precision	0.36	0.77
recall	0.93	0.65
F1	0.52	0.71

Max-recall mode catches 51 of 55 catalogued bugs at the cost of more noise (89 false alarms instead of 11). Useful for audits and pre-release scans. For ongoing CI, the default --min-confidence high is the recommended setting.

Methodology (the technical bit)

For engineers and security architects who want to verify the numbers, here's exactly how the bench is constructed.

Corpus

Thirteen deliberately-vulnerable repositories plus framework controls, all public, all pinned to specific commits:

Application bug examples - DSVW, NodeGoat, Vulnerable-Flask-App, anxolerd-dvpwa, dvna, juice-shop, lets-be-bad-guys (JS / TS, Python, Go)
Clean framework controls - express, fastapi, flask (these should produce zero findings; they're a "do you fire on a clean codebase?" test)
Out of language scope - DVWA (PHP), WebGoat and java-sec-code (Java). Vulkro targets JS / TS, Python and Go, so these three are excluded from the head-to-head scoring (all tools are scored on the same 10 in-scope repos). The 55-bug ground-truth total above is the in-scope count.

The exact list and pinned commits ship with the benchmark harness.

Ground truth

Each catalogued bug is a (file, line, class) tuple in a ground-truth table. Total: 55 bugs. Every bug references a CVE, a GHSA advisory, or a public fix commit so anyone can verify it's a real vulnerability.

[[bugs]]
file = "app/app.py"
line = 207
class = "sql-injection"
cite = "GHSA-xxxx-yyyy-zzzz"

Scoring rules

For each tool's output we check every finding against the ground truth:

TP (true positive) - finding fired on a catalogued bug, same file, line within +/-5.
FP (false positive) - finding fired in a catalogued file but at no catalogued bug line.
FN (false negative) - catalogued bug missed by the tool.
OOS (out of scope) - finding fired in a file we didn't catalogue. Neither rewarded nor penalised; we don't claim to know whether those files are clean or just unannotated.

A catalogue entry can be claimed by at most one finding (greedy nearest) so over-firing on the same bug doesn't inflate the score.

Metrics

precision = TP / (TP + FP) - when the tool fires, how often is it right?
recall = TP / (TP + FN) - of the bugs that exist, how many did the tool find?
F1 = 2 x (precision x recall) / (precision + recall) - the standard combined score (1.0 = perfect, 0.0 = nothing).

Tools compared

vulkro - this project. Tested at --min-confidence high (production-recommended) and at default low.
Semgrep CE - p/security-audit ruleset.
Bearer 2.0.2 - --scanner sast.
Snyk Code - opt-in; soft-skipped if not authenticated.
Akto / Pynt - listed in the matrix for honesty but skipped with a stub note: both are DAST tools that need a running API, not a static codebase.

Reproduce it

# Headline numbers, vulkro-only:
vulkro bench --tools vulkro --tier1 --min-confidence high

# Head-to-head against Semgrep + Bearer (auto-detected on PATH):
vulkro bench --tier1 --min-confidence high

# Full default-tier rollup:
vulkro bench --tools vulkro --tier1

The harness writes a per-run results directory with raw findings, normalised JSON, and a markdown scorecard. A reference scorecard is refreshed every release.

Caveats - what these numbers don't tell you

Tier 1 is deliberately-vulnerable code. Real-world performance varies. A separate noise-floor measurement on popular production SaaS codebases is tracked internally.
Class matching matters. A finding under the wrong OWASP category doesn't count as a TP even if the line is right. This penalises every tool roughly equally.
Java, C#, Ruby, PHP repos are excluded. The general scanner no longer covers those languages (its deep tier is JS/TS, Python, Go; Salesforce Apex is a separate product), so benchmark repos in them are out of scope.
PostHog is excluded from the default corpus. Consistently scanned past the 900-second timeout. It re-enters the rollup after a planned parallel-rule-pass scan optimisation lands.

Confidence model - how High vs Medium vs Low is computed from evidence weights.
OWASP API Top 10 - per-class detector inventory.
Reproducing in CI - running the benchmark as part of every PR.

What it means in plain language​

1. It catches more real bugs​

2. It doesn't bury you in noise​

3. The numbers are independently verifiable​

The bottom line for a security team​

How does it measure up against competitors?​

Methodology (the technical bit)​

Corpus​

Ground truth​

Scoring rules​

Metrics​

Tools compared​

Reproduce it​

Caveats - what these numbers don't tell you​

Related​