Confidence model
Every finding Vulkro emits carries a confidence: High | Medium | Low
value plus a confidence_reason: String that explains why the engine
fired and what would mark this a false positive. This is the heart of
how we keep a high-recall scanner usable.
Levels
| Level | Meaning | When you'd see it |
|---|---|---|
| High | Strong, often runtime-confirmed evidence. | Provider-format secrets, taint-confirmed injection paths, KEV-listed CVEs, vulkro probe confirmations. |
| Medium | Pattern matches but with credible false-positive shape. | Heuristic IDOR (path takes an :id and no auth check is visible, but might be wrapped). Generic secret patterns without provider format. |
| Low | Heuristic. Often fires on test/example/migration code. | Pattern-match-only checks where context confirms little. |
vulkro scan defaults to --min-confidence high. To see the rest:
vulkro scan . --min-confidence medium
vulkro scan . --all
confidence_reason
Every finding has a one-line explanation:
confidence_reason = "taint flowed from req.body to db.query without sanitiser"
confidence_reason = "auth helper not found in import scope"
confidence_reason = "runtime-confirmed via active probe"
confidence_reason = "AKIA-prefix matches AWS access-key format"
Surfaced in JSON, SARIF (properties.confidence_reason), and the desktop
console. Designed to be readable without re-reading source.
Calibration table
Vulkro holds a static calibration table that downgrades High -> Medium for detector categories scoring under 30% TP-rate on the 13-repo benchmark. The intent is that if a category is 70%+ false-positive in the wild, callers shouldn't be told "High confidence" by default.
The table is updated when benchmark numbers change. This keeps the default-mode signal-to-noise ratio honest as new detectors land.
Where the bar is
The published benchmark on 13 deliberately-vulnerable repos:
| vulkro default | vulkro --min-confidence high | semgrep CE | bearer 2.0 | |
|---|---|---|---|---|
| precision | 0.21 | 0.62 | 0.60 | 0.45 |
| recall | 0.91 | 0.76 | 0.22 | 0.45 |
| F1 | 0.34 | 0.68 | 0.32 | 0.45 |
--min-confidence high is now the production-recommended cut: it
leads Semgrep CE on precision (0.62 vs 0.60) and Bearer on
recall (0.76 vs 0.45) on the same corpus. F1 0.68 beats both major
competitors outright.
The lift came from three changes shipped on top of the Phase 4 AST-
confirmation engine: the cumulative-weight High threshold was
calibrated from 1.5 to 1.1 to match what shipped detectors
actually emit (Phase 4 pairs at 0.7 + 0.5 = 1.2); four taint /
template / autoescape emit sites were retrofitted to attach Evidence
so the aggregator can survive the OWASP-category calibration
downgrade; and a (file, line, message) dedup pass removes duplicate
emissions from overlapping intra-/inter-procedural taint engines.
See Benchmark for the full methodology and per-repo breakdown.