Skip to main content

Confidence model

Every finding Vulkro emits carries a confidence: High | Medium | Low value plus a confidence_reason: String that explains why the engine fired and what would mark this a false positive. This is the heart of how we keep a high-recall scanner usable.

Levels

LevelMeaningWhen you'd see it
HighStrong, often runtime-confirmed evidence.Provider-format secrets, taint-confirmed injection paths, KEV-listed CVEs, vulkro probe confirmations.
MediumPattern matches but with credible false-positive shape.Heuristic IDOR (path takes an :id and no auth check is visible, but might be wrapped). Generic secret patterns without provider format.
LowHeuristic. Often fires on test/example/migration code.Pattern-match-only checks where context confirms little.

vulkro scan defaults to --min-confidence high. To see the rest:

vulkro scan . --min-confidence medium
vulkro scan . --all

confidence_reason

Every finding has a one-line explanation:

confidence_reason = "taint flowed from req.body to db.query without sanitiser"
confidence_reason = "auth helper not found in import scope"
confidence_reason = "runtime-confirmed via active probe"
confidence_reason = "AKIA-prefix matches AWS access-key format"

Surfaced in JSON, SARIF (properties.confidence_reason), and the desktop console. Designed to be readable without re-reading source.

Calibration table

Vulkro holds a static calibration table that downgrades High -> Medium for detector categories scoring under 30% TP-rate on the 13-repo benchmark. The intent is that if a category is 70%+ false-positive in the wild, callers shouldn't be told "High confidence" by default.

The table is updated when benchmark numbers change. This keeps the default-mode signal-to-noise ratio honest as new detectors land.

Where the bar is

The published benchmark on 13 deliberately-vulnerable repos:

vulkro defaultvulkro --min-confidence highsemgrep CEbearer 2.0
precision0.210.620.600.45
recall0.910.760.220.45
F10.340.680.320.45

--min-confidence high is now the production-recommended cut: it leads Semgrep CE on precision (0.62 vs 0.60) and Bearer on recall (0.76 vs 0.45) on the same corpus. F1 0.68 beats both major competitors outright.

The lift came from three changes shipped on top of the Phase 4 AST- confirmation engine: the cumulative-weight High threshold was calibrated from 1.5 to 1.1 to match what shipped detectors actually emit (Phase 4 pairs at 0.7 + 0.5 = 1.2); four taint / template / autoescape emit sites were retrofitted to attach Evidence so the aggregator can survive the OWASP-category calibration downgrade; and a (file, line, message) dedup pass removes duplicate emissions from overlapping intra-/inter-procedural taint engines.

See Benchmark for the full methodology and per-repo breakdown.