Skip to main content

De-identification validation

When a Salesforce dataset is masked for a sandbox, an analytics export, or a vendor handoff, "de-identified" is a claim that has to be proven, not assumed. Removing direct identifiers (name, email, phone) is not enough: an attacker can re-identify individuals by correlating quasi-identifiers (ZIP, birth date, gender, job title) against external data. Vulkro's de-identification validation samples the masked dataset and measures it against three standard privacy guarantees.

These findings operate over a sampled dataset you provide, not over live records pulled at scan time. The sampling keeps the check fast and keeps sensitive data out of the scan footprint while still giving a statistically meaningful read on the masking quality.

k-anonymity

What it measures. Whether every combination of quasi-identifier values in the sample is shared by at least k records. If a quasi-identifier combination is unique (or rare) in the dataset, the record it points to is effectively singled out even with direct identifiers stripped.

Why it matters. k-anonymity is the floor for any de-identification claim. A masked export where some quasi-identifier combination matches exactly one person has not been de-identified in any meaningful sense: a single linkage against a public dataset re-identifies that person.

How to fix. Raise k by generalizing quasi-identifiers (bucket ages into ranges, truncate ZIP to three digits, coarsen job titles) or by suppressing the rare records. The finding reports the smallest group size found so you can see how far the dataset is from the target k.

l-diversity (l-diversity-violation)

What it measures. Within each group of records that share a quasi-identifier combination, whether the sensitive attribute takes at least l well-represented distinct values. k-anonymity alone can hide the identity of a record while still leaking its sensitive value: if every record in a k-anonymous group has the same diagnosis, salary band, or case type, knowing someone is in the group reveals their sensitive value.

Why it matters. The l-diversity-violation finding catches the homogeneity attack that k-anonymity misses. A dataset can be perfectly k-anonymous and still disclose sensitive information for every individual in a uniform group.

How to fix. Increase diversity within groups by re-partitioning the quasi-identifiers so each group spans at least l distinct sensitive values, or suppress groups that cannot meet the threshold. The finding names the offending group and its sensitive-value distribution.

t-closeness (t-closeness-violation)

What it measures. Whether the distribution of the sensitive attribute within each quasi-identifier group is close (within a threshold t) to its distribution across the whole dataset. l-diversity ensures variety within a group but not that the variety is representative: a group can be l-diverse yet still skew heavily toward a rare, telling value.

Why it matters. The t-closeness-violation finding catches the skewness and similarity attacks that survive l-diversity. If a group's sensitive-value distribution differs sharply from the population, an attacker learns something specific about the individuals in that group even though the group is diverse.

How to fix. Smooth the within-group distribution toward the overall distribution by generalizing or re-grouping the quasi-identifiers, or suppress the records that drive the skew. The finding reports the measured distance against the t threshold so you can target the worst groups first.

How the three relate

The three checks form a ladder, each closing a gap the previous one leaves open:

  • k-anonymity stops an individual from being singled out.
  • l-diversity stops a uniform group from leaking its sensitive value.
  • t-closeness stops a skewed-but-diverse group from leaking a telling value.

A dataset that clears all three at sensible thresholds has a defensible de-identification claim. Validation should be re-run whenever the masking rules, the source schema, or the set of quasi-identifiers changes.

Reading the output

Each finding carries the quasi-identifier set under test, the threshold configured (the target k, l, or t), the measured value, and the identifier of the offending group within the sample. The l-diversity-violation and t-closeness-violation findings additionally carry the sensitive attribute and the distribution that failed the check.

External references