About this site

Why does this site exist?

There is no standard format for state data breach reporting and it's difficult to compare information on major data breaches across states. I'm hoping this site will help researchers, journalists, and curious people quickly look up data breach information. Instead of routinely visiting a dozen state AG websites, you can just bookmark this one.

Where does the data come from?

Many states have laws requiring private entities, and sometimes state government entities, to report data breaches involving residents' identifying information (PII) to state authorities, usually attorneys general. Not all states have such laws and the specifics of such laws—such as the threshold of affected residents above which a breach must be reported, what constitutes PII, and what information must be reported to authorities—vary considerably. Regardless, many state AGs (and other relevant authorities) maintain websites providing information about reported breaches, generally as a required by law. The data on this site is gleaned from those websites, which are, as follows, as well as the HIPAA data breach database:

How is the data collected?

The data is collected using a technique commonly referred to as web "scraping". I've written code which automatically visits data breach websites with a web browser and parses their underlying HTML code to extract relevant data about each piece. As each site reports different data and formats that data differently, there is a separate script for each state. Some states, such as Indiana, Oklahoma, Vermont, and Wisconsin, have data breach websites but present data in a format that is difficult to parse automatically (actually, Vermont's site just seems to be broken). I have not written code to parse these sites yet, but hope to.

I hope to improve the richness of the data I pull out for each site, e.g. distinguishing when the name reported for a business or entity includes a "d.b.a" or parsing PDFs of data breach notifications for information such as the number of state residents whose data was subject to the breach. There is also likely some "dirty" data, either because there are bugs or oversights in my parsing scripts or because there are errors in the data as it appears on the states' sites. For instance, New Hampshire has some mis-typed dates.

What data is collected?

The data made available and the data I'm able to extract automatically differ by state. Below are the pieces of information I'm generally able to extract from each state, but that does not mean I'm able to extract them for each entry in the database. The data varies widely.

How often is the data updated?

In general, the script to update data will run once a week, on Monday nights. And the site will be refreshed shortly after. More frequent updates are possible.

Can I get this data in an CSV/XLS?

Yes, there's a link at the top of each tale to download the results for your query as a CSV. XLS is not available for now, however.

API

Good news! There's a JSON API for the data on this site.

Available endpoints:

Querystring parameters:

Pagination headers:

Responses will include a Content-Range header with values in the following format:
entries 0-20/136, which indicates that you are viewing the first 20 out of a total of 136 results for your query. This means that page 2 will start at offset 20, and there will be 7 pages total.