State data breach browser

Why does this site exist?

There is no standard format for state data breach reporting and it's difficult to compare information on major data breaches across states. I'm hoping this site will help researchers, journalists, and curious people quickly look up data breach information. Instead of routinely visiting a dozen state AG websites, you can just bookmark this one.

Where does the data come from?

Many states have laws requiring private entities, and sometimes state government entities, to report data breaches involving residents' identifying information (PII) to state authorities, usually attorneys general. Not all states have such laws and the specifics of such laws—such as the threshold of affected residents above which a breach must be reported, what constitutes PII, and what information must be reported to authorities—vary considerably. Regardless, many state AGs (and other relevant authorities) maintain websites providing information about reported breaches, generally as a required by law. The data on this site is gleaned from those websites, which are, as follows, as well as the HIPAA data breach database:

California: https://oag.ca.gov/privacy/databreach/list
Delaware: https://attorneygeneral.delaware.gov/fraud/cpu/securitybreachnotification/database/
Hawaii: https://cca.hawaii.gov/ocp/notices/security-breach/
Iowa: https://www.iowaattorneygeneral.gov/for-consumers/security-breach-notifications
Massachusetts: https://www.mass.gov/lists/data-breach-reports
Maine: https://apps.web.maine.gov/online/aeviewer/ME/40/list.shtml
Maryland: https://www.marylandattorneygeneral.gov/Pages/IdentityTheft/breachnotices.aspx
Montana: https://dojmt.gov/consumer/databreach/
New Hampshire: https://www.doj.nh.gov/consumer/security-breaches/
New Jersey: https://www.cyber.nj.gov/threat-center/public-data-breaches/
North Dakota: https://attorneygeneral.nd.gov/consumer-resources/data-breach-notices
Oregon: https://justice.oregon.gov/consumer/DataBreach/
Texas: https://oag.my.site.com/datasecuritybreachreport/apex/DataSecurityReportsPage
Washington: https://www.atg.wa.gov/data-breach-notifications
Wisconsin: https://datcp.wi.gov/Pages/Programs_Services/DataBreaches.aspx
All other states: HIPAA database at https://ocrportal.hhs.gov/ocr/breach/breach_report.jsf

How is the data collected?

The data is collected using a technique commonly referred to as web "scraping". I've written code which automatically visits data breach websites with a web browser and parses their underlying HTML code to extract relevant data about each piece. As each site reports different data and formats that data differently, there is a separate script for each state. Some states, such as Indiana, Oklahoma, Vermont, and Wisconsin, have data breach websites but present data in a format that is difficult to parse automatically (actually, Vermont's site just seems to be broken). I have not written code to parse these sites yet, but hope to.

I hope to improve the richness of the data I pull out for each site, e.g. distinguishing when the name reported for a business or entity includes a "d.b.a" or parsing PDFs of data breach notifications for information such as the number of state residents whose data was subject to the breach. There is also likely some "dirty" data, either because there are bugs or oversights in my parsing scripts or because there are errors in the data as it appears on the states' sites. For instance, New Hampshire has some mis-typed dates.

What data is collected?

The data made available and the data I'm able to extract automatically differ by state. Below are the pieces of information I'm generally able to extract from each state, but that does not mean I'm able to extract them for each entry in the database. The data varies widely.

AL: Entity name, State, Reported date, # affected, Type of breach, Data source
AK: Entity name, State, Reported date, # affected, Type of breach, Data source
AZ: Entity name, State, Reported date, # affected, Type of breach, Data source
AR: Entity name, State, Reported date, # affected, Type of breach, Data source
CA: Entity name, Breach dates, Reported date, # affected, Type of breach, Data source
CO: Entity name, State, Reported date, # affected, Type of breach, Data source
CT: Entity name, State, Reported date, # affected, Type of breach, Data source
DE: Entity name, Start date, End date, Breach dates, Reported date, # affected, Type of breach, Data source
DC: Entity name, State, Reported date, # affected, Type of breach, Data source
FL: Entity name, State, Reported date, # affected, Type of breach, Data source
GA: Entity name, State, Reported date, # affected, Type of breach, Data source
HI: Entity name, Reported date, # affected, Type of breach, Data source, Notification
ID: Entity name, State, Reported date, # affected, Type of breach, Data source
IL: Entity name, State, Reported date, # affected, Type of breach, Data source
IN: Entity name, State, Reported date, # affected, Type of breach, Data source
IA: Entity name, Reported date, # affected, Type of breach, Data source, Notification
KS: Entity name, State, Reported date, # affected, Type of breach, Data source
KY: Entity name, State, Reported date, # affected, Type of breach, Data source
LA: Entity name, State, Reported date, # affected, Type of breach, Data source
MA: Entity name, Reported date, # affected, Data accessed, Type of breach, Data source
ME: Entity name, Reported date, # affected, Type of breach, Data source, URL
MD: Entity name, Reported date, # affected, Data accessed, Type of breach, Data source
MI: Entity name, State, Reported date, # affected, Type of breach, Data source
MN: Entity name, State, Reported date, # affected, Type of breach, Data source
MS: Entity name, State, Reported date, # affected, Type of breach, Data source
MO: Entity name, State, Reported date, # affected, Type of breach, Data source
MT: Entity name, Notification, Start date, End date, Reported date, # affected, Type of breach, Data source
NE: Entity name, State, Reported date, # affected, Type of breach, Data source
NV: Entity name, State, Reported date, # affected, Type of breach, Data source
NH: Entity name, Reported date, # affected, Type of breach, Data source, URL
NJ: Entity name, Reported date, # affected, Type of breach, Data source, URL
NM: Entity name, State, Reported date, # affected, Type of breach, Data source
NY: Entity name, State, Reported date, # affected, Type of breach, Data source
NC: Entity name, State, Reported date, # affected, Type of breach, Data source
ND: Entity name, d.b.a., Notification, Start date, End date, Breach dates, Reported date, # affected, Type of breach, Data source
OH: Entity name, State, Reported date, # affected, Type of breach, Data source
OK: Entity name, State, Reported date, # affected, Type of breach, Data source
OR: Entity name, Start date, End date, Breach dates, Reported date, # affected, Type of breach, Data source
PA: Entity name, State, Reported date, # affected, Type of breach, Data source
RI: Entity name, State, Reported date, # affected, Type of breach, Data source
SC: Entity name, State, Reported date, # affected, Type of breach, Data source
SD: Entity name, State, Reported date, # affected, Type of breach, Data source
TN: Entity name, State, Reported date, # affected, Type of breach, Data source
TX: Entity name, Biz address, Biz city, Biz state, Biz ZIP, Publish date, # affected, Data accessed, Type of breach, Data source, Notices given
UT: Entity name, State, Reported date, # affected, Type of breach, Data source
VT: Entity name, State, Reported date, # affected, Type of breach, Data source
VA: Entity name, State, Reported date, # affected, Type of breach, Data source
WA: Entity name, Start date, Reported date, # affected, Data accessed, Type of breach, Data source, Notification
WV: Entity name, State, Reported date, # affected, Type of breach, Data source
WI: Entity name, Breach dates, Reported date, # affected, Data accessed, Type of breach, Data source
WY: Entity name, State, Reported date, # affected, Type of breach, Data source

How often is the data updated?

In general, the script to update data will run once a week, on Monday nights. And the site will be refreshed shortly after. More frequent updates are possible.

Can I get this data in an CSV/XLS?

Yes, there's a link at the top of each tale to download the results for your query as a CSV. XLS is not available for now, however.

API

Good news! There's a JSON API for the data on this site.

Available endpoints:

/api/: Data for all states, corresponding to the home page.
/api/states/:code: Data for a single site, where :code is a two-letter state code. E.g.: /api/states/TX

Querystring parameters:

sort: accepts the name of a column, e.g. /api/?sort=number_affected. Options: state, entity_name, dba, business_address, business_city, business_state, business_zip, start_date, end_date, breach_dates, reported_date, number_affected, data_accessed, notice_methods, published_date, breach_type, data_source, letter_url, url
desc: for use with sort. If present, results will be sorted in descending order. E.g.: /api/states/WA?sort=number_affected&desc
Filters: a column name (one of state, entity_name, dba, business_address, business_city, business_state, business_zip, start_date, end_date, breach_dates, reported_date, number_affected, data_accessed, notice_methods, published_date, breach_type, data_source, letter_url, url), an operator (one of eq, like, gt, gte, lt, lte), and a value to filter for. Filters for a column can be combined with [AND] or [OR]. E.g.: /api/?state=eq:WA&reported_date=gte:01/01/2020[AND]lte:12/31/2020 will return entries where the state is Washington, and the reported date is between January 1 and December 31, 2020 (inclusive).
exclude: columns to exclude from the returned results, separated by columns. One of: state, entity_name, dba, business_address, business_city, business_state, business_zip, start_date, end_date, breach_dates, reported_date, number_affected, data_accessed, notice_methods, published_date, breach_type, data_source, letter_url, url. E.g.: /api/?exclude=business_zip,dba
limit: number of results to display per page, e.g.: /api/states/OR?limit=25&sort=number_affected&desc
offset: the result to start with. To be used with limit for pagination. This is zero-based, so the first result is at offset 0 and the twenty-first result is at offset 20. E.g.: /api/states/OR?limit=25&offset=25&sort=number_affected&desc

Pagination headers:

Responses will include a Content-Range header with values in the following format:
entries 0-20/136, which indicates that you are viewing the first 20 out of a total of 136 results for your query. This means that page 2 will start at offset 20, and there will be 7 pages total.