Guest blog by Rob Volkert
In 2018 there were reportedly 1,244 data breaches totaling over 446 million exposed records, primarily targeting the business sector and health care fields. Cyber security systems may be growing more sophisticated, but so too are attacks designed to collect personal data. There may be a silver lining to breach data for those of us who conduct open source intelligence (OSINT) investigations that is not frequently discussed. To do so, it is worth taking a step back to understand what exactly data breach is and what we can do with it.
What is Breach Data?
“Breach Data” is data that is made publicly available by individuals or entities that perpetrate data breaches. While the act of the breach itself is illegal, obtaining and using the data after it has been leaked is both lawful and very useful in OSINT investigations. Almost every major US or multinational company has been or will be, hacked. Most of the company breaches are simply usernames/email addresses and passwords, likely obtained from unsophisticated hacking crews scanning the Internet for unprotected servers left open for external connections. Other data breaches can include personally identifiable information (PII) such as real name, phone number, email address, credit card, national identity documents (driver’s license, passport, national ID), and other protected information. These more sensitive data sets are typically executed by more sophisticated actors and are often hacked government sites or protected data servers; however, they are generally just as accessible as routine business breaches. Hacked data sets are routinely uploaded and provided (some free and others paid) to the public on various paste or file storage sites and also to more limited audiences via dark web forums or marketplaces.
Breach data can be essential in OSINT investigations to discover new leads and confirm existing data about a target, as well as other services – to include credit monitoring and ensuring compliance for protecting customer data. It is also important for privacy reasons to understand the scope of personal data spillage. When thinking about breach data value, it is critical to think beyond only the number of records (quantity) and equally or more about the uniqueness of the data (quality). Is the data duplicative–large compilations of the same company breaches—or does it contain more selective and sensitive information such as government records or private medical history?
Where can I find information about new breach data sets and historical ones?
While numerous websites or services are providing public search queries (free and paid) against data processed and hosted on the site, this article specifically focuses on how to find original breach source files. So where should we look for information about new and historical breaches that can get us started?
DataBreaches.net – https://www.databreaches.net/
An excellent resource for identifying new data sets that are not going to be as widely publicized as the major company leaks like Experian or Yahoo. This site does not provide direct URL links to the breached data itself, and not all of these leaks are being disseminated by the hackers for public consumption. However, the site does provide an excellent picture on what company sectors are being targeted and the degree of PII that is being stolen (and thereby what might be currently most valuable to hackers or criminal organizations).

Reddit – https://www.reddit.com/r/pwned/
Twitter – https://twitter.com/haveibeenpwned
These are two of the most active threads posting information about new breach leaks. The sub-Reddit pwned contains a lot of links back to Raidforums.com (see below for more information on this site) but also has a right mix of media articles and paste sites where breaches are discussed. I like the Twitter feed for “Have I Been Pwned” (of the same website name, run by Troy Hunt) mainly because it addresses only new breaches, summarizes them, and often includes a media article for additional context. Note haveibeenpwned.com (and the Twitter account) do not provide links to the original breach files themselves.
Nuclear Forums – https://nuclearleaks.com/
This site has a great list of historical breaches, the original date of breach, number of records compromised, and hashing algorithm (very helpful to know whether you are getting a .txt file or a list of hashed passwords, for example). This site does not allow for downloads of the breach files themselves and instead claims its primary purpose is to raise awareness about database breaches. For privacy enthusiasts, it also includes a column indicating whether the breached company has actually acknowledged the breach itself (and presumably notified customers).

Where can I find the raw breach data after I read or heard about it?
Breach data files are typically stored and downloaded directly from cloud file hosting services (mega.nz), pastebins, and torrents (TorrentDownloads). While it is possible to run generic web crawler searches to find these links, several curated sites collect links or feature forums where the breach perpetrators themselves provide links to the downloads:
RaidForums – https://raidforums.com/
This site claims to have 257 databases with over 4 billion records uploaded to a private server, all available after unlocking a download link with credits that can be obtained through forum activity or purchased for a small fee. There are thousands more ‘unverified’ database leaks and downloads on the other forums on this site. However, the 257 are considered “official databases”, meaning they have been checked for legitimacy. While this site is not hosted and accessed on the dark web (and doesn’t require only anonymous payment), it is also not entirely as trusted as a surface web site like Databases.today below. In my experience, surface web sites tend to verify, archive, and offer (ideally) clean files but carry only the more common database leaks. RaidForums will have a larger set of newer and unverified leaks; however, I would only recommend downloading these files to a sandboxed environment and in a Virtual Machine for security reasons.

Databases.today – https://databases.today/
This site claims it has “the biggest free-to-download collection of publicly available website databases for security researchers and journalists.” The site contains 1385 databases (totaling 73 gigabytes) of the most popular leaks over recent years (LinkedIn, Ashley Madison, Yahoo, DropBox, MySpace, etc.). However, it does not specify the exact date of the breach itself and instead when it was “modified” in the site (likely the upload date). Note this site is maintained by the owner of Snusbase (https://snusbase.com) that offers a paid search and API capability against what appears to be all the databases from Databases.today.

Darkweb Forums and Marketplaces
There are numerous dark web forums and marketplaces (DDoSecrets and the recently-seized Dream Market are two examples) that offer or sell breached data sets, although searching for these is more difficult than surface or deep web sites. Most, if not all, sellers on these forums require obfuscated payment (Bitcoin) and are often offering much more sensitive data (foreign government official records, foreign citizen data, etc.). We should be particularly aware that these data files might contain malicious content and possibly be directly sold by the actor.
Further, I would like to emphasize here the distinction between obtaining public data from a surface or deep web compilation site and paying an actor on a dark web forum: directly paying an actor for breach data on the dark web is illegal, this would be construed as encouraging them to commit a crime and thereby make you complicit.
How do I safely extract and search the data?
Extracting data is not without risks. But two strategies are crucial to minimizing that risk: using good operational security when collecting the data and employing a capability that can search or parse (or both).
Security: when downloading any data, but particularly breached data since we often don’t know the exact origin of these files, several layers of protection should be employed. Use a Virtual Machine (Virtual Box, VMWare), a privacy-conscience browser (Firefox, TOR), an excellent VPN service, and download files to a sandboxed environment (this could include the VM desktop itself or software such as Sandboxie) or directly into cloud storage. It is possible to collect and parse the data entirely in a sandboxed environment and then run future secure searches against this clean data.
Searching: it is essential to have an understanding and access to capabilities for basic search (GNU Grep or Sift in Linux), basic parsing (MySQL Workbench), or both (AWS ElasticSearch) to search for text within the data. Since the breach data sets are typically formatted differently, some structured and often unstructured, it becomes quickly apparent that more advanced processing is needed. If the data is processed correctly, it is possible to create a comprehensive database with multiple data sets and advanced search or analytic options (discovering trends in the data or running repeatable tailored searches). This is why there are now so many sites that have paid plans for running searches on combined data breach sets (the data has been processed for the user and in an easy-to-use GUI), however, there is obviously an operational security trade-off here.
What can I legally do with breached data? What can’t I do?
What can I do? There are many legal and ethical uses for breach data, which includes acting as a critical source for enabling OSINT investigations. There are several recommendations for responsibly using this information:
- Enable your own investigative efforts or OSINT trade craft. Use the data for pursuing new investigative leads such as additional social media accounts or email addresses, confirming existing data such as an association with a government entity, finding patterns in passwords or usernames, and other methodology or research interests. There is much less grey here than, for example, running a site charging users’ access to the data.
- Support only current clients and services. Notifying prospective individuals or companies, even if your intentions are genuine, is likely to backfire and not a good habit of getting into. Most people will not understand how you obtained the data, will likely be suspicious of what you are trying to do and could misinterpret your actions as blackmailing.
- Understand your company or unit’s policies for collecting, securing and storing, and using this data. Policies on data collection and retention will vary by company or government agency (state or federal) and are influenced by many factors that go beyond US law. If a company is based in the EU for example, GDPR comes into play regarding PII possession. Reputation management is another strong concern since the general public is often uncomfortable with companies or government agencies collecting citizens’ personal information (even publicly available).
- If you are storing this data (particularly in a cloud environment), it is in your interest to take all necessary measures to comply with applicable data and privacy laws*. It is also in your interest (if possible) to store the information in an encrypted environment, use multi-factor authentication for access, only collect if necessary (there is a difference between private personal medical information on US citizens as opposed to emails and passwords from a social media breach), and only retain as long as necessary.
* Fair Credit Reporting Act, Children’s Online Privacy Protection Act, Health Insurance Portability and Accountability Act, Computer Fraud and Abuse Act
What can’t I do? While there is no one US law pertaining specifically to breach data, there are several guidelines responsible investigators and OSINT practitioners should abide by:
- Do not illegally profit from the breach data (i.e. cannot use it to commit another crime).
- Do not sell breach data to third parties (enabling your own investigative capabilities as opposed to selling data for profit).
- Do not, or induce others to, encourage or pay hostile actors to acquire this data.
- Report illegal data or activity to law enforcement. There are certain reporting obligations when accessing or using breach data (this bullet mainly applies to US citizens; please check your country’s local laws if outside of that jurisdiction) and this includes any files that contain: child pornography, illegal interception of communications, stolen or unauthorized US government-issued ID, and any data containing U.S. classified information*.
* Child Pornography Possession, Electronic Communications Privacy Act, National Stolen Property Act, Stolen US Government ID – 18 U.S.C. § 1028, US Classified Information – 18 U.S.C. § 798
Being responsible with acquiring and using breached data is very important. Using these data sets is not as simple as “it’s publicly available” in the way that consenting social media or forum content exists in the public domain. If there is any doubt on the information source (whether you are enticing an actor to gather data or something in the data itself), it is better to leave it alone. Chances are another site (like Have I Been Pwned) will assess and make available the data anyway.
If you have any additional questions, thoughts or comments, please feel free to reach out to me.
Rob Volkert, diligencewatch@protonmail.com
