Many past social media breaches resulted from scraping. Most recently, a hacker scraped over 400 million records from Twitter. And it’s only a matter of time before another data breach occurs using the same technique.
In this blog, I’ll explain how hackers scraped those user records from Twitter and how to mitigate these attacks.
Data Scraping Approach
For those who don’t know, scraping the web is absolutely legal (and it’s done constantly). For example, Google scrapes sites when they’re caching data for their search engine.
Scraping hacks, however, target unprotected user information to generate massive data. They require two components:
- A programmatic way to extract data
- Sufficient authorization
It’s just like scraping off old paint.
In the same way that you run a putty knife along a wall to remove paint, hackers will run their scripts against APIs to extract data.
So, to start, a hacker exercises the available data flows of an application. They’re looking for things like Broken Access Control, Misconfigurations, or Authentication Failures. If they find something that’s both sensitive and accessible with their credentials, then it’s time to scale up.
A scraping attack can either be targeted or brute force. Targeted attacks require a list of valid entries. For example, a targeted attack could know every workout registered on a fitness app. Brute force attacks, on the other hand, test many combinations to determine which are valid.
No matter the approach, the hacker programmatically extracts information from their victims. They’ll aggregate the extracted records into a list; these lists are the data files that ultimately go up for sale on hacking forums.
Explaining the Twitter Breach
Please keep in mind that this section simplifies what happened at Twitter. Only a few details of the breach are public domain, so my code uses the information we know.
Oh, and I called my app Birder since it’s a mini-Twitter.
Twitter’s Leaky Data Flows
Twitter’s woes began with the duplication check of their login process. Let’s dive into what that might’ve looked like.
Insecure Login Flow
In a typical login process, the server receives credentials and compares them against a database of users. The best practice is to log the user in if their ID and password match a record; otherwise, tell them the credentials are incorrect and to try again.
In Twitter’s login flow, they returned an account ID to the client if an email or phone number matched an existing user. Giving any potentially helpful feedback to a failed login is a massive security no-no because it helps hackers identify accounts, especially when the response is a valid account ID.
To mimic this, I created a login API with logic as follows:
- If the email or phone number doesn’t match an existing account, inform the client they provided incorrect credentials.
- If the email or phone number matches an account and the password is correct, log them in.
- Otherwise, provide them with the account ID associated with the email or phone number.
With Birder exposing the insecure login, I moved on to data enrichment.
Enriching User Data
Armed with a user ID list, hackers can search each account to enrich their data set.
The enrichment method used by the hacker isn’t publicly released, but there’s a fair chance they relied on one of Twitter’s public APIs for looking up users.
I built a simple API for Birder that receives an account ID and returns information on that user; I loosely based my API on Twitter’s developer documentation.
Once programmatic enrichment was available, it was time to scrape Birder.
Scraping User Accounts From Mini-Twitter
Author’s Note: I generated an entirely fake dataset for Birder. While some records are reminiscent of the real world, I assure you that any malicious attempts with this data will be fruitless.
Find Account IDs with Brute Force
For my brute force approach, I supplied my script with lists of common:
- First names
- Last names
- Email domains
- Email formats
With those parameters set, I attempted to login to Birder with each unique combination. If an account ID was returned to me by the server, I saved it to a local file.
The script I wrote for this step is compute-heavy. It has a Big O time complexity of O(n4), which is what the industry lovingly refers to as “inefficient.”
With a more extensive data set, testing many combinations would take an absurdly long time. Twitter’s security operations center will likely notice someone trying every possible email. And if security doesn’t detect it, cloud billing will see trillions of additional queries charged to their AWS account.
Seasoned hackers, however, may already have lists of known email addresses or phone numbers; with those, they could avoid the brute force approach and instead go for a targeted attack.
Generate User Profiles with Account IDs
Enriching user profiles is straightforward with the other Birder API.
My second script received the list of account IDs as a parameter. For each ID, Birder returned the data associated with that account. In the spirit of mimicking the Twitter exploit, some of the included fields are:
- Screen name
- Birder handle
- Phone number
My script saved all of these enriched accounts to a new file. That generated list would either be posted for sale (as seen in this breach) or used maliciously in some other manner.
How to Prevent Scraping
Most companies hosting PII declare that scraping is against their terms of service. The thing about hackers is: they don’t care about terms of service.
So how do we prevent scraping?
The ideal solution is never to expose scrapable interfaces that are sensitive. But this is easier said than done.
Companies should implement secure-by-design processes and remain extra-diligent when code touches sensitive data (like login credentials). Seasoned developers and application security experts are incredible resources for identifying possible threats.
Another option is to implement rate limiting, which prevents one user/device/account from spamming an API. For example, Twitter’s user API rate limit is “900 requests per 15-minute window per each authenticated user”.
While rate limiting slows attacks, it doesn’t fix the issue. Advanced hackers use many devices to distribute the attack. Like compromised machines used for DDoS attacks, hackers could call upon their available devices to perform a scraping effort.
There are other possible ways to slow or prevent scraping. I won’t dive into them here, but the takeaway is: it’s complicated.
Often, things that stop scraping also hurt usability. And, when you’re a social media platform, lower usability directly impacts your bottom line.
Data flows and APIs constantly morph in the world of frequent code changes. If you want to see me execute this hack, watch the video here.