Gutting a Phish

In the news lately there have been countless examples of phishing attacks becoming more sophisticated, but it’s important to remember that entire “industry” is a bell curve: the most dedicated attackers are upping their game, but advancements in tooling and automation are also letting many less sophisticated players get started even more easily. Put another way, spamming and phishing are coexisting happily as both massive multinational business organizations and smaller cottage-industry efforts.

One such enterprising but misguided individual made the mistake of sending a typically blatant phishing email to one of our Neohapsis mailing lists, and someone forwarded it along to me for a laugh.

Initial Phish Email

The phishing email, as it appeared in a mailbox

As silly and evident as this is, one thing I’m constantly astounded by is how the proportion of people who will click never quite drops to zero. Our work on social engineering assessments bears out this real world example: with a large enough sample set, you’ll always hook at least one. In fact, a paper out of Microsoft Research suggests that, for scammers, this sort of painfully blatant opening is actually an intentional tool: it acts as a filter that only the most gullible will pass.

Given the weak effort put into the email, I was curious to see if the scam got any better if someone actually clicked through. To be honest, I was pleasantly surprised.

Phish Site

The phishing site: a combination of legitimate Apple code and images and a form added by the attacker

The site is dressed up as a reasonable approximation of an official Apple site. In fact, a look at the source shows that there are two things going on here: some HTML/CSS set dressing and template code that is copied directly from the legitimate Apple site, and the phishing form itself which is a reusable template form created by one of the phishers.

Naturally, I was curious where data went once the form was submitted. I filled in some bogus data and submitted it (the phishing form helpfully pointed out any missing data; there is certainly an audacity in being asked to check the format of the credit card number that’s about to be stolen). The data POST went back to another page on the same server, then quickly forwarded me on to the legitimate iTunes site.

Submit and Forward Burp -For Blog

This is another standard technique: if a “login” appears to work because the victim was already logged in, the victim will often simply proceed with what they were doing without questioning why the login was prompted in the first place. During social engineering exercises at Neohapsis, we have seen participants repeatedly log into a cloned attack site, with mounting frustration, as they wonder why the legitimate site isn’t showing them the bait they logged in for.

Back to this phishing site: my application security tester spider senses were tingling, so I felt that I had to see what our phisher was doing with the data being submitted. To find out, I replayed the submit request with various types of invalid data, strings that should cause errors depending on how the data was being parsed or stored. Not a single test string produced any errors or different behavior. This could be an indication that any parsing and processing is being done carefully and correctly, but the far more likely case is that they’re simply doing no processing and dumping it all straight out as plain text.

Interesting… if harvested data is just being simply dumped to disk, where exactly is it going? Burp indicates that the data is being POSTed to a harvester script at Snd/Snd.php. I wonder what else is in that directory?

directory listing

Under the hood of the phishing site, the loot stash is clearly visible

That results.txt file looks mighty promising… and it is.

result.txt

The format of the result.txt file

These are the raw results dumped from victims by the harvester script (Snd.php). The top entry is dummy data that I submitted, and when I checked it, the file was entirely filled with the various dummy submissions I had done before. It’s pretty clear from the results that I was the first person to actually click through and submit data to the phish site; actually pretty fortunate, because if a victim did enter legitimate information, the attacker would have to sort it out from a few hundred bogus submissions. Any day that we can make life harder for the the bad guys is a good day.

So, the data collection is dead simple, but I’d still like to know a bit more about the scam and the phishers if possible. There’s not a lot to go on, but the tag at the top of each entry seems unique. It’s the sort of thing we’re used to seeing when hackers deface a website and leave a tag to publicize the work:

------------+| $ o H a B  Dz and a m i r TN |+------------

Googling some variations turned up Google cache of a forum post that’s definitely related to the phishing site above; it’s either the same guy, or someone else using the same tool.

AppleFullz Forum post

A post in a carder forum, offering to sell data in the same format as generated by the phishing site above

A criminal using the name AppleFullz is selling complete information dumps of login details and credit card numbers plus CVV numbers (called “fulls” in carder forums) captured in the exact format that the Apple phish used, and even provides a sample of his wares (Insult to injury for the victim: not only was his information stolen, but it’s being given away as the credit card fraud equivalent of the taster trays at the grocery store). This carder is asking for $10 for one person’s information, but is willing to give bulk discounts: $30 for 5 accounts (This is actually a discount over the sorts of prices normally seen on carder forums; Krebs recently reported that Target cards were selling for $20-$100 per card. I read this as an implicit acknowledgement by our seller that this data is much “dirtier” and that the seller is expecting buyers to mine it for legitimate data). The tools being used here are a combination of some pre-existing scraps of  PHP code widely used in other spam and scam campaigns (the section labeled “|INFO|VBV|”), and a separate section added specifically to target Apple ID’s.

Of particular interest is that the carder provided a Bitcoin address. For criminals, Bitcoin has the advantage of anonymity but the disadvantage that transactions are public. This means that we can actually look up how much money has flowed into that particular Bitcoin address.

blockchain

Ill-gotten gains: the Bitcoin blockchain records transfers into the account used for selling stolen Apple Id’s and credit card numbers.

From November 17, when the forum posting went up, until December 4th, when I investigated this phishing attempt, he has received Bitcoin transfers totaling 0.81815987 BTC, which is around $744.53 (based on the BTC value on 12/4). According to his price sheet, that translates to a sale of between 74 and 124 records: not bad for a month of terribly unsophisticated phishing.

Within a few hours of investigating the initial phishing site, it had been removed. The actual server where the phish site was hosted was a legitimate domain that had been compromised; perhaps the phisher noticed the volume of bogus traffic and decided that the jig was up for that particular phish, or the system administrator got tipped off by the unusual traffic and investigated. Either way the phish site is offline, so that’s another small victory.

Repurposing Data…or, “You’re Going To Do WHAT?”

By Patrick Madden

Boston.com published an AP story that Google is implementing an opt-in beta test to include users’ email in search results. In other words, when a user is logged in at Gmail, performing a plain old Google search will turn up emails in addition to web pages. Privacy concerns have been flagged and noted, and the existence of matching emails will be presented using a collapsed control off to the side of the main results. But will this really do the job? Google searches are shoulder-surfed all the time, so simply disclosing even the existence of a matching email can potentially put people in hot water depending on the search terms.

Google’s beta is another instance in a disturbing trend to repurpose user data in ways that weren’t intended or anticipated when the users provided their data. As the AP article reminds us, Google had previously ventured into this territory with Google Buzz before running into legal challenges. Facebook regularly and unashamedly repurposed ancient postings, many of them ephemeral “status” updates, as users’ timelines that could be easily browsed, though it could be debated that this makes it more clear that the old posts do still exist.

In my AppSec work I regularly find “Insufficient Authorization” in sites and products I assess. These findings are generally relative to the app owner’s perspective and answer the question, “Can my user perform activities or access data in ways I don’t want?” When I look at repurposed user data, though, I see an exactly analogous situation but in the opposite direction…from the user perspective, the question becomes, “Can my service provider use my data in ways I don’t want?”

[The answer, of course, is in the terms of service wherein the providers claim rights to use and disseminate data provided to them, even if they don’t claim ownership of the actual data. Apparently, then, there’s no basis to complain about any of this, and Google, Facebook, and others will simply do what they want. That’s slightly pessimistic, but it’s not that far from what we’ve seen so far.]

When a finding of “Insufficient Authorization” appears to be the result of intended functionality, application owners are asked to review their business requirements versus the vulnerability, identify alternative means of meeting the business requirements, or reconsider the functionality altogether. On the other hand, ordinary end users are terrible at doing all of these, if they even care in the first place. Who can state their personal “business requirements” for social media and other free services, who has gone to the trouble of identifying the risks inherent in sharing personal data with third parties (and quantifying the risks? Pfft!), and who’s willing to reconsider the use of a service they’ve invested themselves in substantially, even though no payment was involved?

Repurposing of user data happens because a provider thinks they’ve found a way to make more money, and because one-sided terms of service make few concessions allowing users to control how their own data gets used. At the same time, that repurposing should never increase user risk exposure by default, at least not in my own personal utopia. Maybe we should all be looking for and flagging “Insufficient Authorization” findings from the user’s perspective. And if the Gmail feature goes live in production with no option to disable it, perhaps I can use separate email and search providers to segregate functionality and mitigate risk.

Pass the iOS Privacy Salt – Hashing Does NOT Guarantee Privacy.

By Kate Pearce, Neohapsis & Neolabs

There has been a lot of concern and online chatter about iPhone/mobile applications and the private data that some send to various parties. Starting with the discovery of Path sending your entire address book to their servers, it has since also been revealed that other applications do the same thing. The other offenders include Facebook, Twitter, Instagram, Foursquare, Foodspotting, Yelp, and Gowalla. This corresponds nicely with some research I have been doing into device ID leakage on mobile devices, where I have seen the same leakages, excuses, and techniques applied and abused as those discussed around the address book leakages.

I have observed a few posts discussing the issues proposing solutions. These solutions range from requiring iOS to request permission for address book access (as it does for location) and advising developers to hash sensitive data that they send through and compare hashes server side.

The first idea is a very good one, I see few reasons a device geolocation is less sensitive than its address book. The second one as given by is only partial advice however, and if taken as it is given in Martin May’s post, or Matt Gemmel’s arguments;  it will not solve the privacy problems on its own. This is because 1. anonymised data isn’t anonymous, and 2. no matter what hashing algorithm you use, if the input material is sufficiently constrained you can compute, or precompute all possible values.

Martin May’s two characteristics of a hash [link] :

  • Identical inputs will yield the same hash
  • It is virtually impossible to deduce the original input from a hash if a strong hashing algorithm is used.

This is because, of these two characteristics of a hash the privacy implications of first are not fully discussed, and the second is incorrect as stated.

 Hashing will not solve the privacy concerns because:

  • Hashing Data does not Guarantee Privacy (When the same data is input)
  • Hashing Data does not Guarantee Secrecy (When the input values are constrained)

The reasons not discussed for this are centered on the fact that real world input is constrained, not infinite. Telephone numbers are an extreme case of this, as I will discuss later.

A quick primer on hashing

Hashing is a destructive, theoretically one-way process where some data is taken and put through an algorithm to produce some output that is a shadow of the input. Like a shadow, the same output is always produced by the same input, but not the other way around. (Same car, same shadow).

A very simple example of a hashing function is the modulus (or remainder). For instance the output from 3 mod 2 is the remainder when 3 is divided by 2, or 1. The percent sign is commonly used in programming languages to denote this operation, so similarly

                1 % 3 is 1,             2 % 3 is 2              3 % 3 is 0              4 % 3 is 1              5 % 3 is 2       etc

If you take some input, you get the same output every time from the same hashing function. The reason the hashing process is one way is because it intentionally discards some data about the original. This results in what are called collisions, and we can see some in our earlier example using mod 3, 1 and 4 give the same hash, as do 2 and 5. The example given will cause collisions approximately one time in 1, however modern strong hashing functions are a great deal more complex than modulo 3. Even the “very broken” MD5 has collisions occur only one time in every 2^24 or 1 in ~17 000 000.

A key point is that, with a hashing algorithm for any output there are theoretically an infinite number of inputs that can give it and thus it is a one-way, irreversible, process.

A second key point is that any input gives the same output every time. So, by checking if the hashes of two items are the same you can be pretty sure they are from the same source material.

Cooking Some Phone Number Hash(es)

(All calculations are approximate, if I’m not out by two orders of magnitude then…)

Phone numbers conform to a rather well known format, or set of formats. A modern GPU can run about 20 million hashes per second (2*10^7), or 1.7  trillion (1.7 *10 11) per day. So, how does this fit with possible phone numbers?

A pretty standard phone number is made up of 1-3 digits for a country code, 3 local code, and 7 numbers, with perhaps 4 for the extension.

So, we have the following range of numbers:

0000000000000-0000 to 9999999999999-0000

Or, 10^13 possible numbers… About 60 days work to compute all possible values (and a LOT of storage space…)

If we now represent it in a few other forms that may occur to programmers…

+001 (234) 567-8910, 0012345678910, 001-234-5678910, 0012345678910(US), 001(234)5678910

We have maybe 10-20 times that, or several year’s calculations…

But, real world phone numbers don’t fill all possible values. For instance, take a US phone number. It is also made up of the country code, 3 for the local code , and 7 numbers, with perhaps 4 for the extension. But:

  • The country code is known:
  • The area code is only about 35% used since only 350 values are in use
  • The 7 digit codes are not completely full (let’s guess 80%)
  • Most numbers do not use extensions (let’s say 5% use them

Now, we only have 350 * (10 000 000 *.8) * 1.05 or 2.94 billion combinations (2.94*10^9). That is only a little over two minutes on a modern GPU. Even allowing for different representations of numbers you could store that in a few of gigabytes of RAM for instant lookup, or recalculate every time and take longer. This is what is called a time space tradeoff, the space of the memory or the time to recalculate.

Anyway, the two takeaways for our discussion here regarding privacy are:

1. Every unique output value probably corresponds to a unique input value, so this hashing anonymisation still has privacy concerns.
Since possible phone numbers are significantly fewer than the collision chance of even a broken hashing algorithm there is probably little chance of collisions.

2. Phone numbers can be reverse computed from raw hashes alone
Because of the known constraints of input values It is possible to either brute force reverse values, or to build a reasonable sized rainbow table on a modern system.

Hashing Does NOT Guarantee Privacy

Anonymising data by removing specific user identifying information but leaving in unique identifiers does not work to assuage privacy concerns. This is because often clues are in the data, or in linkages between the data. AOL learned this the hard way when they released “anonymised” search data.

Furthermore, the network effect can reveal a lot about you, how many people you connect to, and how many they connect to can be a powerful identifier of you. Not to mention predict a lot of things like your career area and salary point (since more connections tends to mean richer).

For a good discussion of some of the privacy issues related to hashes see Matt Gemmell’s post, Hashing for Privacy in social apps.

Mobile apps also often send the device hardware identifier (which cannot be changed or removed) to servers and advertising networks. And I have also observed the hash of this (or the WiFi MAC address) sent through. This hardly helps accomplish anything, as anyone who knows the device ID can hash it and look for that, and anyone who knows the hash can look for it, just as with the phone numbers. This hash is equally unique to my device, and unable to be changed.

Hashing Does not equal Secrecy

As discussed under “cooking some hash(es)” it is possible to work back from a hash to the input since we know some of the constraints operating upon phone numbers. Furthermore, even if we are not sure exactly how you are hashing data then we can simply put test data in and look for known hashes of it. If I know what 123456789 hashes to and I see it in the output, then I know how your app is hashing phone numbers.

The Full Solution to Privacy and Secrecy: Salt

Both of these issues can be greatly helped by increasing the complexity of the input into the hash function. This can both remove the tendency for anonymised data to carry identical identifiers across instances, and also reduce the chance of it becoming feasible to reverse-calculate all possible values. Unfortunately there is no perfect solution to this if user-matching functionality comes first.

The correct solution as it should be used to store passwords, entry specific salting (for example with bcrypt),  is not feasible for a matching algorithm as it will only work for comparing hashed input to stored hashes, and it will not work for comparing stored hashes to stored hashes.

However, if you as a developer are determined to make a server side matching service for your users, then you need to apply a hybrid approach. This is not good practice for highly sensitive information, but it should retain the functionality needed for server side matching.

Your first privacy step is to make sure your hashes do not match those collected or used by anyone else, do this by adding some constant secret to them, a process called salting.

e.g., adding 9835476579080945368095468905486 to the start of every number before you hash

This will make all of your hashes different to those used by any other developer, but will still compare them properly. The same input will give the same output.

However, there is still a problem – If your secret salt is leaked or disclosed the reversing attacks outlined earlier become possible. To avoid this, increase the complexity of input by hashing more complex data. So, rather than just hashing the phone number, hash the name, email, and phone number together. This does introduce the problem of causing hashes to disagree if any part of the input differs by misspelling, typo’s etc…

The best way to protect your user’s data from disclosure, and your reputation from damage due to a privacy breach:

  • Don’t collect or send sensitive user data or hashes in the first place – using the security principle of least privilege.
  • Ask for access in a very obvious and unambiguous way – informed consent.

Hit me/us up on twitter ( @neohapsis or @secvalve) if you have any comments or discussion. (Especially if I made an error!)

[Update] Added author byline and clarified some wording.