Gutting a Phish

In the news lately there have been countless examples of phishing attacks becoming more sophisticated, but it’s important to remember that entire “industry” is a bell curve: the most dedicated attackers are upping their game, but advancements in tooling and automation are also letting many less sophisticated players get started even more easily. Put another way, spamming and phishing are coexisting happily as both massive multinational business organizations and smaller cottage-industry efforts.

One such enterprising but misguided individual made the mistake of sending a typically blatant phishing email to one of our Neohapsis mailing lists, and someone forwarded it along to me for a laugh.

Initial Phish Email

The phishing email, as it appeared in a mailbox

As silly and evident as this is, one thing I’m constantly astounded by is how the proportion of people who will click never quite drops to zero. Our work on social engineering assessments bears out this real world example: with a large enough sample set, you’ll always hook at least one. In fact, a paper out of Microsoft Research suggests that, for scammers, this sort of painfully blatant opening is actually an intentional tool: it acts as a filter that only the most gullible will pass.

Given the weak effort put into the email, I was curious to see if the scam got any better if someone actually clicked through. To be honest, I was pleasantly surprised.

Phish Site

The phishing site: a combination of legitimate Apple code and images and a form added by the attacker

The site is dressed up as a reasonable approximation of an official Apple site. In fact, a look at the source shows that there are two things going on here: some HTML/CSS set dressing and template code that is copied directly from the legitimate Apple site, and the phishing form itself which is a reusable template form created by one of the phishers.

Naturally, I was curious where data went once the form was submitted. I filled in some bogus data and submitted it (the phishing form helpfully pointed out any missing data; there is certainly an audacity in being asked to check the format of the credit card number that’s about to be stolen). The data POST went back to another page on the same server, then quickly forwarded me on to the legitimate iTunes site.

Submit and Forward Burp -For Blog

This is another standard technique: if a “login” appears to work because the victim was already logged in, the victim will often simply proceed with what they were doing without questioning why the login was prompted in the first place. During social engineering exercises at Neohapsis, we have seen participants repeatedly log into a cloned attack site, with mounting frustration, as they wonder why the legitimate site isn’t showing them the bait they logged in for.

Back to this phishing site: my application security tester spider senses were tingling, so I felt that I had to see what our phisher was doing with the data being submitted. To find out, I replayed the submit request with various types of invalid data, strings that should cause errors depending on how the data was being parsed or stored. Not a single test string produced any errors or different behavior. This could be an indication that any parsing and processing is being done carefully and correctly, but the far more likely case is that they’re simply doing no processing and dumping it all straight out as plain text.

Interesting… if harvested data is just being simply dumped to disk, where exactly is it going? Burp indicates that the data is being POSTed to a harvester script at Snd/Snd.php. I wonder what else is in that directory?

directory listing

Under the hood of the phishing site, the loot stash is clearly visible

That results.txt file looks mighty promising… and it is.

result.txt

The format of the result.txt file

These are the raw results dumped from victims by the harvester script (Snd.php). The top entry is dummy data that I submitted, and when I checked it, the file was entirely filled with the various dummy submissions I had done before. It’s pretty clear from the results that I was the first person to actually click through and submit data to the phish site; actually pretty fortunate, because if a victim did enter legitimate information, the attacker would have to sort it out from a few hundred bogus submissions. Any day that we can make life harder for the the bad guys is a good day.

So, the data collection is dead simple, but I’d still like to know a bit more about the scam and the phishers if possible. There’s not a lot to go on, but the tag at the top of each entry seems unique. It’s the sort of thing we’re used to seeing when hackers deface a website and leave a tag to publicize the work:

------------+| $ o H a B  Dz and a m i r TN |+------------

Googling some variations turned up Google cache of a forum post that’s definitely related to the phishing site above; it’s either the same guy, or someone else using the same tool.

AppleFullz Forum post

A post in a carder forum, offering to sell data in the same format as generated by the phishing site above

A criminal using the name AppleFullz is selling complete information dumps of login details and credit card numbers plus CVV numbers (called “fulls” in carder forums) captured in the exact format that the Apple phish used, and even provides a sample of his wares (Insult to injury for the victim: not only was his information stolen, but it’s being given away as the credit card fraud equivalent of the taster trays at the grocery store). This carder is asking for $10 for one person’s information, but is willing to give bulk discounts: $30 for 5 accounts (This is actually a discount over the sorts of prices normally seen on carder forums; Krebs recently reported that Target cards were selling for $20-$100 per card. I read this as an implicit acknowledgement by our seller that this data is much “dirtier” and that the seller is expecting buyers to mine it for legitimate data). The tools being used here are a combination of some pre-existing scraps of  PHP code widely used in other spam and scam campaigns (the section labeled “|INFO|VBV|”), and a separate section added specifically to target Apple ID’s.

Of particular interest is that the carder provided a Bitcoin address. For criminals, Bitcoin has the advantage of anonymity but the disadvantage that transactions are public. This means that we can actually look up how much money has flowed into that particular Bitcoin address.

blockchain

Ill-gotten gains: the Bitcoin blockchain records transfers into the account used for selling stolen Apple Id’s and credit card numbers.

From November 17, when the forum posting went up, until December 4th, when I investigated this phishing attempt, he has received Bitcoin transfers totaling 0.81815987 BTC, which is around $744.53 (based on the BTC value on 12/4). According to his price sheet, that translates to a sale of between 74 and 124 records: not bad for a month of terribly unsophisticated phishing.

Within a few hours of investigating the initial phishing site, it had been removed. The actual server where the phish site was hosted was a legitimate domain that had been compromised; perhaps the phisher noticed the volume of bogus traffic and decided that the jig was up for that particular phish, or the system administrator got tipped off by the unusual traffic and investigated. Either way the phish site is offline, so that’s another small victory.

Pass the iOS Privacy Salt – Hashing Does NOT Guarantee Privacy.

By Kate Pearce, Neohapsis & Neolabs

There has been a lot of concern and online chatter about iPhone/mobile applications and the private data that some send to various parties. Starting with the discovery of Path sending your entire address book to their servers, it has since also been revealed that other applications do the same thing. The other offenders include Facebook, Twitter, Instagram, Foursquare, Foodspotting, Yelp, and Gowalla. This corresponds nicely with some research I have been doing into device ID leakage on mobile devices, where I have seen the same leakages, excuses, and techniques applied and abused as those discussed around the address book leakages.

I have observed a few posts discussing the issues proposing solutions. These solutions range from requiring iOS to request permission for address book access (as it does for location) and advising developers to hash sensitive data that they send through and compare hashes server side.

The first idea is a very good one, I see few reasons a device geolocation is less sensitive than its address book. The second one as given by is only partial advice however, and if taken as it is given in Martin May’s post, or Matt Gemmel’s arguments;  it will not solve the privacy problems on its own. This is because 1. anonymised data isn’t anonymous, and 2. no matter what hashing algorithm you use, if the input material is sufficiently constrained you can compute, or precompute all possible values.

Martin May’s two characteristics of a hash [link] :

  • Identical inputs will yield the same hash
  • It is virtually impossible to deduce the original input from a hash if a strong hashing algorithm is used.

This is because, of these two characteristics of a hash the privacy implications of first are not fully discussed, and the second is incorrect as stated.

 Hashing will not solve the privacy concerns because:

  • Hashing Data does not Guarantee Privacy (When the same data is input)
  • Hashing Data does not Guarantee Secrecy (When the input values are constrained)

The reasons not discussed for this are centered on the fact that real world input is constrained, not infinite. Telephone numbers are an extreme case of this, as I will discuss later.

A quick primer on hashing

Hashing is a destructive, theoretically one-way process where some data is taken and put through an algorithm to produce some output that is a shadow of the input. Like a shadow, the same output is always produced by the same input, but not the other way around. (Same car, same shadow).

A very simple example of a hashing function is the modulus (or remainder). For instance the output from 3 mod 2 is the remainder when 3 is divided by 2, or 1. The percent sign is commonly used in programming languages to denote this operation, so similarly

                1 % 3 is 1,             2 % 3 is 2              3 % 3 is 0              4 % 3 is 1              5 % 3 is 2       etc

If you take some input, you get the same output every time from the same hashing function. The reason the hashing process is one way is because it intentionally discards some data about the original. This results in what are called collisions, and we can see some in our earlier example using mod 3, 1 and 4 give the same hash, as do 2 and 5. The example given will cause collisions approximately one time in 1, however modern strong hashing functions are a great deal more complex than modulo 3. Even the “very broken” MD5 has collisions occur only one time in every 2^24 or 1 in ~17 000 000.

A key point is that, with a hashing algorithm for any output there are theoretically an infinite number of inputs that can give it and thus it is a one-way, irreversible, process.

A second key point is that any input gives the same output every time. So, by checking if the hashes of two items are the same you can be pretty sure they are from the same source material.

Cooking Some Phone Number Hash(es)

(All calculations are approximate, if I’m not out by two orders of magnitude then…)

Phone numbers conform to a rather well known format, or set of formats. A modern GPU can run about 20 million hashes per second (2*10^7), or 1.7  trillion (1.7 *10 11) per day. So, how does this fit with possible phone numbers?

A pretty standard phone number is made up of 1-3 digits for a country code, 3 local code, and 7 numbers, with perhaps 4 for the extension.

So, we have the following range of numbers:

0000000000000-0000 to 9999999999999-0000

Or, 10^13 possible numbers… About 60 days work to compute all possible values (and a LOT of storage space…)

If we now represent it in a few other forms that may occur to programmers…

+001 (234) 567-8910, 0012345678910, 001-234-5678910, 0012345678910(US), 001(234)5678910

We have maybe 10-20 times that, or several year’s calculations…

But, real world phone numbers don’t fill all possible values. For instance, take a US phone number. It is also made up of the country code, 3 for the local code , and 7 numbers, with perhaps 4 for the extension. But:

  • The country code is known:
  • The area code is only about 35% used since only 350 values are in use
  • The 7 digit codes are not completely full (let’s guess 80%)
  • Most numbers do not use extensions (let’s say 5% use them

Now, we only have 350 * (10 000 000 *.8) * 1.05 or 2.94 billion combinations (2.94*10^9). That is only a little over two minutes on a modern GPU. Even allowing for different representations of numbers you could store that in a few of gigabytes of RAM for instant lookup, or recalculate every time and take longer. This is what is called a time space tradeoff, the space of the memory or the time to recalculate.

Anyway, the two takeaways for our discussion here regarding privacy are:

1. Every unique output value probably corresponds to a unique input value, so this hashing anonymisation still has privacy concerns.
Since possible phone numbers are significantly fewer than the collision chance of even a broken hashing algorithm there is probably little chance of collisions.

2. Phone numbers can be reverse computed from raw hashes alone
Because of the known constraints of input values It is possible to either brute force reverse values, or to build a reasonable sized rainbow table on a modern system.

Hashing Does NOT Guarantee Privacy

Anonymising data by removing specific user identifying information but leaving in unique identifiers does not work to assuage privacy concerns. This is because often clues are in the data, or in linkages between the data. AOL learned this the hard way when they released “anonymised” search data.

Furthermore, the network effect can reveal a lot about you, how many people you connect to, and how many they connect to can be a powerful identifier of you. Not to mention predict a lot of things like your career area and salary point (since more connections tends to mean richer).

For a good discussion of some of the privacy issues related to hashes see Matt Gemmell’s post, Hashing for Privacy in social apps.

Mobile apps also often send the device hardware identifier (which cannot be changed or removed) to servers and advertising networks. And I have also observed the hash of this (or the WiFi MAC address) sent through. This hardly helps accomplish anything, as anyone who knows the device ID can hash it and look for that, and anyone who knows the hash can look for it, just as with the phone numbers. This hash is equally unique to my device, and unable to be changed.

Hashing Does not equal Secrecy

As discussed under “cooking some hash(es)” it is possible to work back from a hash to the input since we know some of the constraints operating upon phone numbers. Furthermore, even if we are not sure exactly how you are hashing data then we can simply put test data in and look for known hashes of it. If I know what 123456789 hashes to and I see it in the output, then I know how your app is hashing phone numbers.

The Full Solution to Privacy and Secrecy: Salt

Both of these issues can be greatly helped by increasing the complexity of the input into the hash function. This can both remove the tendency for anonymised data to carry identical identifiers across instances, and also reduce the chance of it becoming feasible to reverse-calculate all possible values. Unfortunately there is no perfect solution to this if user-matching functionality comes first.

The correct solution as it should be used to store passwords, entry specific salting (for example with bcrypt),  is not feasible for a matching algorithm as it will only work for comparing hashed input to stored hashes, and it will not work for comparing stored hashes to stored hashes.

However, if you as a developer are determined to make a server side matching service for your users, then you need to apply a hybrid approach. This is not good practice for highly sensitive information, but it should retain the functionality needed for server side matching.

Your first privacy step is to make sure your hashes do not match those collected or used by anyone else, do this by adding some constant secret to them, a process called salting.

e.g., adding 9835476579080945368095468905486 to the start of every number before you hash

This will make all of your hashes different to those used by any other developer, but will still compare them properly. The same input will give the same output.

However, there is still a problem – If your secret salt is leaked or disclosed the reversing attacks outlined earlier become possible. To avoid this, increase the complexity of input by hashing more complex data. So, rather than just hashing the phone number, hash the name, email, and phone number together. This does introduce the problem of causing hashes to disagree if any part of the input differs by misspelling, typo’s etc…

The best way to protect your user’s data from disclosure, and your reputation from damage due to a privacy breach:

  • Don’t collect or send sensitive user data or hashes in the first place – using the security principle of least privilege.
  • Ask for access in a very obvious and unambiguous way – informed consent.

Hit me/us up on twitter ( @neohapsis or @secvalve) if you have any comments or discussion. (Especially if I made an error!)

[Update] Added author byline and clarified some wording.

What Makes Up Facebook Data?

This is the first post in our  Social Networking series.

My guess is that you would not simply give a person that knocked on your front door or approached you in the street most of the data Facebook collects in your profile. Facebook profile data consists of many things, including your birth date, email, physical address, current location, work history, education history and additional information you input for activities, interests and music (interestingly much of this can be used for identity theft…) In addition to your profile data, any installed or authenticated Facebook applications have access to your wall posts and list of friends as well as any other data that is shared with “Everyone”.

As Facebook adds new features, the data included in your face book profile has probably crept to include other data of uploaded pictures, application usage and history, tags in posts or pictures. Facebook will always be looking for ways to collect more of your data as YOU are their product. Your data, data of friends and data of everyone else on Facebook is where Facebook collects their profit and, as with most businesses, profits need to increase through expanding markets and giving access to their product.

The data collected by Facebook on you can also include cookie tracking by Facebook even when you are not explicitly on their website.  Facebook heard much uproar from the user community when a security researcher in September 2011 [link] discovered Facebook was even tracking users that had gone as far as deactivating their accounts! Facebook could then track all web history even through web sites that are not related to Facebook activities in any way.

You do have the ability to limit data on Facebook and make sound decisions on what personal data you do decide to submit to Facebook (friends are another matter). Inherently by using Facebook for the ‘free’ services, you are going to lose some control of your information you share with friends. There are a few important factors that you should think about in dealing with social media and my next post will shine some light on who actually owns and regulates your data within Facebook; stayed tuned and feed back is always welcome.