Pass the iOS Privacy Salt – Hashing Does NOT Guarantee Privacy.

By Kate Pearce, Neohapsis & Neolabs

There has been a lot of concern and online chatter about iPhone/mobile applications and the private data that some send to various parties. Starting with the discovery of Path sending your entire address book to their servers, it has since also been revealed that other applications do the same thing. The other offenders include Facebook, Twitter, Instagram, Foursquare, Foodspotting, Yelp, and Gowalla. This corresponds nicely with some research I have been doing into device ID leakage on mobile devices, where I have seen the same leakages, excuses, and techniques applied and abused as those discussed around the address book leakages.

I have observed a few posts discussing the issues proposing solutions. These solutions range from requiring iOS to request permission for address book access (as it does for location) and advising developers to hash sensitive data that they send through and compare hashes server side.

The first idea is a very good one, I see few reasons a device geolocation is less sensitive than its address book. The second one as given by is only partial advice however, and if taken as it is given in Martin May’s post, or Matt Gemmel’s arguments;  it will not solve the privacy problems on its own. This is because 1. anonymised data isn’t anonymous, and 2. no matter what hashing algorithm you use, if the input material is sufficiently constrained you can compute, or precompute all possible values.

Martin May’s two characteristics of a hash [link] :

  • Identical inputs will yield the same hash
  • It is virtually impossible to deduce the original input from a hash if a strong hashing algorithm is used.

This is because, of these two characteristics of a hash the privacy implications of first are not fully discussed, and the second is incorrect as stated.

 Hashing will not solve the privacy concerns because:

  • Hashing Data does not Guarantee Privacy (When the same data is input)
  • Hashing Data does not Guarantee Secrecy (When the input values are constrained)

The reasons not discussed for this are centered on the fact that real world input is constrained, not infinite. Telephone numbers are an extreme case of this, as I will discuss later.

A quick primer on hashing

Hashing is a destructive, theoretically one-way process where some data is taken and put through an algorithm to produce some output that is a shadow of the input. Like a shadow, the same output is always produced by the same input, but not the other way around. (Same car, same shadow).

A very simple example of a hashing function is the modulus (or remainder). For instance the output from 3 mod 2 is the remainder when 3 is divided by 2, or 1. The percent sign is commonly used in programming languages to denote this operation, so similarly

                1 % 3 is 1,             2 % 3 is 2              3 % 3 is 0              4 % 3 is 1              5 % 3 is 2       etc

If you take some input, you get the same output every time from the same hashing function. The reason the hashing process is one way is because it intentionally discards some data about the original. This results in what are called collisions, and we can see some in our earlier example using mod 3, 1 and 4 give the same hash, as do 2 and 5. The example given will cause collisions approximately one time in 1, however modern strong hashing functions are a great deal more complex than modulo 3. Even the “very broken” MD5 has collisions occur only one time in every 2^24 or 1 in ~17 000 000.

A key point is that, with a hashing algorithm for any output there are theoretically an infinite number of inputs that can give it and thus it is a one-way, irreversible, process.

A second key point is that any input gives the same output every time. So, by checking if the hashes of two items are the same you can be pretty sure they are from the same source material.

Cooking Some Phone Number Hash(es)

(All calculations are approximate, if I’m not out by two orders of magnitude then…)

Phone numbers conform to a rather well known format, or set of formats. A modern GPU can run about 20 million hashes per second (2*10^7), or 1.7  trillion (1.7 *10 11) per day. So, how does this fit with possible phone numbers?

A pretty standard phone number is made up of 1-3 digits for a country code, 3 local code, and 7 numbers, with perhaps 4 for the extension.

So, we have the following range of numbers:

0000000000000-0000 to 9999999999999-0000

Or, 10^13 possible numbers… About 60 days work to compute all possible values (and a LOT of storage space…)

If we now represent it in a few other forms that may occur to programmers…

+001 (234) 567-8910, 0012345678910, 001-234-5678910, 0012345678910(US), 001(234)5678910

We have maybe 10-20 times that, or several year’s calculations…

But, real world phone numbers don’t fill all possible values. For instance, take a US phone number. It is also made up of the country code, 3 for the local code , and 7 numbers, with perhaps 4 for the extension. But:

  • The country code is known:
  • The area code is only about 35% used since only 350 values are in use
  • The 7 digit codes are not completely full (let’s guess 80%)
  • Most numbers do not use extensions (let’s say 5% use them

Now, we only have 350 * (10 000 000 *.8) * 1.05 or 2.94 billion combinations (2.94*10^9). That is only a little over two minutes on a modern GPU. Even allowing for different representations of numbers you could store that in a few of gigabytes of RAM for instant lookup, or recalculate every time and take longer. This is what is called a time space tradeoff, the space of the memory or the time to recalculate.

Anyway, the two takeaways for our discussion here regarding privacy are:

1. Every unique output value probably corresponds to a unique input value, so this hashing anonymisation still has privacy concerns.
Since possible phone numbers are significantly fewer than the collision chance of even a broken hashing algorithm there is probably little chance of collisions.

2. Phone numbers can be reverse computed from raw hashes alone
Because of the known constraints of input values It is possible to either brute force reverse values, or to build a reasonable sized rainbow table on a modern system.

Hashing Does NOT Guarantee Privacy

Anonymising data by removing specific user identifying information but leaving in unique identifiers does not work to assuage privacy concerns. This is because often clues are in the data, or in linkages between the data. AOL learned this the hard way when they released “anonymised” search data.

Furthermore, the network effect can reveal a lot about you, how many people you connect to, and how many they connect to can be a powerful identifier of you. Not to mention predict a lot of things like your career area and salary point (since more connections tends to mean richer).

For a good discussion of some of the privacy issues related to hashes see Matt Gemmell’s post, Hashing for Privacy in social apps.

Mobile apps also often send the device hardware identifier (which cannot be changed or removed) to servers and advertising networks. And I have also observed the hash of this (or the WiFi MAC address) sent through. This hardly helps accomplish anything, as anyone who knows the device ID can hash it and look for that, and anyone who knows the hash can look for it, just as with the phone numbers. This hash is equally unique to my device, and unable to be changed.

Hashing Does not equal Secrecy

As discussed under “cooking some hash(es)” it is possible to work back from a hash to the input since we know some of the constraints operating upon phone numbers. Furthermore, even if we are not sure exactly how you are hashing data then we can simply put test data in and look for known hashes of it. If I know what 123456789 hashes to and I see it in the output, then I know how your app is hashing phone numbers.

The Full Solution to Privacy and Secrecy: Salt

Both of these issues can be greatly helped by increasing the complexity of the input into the hash function. This can both remove the tendency for anonymised data to carry identical identifiers across instances, and also reduce the chance of it becoming feasible to reverse-calculate all possible values. Unfortunately there is no perfect solution to this if user-matching functionality comes first.

The correct solution as it should be used to store passwords, entry specific salting (for example with bcrypt),  is not feasible for a matching algorithm as it will only work for comparing hashed input to stored hashes, and it will not work for comparing stored hashes to stored hashes.

However, if you as a developer are determined to make a server side matching service for your users, then you need to apply a hybrid approach. This is not good practice for highly sensitive information, but it should retain the functionality needed for server side matching.

Your first privacy step is to make sure your hashes do not match those collected or used by anyone else, do this by adding some constant secret to them, a process called salting.

e.g., adding 9835476579080945368095468905486 to the start of every number before you hash

This will make all of your hashes different to those used by any other developer, but will still compare them properly. The same input will give the same output.

However, there is still a problem – If your secret salt is leaked or disclosed the reversing attacks outlined earlier become possible. To avoid this, increase the complexity of input by hashing more complex data. So, rather than just hashing the phone number, hash the name, email, and phone number together. This does introduce the problem of causing hashes to disagree if any part of the input differs by misspelling, typo’s etc…

The best way to protect your user’s data from disclosure, and your reputation from damage due to a privacy breach:

  • Don’t collect or send sensitive user data or hashes in the first place – using the security principle of least privilege.
  • Ask for access in a very obvious and unambiguous way – informed consent.

Hit me/us up on twitter ( @neohapsis or @secvalve) if you have any comments or discussion. (Especially if I made an error!)

[Update] Added author byline and clarified some wording.

“Researchers steal iPhone passwords in 6 minutes”…true…but not the whole story

By Patrick Toomey

Direct link to keychaindumper (for those that want to skip the article and get straight to the code)

So, a few weeks ago a wave of articles hit the usual sites about research that came out of the Fraunhofer Institute (yes, the MP3 folks) regrading some issues found in Apple’s Keychain service.  The vast majority of the articles, while factually accurate, didn’t quite present the full details of what the researchers found.  What the researchers actually found was more nuanced than what was reported.  But, before we get to what they actually found, let’s bring everyone up to speed on Apple’s keychain service.

Apple’s keychain service is a library/API provided by Apple that developers can use to store sensitive information on an iOS device “securely” (a similar service is provided in Mac OS X).  The idea is that instead of storing sensitive information in plaintext configuration files, developers can leverage the keychain service to have the operating system store sensitive information securely on their behalf.  We’ll get into what is meant by “securely” in a minute, but at a high level the keychain encrypts (using a unique per-device key that cannot be exported off of the device) data stored in the keychain database and attempts to restrict which applications can access the stored data.  Each application on an iOS device has a unique “application-identifier” that is cryptographically signed into the application before being submitted to the Apple App store.  The keychain service restricts which data an application can access based on this identifier.  By default, applications can only access data associated with their own application-identifier.  Apple realized this was a bit restrictive, so they also created another mechanism that can be used to share data between applications by using “keychain-access-groups”.  As an example, a developer could release two distinct applications (each with their own application-identifier) and assign each of them a shared access group.  When writing/reading data to the keychain a developer can specify which access group to use.  By default, when no access group is specified, the application will use the unique application-identifier as the access group (thus limiting access to the application itself).  Ok, so that should be all we need to know about the Keychain.  If you want to dig a little deeper Apple has a good doc here.

Ok, so we know the keychain is basically a protected storage facility that the iOS kernel delegates read/write privileges to based on the cryptographic signature of each application.  These cryptographic signatures are known as “entitlements” in iOS parlance.  Essentially, an application must have the correct entitlement to access a given item in the keychain.  So, the most obvious way to go about attacking the keychain is to figure out a way to sign fake entitlements into an application (ok, patching the kernel would be another way to go, but that is a topic for another day).  As an example, if we can sign our application with the “apple” access group then we would be able to access any keychain item stored using this access group.  Hmmm…well, it just so happens that we can do exactly that with the “ldid” tool that is available in the Cydia repositories once you Jailbreak your iOS device.  When a user Jailbreak’s their phone, the portion of the kernel responsible for validating cryptographic signatures is patched so that any signature will validate. So, ldid basically allows you to sign an application using a bogus signature. But, because it is technically signed, a Jailbroken device will honor the signature as if it were from Apple itself.

Based on the above descrption, so long as we can determine all of the access groups that were used to store items in a user’s keychain, we should be able to dump all of them, sign our own application to be a member of all of them using ldid, and then be allowed access to every single keychain item in a user’s keychain.  So, how do we go about getting a list of all the access group entitlements we will need?  Well, the kechain is nothing more than a SQLite database stored in:

/var/Keychains/keychain-2.db

And, it turns out, the access group is stored with each item that is stored in the keychain database.  We can get a complete list of these groups with the following query:

SELECT DISTINCT agrp FROM genp

Once we have a list of all the access groups we just need to create an XML file that contains all of these groups and then sign our own application with ldid.  So, I created a tool that does exactly that called keychain_dumper.  You can first get a properly formatted XML document with all the entitlements you will need by doing the following:

./keychain_dumper -e > /var/tmp/entitlements.xml

You can then sign all of these entitlments into keychain_dumper itself (please note the lack of a space between the flag and the path argument):

ldid -S/var/tmp/entitlements.xml keychain_dumper

After that, you can dump all of the entries within the keychain:

./keychain_dumper

If all of the above worked you will see numerous entries that look similar to the following:

Service: Dropbox
Account: remote
Entitlement Group: R96HGCUQ8V.*
Label: Generic
Field: data
Keychain Data: SenSiTive_PassWorD_Here

Ok, so what does any of this have to do with what was being reported on a few weeks ago?  We basically just showed that you can in fact dump all of the keychain items using a jailbroken iOS device.  Here is where the discussion is more nuanced than what was reported.  The steps we took above will only dump the entire keychain on devices that have no PIN set or are currently unlocked.  If you set a PIN on your device, lock the device, and rerun the above steps, you will find that some keychain data items are returned, while others are not.  You will find a number of entries now look like this:

Service: Dropbox
Account: remote
Entitlement Group: R96HGCUQ8V.*
Label: Generic
Field: data
Keychain Data: <Not Accessible>

This fundamental point was either glossed over or simply ignored in every single article I happend to come across (I’m sure at least one person will find the article that does mention this point :-)).  This is an important point, as it completely reframes the discussion.  The way it was reported it looks like the point is to show how insecure iOS is.  In reality the point should have been to show how security is all about trading off various factors (security, convenience, etc).  This point was not lost on Apple, and the keychain allows developers to choose the appropriate level of security for their application.  Stealing a small section from the keychain document from Apple, they allow six levels of access for a given keychain item:

CFTypeRef kSecAttrAccessibleWhenUnlocked;
CFTypeRef kSecAttrAccessibleAfterFirstUnlock;
CFTypeRef kSecAttrAccessibleAlways;
CFTypeRef kSecAttrAccessibleWhenUnlockedThisDeviceOnly;
CFTypeRef kSecAttrAccessibleAfterFirstUnlockThisDeviceOnly;
CFTypeRef kSecAttrAccessibleAlwaysThisDeviceOnly;

The names are pretty self descriptive, but the main thing to focus in on is the “WhenUnlocked” accessibility constants.  If a developer chooses the “WhenUnlocked” constant then the keychain item is encrypted using a cryptographic key that is created using the user’s PIN as well as the per-device key mentioned above.  In other words, if a device is locked, the cryptographic key material does not exist on the phone to decrypt the related keychain item.  Thus, when the device is locked, keychain_dumper, despite having the correct entitlements, does not have the ability to access keychain items stored using the “WhenUnlocked” constant.  We won’t talk about the “ThisDeviceOnly” constant, but it is basically the most strict security constant available for a keychain item, as it prevents the items from being backed up through iTunes (see the Apple docs for more detail).

If a developer does not specify an accessibility constant, a keychain item will use “kSecAttrAccessibleWhenUnlocked”, which makes the item available only when the device is unlocked.  In other words, applications that store items in the keychain using the default security settings would not have been leaked using the approach used by Fraunhofer and/or keychain_dumper (I assume we are both just using the Keychain API as it is documented).  That said, quite a few items appear to be set with “kSecAttrAccessibleAlways”.  Such items include wireless access point passwords, MS Exchange passwords, etc.  So, what was Apple thinking; why does Apple let developers choose among all these options?  Well, let’s use some pretty typical use cases to think about it.  A user boots their phone and they expect their device to connect to their wireless access point without intervention.  I guess that requires that iOS be able to retrieve their access point’s password regardless of whether the device is locked or not.  How about MS Exchange?  Let’s say I lost my iPhone on the subway this morning.  Once I get to work I let the system administrator know and they proceed to initiate a remote wipe of my Exchange data.  Oh, right, my device would have to be able to login to the Exchange server, even when locked, for that to work.  So, Apple is left in the position of having to balance the privacy of user’s data with a number of use cases where less privacy is potentially worthwhile.  We can probably go through each keychain item and debate whether Apple chose the right accessibility constant for each service, but I think the main point still stands.

Wow…that turned out to be way longer than I thought it would be.  Anyway, if you want to grab the code for keychain_dumper to reproduce the above steps yourself you can grab the code on github.  I’ve included the source as well as a binary just in case you don’t want/have the developer tools on your machine.  Hopefully this tool will be useful for security professionals that are trying to evaluate whether an application has chosen the appropriate accessibility parameters during blackbox assessments. Oh, and if you want to read the original paper by Fraunhofer you can find that here.

Java Cisco Group Password Decrypter

By Patrick Toomey

For whatever reason I have found myself needing to “decrypt” Cisco VPN client group passwords throughout the years.  I say “decrypt” , as the value is technically encrypted using 3DES, but the mechanism used to decrypt the value is largely obfuscative (the cryptographic key is included in the ciphertext).  As such, the encryption used is largely incidental, and any simplistic substitution cipher would have protected the group password equally well (think newspaper cryptogram).

I can’t pinpoint all the root reasons, but it seems as though every couple months I’m needing a group password decrypted.  Sometimes I am simply moving a Cisco VPN profile from Windows to Linux.  Other times I’ve been on a  client application assessments/penetrations test where I’ve needed to gain access to the plaintext group password.  Regardless, I inevitably find myself googling for “cisco group password decrypter”.  A few different results are returned.  The top result is always a link here.  This site has a simple web app that will decrypt the group password for you, though it occurs server-side.  Being paranoid by trade, I am always apprehensive sending information to a third party if that information is considered sensitive (whether by me, our IT department, or a client).  I have no reason to think the referenced site is malicious, but it would not be in my best interest professionally not to be paranoid security conscious, particularly with client information.  The referenced site has a link to the original source code, and one can feel free to download, audit, compile, and use the provided tool to perform all of the decryption client-side.  That said, the linked file depends on libgcrypt.  That is fine if I am sitting in Linux, but not as great if I am on some new Windows box (ok, maybe I’m one of the few who doesn’t keep libgcrypt at the ready on my fresh Windows installs).  I’ve googled around to see if anything exists that is more portable.  I found a few things, including a link to a Java applet, but the developer seems to have lost the source code…..  So, laziness won, and I decided it would be easier to spend 30 minutes to write my own cross-platform Java version than spend any more time on Google.

The code for the decrypter can be found on our github, here.  I am not a huge fan of Java GUI development, and thus leveraged the incredible GUI toolkit built in to NetBeans.  The referenced source code should compile cleanly in NetBeans if you want the GUI.  If you simply want to decrypt group passwords with no dependencies you can run a command line version by compiling the “GroupPasswordDecrypter” class file.  This file has zero dependencies on third-party libraries and should be sufficient for anyone that doesn’t feel compelled to use a GUI (me included).

As a quick example, I borrowed a sample encrypted group password from another server-side implementation .  The encrypted group password we would like to decrypt is

9196FE0075E359E6A2486905A1EFAE9A11D652B2C588EF3FBA15574237302B74C194EC7D0DD16645CB534D94CE85FEC4

Or, if you prefer the command line version

Hope it comes in handy!

Even if You Don’t Invent Your Own Crypto….It’s Still Hard

By: Patrick Toomey

So, yet another crypto related post.  Often times crypto related attacks fall on deaf ears, as for most people crypto is like magic (include me in that group every now and again), and it becomes difficult for the average developer to differentiate the risks associated with the attacks circulated within the security community.  There is a vast chasm between the crypto community and just about everyone else.  The cryptographers will talk about differential cryptanalysis, meet in the middle attacks, perfect forward secrecy, etc, etc.  Often, the cryptographers’ goals are different than your average developer.  Cryptographers are trying to achieve perfection.  For them, any flaw in a cryptographic protocol is something to be discussed, researched, and fixed.  As is a famous saying in the crypto community, “Attacks only get better”.  So, even the smallest flaw is worth their time and effort, as it may lead to a practical crack that will warrant more immediate attention.

Unlike cryptographers, for the vast majority of developers, the goal is “good enough” security.  Often times the result is not good enough, but the goal is admirable, as it doesn’t make sense to over invest in security.  As a result, there is a chasm between “good enough” for cryptographers and “good enough” for developers.  In this chasm are the real world attacks that are too boring for your average cryptographer to discuss, yet too subtle as to be obvious to your average developer.  Many of these attacks have nothing to do with the cryptography itself.  Instead, many real world crypto attacks have more to do with how developers piece together proven crypto building blocks insecurely.  It was on a recent assessment that I came upon the following story that perfectly illustrates how cryptography is difficult to get right, even when you think you have made an informed decision.

On a recent assessment a colleague of mine came across a typical SQL injection attack.  In fact, it was just the kind of SQL injection you hope for as an attacker, as it returned data to the user in a nice tabular output format.  After a bit of mucking around we managed to find where the system was storing credit card information for its users.  A few minutes later we had output that looked like the following:

Base64 Credit Card Numbers

Base64 Credit Card Numbers

Hmmm, base64 encoded credit card numbers.  I wonder what they look like when we base64 decode them.  A simple base64 decode resulted in the following hex values:

Base64 Decoded Credit Card Numbers

Base64 Decoded Credit Card Numbers

If you start trying to translate the above hex values to their ASCII equivalent you will quickly find that the values do not represent plaintext ASCII credit card numbers.  Our first guess was that they were likely encrypted.  So, step two is trying to figure out how they were encrypted.  We took a look at the base64 decoded values and observed a simple fact: the vast majority of them were either 15 or 16 bytes long.  Well, a 16 byte long ciphertext could be any 128 bit symmetric algorithm in simple ECB mode.  But, the 15 byte entries were evidence against that, as they would get padded out to 16 bytes due to the requirement that symmetric algorithms encrypt a full block (16 bytes in the case of AES).  So, my first thought was the use of a stream cipher.  In their most basic form, a stream cipher simply takes a secret value as input (the key), and begins to generate a pseudorandom output stream of bytes.  Encryption simply involves XORing the stream of random bytes with your plaintext.  Unlike a block cipher, there is no need to pad out your plaintext, as you only use as many pseudorandom bytes as is necessary to XOR with your plaintext.  Decryption is also simple, as XORing is a trivially invertible operation.  The user simply feeds the algorithm the same key, generates the same pseudorandom byte stream, and XORs that with the ciphertext.

To see why this works let’s take a look at how this works for a single plaintext/ciphertext.  Let us say we wish to encrypt the credit card number “4012888888881881”. The ASCII values in hex result in a plaintext array of the following values:

Plaintext Credit Card Number ASCII in Hex

Plaintext Credit Card Number ASCII in Hex

Also, let us assume that we are using a stream cipher with a strong key whose first 16 output bytes are:

Pseudorandom Byte Stream in Hex

Pseudorandom Byte Stream in Hex

When we XOR the plaintext with the pseudorandom stream we obtain the ciphertext.

Resulting Ciphertext

Resulting Ciphertext

To decrypt we simply take the ciphertext and perform the same XOR operation.  Let’s quickly take our ciphertext and see how we decrypt it.

Decrypted Ciphertext

Decrypted Ciphertext

When we XOR the ciphertext with the stream we obtain the plaintext we started with.  Through the whole encryption/decryption process we have effectively XORed the plaintext with the stream twice.  XORing anything with the same value twice is effectively a NOP on the original value.  Pictorially:

Full Encryption Decryption Cycle

Full Encryption Decryption Cycle

However, our examples so far have assumed knowledge of the pseudorandom stream.  Without knowledge of the key it is difficult to reproduce the pseudorandom stream, and thus very difficult to perform the steps we just outlined to obtain the plaintext for a given ciphertext.  Moreover, we don’t even know what algorithm is in use here.  Did the developers use RC4?  Did they try to make their own pseudorandom generator by repetitively hashing a secret key?  We didn’t really know, but  regardless of the algorithm, there is inherent risk with leveraging a stream cipher in its most basic form.  If the stream cipher uses a shared key, the pseudorandom stream will be identical every time it is initialized.  To test this conjecture we used several test user accounts  to register several credit cards.  Sure enough, identical credit card numbers resulted in identical ciphertext.  So, what can we do with this?  We have no idea what algorithm was used, and we have no idea of what the value of the secret key is.  However, it turns out that we don’t need any of that information.  To see why let’s first take a look at the hex of two encrypted credit cards we base64 decoded above.

First Encrypted Credit Card

First Encrypted Credit Card

The above shows the 16 ciphertext bytes for one of the credit card numbers in the database.

Second Encrypted Credit Card

Second Encrypted Credit Card

The above shows the 16 ciphertext bytes for a second credit card number in the database.

Let’s assume that the ciphertext from the first encrypted credit card was ours, where we know the plaintext ASCII hex values for the card.

Plaintext Bytes for the First Credit Card

Plaintext Bytes for the First Credit Card

Let’s also assume that the second credit card represents a credit card of unknown origin (i.e.  all we can observe is the ciphertext.  We have no ideas what the digits of the card are).  Similarly, without knowledge of the key, we have no idea how to generate the pseudorandom byte stream used to encrypt the cards.  But, what happens if we simply XOR these two ciphertexts together.

XOR of the Two Encrypted Credit Cards

XOR of the Two Encrypted Credit Cards

Well, that doesn’t really look like it got us anywhere, as we still don’t have anything that looks like an ASCII credit card number.  However, let’s take a look at what actually happened when we did the above operation.  We are confident that both credit card numbers were encrypted with the same pseudorandom byte stream (remember that encryption of the same credit card under different users resulted in the same ciphertext).  Furthermore, we know that the first credit card number is “4012888888881881”, as that is our own credit card number.  The key insight is that we have effectively XORed four different byte sequences together.  We have XORed the plaintext bytes of our credit card, the plaintext bytes of an unknown credit card, and the same pseudorandom sequence twice.

Encrypted Credit Card XOR Expanded

Encrypted Credit Card XOR Expanded

In the above picture the unknown plaintext credit card number is represented with a series of “YY”s.  Similarly, the unknown pseudorandom stream is represented with a series of “XX”s.  However, despite the fact that we don’t know the value of the pseudorandom stream, we do know they are the same.  As a result, whenever you XOR anything with the same value twice you effectively create a NOP. So, the above three XOR operations can be simplified to the following:

Simplified XOR of Encrypted Credit Cards

Simplified XOR of Encrypted Credit Cards

So, what we actually did by XORing the two ciphertexts is create a new value that is the XOR of the two plaintexts.  Wow, that is kind of cool.  We just got rid of the pseudorandom byte stream that is supposed to provide all the security without ever knowing the key.  But, what do we do now?  We still don’t have plaintext credit card numbers yet.  Ah, but we do!  You can think of what we now have as the unknown credit card number encrypted with our known credit card number as the pseudorandom stream.  In other words, we can simply XOR our result with our known credit card value again and we will obtain the resulting plaintext for the unknown credit card number.

Decrypted Credit Card

Decrypted Credit Card

One you translate the resulting hex into the equivalent ASCII string you obtain “5362016689976147”, a valid MasterCard number (obviously we have used test credit card numbers throughout this blog entry).  We can simply apply the above steps to recover the full unencrypted credit card numbers for the entire database.  We have effectively used our own encrypted credit card number as a key to decrypt all of the other unknown credit cards in the database.  This is a pretty impressive feat for not knowing the algorithm and/or the secret key used to encrypt all of the credit card information.

As it turns out, we eventually received source code for the functionality under investigation and discovered that the developers were using RC4.  RC4 has had its fair share of issues, and is generally not a recommended algorithm, but none of what we just discussed had to do with any weakness with RC4.  The above attack would have worked on any stream cipher used in the same way.  So, as is typical, the attack had more to do with how cryptography is used than the cryptographic algorithm itself.  We can’t say it enough….crypto is hard.  It just is.  There are just too many things that developers need to get right for crypto to be done correctly.  Algorithm selection, IVs, padding, modes of encryption, secret key generation, key expiration, key revocation…the list goes on and on.  We in the security community have been pounding the “don’t invent your own cryptography” mantra for a long time.  However, that has lead to a swarm of developers simply leveraging proven cryptography insecurely.  So, we need to take this one step further.  On average, it is best for developers to defer to proven solutions that abstract the notion of cryptography even further.  For example, GPG, Keyczar, et al have abstractions that simply let you encrypt/decrypt/sign/verify things based off of a secret that you provide.  They handle proper padding, encryption mode selection, integrity protection, etc.  That is not to say that there aren’t use cases for needing to delve into the weeds sometimes, but for the average use case a trusted library is your friend.



Hulu…client-side “encryption”…seriously?

By: Patrick Toomey

I remember being pretty excited by the prospect of a service like Hulu.   The idea that major networks were actually coming together to stream mainstream video content was impressive.  It was such a departure from the locked down, share nothing, mentality of old.   I thought to myself, “Wow, does Hollywood finally get it?”. Apparently my optimism was exactly that…optimistic.

Sometime in the last week or so it was reported that Hulu, a video streaming service run by NBC and FOX, started “encrypting” Ajax responses to block unauthorized software clients (Boxee et al.) from sidestepping the hulu.com website to view content.  However, encryption is purposefully in quotes, as what Hulu actually implemented is a client-side obfuscation mechanism.  It it well known that such protection mechanisms are flawed by design and bound to be circumvented quickly.

The protective measure that is implemented rests on the obfuscation of Ajax responses made against hulu.com.  Instead of returning plaintext HTML content, Ajax requests return obfuscated URL encoded strings.  These URL encoded strings are reverted to plaintext on the client-side using JavaScript.  For example, a request to:

http://www.hulu.com/channels/Home-and-Garden?kind=videos&sort=popularity

returns a URL encoded string that begins:

dobfu__%F2%9E%84%88%EE%99%81%9F%BD%89%D0%DC …

The entire string is approximately 141KB long.  Other than the “dobfu__” prefix, the remainder of the string is URL encoded.  This obfuscated string is transformed into plaintext by a JavaScript function called “_dobfu()”.  This function, after a bit of reformatting, is reproduced below:

function _dobfu(text) {
  return text.slice(0,7)!='dobfu__'?text:
    $A(unescape(text.substring(7)).tol()).map(function(i) {
      i=0xfeedface^i;
      return String.fromCharCode(i&0xFF,i>>>8&0xFF,i>>>16&0xFF,i>>>24&0xFF);
    }
  ).join('').replace(/\+$/,'');
}

All of the above code is pretty easy to follow, save for the references to $A() and the the tol() functions.  The $A() function is a Prototype global function that creates a full array object from any other object that can pass for an array (supports indexing, etc).  This is done so that the new object inherits the full functionality of an array (the map method is needed in this case).  The second piece of ambiguous logic , the tol() method, is defined in another JavaScript file and is reproduced below:

String.prototype.tol=function(){
  var s=this;
  return $R(0,Math.ceil(s.length/4)-1).map(
    function(i){
      return s.charCodeAt(i*4)+(s.charCodeAt(i*4+1)<<8)+(s.charCodeAt(i*4+2)<<16)+(s.charCodeAt(i*4+3)<<24);
    }
  );
};

Essentially this method takes a string of bytes and creates an array of 32-bit integers from each 4-byte chunk.  For example, if the string processed in the method was “\x01\x23\x45\x67\x89\xab\xcd\xef” the method would return the array [0x67452301, 0xefcdab89].  The ordering of the individual bytes is a result of the “tol()” method parsing the data as little-endian.

So, with those two functions defined we can quickly describe how Hulu de-obfuscates responses.  The obfuscated string is broken up into 4-byte integers.  Since the length of the obfuscated string is always evenly divisible by four we are guaranteed that a string of length x will turn into an array of 4-byte integers of length x/4.  Then, for each 4-byte integer, the value is XORed with the constant “0xfeedface”.  Once XORed, the individual bytes from the integer are split apart and converted back to their equivalent ASCII value.  Finally, all trailing NULL bytes are removed from the de-obfuscated string.

It is a bit difficult to imagine what Hulu thought they might accomplish with the above scheme.  It effectively does nothing to prevent third-party tools from performing the same obfuscation/de-obfuscation.  Any scheme that attempts to implement client-side “decryption”, particularly in JavaScript, is bound for failure.  The client possesses the obfuscated message, the key to de-obfuscate the message, and the Javascript that executes the algorithm.   Using these components, it is a trivial exercise to transform any obfuscated response back into plaintext.  Hulu likely thwarted unauthorized software for the better part of an afternoon and no more.  Client-side security mechanisms simply don’t work.  Even complex systems implemented in native code, such as popular DRM schemes, that may go unbroken for a period of time, will eventually be circumvented.  However, to implement a similar preventative measure in JavaScript lowers the difficulty of circumvention dramatically.

Beyond the technical discussion there is also a more broad question to be asked.  What was the net gain for Hulu?  They failed to accomplish their implicit goal: to block unauthorized software.  Hulu simply received another  bit of bad press for treating their customers like thieves.  Hulu, and other such services, need to realize that the ubiquitous availability of their content will ultimately grow their fan base.  There is ever increasing competition for a viewer’s eyes and ears.  Podcasts, YouTube, gaming, etc are all competing.  Third-party products, such as Boxee, only serve to increase the ubiquity of their content, which shouldn’t be viewed as a bad thing.  Thwarting their own customers only sours the experience and reinforces the presumption that a good chunk of the entertainment industry just doesn’t get it.  Besides being bad security, this latest debacle is just bad business.

Crypto Pet Peeves: Hashing…Encoding…It’s All The Same, Right?

Patrick Toomey

© 2008 Neohapsis

We all know cryptography is hard. Time and time again we in the security community give advice that goes something like, “Unless you have an unbelievably good reason for developing your own cryptography, don’t!”. Even if you think you have an unbelievably good reason I would still take pause and make sure there is no other alternative. Nearly every aspect of cryptography is painstakingly difficult: developing new crypto primitives is hard, correctly implementing them is nearly just as hard, and even using existing crypto APIs can be fraught with subtlety. As discussed in a prior post, Seed Racing, even fairly simple random number generation is prone to developer error. Whenever I audit source I keep my eyes open for unfamiliar crypto code. So was the case on a recent engagement; I found myself reviewing an application in a language that I was less familiar with: Progress ABL.

Progress ABL is similar to a number of other 4GL languages, simplifying development given the proper problem set. Most notably, Progress ABL allows for rapid development of typical business CRUD applications, as the language has a number of features that make database interactions fairly transparent. For those of you interested to learn more, the language reference manual can be found on Progress’ website.

As I began my review of the application I found myself starting where I usually do: staring at the login page. The application was a fairly standard web app that required authentication via login credentials before accessing the sensitive components of the application. Being relatively unfamiliar with ABL, I was curious how they would handle session management. Sure enough, just as with many other web apps, the application set a secure cookie that uniquely identifies my session upon login. However, I noticed that the session ID was relatively short (sixteen lower/upper case letters and four digits). I decided to pull down a few thousand of the tokens to see if I noticed any anomalies. The first thing I noticed was that the four digit number on the end was obviously not random, as values tended to repeat, cluster together, etc. So, the security of the session ID must lie in the sixteen characters that precede the four digits. However, even the sixteen characters did not look so random. Certain letters appeared to occur more than others. Certain characters seemed to follow other characters more than others. But, this was totally unscientific; strange patterns can be found in any small sample of data. So, I decided to do a bit more scientific investigation into what was going on.

Just to confirm my suspicions I coded up a quick python script to pull down a few thousand tokens and count the frequency of each character in the token. Several minutes later I had a nice graph in excel.

Histogram of Encode Character Frequency
Histogram of Encode Character Frequency

Ouch! That sure doesn’t look very random. So, I opened up Burp Proxy and used their Sequencer to pull down a few thousand more session cookies. The Burp Sequencer has support for running a number of tests, including a set of FIPS-compliant statistical tests for randomness. To obtain a statistically significant result Burp analyzes a sample size of 20,000 tokens. Since I saw that the four digit token at the end of the session ID provided little to no entropy, I discarded them from the analysis. It seemed obvious that the sixteen character sequence was generated using some sort of cryptographic hash, and the four digit number was generated in some other way. I was more interested in the entropy provided by the hash. So, after twenty minutes of downloading tokens, I let Burp crunch the numbers. About 25 seconds later Burp returned an entropy value of 0 bits. Burp returned a graph that looked like the one below, showing the entropy of the data at various significance levels.

Encode Entropy Estimation
Encode Entropy Estimation

Hmmm, maybe Burp is broken. I was pretty sure I had successfully used the Burp Sequencer before. Maybe it was user error, a bug in the current version, who knows. I decided that a control was needed, just to ensure that the tool was working the way I thought it should. So, I wrote a bit more python to simply print the hex-encoded value of a SHA1 hash on the numbers 1-20,000. I loaded this data into Burp and analyzed the data set. Burp estimated the entropy at 153 bits. Just to compare with the prior results, here is the distribution graph and the Burp entropy results for the SHA1 output:

Histogram of SHA1 Character Frequency
Histogram of SHA1 Character Frequency

SHA1 Entropy Estimation
SHA1 Entropy Estimation

I repeated the same test against a set of JSESSIONID tokens and found a similarly acceptable result. Ok, so the Burp Sequencer seems to be working.

So, I next went hunting for the session token generation code in the application. After a little greping I found the function for generating new session tokens. Ultimately the function took a number of values and ran them through a function called “ENCODE”. Hmmm, ENCODE, that didn’t sound familiar. Some more greping through the source did not reveal any function definitions, so I assumed the function must be part of the standard library for ABL. Sure enough, on page 480 of the language reference manual there was a description of the ENCODE function.

“Encodes a source character string and returns the encoded character string result”

The documentation then goes on to state:

“The ENCODE function performs a one-way encoding operation that you cannot reverse.  It is useful for storing scrambled copies of passwords in a database. It is impossible to determine the original password by examining the database. However, a procedure can prompt a user for a password, encode it, and compare the result with the stored, encoded password to determine if the user supplied the correct password.”

That is the least committal description of a hash function I’ve ever had the pleasure reading. It turns out the application, as well as a third party library the application depends upon, uses this function for generating session tokens, storing passwords, and generating encryption keys. For the sake of reproducibility I wanted to be sure my data was not the result of some strange artifact in their environment. I installed the ABL runtime locally and coded up a simple ABL script to call ENCODE on the numbers 1-20000. I reran the Burp Sequencer and got the exact same result, 0 bits.

At this point I was fairly sure that ENCODE was flawed from a hashing perspective. A good quality secure hash function, regardless of how correlated the inputs are (as the number 1-20000 obviously would be), should produce output that is indistinguishable from truly random values (see Cryptographic Hash Functions and  Random Oracle Model for more information). ENCODE clearly does not meet this definition of a secure hash function. But, 0 bits, that seems almost inconceivably flawed.  So, giving them the benefit of the doubt, I wondered if the result is dependent on the input. In other words, I conjectured that ENCODE might perform some unsophisticated “scrambling” operation on the input, and thus input with low entropy will have low entropy on the output. Conversely, input with high entropy might retain it’s entropy on output. This still wouldn’t excuse the final result, but I was curious none the less. My final test was to use the output of my SHA1 results and feed them each through the ENCODE function. Since the output of the SHA1 function contains high entropy I conjectured that ENCODE, despite its obvious flaws, might retain this entropy. The results are shown below:

Histogram of SHA1 then Encode Character Frequency
Histogram of SHA1 then Encode Character Frequency

SHA1 then Encode Entropy Estimation
SHA1 then Encode Entropy Estimation

ENCODE manages to transform an input with approximately 160 bits of entropy into an output that, statistically speaking, contains 0 bits of entropy. In fact, the frequency distribution of the character output is nearly identical to the first graph in this post.

This brings me back to my opening statement, “Unless you have an unbelievably good reason for developing your own cryptography, don’t!”. I can’t figure out why this ENCODE function exists? Surely the ABL library has support for a proper hash function like SHA1, right? Yes, in fact it does. The best explanation I could come up with is that it is a legacy API call. If that is the case then the call should be deprecated and/or  documented as suitable only in cases where security is of no importance. The current API does the exact opposite, encouraging developers to use the function for storing passwords. Cryptography is hard, even for those of us that understand the subtlety involved. Anything that blurs the line between safe and unsafe behavior only makes the burden on developers even greater.

It is unclear, based on this analysis, how much effort it would require to find collisions in ABL’s ENCODE function. But, even this simple statistical analysis should be enough for anyone to steer clear of its use for anything security related. If you are an ABL developer I would recommend that you try replacing ENCODE with something else. As a trivial example, you could try: HEX-ENCODE(SHA1-DIGEST(input)). Obviously you need to test and refactor any code that this breaks. But, you can at least be assured that SHA1 is relatively secure from a hashing perspective. That said, you might want to start looking at SHA-256 or SHA-512, given the recent chinks in the armor of SHA1:

Unfortunately, it does not appear that ABL has support for these more contemporary SHA functions in their current release.

Ok….slowly stepping down off my soapbox now.   Bad crypto just happens to be one of my pet peeves.

Footnote:

Just before posting this blog entry I decided to email Progress to see if they were aware of the behavior of the ENCODE function.  After a bit a few back and forth emails I eventually got an email that desribed the ENCODE function as using a CRC-16 to generate it’s output (it is not the direct output, but CRC-16 is the basic primitive used to derive the output).  Unfortunately, CRCs were never meant to have any security gurantees.  CRCs do an excellent job of detecting accidental bit errors in a noisy transmission medium.  However, they provide no gurantees if a malicous user tries to find a collision.  In fact, maliciously generating inputs that produce identical CRC outputs is fairly trivial.  As an example, the linearity of the CRC-32 alogirthm was noted as problematic in an analysis of WEP.   Thus, despite the API doc recommendation, I would highly recommend that you not use ENCODE as a means of securely storing your user’s passwords.