Collecting Cookies with PhantomJS

TL;DR: Automate WebKit with PhantomJS to get specific Web site data.

This is the first post in a series about gathering Web site reconnaissance with PhantomJS.

My first major engagement with Neohapsis involved compiling a Web site survey for a global legal services firm. The client was preparing for a compliance assessment against Article 29 of the EU Data Protection Directive, which details disclosure requirements for user privacy and usage of cookies. The scope of the engagement involved working with their provided list of IP addresses and domain names to validate their active and inactive Web sites and redirects, count how many first party and third party cookies each site placed, identify any login forms, and determine the presence of links to site privacy policy and cookie policy.

The list was extensive and the team had a hard deadline. We had a number of tools at our disposal to scrape Web sites, but as we had a specific set of attributes to look for, we determined that our best bet was to use a modern browser engine to capture fully rendered pages and try to automate the analysis. My colleague, Ben Toews, contributed a script towards this effort that used PhantomJS to visit a text file full of URLs and capture the cookies into another file. PhantomJS is a distribution of WebKit that is intended to run in a “headless” fashion, meaning that it renders Web pages and scripts like Apple Safari or Google Chrome, but without an interactive user interface. Instead, it runs on the command line and exposes an API for JavaScript for command execution.  I was able to build on this script to build out a list of active and inactive URLs by checking the status callback from page.open and capture the cookies from every active URL as stored in page.cookies property.

Remember how I said that PhantomJS would render a Web page like Safari or Chrome? This was very important to the project as I needed to capture the Web site attributes in the same way a typical user would encounter the site. We needed to account for redirects from either the Web server or from JavaScript, and any first or third party cookies along the way. As it turns out, PhantomJS provides a way to capture URL changes with the page.OnUrlChanged callback function, which I used to log the redirects and final destination URL. The page.cookies attribute includes all first and third party cookies without any additional work as PhantomJS makes all of the needed requests and script executions already. Check out my version of the script in chs2-basic.coffee.

This is the command invocation. It takes two arguments: a text file with one URL per line and a file name prefix for the output files.


phantomjs chs2-basic.coffee [in.txt] [prefix]

This snippet writes out the cookies into a JSON string and appends it to an output file.

if status is 'success'
# output JSON of cookies from page, one JSON string per line
# format: url:(requested URL from input) pageURL:(resolved Location from the PhantomJS "Address Bar") cookie: object containing cookies set on the page
fs.write system.args[2] + ".jsoncookies", JSON.stringify({url:url,pageURL:page.url,cookie:page.cookies})+"\n", 'a'

In a followup post, I’ll discuss how to capture page headers and detect some common platform stacks.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s