Generate Application Cache Manifest with PhantomJS

What is PhantomJS?

To quote the source…

PhantomJS is a headless WebKit scriptable with a JavaScript API. It has fast and native support for various web standards: DOM handling, CSS selector, JSON, Canvas, and SVG.

Why use PhantomJS?

So why do you care? Well, having a headless WebKit allows you to do all sorts of things. You can automate anything within a browser environment, and that's very cool!

So for example, I work for BBC News on the responsive code base and some of the very talented guys here have recently released an open-source project called Wraith which is a front-end regression testing tool that lets you grab screen shot images of web pages at different dimensions (it's great for making sure your recent CSS commits or changes hasn't broken your site, it doesn't replace acceptance tests, unit tests or manual testing but it helps improve confidence).

In case you're wondering, the definition of 'wraith' is…

a ghost or ghostlike image of someone

…see what we did there, Phantom… ghost… wraith… yeah?

But there are other spin-offs such as CasperJS and Poltergeist both of which are named on a variation of the phantom theme but focus specifically on user/acceptance testing a user interface by allowing developers to interact with a web page programmatically.

How can we use it?

We'll be using PhantomJS to help us interrogate a web page of the user's choice. It will check the network requests, store them and then generate a Application Cache manifest file based on the content of the web page.

So I wrote a small script that allows you to do this which I called Squirrel because you're storing away small pieces of information… like… nuts? Yeah I think you see where I'm coming from.

This is very much a work in progress but the principles are there and it also demonstrates some different parts of PhantomJS that you might be interested in.

I won't go into the details of what the Application Cache does (or should do), I'll leave that to another.

Basic introduction

Here's a super quick introduction on how to use PhantomJS…

var page = require('webpage').create(),
    url = 'http://www.phantomjs.org/';

page.open(url, function (status) {
    // Page is loaded!
    phantom.exit();
});

…here we load the PhantomJS webpage module, and call the create() method which instantiates a new web page.

We then call the open() method on this web page object and pass in the URL of the web page we want to open.

We then execute the exit() method on Phantom itself which tells the script that we're finished doing what we're doing.

That's it. Super simple run down of how to use PhantomJS.

Let's break down Squirrel

OK, so let's review our Squirrel repo (specifically the appcache.js file).

The process of our script is as follows…

  1. Set-up variables and load specific modules
  2. Set-up event listeners for resources requested, received and any errors detected.
  3. In the onResourceReceived we check the contentType of the resource and if it falls into any of the categories we're interested in (image, style sheet, javascript) then we store it.
  4. Check the URL provided by the user is in the required format.
  5. If the URL is our example bbc.co.uk/news then set an additional cookie (required to load the Responsive version of BBC News site)
  6. Open the URL specified by the user (the clean-up version).
  7. Get a list of all anchors/links found in the web page.
  8. Make sure we have a unique list of each resource type.
  9. Call a function populateManifest() that will begin the process of populating our manifest file.
  10. Exit our PhantomJS script.

Set-up

var _          = require('lodash'),
    system     = require('system'),
    fs         = require('fs'),
    page       = require('webpage').create(),
    args       = system.args,
    path       = './appcache.manifest',
    manifest   = fs.read(path),
    css        = [],
    jpgs       = [],
    pngs       = [],
    gifs       = [],
    images     = [],
    javascript = [],
    links, url;

…OK so this looks a bit ugly but this long list of variables is basically us loading the Lo-Dash utility library, the Node file system, the PhantomJS web page module and then grabbing any arguments passed via the command line and reading the content of our manifest file.

You might be wondering about the long list of empty Arrays, well we have to set each variable to an empty Array because Arrays are a reference type object which means if we did var css = jpgs = pngs = gifs = images = javascript = []; then although we'd end up with shorter code (a one liner) it means that we're not getting a unique Array for each variable but sharing a single Array instance (which isn't what we want).

Functions

We have a whole group of functions that carry out different things for us, so we'll just quickly scan over them…

function getLinks(){
    var results = page.evaluate(function(){
        return Array.prototype.slice.call(document.getElementsByTagName('a')).map(function (item) {
            return item.href;
        });
    });

    return results;
}
function populateManifest(){
    writeVersion();

    writeListContentFor('Images', images);
    writeListContentFor('Internal HTML documents', links);
    writeListContentFor('Style Sheets', css);
    writeListContentFor('JavaScript', javascript);

    writeManifest();
}
function writeVersion(){
    manifest = manifest.replace(/# Timestamp: \d+/i, '# Timestamp: ' + (new Date()).getTime());
}
function writeListContentFor (str, type) {
    manifest = manifest.replace(new RegExp('(# ' + str + ')\\n[\\s\\S]+?\\n\\n', 'igm'), function (match, cg) {
        return cg + '\n' + type.join('\n') + '\n\n';
    });
}
function writeManifest(){
    fs.write(path, manifest, 'w');
}
function urlProvided(){
    return args.length > 1 && /(?:www\.)?[a-z-z1-9]+\./i.test(args[1]);
}
function cleanUrl (providedUrl) {
    // If no http or https found at the start of the url...
    if (/^(?!https?:\/\/)[\w\d]/i.test(providedUrl)) {
        return 'http://' + providedUrl + '/';
    }
}
function bbcNews(){
    if (/bbc.co.uk\/news/i.test(url)) {
        return true;
    }
}

Checking resources that are loaded

The following is our event handler that checks the content type of the resource being loaded and stores it in the relevant Array.

page.onResourceReceived = function (request) {
    // Some requests can have `contentType = null` which causes errors if we don't check for truthy value
    if (request.contentType) {
        if (request.contentType.indexOf('text/css') !== -1) {
            css.push(request.url);
        }

        // application/javascript & application/x-javascript
        if (request.contentType.indexOf('javascript') !== -1) {
            javascript.push(request.url);
        }

        if (request.contentType.indexOf('image/png') !== -1) {
            if (request.url.indexOf('data:image') === -1) {
                pngs.push(request.url);
            }
        }

        if (request.contentType.indexOf('image/jpeg') !== -1) {
            if (request.url.indexOf('data:image') === -1) {
                jpgs.push(request.url);
            }
        }

        if (request.contentType.indexOf('image/gif') !== -1) {
            if (request.url.indexOf('data:image') === -1) {
                gifs.push(request.url);
            }
        }
    }
};

To be honest we don't need to separate the content types out into jpg, gif and png. We could just have an images Array that holds them all - it would be cleaner and more efficient code, something like...

page.onResourceReceived = function (request) {
    if (request.contentType) {
        if (request.contentType.indexOf('text/css') !== -1) {
            css.push(request.url);
        }

        if (request.contentType.indexOf('javascript') !== -1) {
            javascript.push(request.url);
        }

        if (/image\/(?:png|jpeg|gif)/i.test(request.contentType) && request.url.indexOf('data:image') === -1) {
            images.push(request.url);
        }
    }
};

Checking for errors

This should be self explanatory, but basically it checks for any errors that happen while the page is loading and displays them for the user…

page.onError = function (msg, trace) {
    console.log(msg);

    trace.forEach(function (item) {
        console.log('  ', item.file, ':', item.line);
    });
}

Ensure URL is provided

Following conditional makes sure that a URL was provided by the user from the command line and if not it displays appropriate feedback to the user to let them know what they should be doing…

if (urlProvided()) {
    url = cleanUrl(args[1]);
} else {
    console.log('Sorry a valid URL should have been provided');
    console.log('Example usage: phantomjs appcache.js bbc.co.uk/news');
    phantom.exit();
}

Setting custom headers

The following code isn't needed for our Squirrel repo to do its job, I just stuck it in to demonstrate how you could use an interesting feature of PhantomJS which is to set cookies.

Here we check if the URL being requested is BBC News and if so I want the user to load up the responsive code base version of the site, and to do that we need to set a cookie for the page to know to load that version of the page…

if (bbcNews()) {
    // We want to serve up the responsive code base...
    phantom.addCookie({
        'name'  : 'ckps_d',
        'value' : 'm',
        'domain': '.bbc.co.uk'
    });
}

The main star of the show

Here is the fundamental piece of the puzzle, the call to open the web page specified by the user and which once the page has loaded then parses the page for the information we're interested in…

page.open(url, function (status) {
    links = getLinks();

    links      = _.unique(links);
    images     = _.unique(pngs.concat(jpgs).concat(gifs));
    css        = _.unique(css);
    javascript = _.unique(javascript);

    console.log('links: ', links.length);
    console.log('images: ', images.length);
    console.log('css: ', css.length);
    console.log('javascript: ', javascript.length);

    populateManifest();

    phantom.exit();
});

…PhantomJS passes us a status string as an argument to the callback function and which tells us if the page loaded successfully or not.

After we've called our main populateManifest() method we then tell PhantomJS to stop (phantom.exit()).

Example usage

So how do we use this script? Well, once you've cloned down the repo and followed the installation instructions you simply jump inside our cloned directory and run something like:

phantomjs appcache.js bbc.co.uk/news

Other PhantomJS features

Here follows are a quick run down of some other features available to you when using PhantomJS, just check the PhantomJS API for full details.

Custom Headers

For BBC News I wrote a similar script, but because I was interrogating a local development URL which uses test data I had to set some additional headers to get the page to load live data…

page.customHeaders = { 
    'X-Candy-Override': 'https://api.live.bbc.co.uk',
    'X-Compiled-Js'   : 'true'
};

Remote Debugging

You can remote debug your code by executing your code with a debug flag phantomjs --remote-debugger-port=9000 appcache.js [url], you can then open up Chrome or Safari and access the debugger via http://127.0.0.1:9000/

View Port Size

You can change the dimensions of the page being loaded using the viewportSize property (this is used in the Wraith open-source project released by BBC News)…

page.viewportSize = { width: 320, height: 480 };

Conclusion

Hopefully this introduction to PhantomJS was useful. The Squirrel repo is still a work in progress so feel free to open up an issue (or better yet a pull request) on GitHub if you find any problems.

I'll likely follow up this article at some point in the future with one about using CasperJS so keep your eyes peeled for that.

Any other feedback then catch me on twitter.


Links