Web archiving is an important skeptic tool

You may have noticed a new badge appeared recently on the right side of this blog that says Keep Libel Laws out of Science. It has to do with an ongoing legal case in England where a major chiropractic association is suing science writer Simon Singh for libel over an article in which he referred to certain chiropractic procedures as “bogus.” I encourage you to click the badge and sign the petition.

Today there was a fascinating development in this case that relates directly to skeptical software tools. Because the case hinges on whether or not chiros promote procedures they know to be “bogus”, skeptics have been scouring chiropractic websites in the UK looking for evidence of this. In response, another chiropractic association has advised its members to take down their websites entirely! This is stunning.

Internet Archive Wayback Machine

Internet Archive Wayback Machine

As skeptics one of the key things we do is hold woo-woos feet to the fire when they make ridiculous claims. Perhaps the most public place of all to make a claim is on a website, because it is instantaneously visible to everyone on earth who chooses to look. Tracking claims made on websites is thus an important skeptical technique.

But this move by the chiropractors reminds us that the web is mutable thing. Any content anywhere on the web can be changed at any time. Paranormalists and pseudoscientists can edit their web sites constantly to present a moving target or to remove evidence of their missteps. In order to do our jobs as skeptics, we need to be constantly aware of this and use tools to compensate. Fortunately such archival tools exist. One is the well-known Internet Wayback Machine, but several others (including commercial products) exist.

After the jump, I’ll talk more about some of the uses of these tools and show you how to use them as a skeptic.

The psychic two-step

A standard technique used by psychics is to claim to have predicted something, when in fact they are doing so retroactively. Credulous media will often let them get away with this with little or no checking. It is up to skeptics to check. Catching a psychic doing this is always a big win for skepticism.

This has been done successfully several times over the years. Last year the skepTick caught Nikki, Psychic to the Stars in a retroactive prediction of Heath Ledger’s death. Robert Todd Carroll of Skeptic’s Dictionary also pointed out that Ellie Crystal‘s so-called predictions of the 9/11 attacks apparently weren’t posted on the web until 2005.

As I mentioned in my presentation at TAM6, I think we skeptics need to be more systematic about how we approach what we do. The skepTick only happened to notice the Heath Ledger prediction because he had copied her web site for a presentation he was giving the same week Ledger died. How many more retroactive predictions like this might we catch if we were actively looking for them?

Reactive archive tools

There are a number of tools that can be used reactively to deal with this. Some are online services that monitor web pages for you and send you an email when a change is detected. Many of these tools are completely free.

As I mentioned, the most famous tool that can be used for this is The Internet Wayback Machine. This is a massive archive of the web that allows you to see multiple old versions of virtually any popular web page. Robert Lancaster used this tool to research Sylvia Browne’s predictions for the U.S. presidential elections, even though the older predictions had been removed from her website.

There are other tools that can be used similarly to find old copies of web pages, such as the Google cache. Several of them are reviewed here. Of course the advantage of public archives like this is they can be used to check pages you weren’t monitoring in advance. The disadvantage of Google Cache is that it can only be used to catch changes in the very recent past (a few days at most).

There are several other practical issues with using these reactive tools. First of all, free web-based tools have a tendency to disappear suddenly when the operator decides it is no longer a priority or taking too much effort. (This is not true of the Wayback Machine or Google, of course, but of the other monitoring tools I linked to above). The quality of service can vary as well. You get what you pay for.

Another problem is some of them are only oriented toward archiving copies of pages or noticing when changes occur. They often will do nothing to alert you as to the exact nature of the change, leaving you the job of laboriously comparing versions to figure out if there was a change at all, or what the change was.

Further, if you don’t choose your sites to monitor in advance, you are at the mercy of whether the tool found that site “interesting” enough to archive. The Wayback Machine has an algorithm for choosing what sites to copy that tends to ignore little-known sites. It also will not show any results for a site until 6 months have past, so it is useless for time-frames shorter than that. So these tools are not going to be too useful for tracking a little-known psychic or chiropractor with a brand new website.

Achilles heel: Robots.txt & “noarchive”

Perhaps the biggest problem with public archiving tools like the Google Cache and the Wayback Machine is they can be manipulated by the webmaster of the target site.

There’s a well known system known as the robots.txt protocol that has existed for many years. This is a way for webmasters to specify that certain web pages should not be crawled or indexed by search engines. Originally it was intended to keep these automated crawlers from getting “lost” on deep websites, but it can be used to block any content at all, including entire sites if desired.

More recently, Google and others have introduced a way to request that a particular page not be archived even though it is available for indexing and searching. This is used by news sites in particular to protect their advertising revenue. But it can also be used by woos to prevent you from using the Wayback Machine or Google cache to monitor their site.

Any woo who has already been caught once and/or has a clever webmaster running their site, is going to take advantage of these features, locking skeptics out from using these tools.

Proactive archive tools

The best way to play this game is to be proactive about it. Pick your skeptical targets ahead of time, and regularly archive copies of their website that you can control. Then you can catch them in their foolishness without having to rely on the vagaries of the Wayback Machine or whether or not they have a robots.txt defined.

One possible option you might consider here is using WebCite. This is a service intended for use by scholars and academics who need to cite particular versions of web-based resources in papers and other research. You can give it the URL of a resource and it will archive a copy that can be accessed at any future time via a WebCite url. Thus it gives you the same result as the Internet archive, but a greater level of control. It might be particularly well suited to use in tracking psychic prediction pages since these are usually localized within a site.

Sometimes, however, you really do need to have an entire site archived, because you are not sure what part of it might become significant due to later changes. Online tools such as Wayback or WebCite not great for this, because they usually will not archive everything on a site. What we need is a tool you can run on your own computer.

Command line tools

Particularly if you are a command-line nerd like myself, you may want to use one of the free command-line tools that does this. (If you are averse to using command line tools, you may want to skip down to the section below on commercial tools).

There are two well-known command line tools useful here. One is called cURL (Wikipedia) and another is called WGET (Wikipedia). Either can easily retrieve individual pages by URL, and because they are free and open-source they are often pre-installed on some computers. (In particular if you have a recent-vintage OSX Macintosh, you have a copy of CURL already installed. Any Linux computer is likely to have either CURL or WGET or both pre-installed). In essence, each of these tools let you type a command line to fetch a particular URL, and the content of the page at that URL is downloaded to a file on your disk.

I’m going to highlight WGET here, because it has a nice recursive option that will come in handy. This tells it to chase all the hyperlinks in HTML content and recursively download an entire website. This is perfect for our use as skeptics. A few months ago when there was a problem with the Stop Sylvia website, I used this tool myself to pull an archive copy of everything on that site just in case. I later used my local archive to easily and quickly build a list of all the URLs of articles on the site for purposes of sub-linking to them.

The basic technique is to point WGET at the home page of the site in question, include the options to recurse the site and ignore ROBOTS.TXT, and let it run. The result will be a set of files and subdirectories on your disk that mirror the structure of the site.

Here’s a typical WGET command-line you might use to pull down an entire site. This will give you a browsable local mirror of What’s the Harm as written below, change the domain in both places to point it at your woo-woo target. (And please take it easy on my server when experimenting!)


wget -r -nc -p -k -l 0 -erobots=off -D whatstheharm.net http://whatstheharm.net

The above should be on one line as a command, though it probably wraps on your screen. The switches used here are as follows:

  • -r is recursive, the key to downloading an entire site.
  • -nc is “no clobber”, it will not overwrite existing files. Useful for resuming a long run.
  • -p is “prerequisites” it ensures all graphics and other files needed are downloaed too.
  • -k is an option to rewrite all the links in the HTML so they point to your copy.
  • -l 0 tells it to follow all the links as far as they go. This can be dangerous on huge sites.
  • -D whatstheharm.net tells it to stay within the original site and not follow outside links.
  • -erobots=off of course tells it to ignore the ROBOTS.TXT.

There are many other switches that are useful, type wget --help to see many of them. Also, there are also verbose versions of the command line switches if you prefer. This post at Linux Journal gives you an approximately equivalent command line using verbose switches.

WGET is a free & open source software (FOSS) project, which means you can use it freely and even modify it for your own use if you are a programmer. Unfortunately the fact that it’s FOSS and also a command line tool means it can be difficult to find an easy-to-install package for it that is simple to use.

There is one for Windows systems here and for Mac OSX systems here. The installation instructions for that OSX version are incomplete, you will need to use commands such as sudo, mkdir and chmod to complete installation. Here is a typical sequence once you have extracted the zip file and are sitting at a command line in that directory:


chmod +x wget
sudo mv wget /usr/local/bin
sudo mkdir /usr/local/etc
sudo mv wgetrc /usr/local/etc

The above worked on my MacBook Pro with 10.5.7 installed, your mileage may vary.

The biggest problem with WGET is it doesn’t give you any assistance with finding the key changes on a web site once you have mirrored it to your local disk. You are on your own to use whatever differencing tools you may have on your computer. Programmers and other IT people usually have one handy. For regular humans, there are a number of free tools available to perform this task. Pick one out and learn it well.

Commercial solution: Versionista

Enter Versionista, a recently launched pay service. During the U.S. presidential campaigns in 2008, it got some publicity because the John McCain presidential campaign was using this service to track changes in Barack Obama’s website, which is certainly an interesting application. Now that the election is over, public interest organizations are using it to track changes in policy as reflected on the White House website.

This service will track up to 2500 URLs for you, emailing you when any change occurs. It can even show you a side-by-side view of a changed web page, with changes between the versions highlighted. It seems to be a much more professional version of the free services I mentioned above.

You can use the service for free to track just 5 URLs (with 4 versions tracked per URL). This might be enough for certain key skeptical uses, such as tracking the “predictions” page on a given psychic’s site.

Beyond that it does get a bit pricey:

  • 30 urls: $16.00 per month
  • 100 urls: $29.00 per month
  • 200 urls: $59.00 per month
  • 400 urls: $99.00 per month
  • 1200 urls: $249.00 per month
  • 2500 urls: $499.00 per month

There are also some sub-features that are enabled as you move upward in their pricing tiers. I suspect they’ll simplify that price list over time, it seems overly complex to me.

Looking forward, another similar tool that might become useful is Zoetrope at the University of Washington. Skeptics should keep an eye on this project for usable spinoffs.

Conclusion

Clearly any skeptic targeting someone with a web presence should consider availing themselves of web archiving tools like these. The application to psychic predictions is huge, and the use in tracking pseudoscientific claims such as those of chiropractors is also clear. The commercial tools are easy to use albeit expensive, but there are plenty of free options available as well.

In the comments, lets discuss any tools like these you have used personally and the pros and cons that you’ve seen. I’m also interested in hearing from any skeptics who are already actively using this technique to catch the bad guys in the act. (And yes, the folks tracking the chiropractors mentioned in the opening paragraph are using this technique already, their archive of chiropractic websites is already available.)

About these ads

About Tim Farley
Focused on online misinformation, Tim Farley is a software engineer, computer security expert and scientific skeptic who created the site What's The Harm. He is a Past Fellow of the James Randi Educational Foundation.

5 Responses to Web archiving is an important skeptic tool

  1. yjmbo says:

    If you collect the entire website with wget, you could do a lot worse than to use software versioning tools to examine the site for differences, these are designed to find and highlight differences for you.

    One of the easiest ways to do this is to install and use subversion (http://subversion.tigris.org/). The process would be something like :-

    Create an empty subversion repository.
    Collect a copy of the website.
    Import that website into your repository with “import in-place”.

    Now the fun stuff begins …

    Grab a new copy of the website into the same directory as the first time.
    run “svn status” — this tells you which files have changed.
    For each interesting file, run “svn diff” — this will tell you exactly which lines have changed.
    When you have finished examining things, “svn checkin” will make the new version the baseline for the next comparison. Don’t worry though, all the previous states are preserved.

    If you like a GUI, there are lots of these available, have a look at the third-patrty client list http://subversion.tigris.org/links.html#clients

    I like to have a dedicated website for a set of files tracked by subversion, for that I use Trac (http://trac.edgewall.org/) — this is a combination of wiki, job tracking system, and a full front-end to subversion so you can see the differences between files.

    Enjoy!

  2. yjmbo says:

    Actualy, grabbing your own copy of a site is useful for analysis, but possibly not sufficient to “prove” a claim.

    After all, it is possible for you to have modified your own local copies, or simply to have made a false claim – and people used to such tactics will accuse their employment. How do you defend against that?

    The third-party sites such as archive.org have a defence “en masse” — they are unlikely to have modified the data they collect, due to their wide range and scope, and the unnotability (in their eyes) of any individual site. Therefore a charge of malfeasance against those services can be more easily dismissed.

    Also note that the manual recursive WGET you describe deliberately ignores robots.txt. In a litigous society, a web site owner could publish terms & conditions of use of their website, and require that robots.txt is honoured. While this may not be an enforceable contract, they can tie up the resources of any individual researcher if they choose. So make sure you collect their website from an offshore location!

  3. Tim Farley says:

    Obviously I’ve been out of the day-to-day world of programming far too long. Of course! Version control systems are an excellent way to solve this issue, and many of them are available for free just like WGET is.

    Yes the “chain of custody” problem is one that should be considered, especially in cases like Simon Singh’s where court action is involved. If the site is not archived by the Wayback Machine, the WebCite service I mention in the article might be a good option here too.

    As for running afoul of license agreements or otherwise angering the webmaster, I would suggest using the delay options on WGET to force it to pull the site very slowly, perhaps over the course of a day. Done correctly, it is highly unlikely anyone would notice that you had truly been violating the robots.txt.

  4. yjmbo says:

    Chain of custody can be managed. All you have to do is to find a service that is reliable and indifferent to you or your content — a “Trusted Third Party”

    As an example, you could prove your own local copy is untampered with, by immediately submitting timestamped checksums to a third party site.

    http://en.wikipedia.org/wiki/Trusted_timestamping

    http://www.itconsult.co.uk/stamper.htm is a very old service offering this facility, http://www.copyclaim.com/ is a new one I’ve just spotted.

  5. Pingback: Please Don’t Start Another Blog or Podcast! « Skeptical Software Tools

Follow

Get every new post delivered to your Inbox.

Join 10,888 other followers