May 8, 2012 5 Comments
Since the beginning of this blog, we’ve been talking about ways to re-use and mash up data that already exists online. This is the core of what the programmable web is about, and there are many potential data sources to use. Figuring out ways to use them that advances skepticism and critical thinking is the key.
Among the others who noticed the utility of re-using existing data this way were journalists. This is because at the same time these fantastic web APIs and tools have become available, governments and other public institutions have moved to open up many of their massive public-domain databases for use by the public. When these datasets contain information that might bear on policy issues and decisions, they are potential gold mines for journalists.
This has kicked off a trend called data-driven journalism. Simply put, it is journalists using data mining and other data analysis techniques in order to find the basis for stories. I think skeptics could learn from the techniques of data-driven journalism, and use them for our purposes too. Indeed, I’ve done some very small experiments in that direction in my metrics articles.
Beware: it’s not the easiest thing in the world to get right. There are definitely many ways you can be tripped up if you aren’t careful. But I think if you are careful there are some interesting techniques here that will be helpful to skeptics.
So let’s explore what it would mean to do data-driven skepticism.
What is data-driven journalism?
Data-driven journalism is a complex topic, but fortunately quite a bit has been written about it. Just in the last week there’s been quite a bit of attention paid to the release of a new handbook for data journalists as well as Google supporting data journalism through grants and prizes. A number of news organizations such as the New York Times and The Guardian have embraced data journalism in the creation of new stories.
An extremely simplistic overview of the topic might read:
- Gather data from:
- Government databases
- Freedom of information requests
- Publicly available online data
- Crowdsourced data gathering efforts
- Analyze the data using:
- Free and open source software tools
- Custom purpose-built software
- Programmers working with journalists
- Journalists who know how to program
- Visualize and present the data
- Allow your readers to interact with the data
- Tell a news story around the data
There are many forms this can take and many types of stories that can be told. It is certainly a very interesting frontier to explore at a time when journalism is facing many challenges.
Beware: This is Science, and Science is Hard
But all is not sweetness and light here. Just as in other forms of journalism, it is very possible to get the story wrong. In fact, since data-driven journalism often involves statistical analysis of complex and incomplete data from sometimes poorly understood sources, it is possible to get this wildly, terribly wrong.
Ben Goldacre has pointed this out when The Guardian did an erroneous analysis of the rate of prescription of anti-depressants, and repeated the error in more stories months later. Among other things, he objected that there were alternate explanations for the statistics which had already been documented in the scientific press. By focusing on the numbers instead of doing their homework, the journalists came up with a conclusion that simply was not supported by the data they were using.
Skeptics should take note of this, and take care not to fall into the same traps.
What is data-driven skepticism?
So what form would data-driven skepticism take? Unfortunately there aren’t a ton of giant government databases of bigfoot sightings or psychic reading reports to mine for trends, such as you can with crime statistics or government budgets. (There are exceptions, for instance there might be some interesting numbers from the UK Ministry of Defence and US Air Force projects to track UFO reports). So where are we going to get the data?
There are many sources of data in the online realm that can be used instead. I wrote about some of them in my article on “skeptic metrics” at the JREF blog last October. They include:
- Google Insight for Search for popularity of search terms
- Google Ad Planner for demographic & rough traffic levels for believer web sites
- Sites such as YouTube and Wikipedia make page view counts publicly available
- Sites such as WOT, YouTube, Yelp and others include votes of approval or disapproval
These statistics can be gathered manually, by automated “scraping” techniques or even via documented APIs, and then compiled for statistical analysis.
Skeptics could also crowd-source the gathering of numbers that don’t already exist online. For example, estimates of the number of attendees at psychic shows could be done by skeptics local to the show, and then pooled to get an idea of the success of a psychic on tour.
A real-world example might help make it clearer.
An Example, Courtesy of Simon Singh
You may recall that Sally Morgan, a prominent UK celebrity psychic, got much attention last September. Accusations were made that she used an earpiece while on stage to receive instructions which improved her readings. Morgan denied the allegations, saying the earpiece was only used to communicate technical matters and stage directions, as is often done in television.
As this story developed, writer and skeptic Simon Singh attended several of her shows, and noted that she had stopped wearing her earpiece. That made him wonder: might there be a way to see if the earpiece made a difference? If only we could measure how accurate her readings were with and without it.
Well doing that directly would be difficult, but perhaps online data sources offer a proxy measurement. Simon found that the ticketing site (Ticketmaster) allowed attendees to review a show after attending. The votes are right there on the site and can be gathered and averaged. When he did so and charted them, this was the result:
Clearly the people rating Sally Morgan on Ticketmaster were markedly less satisfied with her show late in the year, compared to prior to the controversy.
Now Simon is well aware of the problems with these types of analyses (such as those pointed out by Ben Goldacre). He points out there could be several other explanations such as nervousness caused by the controversy, malicious negative voting on the ticket site and more. But it certainly could also mean that the earpiece was important to the quality of Sally Morgan’s readings.
You can read Simon’s analysis in it’s entirety, including his caveats about the conclusion, here. It is an interesting result gathered from freely available data, and is a fine example of data-driven skepticism.
I think Simon’s analysis is just the beginning. Data-literate skeptics (who are appropriately wary of possible statistical pitfalls) should consider emulating data-driven journalists. A little bit of programming, combined with some data mining or FOIA requests could result in interesting skeptic insights.
Be sure to check out the Data-Driven Journalism Handbook for tips on how to proceed.