A list of data-scraping resources was among the tools most requested by our free training attendees. They also often ask for access to a business-terms glossary.
With billions of dollars pumping through the pharmaceutical industry each year and doctors making unknown profits, one question for investigative journalists has remained: How do we follow the money trail?
Last year ProPublica reporters Dan Nguyen, Charles Ornstein and Tracy Weber decided to apply their collective understanding of both the health care industry and data-scraping by creating a tool for tracking the millions of dollars transferred from drug companies to doctors across the United States.
The project is called “Dollars for Docs,” and its database currently includes seven drug companies whose 2009 sales totaled $109 billion and made up approximately 36% of the market share, according to the site.
The initial idea flowed from their experience with existing public datasets and databases, some more cumbersome than others.
“I was writing a data-scraping tutorial and needed a timely example of a source of data that was public, but difficult to use,” Nguyen says. “Pfizer had just published their data on a website that was difficult to analyze beyond searching for individual names, so it was a great example.”
Nguyen, who had used data-scraping tools for simpler projects, found his interested piqued even further when he saw The New York Times tackle the issue of drug companies and their confusing data formats. According to Nguyen, the pharmaceutical company Eli Lilly said they didn’t make their data easier to download for fear that others would access information, manipulate it and then republish manipulated data.
“That seemed like such a weak excuse,” Nguyen says, “and since the company had been forced to disclose their doctor payments, I felt that it was a worthwhile project to bring more transparency to this issue.”
“Dollars for Docs” covers all 50 states, the District of Columbia and Puerto Rico. In addition, ProPublica tracks doctors from the 30 most populous states, who remain in the pay database, despite having been sanctioned or flagged by the U.S. Food and Drug Administration.
So, how can other journalists apply data-scraping to their own beat?
Nguyen encourages beginners who are starting from scratch to try Web-scraping, as its productive results will provide the necessary motivation and incentive to push through the inevitable obstacles.
“I’ve felt that Web-scraping is one of the best ways for journalists and researchers to learn programming,” Nguyen says. “It’s relatively easy, and it’s immediately useful once you get your program working.”
There are definitely benefits and drawbacks to using open-source software, though Nguyen believes results will vary, and that much of the benefits hinge on how good the software’s supporting community can be.
“You don’t have the paid technical support right out of the box,” Nguyen explains. “But if there’s an active development group, such as the one behind Ruby or Google Refine, you may get fast and useful help in the discussions groups.”
In addition to providing useful tools to check out, Nguyen and the rest of the ProPublica team have published “A Guide to the Guides,” which includes helpful tips and instructions going forward for journalists of all skillsets.
Nguyen recommends that journalists prepare for what they expect to gain from the data-scraping tools they’ll be using. “As in any kind of software development,” Nguyen advises, “it’s always worth it to invest time upfront to really study the kind of data you’ll be collecting, and develop a good process to manage the data and keep track of editing it.”
When “Dollars for Docs” was still in the development stages, the team didn’t know how all of the companies would report their data. Some companies chose to release payment amounts in categories like “consulting” or “speaking,” while others provided a total sum for each doctor.
Nguyen adds, “If you haven’t thought of some way to organize the data that’s flexible enough to handle the different situations, it’s a major pain to backtrack.”
He credits team members Ornstein and Weber for their experience in working with databases, which helped them maintain the variety of information they received.
“It’s not always possible for everyone in a reporting project to be experts in all the journalistic aspects,” Nguyen says. “But the more holistic everyone is, the easier it is to coordinate the traditional reporting, data-gathering and analysis.”
While today’s technological advancements have made processes like data-scraping more accessible to “the average non-techie,” Nguyen cautions against complacency, saying there needs to be “the same kind of astute curiosity and attention to detail as traditional reporting.”
Nguyen points out that journalists are now able to publish data, while opening it up for others to examine and expand upon with their own resources.
“Before the Internet was a viable publishing platform,” he says, “our printed story would’ve just stated our summary numbers in the lead paragraphs, along with some general graphs. In this case, the data itself is interesting and valuable to the public, researchers and other reporters.”
In regards to the future of data-scraping, Nguyen predicts that, with the overall increase of computer literacy and the continued lowering of barriers across the board, journalists will be able to access even more information digitally.
“The reporters,” he says, “who put in some time to train themselves past an average level of digital competency will always have an edge.”