Dan Nguyen of the nonprofit investigative site ProPublica.org worked with reporters Charles Ornstein and Tracy Weber to produce the “Dollars for Docs” package looking at doctors and the money they receive from pharmaceutical companies for speaking about their products.
For the series, Dan compiled the data that became the meat for the collaborative work between ProPublica, NPR, the Chicago Tribune, The Boston Globe, Consumer Reports and PBS’s Nightly Business Report. (Check out the Nieman Journalism Lab’s article about the project.)
Today’s Tip: Use Google Refine to clean up messy data, Dan says.
Google Refine, known as Freebase Gridworks until acquired by Google, works similar to an Excel spreadsheet, except it has built-in functions that can match entries that appear to have similar values, he says. “This can vary from comparing values by ignoring capitalization and punctuation (to a computer program, ‘MARIA’ is wholly different than ‘Maria’) to matching up values by their similarities in pronunciation, i.e. ‘Jenny’ and ‘Jenni,’” Dan says.
As a news-applications developer, he says he needed the program because of inconsistent formats and the lack of unique identifiers. The data came from seven drug companies that have begun posting doctors’ names and compensation on their websites, some as the result of legal settlements with the federal government. “ProPublica took these disclosures, totaling $257.8 million to about 17,700 providers, and assembled them into a single, comprehensive database that allows patients to search for their physician,” writes senior editor Tom Detzel in a piece about the data behind the stories.
“There is no unique identifier that allows us to say that so-and-so doctor got $x amount of money from Company A and $y amount from Company B,” Dan says. “You can’t just do a query that asks “match all entries where the name is ‘John R. Smith’ because in one report, that doctor will be named ‘John Smith’ and in the other, ‘John Ronald Smith.’”
Using the program helped Dan whittle down more than 30,000 entries to more than 380 unique names with total payments worth $100,000 or more, he says. But even with the sorting, reporters had to call every doctor that made the top-earners list to confirm identities.
“I’ve also used this method in looking at campaign-donation records, to find people who have donated several times but used variations in their name,” Dan says. “Gridworks/Google Refine is a great all-purpose tool for any reporter combing through spreadsheets.”
ProPublica has a free widget that you can put on your website to search by name for docs in your state who’ve received money. Here’s how to get the embed code for the widget.