Dan Nguyen of the nonprofit investigative site ProPublica.org worked with reporters Charles Ornstein and Tracy Weber to produce the “Dollars for Docs” package looking at doctors and the money they receive from pharmaceutical companies for speaking about their products.
For the series, Dan compiled the data that became the meat for the collaborative work between ProPublica, NPR, the Chicago Tribune, The Boston Globe, Consumer Reports and PBS’s Nightly Business Report. (Check out the Nieman Journalism Lab’s article about the project.)
Today’s Tip: Use Google Refine to clean up messy data, Dan says.
Google Refine, known as Freebase Gridworks until acquired by Google, works similar to an Excel spreadsheet, except it has built-in functions that can match entries that appear to have similar values, he says. “This can vary from comparing values by ignoring capitalization and punctuation (to a computer program, ‘MARIA’ is wholly different than ‘Maria’) to matching up values by their similarities in pronunciation, i.e. ‘Jenny’ and ‘Jenni,’” Dan says.
As a news-applications developer, he says he needed the program because of inconsistent formats and the lack of unique identifiers. The data came from seven drug companies that have begun posting doctors’ names and compensation on their websites, some as the result of legal settlements with the federal government. “ProPublica took these disclosures, totaling $257.8 million to about 17,700 providers, and assembled them into a single, comprehensive database that allows patients to search for their physician,” writes senior editor Tom Detzel in a piece about the data behind the stories.
“There is no unique identifier that allows us to say that so-and-so doctor got $x amount of money from Company A and $y amount from Company B,” Dan says. “You can’t just do a query that asks “match all entries where the name is ‘John R. Smith’ because in one report, that doctor will be named ‘John Smith’ and in the other, ‘John Ronald Smith.’”