ProPublica used Google Refine to clean up data

October 21, 2010

Share this article:

In this Chicago Tribune photo, Dr. Merle Diamond (right) was one of the Illinois doctors getting the most money from drug companies, receiving $148,300 this year and last, mostly from speaking on behalf of GlaxoSmithKline. “To me, it’s always been about helping physicians understand the treatments that are available for migraine,” she tells the Trib.

Dan Nguyen of the nonprofit investigative site ProPublica.org worked with reporters Charles Ornstein and Tracy Weber to produce the “Dollars for Docs” package looking at doctors and the money they receive from pharmaceutical companies for speaking about their products.

For the series, Dan compiled the data that became the meat for the collaborative work between ProPublica, NPR, the Chicago Tribune, and The Boston Globe. (Check out the Nieman Journalism Lab’s article about the project.)

Today’s Tip: Use Google Refine to clean up messy data, Dan says.

Google Refine, known as Freebase Gridworks until acquired by Google, works similar to an Excel spreadsheet, except it has built-in functions that can match entries that appear to have similar values, he says. “This can vary from comparing values by ignoring capitalization and punctuation (to a computer program, ‘MARIA’ is wholly different than ‘Maria’) to matching up values by their similarities in pronunciation, i.e.  ‘Jenny’ and ‘Jenni,’” Dan says.

As a news-applications developer, he says he needed the program because of inconsistent formats and the lack of unique identifiers. The data came from seven drug companies that have begun posting doctors’ names and compensation on their websites, some as the result of legal settlements with the federal government. “ProPublica took these disclosures, totaling $257.8 million to about 17,700 providers, and assembled them into a single, comprehensive database that allows patients to search for their physician,” writes senior editor Tom Detzel in a piece about the data behind the stories.

“There is no unique identifier that allows us to say that so-and-so doctor got $x amount of money from Company A and $y amount from Company B,” Dan says. “You can’t just do a query that asks “match all entries where the name is ‘John R. Smith’ because in one report, that doctor will be named ‘John Smith’ and in the other, ‘John Ronald Smith.’”

Using the program helped Dan whittle down more than 30,000 entries to more than 380 unique names with total payments worth $100,000 or more, he says. But even with the sorting, reporters had to call every doctor that made the top-earners list to confirm identities.

“I’ve also used this method in looking at campaign-donation records, to find people who have donated several times but used variations in their name,” Dan says. “Gridworks/Google Refine is a great all-purpose tool for any reporter combing through spreadsheets.”

ProPublica has a free widget that you can put on your website to search by name for docs in your state who’ve received money. Here’s how to get the embed code for the widget.

Author

More Like This...

Two Minute Tips

Sign up now.
Get one Tuesday.

Every Tuesday we send out a quick-read email with tips for business journalism.

Subscribers also get access to the Tip archive.

Search

Get Two Minute Tips For Business Journalism Delivered To Your Email Every Tuesday

Two Minute Tips

Every Tuesday we send out a quick-read email with tips for business journalism. Sign up now and get one Tuesday.

Our free eBook has gotten a makeover!
It now has expanded topics and downloadable chapters.