Login | Help

banner ad
6

ProPublica uses Google Refine to clean up messy data for ‘Dollars for Docs’

Chicago Tribune Dollars for Docs series with ProPublica.org
In this Chicago Tribune photo, Dr. Merle Diamond (right) was one of the Illinois doctors getting the most money from drug companies, receiving $148,300 this year and last, mostly from speaking on behalf of GlaxoSmithKline. “To me, it’s always been about helping physicians understand the treatments that are available for migraine,” she tells the Trib.

Dan Nguyen of the nonprofit investigative site ProPublica.org worked with reporters Charles Ornstein and Tracy Weber to produce the “Dollars for Docs” package looking at doctors and the money they receive from pharmaceutical companies for speaking about their products.

For the series, Dan compiled the data that became the meat for the collaborative work between ProPublica, NPR, the Chicago Tribune, The Boston Globe, Consumer Reports and PBS’s Nightly Business Report. (Check out the Nieman Journalism Lab’s article about the project.)

Today’s Tip: Use Google Refine to clean up messy data, Dan says.

Google Refine, known as Freebase Gridworks until acquired by Google, works similar to an Excel spreadsheet, except it has built-in functions that can match entries that appear to have similar values, he says. “This can vary from comparing values by ignoring capitalization and punctuation (to a computer program, ‘MARIA’ is wholly different than ‘Maria’) to matching up values by their similarities in pronunciation, i.e.  ‘Jenny’ and ‘Jenni,’” Dan says.

As a news-applications developer, he says he needed the program because of inconsistent formats and the lack of unique identifiers. The data came from seven drug companies that have begun posting doctors’ names and compensation on their websites, some as the result of legal settlements with the federal government. “ProPublica took these disclosures, totaling $257.8 million to about 17,700 providers, and assembled them into a single, comprehensive database that allows patients to search for their physician,” writes senior editor Tom Detzel in a piece about the data behind the stories.

Dan Nguyen, news applications developer, ProPublica.org

Dan Nguyen

“There is no unique identifier that allows us to say that so-and-so doctor got $x amount of money from Company A and $y amount from Company B,” Dan says. “You can’t just do a query that asks “match all entries where the name is ‘John R. Smith’ because in one report, that doctor will be named ‘John Smith’ and in the other, ‘John Ronald Smith.’”

Using the program helped Dan whittle down more than 30,000 entries to more than 380 unique names with total payments worth $100,000 or more, he says. But even with the sorting, reporters had to call every doctor that made the top-earners list to confirm identities.

“I’ve also used this method in looking at campaign-donation records, to find people who have donated several times but used variations in their name,” Dan says. “Gridworks/Google Refine is a great all-purpose tool for any reporter combing through spreadsheets.”

ProPublica has a free widget that you can put on your website to search by name for docs in your state who’ve received money. Here’s how to get the embed code for the widget.

And Dan’s colleagues, Charles and Tracy, are hosting a conference call at 3 p.m. EDT on Oct. 21 on how you can use the data for local reporting. You can sign up here.

About the Author

Rosland Gammon is a former business journalist turned college instructor. Her newsroom experience includes reporting for The Philadelphia Inquirer, and reporting and editing at Bloomberg News. Gammon currently teaches communications at Alverno College in Milwaukee. Follow her daily posts. | E-mail: Rosland Gammon

Leave a Comment

1) Register to join the community & comment or 2) Quick comment
Username: Username:
Email: Email:
Password:
Verify Password:
or 3) Login if you already have an account
Comment:

Switch to our mobile site