Aug 6, 2015

Takeaways from the CAJ conference, Part 3

This is what five years worth of data looks like.
David Weisz of the Toronto Star, who is a co-instructor at the University of King's College data school, led a session at the June CAJ conference titled, Dirty Data and Common Mistakes.

From the annual King's College Data School website, this is what data journalism is about:

Data journalism encompasses a lot of things these days, from the data analysis skills that have traditionally been known as computer-assisted reporting, or CAR, to computer programming to developing news applications...CAR skills can range from using a spreadsheet to re-order and make sense of a list of large salaries, to using a database program to crunch through a large inspection database to designing maps that compare poverty and crime in your community.

One example where a large dataset might be analyzed in a spreadsheet program (typically people use Excel or a program like it), is this story about how eleven nurses at the Nova Scotia Health Authority are earning twice their salary, while "dozens more are earning tens of thousands of dollars more in overtime." It raises important questions about why nurses are working so hard, and if it's safe for them to do so.

I am trying to work through a huge dataset right now (pictured above!) so I decided to go over my notes from David Weisz's session to remind me about good data practices. A few pointers:

-All data sets are dirty. There will always be technical errors.
-Make sure you have a strong index. That means, make sure each individual record has its own unique identifiers. This avoids duplicates and makes connections within the data set.
-Check to see that each individual identity number only comes up once.
-Relying on scraped data is a great way to be wrong.
-Don't be afraid to check issues that seem wrong with other sources such as communications staff, FOI staff, annual reports.
-If you feel you may be wrong about the conclusions you are drawing from your data, embrace your fear. You may well be wrong.
-Outliers: do they catastrophically affect the story? Has someone screwed up somewhere? If you can work around it, do. Give ranges instead of exact figures, if necessary to remain accurate.
-Know the weaknesses of your data and be prepared to defend them.
-Get and give as much context for the data as possible.

No comments:

Post a Comment