Getting and Cleaning Data – Data Science day by day

“Getting and Cleaning Data” is the third course of the “Data Science Specialization” from Johns Hopkins Bloomberg School of Public Health on Coursera, was the first of this series of course where some connection to some of the Data Scientist’s real tasks can be found. I imagine that the data collected on the fields in various types of experiment and from different sensors may be dirty, incomplete or plain wrong. Bring the data to usable form without altering their “payload” of information is what this course is about.

The course develops itself in 4 weeks, as usual in this series. Each week closes with a mandatory quiz, which counts towards the final marks. At the end of the third week there is a project, which involves a GitHub repository and some data to clean, following the project rubric and using a R script that needs to be stored in the github repository mentioned above, together with a Readme.md and Codebook.md R Markdown files, you have to submit the clean data produced by your R script. The project accounts for 40% of the final marks and the evaluation is based on Peer Reviews. If you do not evaluate at least 4 peers then you lose 20% of your marks. There are some bonus points that you can achieve if you complete an optional assignment on “Swirl“.

During the lessons a few brilliant links where provided to data resources around the world.

Overall another nice experience from Coursera, more experience with R and RStudio and finally a few more insights in this immense field.