Data Science Specialization it is!

I just went through the experience of completing the Johns Hopkins University Data Science Specialization on Coursera.

The last course of this specialization was the Capstone project, which consists basically in learning about a new subject, Natural Language Processing (or NLP in short) and producing a Shiny application hosted on Shinyapps.io that predicts the next word a user is going to type, based on his/her previous input. The prediction is performed by a model which is based on a set of texts provided by the course instructors.

The course itself lasts 7 weeks and develops into logical steps, in which one at the time the various aspects of dealing with a task of this type are analysed. You have to understand the data, and therefore there is the requirement of producing a milestone report where an exploratory data analysis is performed and a hint of a first a first approach to the final model can be given (not all do that).

The milestone report and the final submission, which consists of a slide deck and the Shiny application itself, are  peer reviewed. There are also two graded quizzes, which are helpful to understand whether the model you are building is going in the right direction.

This is a really challenging task. For the following reasons:

  • No specific videos to take you to the solution holding you by the hand, so this is really something you have to put together on your own
  • The data set (the text Corpora) is huge and performances are important
  • The choice of R libraries to use are left entirely to you
  • You may make all beautiful plans for your application, but you have to concentrate on speed (efficiency) and usability
  • You will be pressed for time, no matter what

My application (the one I submitted) can be visited here. It received a rating of 10/11 and I am rather proud of it especially because I had to cut a few features that I would have liked to include because I was pressed for time. I have seen some of these in the submissions from other peers during the review, and they really looked good.

Most of the source code is embedded and can be seen using the tab I put on the top of the user interface. I only left out the SQLite DB schema, which is an easy thing to create using one of the free tools available. I used DB browser for SQLite.  I have the intention to write another article to explain how to use SQLite databases in R and a fast and efficient way to create them.

The last thing I want to say is a big THANKS to all the reviewers, THANKS to the instructors and finally THANKS to my family for supporting me especially for working at the final submission during the winter holiday period.