In the words of its creators, SQLite is a self-contained, high-reliability, embedded, full-featured, public-domain, SQL database engine. And apparently is the most used in the world.
Libraries exist for interfacing R with SLQLite, the minimum requirement being DBI (A Common Database Interface) and RSQLite (SQLite interface for R).
The keyword here is “embedded”. You do not need any other external library or program to access your data. The library is the database engine and that becomes part of your program.
Database Creation
I will not enter in the discussion of why in some cases we would prefer a database instead of a R save files, this is beyond the scope of this article. I will also not discuss the theory of relational databases, there is plenty of literature about that. What I would like to show instead is a specific use case, using R. In particular, I would like to focus on the simple tasks of Selecting data, Inserting or updating data. To create my database, I use a tool, the excellent and free DB Browser for SQLite, which not only allows you to create databases, tables and indexes but also allows to save the database “schema” in a text file, like for instance the following, which is taken from my Data Science Capstone project.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
BEGIN TRANSACTION; CREATE TABLE `UNIGRAM` ( `W1` TEXT NOT NULL UNIQUE, `count` INTEGER, PRIMARY KEY(`W1`) ); CREATE TABLE `TRIGRAM` ( `W1` TEXT NOT NULL, `W2` TEXT NOT NULL, `W3` TEXT NOT NULL, `count` INTEGER, PRIMARY KEY(`W1`,`W2`,`W3`) ); CREATE TABLE `QUADGRAM` ( `W1` TEXT NOT NULL, `W2` TEXT NOT NULL, `W3` TEXT NOT NULL, `W4` TEXT NOT NULL, `count` INTEGER, PRIMARY KEY(`W1`,`W2`,`W3`,`W4`) ); CREATE TABLE `BIGRAM` ( `W1` TEXT NOT NULL, `W2` TEXT NOT NULL, `count` INTEGER, PRIMARY KEY(`W1`,`W2`) ); COMMIT; |
Another way of creating a RSQLite table is to write a R dataframe to a RSQLite table. An operation that looks like this :
1 2 3 4 5 6 7 8 9 10 11 12 |
library(DBI) library(RSQLite) # Connecting to a SQLite database (notice, it does not need to exist at this point) con <- dbConnect(RSQLite::SQLite(), "iris.sqlite") # write a Data Frame into a SQLIte (this shall create the physical file) dbWriteTable(con, "iris", iris) # A query to get some data back dbGetQuery(con, "SELECT * FROM iris WHERE [Petal.Width] > 2.3") dbDisconnect(con) |
This is taken directly from the DBI package initial manual page.
Disadvantages:
- No primary key defined
- No indexes defined
In practice this approach works, but for any big size / real life application database, some indexes to optimize query speed are needed and some primary key should be defined, and I prefer to do this with an external tool. Here it is a screenshot of the tool, after the creation of the table with the dbWriteTable command. Notice the “Create Table” statement by the side of the table name.
Querying a database and inserting rows from R
The SQL command SELECT behaves as expected. As in other languages, you can build up text strings that you can then send to the database via the dbGetQuery command. Attention must be put on the quotes (single and double) especially when querying text fields. Examples:
1 2 3 4 5 6 |
petalWidth<-2.3 # Querying the database res <- dbGetQuery(con, paste("SELECT * FROM iris WHERE [Petal.Width] =", petalWidth)) # Exploring the results res |
Now, res is the return value from the database query and it is a list of lists (easily seen using typeof(res)). To know how many records you have received back from your query, the list can be interrogated in many ways. Here below they are resumed:
1 2 3 4 5 6 7 8 |
# This shall return "list" typeof(res) # This shall return 5. These are the fields of the response, not the number of records length(res) # This shall return you the number of records and the name of the first field. lenghts(res)[1] # This is an alternative way as.numeric(nrow(res)) |
The insertion and update of a record into a SQLite database have been improved. Beyond the INSERT and the UPDATE commands, the REPLACE command (which did not exist in my database times), works as follows: If the record being updated exists, then is simply replaced, otherwise it is inserted. This is very useful when updating fields, you can use a single SQL command and either use update counts or simply the new values, so in the simplest cases you can use it as an INSERT command. Here a few examples of how you can build a query. Notice that the queries can be tested for correctness with the DB Browser for SQLite tool before implementation in your script.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
# Setting up variables sepalLength<-6 sepalWidth<-4 petalWidth<-3 petalLength<-5 species<-"testing" # Example insertion query built from the variables query<-paste("INSERT into iris ('Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width', Species) values(", sepalLength, ", ", sepalWidth, ", ",petalLength, ", ", petalWidth, ", '", species, "');", sep = "") # Execute the insertion query dbGetQuery(con, query) # Selection query res # Verify the results res # Use now the REPLACE command instead # Changing petal width petalWidth<-4 query<-paste("REPLACE into iris ('Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width', Species) values(", sepalLength, ", ", sepalWidth, ", ",petalLength, ", ", petalWidth, ", '", species, "')", sep = "") # Execute REPLACE the query dbGetQuery(con, query) # Selection query res res # NOTICE! We did not define a primary key, there the REPLACE behaves exactly as a INSERT and we get a new record. # Finally, a delete query statement query = "DELETE FROM iris WHERE [Species] ='testing'" # And as usual the execution dbGetQuery(con, query) |
And that is all. As usual, very basic stuff, but I would like to say that the methodology is here the key, to know when to work from within R or RStudio and when to craft the characteristics of the database from outside with and external tool. If you want a bit more complex example of a shiny application entirely based on this technique (and not more than what I just explained), you can visit this page, which is the final submission for my Data Science Capstone project. The source code is available on the third tab of the application.