Dplyr Cheat Sheet



Combining dplyr‘s groupby and mutate functions allows for the creation of new variables that involve per-group metrics in their calculation. Grouping a data frame by a selection of columns followed by a call to mutate allows for the creation of new columns based on per-group summary functions. Given a students data frame, to add a new column containing the difference between a. Work with strings with stringr:: CHEAT SHEET Detect Matches strdetect(string, pattern) Detect the presence of a pattern match in a string. Strdetect(fruit, 'a') strwhich(string, pattern) Find the indexes of strings that contain a pattern match. Strwhich(fruit, 'a') strcount(string, pattern) Count the number of matches in a string.

In a previous post, I described how I was captivated by the virtual landscape imagined by the RStudio education team while looking for resources on the RStudio website. In this post, I’ll take a look atCheatsheets another amazing resource hiding in plain sight.

Apparently, some time ago when I wasn’t paying much attention, cheat sheets evolved from the home made study notes of students with highly refined visual cognitive skills, but a relatively poor grasp of algebra or history or whatever to an essential software learning tool. I don’t know how this happened in general, but master cheat sheet artist Garrett Grolemund has passed along some of the lore of the cheat sheet at RStudio. Garrett writes:

One day I put two and two together and realized that our Winston Chang, who I had known for a couple of years, was the same “W Chang” that made the LaTex cheatsheet that I’d used throughout grad school. It inspired me to do something similarly useful, so I tried my hand at making a cheatsheet for Winston and Joe’s Shiny package. The Shiny cheatsheet ended up being the first of many. A funny thing about the first cheatsheet is that I was working next to Hadley at a co-working space when I made it. In the time it took me to put together the cheatsheet, he wrote the entire first version of the tidyr package from scratch.

It is now hard to imagine getting by without cheat sheets. It seems as if they are becoming expected adjunct to the documentation. But, as Garret explains in the README for the cheat sheets GitHub repository, they are not documentation!

RStudio cheat sheets are not meant to be text or documentation! They are scannable visual aids that use layout and visual mnemonics to help people zoom to the functions they need. … Cheat sheets fall squarely on the human-facing side of software design.

Cheat sheets live in the space where human factors engineering gets a boost from artistic design. If R packages were airplanes then pilots would want cheat sheets to help them master the controls.

The RStudio site contains sixteen RStudio produced cheat sheets and nearly forty contributed efforts, some of which are displayed in the graphic above. The Data Transformation cheat sheet is a classic example of a straightforward mnemonic tool.It is likely that even someone who just beginning to work with dplyr will immediately grok that it organizes functions that manipulate tidy data. The cognitive load then is to remember how functions are grouped by task. The cheat sheet offers a canonical set of classes: “manipulate cases”, “manipulate variables” etc. to facilitate the process. Users that work with dplyr on a regular basis will probably just need to glance at the cheat sheet after a relatively short time.

The Shiny cheat sheet is little more ambitious. It works on multiple levels and goes beyond categories to also suggest process and workflow.

The Apply functions cheat sheet takes on an even more difficult task. For most of us, internally visualizing multi-level data structures is difficult enough, imaging how data elements flow under transformations is a serious cognitive load. I for one, really appreciate the help.

Cheat sheets are immensely popular. And even in this ebook age where nearly everything you can look at is online, and conference attending digital natives travel light, the cheat sheets as artifacts retain considerable appeal. Not only are they useful tools and geek art (Take a look at cartography) for decorating a workplace, my guess is that they are perceived as runes of power enabling the cognoscenti to grasp essential knowledge and project it in the world.

When in-person conferences resume again, I fully expect the heavy paper copies to disappear soon after we put them out at the RStudio booth.

4 min read2020/04/16

Motivation

I use R to access data held in Microsoft SQL Server databases on a daily basis. As a result of running into problems, I’ve realized I don’t have an understanding of the roles different components, notably dplyr, dbplyr, odbc and dbi each play in the process. This contributes to the fact that my efforts to resolve or mitigate issues are often an inefficient combination of Google searches and trial and error. Additionally, as I find or develop workarounds, I am unwilling to promote them with others because I don’t fully understand the cause of the issue and, as a result, I am not confident I have addressed the problem at the appropriate level.

Dplyr Cheat Sheet

Specifically, this is most motivated by the, “Invalid descriptor index,” error documented here.

I am writing what I learn to solidly my thinking and to help others who may experience the same challenge.

Scope

Given that the source of my motivation is encountering problems when working with data in Microsoft SQL Server databases, and that I prefer to use packages in the tidyverse, this investigation will be focused on how these packages work together to collect data from databases managed by Microsoft SQL server.

I won’t be writing about other options like RODBC, RJDBC or database-specific packages.

I will also not include comments about database management systems (DBMS) other than Microsoft SQL Server.

While it’s possible to generalize many of the concepts I write about here to other DBMS systems, I will not explicitly call them out. There are plenty of resources that do that. I aim to be very focused on how these components interact in a tidyverse and Microsoft SQL Server environment in the hopes it will help paint a simpler, clearer picture for others working in that same configuration.

Additionally, I almost never write data to a DBMS, and I suspect this is the case for many people working as analysts in Enterprise environments. In light of that I will be focused on how these components work together to extract data from SQL, and not how they write data to it.

Role of dplyr and dbplyr packages

dbplyr is the database back-end for dplyr - it does not need to be loaded explicitly, it is loaded by dplyr when working with data in a database.

dbplyr translates dplyr syntax into Microsoft SQL Server specific SQL code so dplyr can be used to retrieve data from a database system without the need to write SQL code.

dbplyr relies on the DBI and odbc packages as an intermediaries for connections with a SQL Server database.

String R Cheat Sheet

Dplyr can also pass on explicitly written (not translated from dplyr) SQL code to DBI.

dbplyr generates, or captures, the SQL code that is then passed into the front-end of the database stack provided by DBI, odbc and the ODBC driver

Role of the DBI package

DBI segments the connectivity to the SQL database into a, “front-end,” and a, “back-end.”

DBI implements a standardized front-end to dbplyr, and the odbc package acts as a driver for DBI to interface with SQL Server.

An example of front-end functionality provided by DBI…

  • connect/disconnect to the database
  • create and execute statements in the DBMS
  • extract results/output from statements
  • error/exception handling
  • information (meta-data) from database objects
  • transaction management (optional)

I think of the DBI package as the front end for interactive user at the R console, a script or package, into the other components, odbc and an ODBC driver, that make it possible to extract data from SQL server.

Tidyr

Role of the odbc package

The odbc package provides the DBI back-end to any odbc driver connection, including those for Microsoft SQL Server.

This enables a connection to any database with ODBC drivers available.

I think of the odbc package as the “back-end” of DBI and the “front-end” into the ODBC driver.

Role of ODBC drivers

Open Database Connectivity (ODBC) drivers are the last leg of the link between dplyr and SQL Server. They are what enable the odbc package to interface with SQL server.

I think of the SQL Server ODBC driver as the “front-end” into SQL server, and again, back to the user, script or package

Dplyr Cheat Sheet Rstudio

Summary

  • Dbplyr translates dplyr syntax into SQL code, or captures explicitly written SQL, and hands it off to DBI
  • DBI uses odbc to interface with the SQL Server ODBC driver
  • ODBC driver communicates with SQL server and retrieves results

User or package code -> DBI -> odbc -> SQL Server ODBC driver -> SQL server

Sources and notes

I haven’t written anything new here, just focused it on the configuration I use day to day in the hopes it helps someone else.

Most was gleaned from the following and I’d recommend reviewing them for a broader perspective, and deeper insights into specific areas:

  • vignette('DBI', package = 'DBI')
  • vignette('dbplyr', 'package = 'dbplyr)