It’s the little things – A couple of R hints to speed up interactive work

DISCLAIMER: I’m an R newbie. Would love to hear other approaches to this as well…

 

I’m about halfway through the Data Science specialization offered by Johns Hopkins University on Coursera. Right now, just about wrapping up the Exploratory Data Analysis course, which focuses on learning basic charting/plotting on R. The final project involves loading up the NEI dataset, which is about 6.5M rows and uses up 650MB, then creating separate plots on it.

R makes it *very* easy to load datasets. The read* functions are great, but when working with a large-ish dataset, the actual loading of the dataset may take a little while. It doesn’t sound like much, but I timed it and, for my system, it was a full 15 seconds. May not sound much, but 15 seconds here and there add up…

What I found works really well for me is to just check if the large object exists before trying to load it 🙂

Something like:

if(!exists("NEI")) {
  NEI <- readRDS("summarySCC_PM25.rds")
}

It sounds really simple, but as you slowly create your R script (that’s how I work anyway…) it saves a lot of time.

The second tip that worked for me was to use a subset of the data as I tinkered with the graphs themselves. Here’s a code sample:

NEI_sample<-sample(nrow(NEI),size=100000,replace = FALSE)
# select either sample or full set for calculations
mymatrix<-NEI[NEI_sample,]
#mymatrix<-NEI

This will NOT give you ‘valid’ data – it may be out of order, will change from run to run, and so on… – but will allow you to proceed with manipulating the data frame, preparing the plot (base or ggplot), and so on. Until such a time as the ggplot grammar is second nature to me, it takes me a long time to get a chart up the way I want it…

When you want to load the ‘final’ dataset, just comment out the sample and uncomment the full dataset.

Using these two simple approaches – only loading the main dataset when needed and using a subset while tinkering with the data – I was able to work efficiently with the large dataset without too much burden. As an R newbie, I really appreciated that.

Hope this helps others.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s