R is often described as having a particularly steep learning curve, and to some degree that is true. However, one of the steepest parts of this learning curve, at least for me, was the simple task of pulling in existing data.
In the last post, I told you to use RStudio, and that advice still holds. Bringing in files and importing data may be the place where RStudio has the largest impact on lessening the R learning curve.
In fact, pulling in data to vanilla R (vanilla in this context is simply some software without any add-ons or extras; in this case the standalone build of R) is something that I don’t actually know how to do off the top of my head. I know that there’s a get something something command, and I could probably figure it out easily by searching for it (this is also a skill you’ll pick up quickly using R), but for a number of reasons I don’t have to.
First and foremost, you shouldn’t feel bad or embarrassed if you forget a command like this. I could go to any search engine and type ‘importing data to R’ and I’m going to get…about 36 million hits. That knowledge is out there. If I need it, I can find it.
If I’m using RStudio, though, I have other options. Now, there are specific cases where you might want to write code to pull in a file so that you don’t have to stop and do it by hand, but for the most part you’ll likely just be pulling in one file at a time.
Let’s start with a file from whatever spreadsheet program you choose to work with. I’ve entered in some quick numbers into a file, to demonstrate:
You can see that I’ve used the first row to toss in some variable names, though they’re not super descriptive. If these variables had any sort of meaning I could have put those names in the first row, and R will read them. It is also perfectly fine if you don’t have variable names in your file, R will take care of that for you.
Once you have this file, you’ll want to save it as a CSV file in a place where you’ll remember it. You can name the file whatever you want, as you’ll have an opportunity in our next step to change that name for the purposes of R.
In RStudio, on the Environment window, there is a button marked ‘Import Dataset.’ Click it, and you’ll get a file dialog window that you can use to find your file. Once you have that selected you’ll be taken to another window that you can use to specify characteristics of your file. This window looks like this:
There are a number of important components here. First off, ‘test’ is what I named my CSV file. If I wanted to call this dataset something different in R, this is where I would change it. I’m going to go with ‘dataExample1’ for my data, but feel free to choose whatever name you’d like for your own.
The section in the upper right is showing what the values in my file look like at the moment. You can see that the values are in fact comma separated (that’s what the C and S stand for), and you can see that ‘Comma’ is correctly selected in the ‘Separator’ box. If you have other delimiters you can change this value now.
In the lower right you can get a feel for how this file is going to look once it is in R. You may notice two rows of variable names here (don’t be too confused by the fact that I used, in my file, names that are redundant with R’s default names). The bold row at the top are variable names that R has generated for this file. The set below are those from the CSV, because R thinks they are values at the moment!
We can easily fix that by choosing ‘yes’ for heading instead of the default ‘no.’ This should remove that second set of values and take your own variable names into the bold area. If you didn’t come up with your own variable names you can leave the heading selection alone.
You may at some point need to change the values for Decimal, Quote, and/or na.strings, but it is unlikely you’ll need to do this early on. For now, don’t worry about them unless you know something different about the values in your file (such as commas used as decimals).
For reference, I’m left with something like this:
Now, click ‘Import,’ and your data should be pulled into R and named with whatever name you assigned in this window. If things have worked, you should see this name show up in your Environment window. This file is now present in the global Environment, and we can write code that references it just like we wrote calculations that referenced x and y last time.
You might be thinking, ‘that was easy.’
Well, yeah. It kind of is.
You’re certainly going to run into more complex problems as you start to pull in larger and more complex datasets, but if you understand the core of what we just did above you understand the most of it. The main trick is getting your data into a clean and orderly CSV file.
We will focus next time on some of the things you can do with R now that you have a dataset in the Environment, but it’s probably worth giving you a bit of a few of these this time, just to give a feeling of what’s coming.
Again, I named my dataset ‘dataExample1,’ so change that value to whatever you named your file and then try running the following code:
Those should be pretty straightforward, but in in the spirit of giving you something complex, feel free to try this, as well:
It should work, as long as you’ve input a CSV file with all numbers, without missing values. If it’s not working for you, see if you can’t fix it.
I also tend to use more code than necessary, so also feel free to suggest improvements that would do the same thing!