Adventures in R: Experiences with my Dissertation

Well, it has now been 6 months that I have been using R “regularly”. I set out from the beginning to learn R, and since the last time I attempted it, I need to make sure that I had external reasons for needing to learn it. To do that I told myself that I wouldn’t touch any statistical software that I had used previously (SPSS, SAS, MiniTab). This put me in the position that I would have to learn R to be able to run any analysis that I planned to use for my dissertation’s multiple studies.

Now, there is a logical question that I should answer here, and that question is “Why did I bother to learn R in the first place?”

There are a few reasons that I think are worth mentioning. Some of these reasons are ones that originally made me attempt to learn R in the first place, and some of these benefits are ones I discovered while getting through the process of actually using it.

  • It’s free. At the time I started putting together analysis for my dissertation, I didn’t know where I was heading post-PhD. That means I had no way of knowing what statistical software I would have access to in my future institution. Sure, you can reasonably expect SPSS, maybe SAS, but it could very well be that I end up in a STATA department, which I have exactly zero experience with. R on the other hand, I can use on any computer at any time that I might want to. A secondary benefit of this being free is that I could install it on my personal laptop (which as a graduate student, is my work computer). One of the annoying issues I ran into when I learned SAS for my thesis is that I always had to use a University-owned computer, since I could not get a personal copy for a reasonable price (that I know of). I didn’t want to have to bounce between computer to computer, and being able to work from my own laptop solves that issue. I also have a little thought in the back of my head that maybe I can start to put together enough code to improve our ability to effectively monitor athletes. The fact that R is free means I could share my code with another sport scientist or coach, who could then use the code to analyze their data.


  • It is code-based. Sure, there is the small benefit of having “geek-cred” amongst my peers, being the only one in my department with some level of mastery of R. However, the big benefit of this software is that I have repeatable analysis. I loved that with SAS, I could write whatever long strings of code that I needed to run my analyses, then be able to run them again when necessary. With R, it is the same story. I now have extensive analysis code that requires little more than highlighting the entire script, then clicking “run”. This is useful for repeatability’s sake, but it is also really important for those occasional times that you discover mistakes. In my case, I discovered a mistake I had made in coding one of my subjects. He was tagged as being part of one group, when in fact he was part of another. With the way most people use SPSS, this could mean a whole bunch of extra time going back through all of the old analyses to redo everything (yes I am aware that SPSS allows syntax pasting, but I don’t know of many people who actually use it that way on a regular basis, except in specific circumstances). In my case, once I discovered the error, I fixed the error in my main datafile, then was able to re-run the analysis by doing, guess what, highlighting my code and hitting run. Not too shabby.


  • The fact that R is code based also means that you end up with a neat little code library for later studies and analysis. I have posted about some of the functions I wrote to simplify my life a little bit, specifically for speeding up my analysis of reliability and data screening. Having this code already written means that I can sometimes do a straight copy and paste job across projects for certain aspects. Sometimes its a simple CTRL-C/CTRL-V, sometimes there is a little bit of tweaking that has to be done. Either way, it has really sped up certain aspects of analysis.


  • Yet another benefit of having a code-based statistics package is the fact that you can iterate certain repetive analyses. Now, I’m not suggesting it is okay to iterate a run of 500 t-tests, but for something like reliability, sometimes you have a set of multiple variables that need assessment. In one project, I had to assess test-retest reliability of a whole bunch of subgroups in my analysis for each of multiple variables. Lots and lots of tests. Had I run through each test individually, it would have taken me a TON of time. Instead, I could use a simple apply function to run through the subgroups, outputting the results of my tests into a spreadsheet. The entire analysis for all groups takes no more than a few seconds. Not bad at all.


  • Being code-based means that my analysis is shareable. I can give somebody a dataset and analysis code and they can perform exactly what I did with the same ease I can. This is a really great way to help out other people, and to get feedback/troubleshooting on my code as well.


  • It isn’t a “click, drag, select” system. I feel that in SPSS and Minitab, the fact that it is so easy to get your analysis to run (since all you need to do is click menus, drag over variables, and check boxes), it is easy to trust the “magic box”. Maybe it is just me, but having to write out the code for my analysis makes me seriously consider the analysis that I run and the options that I select. I’m not saying that I didn’t carefully consider the analysis I did before, but having to code it makes me consider the analysis in even more depth than I did previously.


  • I am learning to be very careful with versioning and my naming conventions for variables (aka teaching me good habits). Prior to having all of my data collected, I started putting together my analysis code in R. In order to maintain correct versioning of files, I had to make sure to name subsequent copies of files to “forcedata v1 12-13-15.csv”, “forcedata v2 12-15-14.csv” etc. This made it much easier to stay on top of new versions of datasets, but also to allow easy referencing in my analysis. Secondly, in order to have code that actually makes sense, I have had to make sure to use very clear, understandable variable names. In something like Excel, it is really easy to end up with overly complicated variable names. With R, you have to be careful about naming conventions if you want to be able to have a snowballs chance to figure out what is going on when you look at your code again a month from now.

Now, I’ll admit that this process has not been full of fluffy clouds, unicorns and rainbows. There are definitely some drawbacks, but the biggest one is definitely the following:

  • R has a significant learning curve. I had a little bit of programming experience (and by experience, I mean I have a bit of tooling around with Basic, Visual Basic, Labview and SAS) which left me in a position where I wasn’t completely green to the idea of writing code, but I still needed a lot of time to get used to both coding and the conventions of R. I spent quite a few days in the beginning doing what I will call “fighting with R”, where I couldn’t make things happen simply because I didn’t have my head wrapped around the “style” of R’s language and because I didn’t have the mindset of a programmer. There are some great video tutorials on Youtube, and some pretty awesome MOOC courses that helped me out here. However, the longer I play around with it, and use it on a regular basis, the easier it gets (in understanding what is going on, and in coding).

You can probably tell by the long list of pros, compared to the one con, that I have really come to enjoy using R. I have been recommending it left and right to people who might be interested in it. I think I have one convert, who has installed both R and RStudio, and is running some code I wrote for him. With luck, there will be a few more of them out there, and we continue spreading the usefulness of this statistical software in the sport science world.