The Data Scientist’s Toolkit: R Programming Language
By Kat Campise, Data Scientist, Ph.D.
Just about everyone in the programming/computer science industry has heard of Python: it’s the programming language most frequently listed in job descriptions (aside from Java or JavaScript). But, within the data science community, R sits alongside Python as one of the go-to languages for all things data science related. Initially designed as a statistical calculation and graphical tool by Ross Ihaka and Robert Gentleman, who wrote about R in 1996, R now has over 10,000 packages available for a wide swath of computational capabilities including (the list below is meant only as a brief overview; for a more specific list, please see the CRAN Packages site):
- Both basic and advanced statistics
- Data cleaning
- Data mining and scraping
- Machine Learning
- Graphics/plots
Given the ever-increasing R community, and the growing demand for R expertise throughout all industries, the TIOBE programming community index currently ranks R as the 11th most popular programming language; and its popularity continues to climb. But, with all of the other available languages, why use R?
Why Use R?
Let’s get this fact out of the way: choosing a programming language is equal parts personal preference and what an employer requires. If you are an expert in R, but Corporation X uses Python, unless you can convince them otherwise or they are flexible about programming languages for data science, then you’ll need to make the switch. In fact, knowing how to perform the same (or similar) functions in both languages is ideal. But, if you’re just getting started in all things data science, and you are new to programming, then R is a great gateway for learning how to merge the worlds of statistics and programming:
- R has a robust community that is constantly developing new packages and maintains several user groups where newbies, intermediates, and experts can exchange ideas and support new R users. Rather than slogging through R “How to” blogs, Stack Exchange, and Stack Overflow — which are quickly out of date due to consistent package updates and the release of brand new packages — R users can direct their questions precisely towards the R community. There is, however, one exception: R-bloggers is an excellent resource for all levels of R users. Keep in mind that R’s popularity is still growing while also being used primarily within the academic, healthcare, and government sectors (though many Google jobs call for R expertise). So, the more well-known resources for all things programming related aren’t as reliable for helping you to increase your understanding of how R can be used within data science.
- In terms of statistical packages, R is completely free; it costs you nothing to download and get started. This is in contrast to software such as SPSS, SAS, and STATA, which can cost you hundreds if not thousands of dollars for a license. While some employers still use the aforementioned statistical programs, and becoming familiar with them is recommended, most data science courses — whether via massive online open courses (MOOCs) or the increasing number of degree programs — will use either R or Python for statistical analysis.
You don’t have to perform all analyses within the R environment by using the R programming language: you can run Python packages from R, and write R functions using C++. Additionally, due to its quickly increasing adoption rate, R is now compatible with AWS. As such, R is flexible as both a programming language and a statistical package.
Where to Learn R
In this age of open source and self-directed learning, there are a dizzying array of resources for learning R. It really comes down to which learning method most suits you.
- Coursera, one of the largest MOOCs, has a number of R programming and R-focused data science courses available. Most courses you can audit for free, which means you may or may not have access to the quizzes and peer review process that lead to earning a certificate. If you’re strapped for cash and prefer to have a verified certificate to beef up your resume, you can apply for “Financial Aid” and take the entire course for free.
- EdX, another MOOC, also offers self-study courses in Programming R for Data Science, Statistics and R, Introduction to R for Data Science, among many others. EdX also offers course auditing, where you’ll have access to the video lectures and quizzes but won’t earn a certificate unless you pay for the course.
- DataCamp has a free Introduction to R learning module that takes you step by step through basic R functions. You have the ability to either practice online or via the DataCamp app (so you can learn on the go). As with Coursera and EdX, they limit the number of modules that you can access without cost.
- In addition to Nanodegree programs such as Data Scientist Foundation, Udacity offers a free a Data Analysis with R course that focuses on exploratory data analysis (EDA).
- Pluralsight is yet another resource for kick starting your R journey via their Try R module, where they will take you through expressions, logical values, data frames, and how to apply what you’ve learned to real-world data.
To get started with R, all you need to do is visit the R Project website, and download the latest version of R. For those who prefer an integrated development environment (IDE), RStudio offers an all in one code editor, debugger, and visualization package. If you prefer using a Jupyter Notebook, R is now available as a supported programming language and can either be downloaded from Jupyter’s website or via Anaconda. Since R is so flexible, both as a statistical package and a programming language, its usage continues to climb up the ladder as a reliable tool for a wide variety of statistical computations. For those who have either no programming or limited programming experience, familiarizing yourself with one programming language first will lay the groundwork for transferring that knowledge to other languages. Knowing multiple programming languages is ideal. But, for aspiring data scientists (who hopefully have some advanced statistical coursework or practical experience in statistics beyond simple descriptive statistics), R is the perfect introduction to applying analyses using a freely available and widely supported programming language.