R or Python

When and how to use R?

R is mainly used when the data analysis task requires standalone computing or analysis on individual servers. It’s great for exploratory work, and it's handy for almost any type of data analysis because of the huge number of packages and readily usable tests that often provide you with the necessary tools to get up and running quickly. R can even be part of a big data solution.

When getting started with R, a good first step is to install the amazing RStudio IDE.  Once this is done, we recommend you to have a look at the following popular packages:

When and how to use Python?

You can use Python when your data analysis tasks need to be integrated with web apps or if statistics code needs to be incorporated into a production database. Being a fully fledged programming language, it’s a great tool to implement algorithms for production use.

While the infancy of Python packages for data analysis was an issue in the past, this has improved significantly over the years. Make sure to install NumPy /SciPy (scientific computing) and pandas (data manipulation) to make Python usable for data analysis.  Also have a look atmatplotlib to make graphics, and scikit-learn for machine learning.

Unlike R, Python has no clear “winning” IDE. We recommend you to have a look at Spyder,IPython Notebook and Rodeo to see which one best fits your needs.

R and Python: The Data Science Numbers

If you look at recent polls that focus on programming languages used for data analysis, R often is a clear winner. If you focus specifically on Python and R's data analysis community, a similar pattern appears.

Despite the above figures, there are signals that more people are switching from R to Python. Furthermore, there is a growing group of individuals using a combination of both languages when appropriate. This is exactly in line with what we recommend to our students as well.

If you’re planning to start a career in data science, you are good with both languages. Job trends indicated an increasing demand for both skills, and wages are well above average.

R: Pros and Cons

Pro: A picture says more than a thousands words

Visualized data can often be understood more efficiently and effectively than the raw numbers alone. R and visualization are a perfect match. Some must-see visualization packages are ggplot2, ggvis, googleVis and rCharts.

Pro: R ecosystem

R has a rich ecosystem of cutting-edge packages and active community. Packages are available at CRAN, BioConductor and Github. You can search through all R packages atRdocumentation.

Pro: R lingua franca of data science

R is developed by statisticians for statisticians. They can communicate ideas and concepts through R code and packages, you don’t necessarily need a computer science background to get started.  Furthermore, it is increasingly adopted outside of academia.

Pro/Con: R is slow

R was developed to make the life of statisticians easier, not the life of your computer. Although R can be experienced as slow due to poorly written code, there are multiple packages to improve R’s performance: pqR, renjin and FastR, Riposte and many more.

Con: R has a steep learning curve

R’s learning curve is non-trivial, especially if you come from a GUI for your statistical analysis. Even finding packages can be time consuming if you’re not familiar with it.

Python: Pros and Cons

Pro: IPython Notebook

The IPython Notebook makes it easier to work with Python and data. You can easily share notebooks with colleagues, without having them to install anything.  This drastically reduces the overhead of organizing code, output and notes files. This will allow you to spend more time doing real work.

Pro: A general purpose language

Python is a general purpose language that is easy and intuitive. This gives it a relatively flat learning curve, and it increases the speed at which you can write a program. In short,  you need less time to code and you have more time to play around with it!

Furthermore, the Python testing framework is a built-in, low-barrier-to-entry testing framework that encourages good test coverage. This guarantees your code is reusable and dependable.

Pro: A multi purpose language

Python brings people with different backgrounds together. As a common, easy to understand language that is known by programmers and that can easily be learnt by statisticians, you can build a single tool that integrates with every part of your workflow.

Pro/Con: Visualizations

Visualizations are an important criteria when choosing data analysis software. Although Python has some nice visualization libraries, such as Seaborn, Bokeh and Pygal, there are maybe too many options to choose from. Moreover, compared to R, visualizations are usually more convoluted, and the results are not always so pleasing to the eye.

Con: Python is a challenger

Python is a challenger to R. It does not offer an alternative to the hundreds of essential R packages.  Although it’s catching up, it’s still unclear if this will make people give up R?

And the winner is..

Up to you! As a data scientist it’s your job to pick the language that best fits the needs. Some questions that can help you: