Stata Meets R

Welcome to a session of R meets Stata. This sessions shows you how the basic Stata commands work in R.

In [1]:
#R is a free statistics software 
#Everything in this box is R code: Copy and past it and try it on your own!
#Everything behind a # is a comment
#Use the right arrow right to continue!

Before we can start, download and install R first:

R Download R

And download and install RStudio Desktop, which is also free:

RStudio Download R Studio

After the installation is finished, open RStudio. We will work with RStudio because it has many advantages. RStudio uses R in the background that's why we need both.

RStudio has four windows with different functions.

RStudio

First steps

Before we can start with first calculations, please check whether the installation has worked. Please open a new R Script under File(New File). The R code will be saved in the script. It is like a do file in Stata.

RStudio

Check whether your first R script is working. For example, we can use R as a calculator my typing an equation in the R script and running the code. R will solve mathematical equations for us.

RStudio

The code is being executed by pressing Ctrl + Enter. If everything works fine, you will see the following result in the output window:

RStudio

Compared to Stata, R has a few special features. One of them is the assignment character "<-". The character is used to store data frames, single values and other elements in R.

In [2]:
# The assignment character assigns numeric values to the term before the "<-".
a <- 5
b <- 7

# You can see what's behind the saved term by calling the object again.
a
b
a + b
5
7
12
In [3]:
#Save numerical values or even complete datasets, variables and other elements as vectors with the assignment character.

b <- "Hallo World"
b
'Hallo World'

R is an open-source software, which is why many tools or packages are written by users for users. You need to install the packages and then load the corresponding library before you can use it.

In [4]:
#install.packages("dplyr") => installs the package dplyr
#library(dplyr) => loads the package dplyr
#These packages are already installed on my machine, that's why I put a # in front of them. Please install and load them.

On the following pages, we will repeat some core basics from Stata and I'll show you how to run the same calculations in R. We use the mtcars data set which is stored in R. If you work for the first time with a dataset, you probably want to know a few things about it before we can run some calculations.

describe

The describe command shows you stored information about the data in Stata. In R, we can display the structure of a dataset with the str() command. In addition, we have to specify the name of the dataset in the brackets.

In [5]:
#Stata: describe
#In R
str(mtcars)
'data.frame':	32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

The mtcars data contain 32 observations with information about cars, including the consumption (mpg), the number of horsepower (hp) or the transmission (am) of a vehicle.

list

The list command can be used in Stata to display single obseravation of the data, for example, as a table. In R, we can use the head() command.

In [6]:
#STATA: list in 1/5, table
In [9]:
#R: the command %>% slice(1:5) tells R that we only want to see first 5 cases like in the Stata command
head(mtcars)%>% slice(1:5)
A data.frame: 5 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
18.7 8 360 175 3.15 3.440 17.02 0 0 3 2

Commands like list or head are very useful, especially to check whether the data management steps have worked.

summarize

In Stata, the summarize command calculates statistical measures of central tendencies, such as the mean or the median of the distribution. With the detail option, all measures are displayed. The summary() function in R works similar.

In [10]:
#STATA: summarize var, detail
In [11]:
#R
summary(mtcars)
      mpg             cyl             disp             hp       
 Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
 1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
 Median :19.20   Median :6.000   Median :196.3   Median :123.0  
 Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
 3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
 Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
      drat             wt             qsec             vs        
 Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
 1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
 Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
 Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
 3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
 Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
       am              gear            carb      
 Min.   :0.0000   Min.   :3.000   Min.   :1.000  
 1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
 Median :0.0000   Median :4.000   Median :2.000  
 Mean   :0.4062   Mean   :3.688   Mean   :2.812  
 3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
 Max.   :1.0000   Max.   :5.000   Max.   :8.000  
In [12]:
#Specific variables can be selected with $variable_name.
summary(mtcars$mpg)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  10.40   15.43   19.20   20.09   22.80   33.90 

tab

By using the tab command you get a simple table in Stata. If you give Stata two variables names, Stata gives you a cross table in return. In R we can use the table function to produce a similar one.

In [13]:
#STATA: tab var1 var2
In [14]:
#R
table(mtcars$am,mtcars$vs)
   
     0  1
  0 12  7
  1  6  7

Data preparation

Data preparation steps follow its own logic in R and it would take us an own session to learn a bit more about data preparation in R. So, let's focus first on how it works in Stata, we will see a little bit about the differences between R and Stata in another session.

Generate new variables

In Stata, a new variable can be created with generate variable and the replace command. The generate command creates a variable with missing values (.) or constant values (e.g. 1). These are placeholders and can be replaced with the if command, depending on a specified condition. For example:

In [15]:
#Stata:
# gen female=.
#The variable female contains first only missing values (.)
# replace female=1 if male==0
#If male equals (==) 0, the generated variable becomes 1.
# replace female=0 if male==1
#If male equals (==) 1, the generated variable becomes 0.

Recode variables

A second way to generate new variables in Stata is recode. The recode command uses an exisiting variable and recodes values of it. Important, the option gen(variable) creates a new variable, because we certainly want not to replace the original variable or values.

In [16]:
#Stata:
#Lets generate a new variable which indicate males in our data:
# recode female 0=1 1=0, gen (male)
#Thus, we set male now on 1, female on 0 and save it as a new variable male
In [17]:
#in R a new variable is added as vector <-
#recode from the dplyr package recods similar to Stata
#install and download dplyr!
#library(dplyr) 
mtcars$new_var <- recode(mtcars$am, `0` = 1, `1` = 0)
mtcars%>% slice(1:5)
A data.frame: 5 × 12
mpg cyl disp hp drat wt qsec vs am gear carb new_var
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 0
21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 0
22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 0
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 1
18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 1

if conditions

In Stata, we often use if conditions for data preparation and analyses. The R universe works similar, but instead of the if condition we can use the filter() command from the dplyr package. With the help of the filter function, we can keep specific values of a variable just like the if condition. For example, we can call the filter funtion, provide R with the name of the dataframe we want to filter and then we can use mathematical operations and funtions to filter specific observations. Do you have any idea which oberservations remain in the fiter below?

In [18]:
#Stata: mean var if var==0
In [19]:
#in R: 
filter(mtcars, am == 0)%>% slice(1:5)
A data.frame: 5 × 12
mpg cyl disp hp drat wt qsec vs am gear carb new_var
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 1
18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 1
18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 1
14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 1
24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 1

The condition am == 0 tells R that the analysis does no longer remain on observations which meet this criteria. Or in other words, the filter only includes car with an automatic transmission (am == 0).

Load a dataset

In Stata, you can load the dataset with the use command. You have to tell Stata where to find the data by providing the file path and with the clear option you can delete all previously loaded datasets. In R, we need the package readr and the read_csv() function in order to load a csv dataframe, the rest follows the same logic. With "<-" we safe the loaded data under the name of our choice.

In [27]:
#library(readr) don't forget to install and load readr before running the command! 
dataframe <- read_csv("C:/Users/Edgar Treischl/Desktop/titanic_R.csv")
head(dataframe)
A tibble: 6 × 4
class adult male survived
<chr> <chr> <chr> <chr>
first adult female yes
crew adult male yes
first adult male yes
crew adult male yes
third adult female yes
crew adult male no

Thanks to RStudio we don't have to remember this steps. Use RStudio to import data directly. It supports Stata, SPSS and other data formats. Let's have a look!

The button to import new data is located in the data/packages window (top right). Click on it.

R

This opens the import window and you can select the new dataset.

R

Take another look at the code preview in the right corner of the import window. After selecting the file, RStudio gives us the code and the corresponding package which is needed to load the data. Now, all you have to do is copy and paste the code to load the new dataset.

I hope this short presentation has given you an impression how the basic commands work in R. Fortunately, R has a big community and you can find many tips and tricks on the internet. Give it a try and google how you can run a linear regression in R.

In [24]:
# Running a multiple regression
#Let's see whether horsepower and the number of cylinders can predict mpg (miles per gallon)
fit <- lm(mpg ~ hp + cyl, data=mtcars) 
summary(fit) # show results
Call:
lm(formula = mpg ~ hp + cyl, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.4948 -2.4901 -0.1828  1.9777  7.2934 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 36.90833    2.19080  16.847  < 2e-16 ***
hp          -0.01912    0.01500  -1.275  0.21253    
cyl         -2.26469    0.57589  -3.933  0.00048 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.173 on 29 degrees of freedom
Multiple R-squared:  0.7407,	Adjusted R-squared:  0.7228 
F-statistic: 41.42 on 2 and 29 DF,  p-value: 3.162e-09

As you can see, you can even run a regression without much knowledge of R. So, just give it a try and see whether you can run the learned commands by your own.