Stata Meets R¶

Welcome to a session of R meets Stata. This sessions shows you how the basic Stata commands work in R.¶

In [1]:

#R is a free statistics software 
#Everything in this box is R code: Copy and past it and try it on your own!
#Everything behind a # is a comment
#Use the right arrow right to continue!

Before we can start, download and install R first:¶

Download R

And download and install RStudio Desktop, which is also free:¶

RStudio Download R Studio

After the installation is finished, open RStudio. We will work with RStudio because it has many advantages. RStudio uses R in the background that's why we need both.¶

RStudio has four windows with different functions.¶

First steps¶

Before we can start with first calculations, please check whether the installation has worked. Please open a new R Script under File(New File). The R code will be saved in the script. It is like a do file in Stata.

RStudio

Check whether your first R script is working. For example, we can use R as a calculator my typing an equation in the R script and running the code. R will solve mathematical equations for us.¶

RStudio

The code is being executed by pressing Ctrl + Enter. If everything works fine, you will see the following result in the output window:¶

RStudio

Compared to Stata, R has a few special features. One of them is the assignment character "<-". The character is used to store data frames, single values and other elements in R.¶

In [2]:

# The assignment character assigns numeric values to the term before the "<-".
a <- 5
b <- 7

# You can see what's behind the saved term by calling the object again.
a
b
a + b

In [3]:

#Save numerical values or even complete datasets, variables and other elements as vectors with the assignment character.

b <- "Hallo World"
b

'Hallo World'

R is an open-source software, which is why many tools or packages are written by users for users. You need to install the packages and then load the corresponding library before you can use it.¶

In [4]:

#install.packages("dplyr") => installs the package dplyr
#library(dplyr) => loads the package dplyr
#These packages are already installed on my machine, that's why I put a # in front of them. Please install and load them.

On the following pages, we will repeat some core basics from Stata and I'll show you how to run the same calculations in R. We use the mtcars data set which is stored in R. If you work for the first time with a dataset, you probably want to know a few things about it before we can run some calculations.¶

describe¶

The describe command shows you stored information about the data in Stata. In R, we can display the structure of a dataset with the str() command. In addition, we have to specify the name of the dataset in the brackets.

In [5]:

#Stata: describe
#In R
str(mtcars)

'data.frame':	32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

The mtcars data contain 32 observations with information about cars, including the consumption (mpg), the number of horsepower (hp) or the transmission (am) of a vehicle.

list¶

The list command can be used in Stata to display single obseravation of the data, for example, as a table. In R, we can use the head() command.

In [6]:

#STATA: list in 1/5, table

In [9]:

#R: the command %>% slice(1:5) tells R that we only want to see first 5 cases like in the Stata command
head(mtcars)%>% slice(1:5)

A data.frame: 5 × 11
mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
21.0	6	160	110	3.90	2.620	16.46	0	1	4	4
21.0	6	160	110	3.90	2.875	17.02	0	1	4	4
22.8	4	108	93	3.85	2.320	18.61	1	1	4	1
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
18.7	8	360	175	3.15	3.440	17.02	0	0	3	2

Commands like list or head are very useful, especially to check whether the data management steps have worked.

summarize¶

In Stata, the summarize command calculates statistical measures of central tendencies, such as the mean or the median of the distribution. With the detail option, all measures are displayed. The summary() function in R works similar.

In [10]:

#STATA: summarize var, detail

In [11]:

#R
summary(mtcars)

      mpg             cyl             disp             hp       
 Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
 1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
 Median :19.20   Median :6.000   Median :196.3   Median :123.0  
 Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
 3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
 Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
      drat             wt             qsec             vs        
 Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
 1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
 Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
 Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
 3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
 Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
       am              gear            carb      
 Min.   :0.0000   Min.   :3.000   Min.   :1.000  
 1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
 Median :0.0000   Median :4.000   Median :2.000  
 Mean   :0.4062   Mean   :3.688   Mean   :2.812  
 3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
 Max.   :1.0000   Max.   :5.000   Max.   :8.000

In [12]:

#Specific variables can be selected with $variable_name.
summary(mtcars$mpg)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  10.40   15.43   19.20   20.09   22.80   33.90

tab¶

By using the tab command you get a simple table in Stata. If you give Stata two variables names, Stata gives you a cross table in return. In R we can use the table function to produce a similar one.

In [13]:

#STATA: tab var1 var2

In [14]:

#R
table(mtcars$am,mtcars$vs)

Data preparation¶

Data preparation steps follow its own logic in R and it would take us an own session to learn a bit more about data preparation in R. So, let's focus first on how it works in Stata, we will see a little bit about the differences between R and Stata in another session.

Generate new variables¶

In Stata, a new variable can be created with generate variable and the replace command. The generate command creates a variable with missing values (.) or constant values (e.g. 1). These are placeholders and can be replaced with the if command, depending on a specified condition. For example:

In [15]:

#Stata:
# gen female=.
#The variable female contains first only missing values (.)
# replace female=1 if male==0
#If male equals (==) 0, the generated variable becomes 1.
# replace female=0 if male==1
#If male equals (==) 1, the generated variable becomes 0.

Recode variables¶

A second way to generate new variables in Stata is recode. The recode command uses an exisiting variable and recodes values of it. Important, the option gen(variable) creates a new variable, because we certainly want not to replace the original variable or values.

In [16]:

#Stata:
#Lets generate a new variable which indicate males in our data:
# recode female 0=1 1=0, gen (male)
#Thus, we set male now on 1, female on 0 and save it as a new variable male

In [17]:

#in R a new variable is added as vector <-
#recode from the dplyr package recods similar to Stata
#install and download dplyr!
#library(dplyr) 
mtcars$new_var <- recode(mtcars$am, `0` = 1, `1` = 0)
mtcars%>% slice(1:5)

A data.frame: 5 × 12
mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	new_var
<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
21.0	6	160	110	3.90	2.620	16.46	0	1	4	4	0
21.0	6	160	110	3.90	2.875	17.02	0	1	4	4	0
22.8	4	108	93	3.85	2.320	18.61	1	1	4	1	0
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	1
18.7	8	360	175	3.15	3.440	17.02	0	0	3	2	1

if conditions¶

In Stata, we often use if conditions for data preparation and analyses. The R universe works similar, but instead of the if condition we can use the filter() command from the dplyr package. With the help of the filter function, we can keep specific values of a variable just like the if condition. For example, we can call the filter funtion, provide R with the name of the dataframe we want to filter and then we can use mathematical operations and funtions to filter specific observations. Do you have any idea which oberservations remain in the fiter below?

In [18]:

#Stata: mean var if var==0

In [19]:

#in R: 
filter(mtcars, am == 0)%>% slice(1:5)

A data.frame: 5 × 12
mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	new_var
<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
21.4	6	258.0	110	3.08	3.215	19.44	1	0	3	1	1
18.7	8	360.0	175	3.15	3.440	17.02	0	0	3	2	1
18.1	6	225.0	105	2.76	3.460	20.22	1	0	3	1	1
14.3	8	360.0	245	3.21	3.570	15.84	0	0	3	4	1
24.4	4	146.7	62	3.69	3.190	20.00	1	0	4	2	1

The condition am == 0 tells R that the analysis does no longer remain on observations which meet this criteria. Or in other words, the filter only includes car with an automatic transmission (am == 0).

Load a dataset¶

In Stata, you can load the dataset with the use command. You have to tell Stata where to find the data by providing the file path and with the clear option you can delete all previously loaded datasets. In R, we need the package readr and the read_csv() function in order to load a csv dataframe, the rest follows the same logic. With "<-" we safe the loaded data under the name of our choice.

In [27]:

#library(readr) don't forget to install and load readr before running the command! 
dataframe <- read_csv("C:/Users/Edgar Treischl/Desktop/titanic_R.csv")
head(dataframe)

A tibble: 6 × 4
class	adult	male	survived
<chr>	<chr>	<chr>	<chr>
first	adult	female	yes
crew	adult	male	yes
first	adult	male	yes
crew	adult	male	yes
third	adult	female	yes
crew	adult	male	no

Thanks to RStudio we don't have to remember this steps. Use RStudio to import data directly. It supports Stata, SPSS and other data formats. Let's have a look!¶

The button to import new data is located in the data/packages window (top right). Click on it.¶

This opens the import window and you can select the new dataset.¶

Take another look at the code preview in the right corner of the import window. After selecting the file, RStudio gives us the code and the corresponding package which is needed to load the data. Now, all you have to do is copy and paste the code to load the new dataset.¶

I hope this short presentation has given you an impression how the basic commands work in R. Fortunately, R has a big community and you can find many tips and tricks on the internet. Give it a try and google how you can run a linear regression in R.¶

In [24]:

# Running a multiple regression
#Let's see whether horsepower and the number of cylinders can predict mpg (miles per gallon)
fit <- lm(mpg ~ hp + cyl, data=mtcars) 
summary(fit) # show results

Call:
lm(formula = mpg ~ hp + cyl, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.4948 -2.4901 -0.1828  1.9777  7.2934 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 36.90833    2.19080  16.847  < 2e-16 ***
hp          -0.01912    0.01500  -1.275  0.21253    
cyl         -2.26469    0.57589  -3.933  0.00048 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.173 on 29 degrees of freedom
Multiple R-squared:  0.7407,	Adjusted R-squared:  0.7228 
F-statistic: 41.42 on 2 and 29 DF,  p-value: 3.162e-09

As you can see, you can even run a regression without much knowledge of R. So, just give it a try and see whether you can run the learned commands by your own.¶