#R is a free statistics software
#Everything in this box is R code: Copy and past it and try it on your own!
#Everything behind a # is a comment
#Use the right arrow right to continue!
Before we can start with first calculations, please check whether the installation has worked. Please open a new R Script under File(New File). The R code will be saved in the script. It is like a do file in Stata.
# The assignment character assigns numeric values to the term before the "<-".
a <- 5
b <- 7
# You can see what's behind the saved term by calling the object again.
a
b
a + b
#Save numerical values or even complete datasets, variables and other elements as vectors with the assignment character.
b <- "Hallo World"
b
#install.packages("dplyr") => installs the package dplyr
#library(dplyr) => loads the package dplyr
#These packages are already installed on my machine, that's why I put a # in front of them. Please install and load them.
The describe command shows you stored information about the data in Stata. In R, we can display the structure of a dataset with the str() command. In addition, we have to specify the name of the dataset in the brackets.
#Stata: describe
#In R
str(mtcars)
'data.frame': 32 obs. of 11 variables: $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ... $ cyl : num 6 6 4 6 8 6 8 4 4 6 ... $ disp: num 160 160 108 258 360 ... $ hp : num 110 110 93 110 175 105 245 62 95 123 ... $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ... $ wt : num 2.62 2.88 2.32 3.21 3.44 ... $ qsec: num 16.5 17 18.6 19.4 17 ... $ vs : num 0 0 1 1 0 1 0 1 1 1 ... $ am : num 1 1 1 0 0 0 0 0 0 0 ... $ gear: num 4 4 4 3 3 3 3 4 4 4 ... $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
The mtcars data contain 32 observations with information about cars, including the consumption (mpg), the number of horsepower (hp) or the transmission (am) of a vehicle.
The list command can be used in Stata to display single obseravation of the data, for example, as a table. In R, we can use the head() command.
#STATA: list in 1/5, table
#R: the command %>% slice(1:5) tells R that we only want to see first 5 cases like in the Stata command
head(mtcars)%>% slice(1:5)
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb |
---|---|---|---|---|---|---|---|---|---|---|
<dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
21.0 | 6 | 160 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
21.0 | 6 | 160 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
22.8 | 4 | 108 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
21.4 | 6 | 258 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
18.7 | 8 | 360 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
Commands like list or head are very useful, especially to check whether the data management steps have worked.
In Stata, the summarize command calculates statistical measures of central tendencies, such as the mean or the median of the distribution. With the detail option, all measures are displayed. The summary() function in R works similar.
#STATA: summarize var, detail
#R
summary(mtcars)
mpg cyl disp hp Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5 Median :19.20 Median :6.000 Median :196.3 Median :123.0 Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0 Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0 drat wt qsec vs Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000 Median :3.695 Median :3.325 Median :17.71 Median :0.0000 Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000 Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000 am gear carb Min. :0.0000 Min. :3.000 Min. :1.000 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000 Median :0.0000 Median :4.000 Median :2.000 Mean :0.4062 Mean :3.688 Mean :2.812 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000 Max. :1.0000 Max. :5.000 Max. :8.000
#Specific variables can be selected with $variable_name.
summary(mtcars$mpg)
Min. 1st Qu. Median Mean 3rd Qu. Max. 10.40 15.43 19.20 20.09 22.80 33.90
By using the tab command you get a simple table in Stata. If you give Stata two variables names, Stata gives you a cross table in return. In R we can use the table function to produce a similar one.
#STATA: tab var1 var2
#R
table(mtcars$am,mtcars$vs)
0 1 0 12 7 1 6 7
Data preparation steps follow its own logic in R and it would take us an own session to learn a bit more about data preparation in R. So, let's focus first on how it works in Stata, we will see a little bit about the differences between R and Stata in another session.
In Stata, a new variable can be created with generate variable and the replace command. The generate command creates a variable with missing values (.) or constant values (e.g. 1). These are placeholders and can be replaced with the if command, depending on a specified condition. For example:
#Stata:
# gen female=.
#The variable female contains first only missing values (.)
# replace female=1 if male==0
#If male equals (==) 0, the generated variable becomes 1.
# replace female=0 if male==1
#If male equals (==) 1, the generated variable becomes 0.
A second way to generate new variables in Stata is recode. The recode command uses an exisiting variable and recodes values of it. Important, the option gen(variable) creates a new variable, because we certainly want not to replace the original variable or values.
#Stata:
#Lets generate a new variable which indicate males in our data:
# recode female 0=1 1=0, gen (male)
#Thus, we set male now on 1, female on 0 and save it as a new variable male
#in R a new variable is added as vector <-
#recode from the dplyr package recods similar to Stata
#install and download dplyr!
#library(dplyr)
mtcars$new_var <- recode(mtcars$am, `0` = 1, `1` = 0)
mtcars%>% slice(1:5)
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | new_var |
---|---|---|---|---|---|---|---|---|---|---|---|
<dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
21.0 | 6 | 160 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 | 0 |
21.0 | 6 | 160 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 | 0 |
22.8 | 4 | 108 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 | 0 |
21.4 | 6 | 258 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 | 1 |
18.7 | 8 | 360 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 | 1 |
In Stata, we often use if conditions for data preparation and analyses. The R universe works similar, but instead of the if condition we can use the filter() command from the dplyr package. With the help of the filter function, we can keep specific values of a variable just like the if condition. For example, we can call the filter funtion, provide R with the name of the dataframe we want to filter and then we can use mathematical operations and funtions to filter specific observations. Do you have any idea which oberservations remain in the fiter below?
#Stata: mean var if var==0
#in R:
filter(mtcars, am == 0)%>% slice(1:5)
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | new_var |
---|---|---|---|---|---|---|---|---|---|---|---|
<dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
21.4 | 6 | 258.0 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 | 1 |
18.7 | 8 | 360.0 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 | 1 |
18.1 | 6 | 225.0 | 105 | 2.76 | 3.460 | 20.22 | 1 | 0 | 3 | 1 | 1 |
14.3 | 8 | 360.0 | 245 | 3.21 | 3.570 | 15.84 | 0 | 0 | 3 | 4 | 1 |
24.4 | 4 | 146.7 | 62 | 3.69 | 3.190 | 20.00 | 1 | 0 | 4 | 2 | 1 |
The condition am == 0 tells R that the analysis does no longer remain on observations which meet this criteria. Or in other words, the filter only includes car with an automatic transmission (am == 0).
In Stata, you can load the dataset with the use command. You have to tell Stata where to find the data by providing the file path and with the clear option you can delete all previously loaded datasets. In R, we need the package readr and the read_csv() function in order to load a csv dataframe, the rest follows the same logic. With "<-" we safe the loaded data under the name of our choice.
#library(readr) don't forget to install and load readr before running the command!
dataframe <- read_csv("C:/Users/Edgar Treischl/Desktop/titanic_R.csv")
head(dataframe)
class | adult | male | survived |
---|---|---|---|
<chr> | <chr> | <chr> | <chr> |
first | adult | female | yes |
crew | adult | male | yes |
first | adult | male | yes |
crew | adult | male | yes |
third | adult | female | yes |
crew | adult | male | no |
# Running a multiple regression
#Let's see whether horsepower and the number of cylinders can predict mpg (miles per gallon)
fit <- lm(mpg ~ hp + cyl, data=mtcars)
summary(fit) # show results
Call: lm(formula = mpg ~ hp + cyl, data = mtcars) Residuals: Min 1Q Median 3Q Max -4.4948 -2.4901 -0.1828 1.9777 7.2934 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 36.90833 2.19080 16.847 < 2e-16 *** hp -0.01912 0.01500 -1.275 0.21253 cyl -2.26469 0.57589 -3.933 0.00048 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 3.173 on 29 degrees of freedom Multiple R-squared: 0.7407, Adjusted R-squared: 0.7228 F-statistic: 41.42 on 2 and 29 DF, p-value: 3.162e-09