Introduction to open data science is about discovering the patterns hidden behind numbers in matrices and arrays. The course introduces you to open data science using tools such as R, RStudio, RMarkdown, and Github. The course will include coding in R, analyzing, modelling and maintaining a course diary.
Week 2, Anton Jyrkiäinen
The data used is from a study which investigates the relationship between teaching, learning approaches and exam results of the students.
Link to data here and link to metadata here.
First we import the data which we wrangled and saved locally in the first part of the exercise. The data is set to a variable lrn14 using read.table.
lrn14 <- read.table("C:/Users/Anton/Documents/IODS-project/learning2014.txt")
We also need to import few dependancies for later use:
library(GGally)
## Loading required package: ggplot2
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
##
## nasa
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
By using str() we can have a look at the structure of the data:
str(lrn14)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ Age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ Attitude: int 37 31 25 35 37 38 35 29 38 21 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ Points : int 25 12 24 10 22 21 21 31 24 26 ...
The output gives us meta information about the data in general and about each column itself: amount of rows and colums, name of the variables, data type and the first values. There’s seven variables: gender, global attitude toward statistics, exam points and three variables “deep” (deep questions), “surf” (surface questions), “stra” (strategic questions) which are composed of means calculated from answering various questions. These three, in the same order as mentioned previously, stand for “intention to maximize understanding, with a true commitment to learning”, “memorizing without understanding, with a serious lack of personal engagement in the learning process” and “the ways students organize their studying”.
The questions were measured with a scale from 1 to 5 with (1 = disagree, 5 = agree). The data covers answers from 183 participants of which 67% were female and had a mean age of 25.6 years. Observations with 0 exam points are excluded from the data in the wrangling part of the exercise.
Using dim() we get an output which gives as the dimensions of the dataset.
dim(lrn14)
## [1] 166 7
The ggpairs() functions provides us a matrix of plots which consists of density plots, bar plot, histograms, scatter plots, correlation coefficients and box plots. These plots represent all the possible combinations of the variables.
Using the dataset with ggpairs() we generate a graphical overview:
p <- ggpairs(lrn14, mapping = aes(col = gender), lower = list(combo = wrap("facethist", bins = 20)))
p
And with summary() we get a summary of the variables:
summary(lrn14)
## gender Age Attitude deep stra
## F:110 Min. :17.00 Min. :14.00 Min. :1.583 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:26.00 1st Qu.:3.333 1st Qu.:2.625
## Median :22.00 Median :32.00 Median :3.667 Median :3.188
## Mean :25.51 Mean :31.43 Mean :3.680 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:37.00 3rd Qu.:4.083 3rd Qu.:3.625
## Max. :55.00 Max. :50.00 Max. :4.917 Max. :5.000
## surf Points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
Just looking at the density plots they give us an rough idea how each variable is distributed. All the variables seem to be close to normally distributed expect for “age” and “deep”. The scale for age is from 17 to 55 so it makes sense for the density plot to be heavily distributed to the young side since the participants of the survey are students. In the “deep” variable there is quite interesting bias towards agreeing to a lot to the questions. From the summary we see that the mean to answering “deep questions” is 3.68.
A regression model is made with lm(). As for the formula we will be using “Points” as dependent variable and “Attitude”, “stra” and “deep” for explanatory variables. Also a summary of the regression model is generated with summary():
lm1 <- lm(formula = Points ~ Attitude + stra + deep, data = lrn14)
lm1
##
## Call:
## lm(formula = Points ~ Attitude + stra + deep, data = lrn14)
##
## Coefficients:
## (Intercept) Attitude stra deep
## 11.3915 0.3525 0.9621 -0.7492
The intercept is the value of y (“Points”) when all the x (“Attitude”,“stra”,“deep”) are 0. The other values are the slope in the regression model.
summary(lm1)
##
## Call:
## lm(formula = Points ~ Attitude + stra + deep, data = lrn14)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.5239 -3.4276 0.5474 3.8220 11.5112
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.39145 3.40775 3.343 0.00103 **
## Attitude 0.35254 0.05683 6.203 4.44e-09 ***
## stra 0.96208 0.53668 1.793 0.07489 .
## deep -0.74920 0.75066 -0.998 0.31974
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 162 degrees of freedom
## Multiple R-squared: 0.2097, Adjusted R-squared: 0.195
## F-statistic: 14.33 on 3 and 162 DF, p-value: 2.521e-08
The residuals are difference between the observed value of the dependant variable and the predicted value. Each data point has a residual. The residuals should be normally distributed and their sum should always be 0. Residuals can reveal unexplained patterns in data which is fitted to the model.
The coefficients show us the estimate, standard error, t-value and p-value. Standard error is an estimate of the standard deviation of the coefficient estimate. T-value is estimate divided by standard error. The p-value shows the probability of observing A t-value larger than the one presented. P-value basically indicates if there’s evidence against the null hypothesis. Low p-value indicates strong evidence against the null hypothesis and high p-value indicates weak evidence or no evidence against the null hypothesis.
The output also gives us a “signif. codes” which make it easier to interpret the results with just a glance. In this case “Attitude” received p-value of under 0.01 which means there is strong evidence that the null hypothesis does not hold. This means that it is unlikely that “Attitude” does not correlate with “points”. “stra” had p-value between 0.05 and 0.1 which indicates weak evidence that the null hypothesis does not hold. Variable “deep” does not have statistically signifant relationship with the target variable.
Next we fit the model again without variable “deep”:
lm1 <- lm(formula = Points ~ Attitude + stra, data = lrn14)
lm1
##
## Call:
## lm(formula = Points ~ Attitude + stra, data = lrn14)
##
## Coefficients:
## (Intercept) Attitude stra
## 8.9729 0.3466 0.9137
summary(lm1)
##
## Call:
## lm(formula = Points ~ Attitude + stra, data = lrn14)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.97290 2.39591 3.745 0.00025 ***
## Attitude 0.34658 0.05652 6.132 6.31e-09 ***
## stra 0.91365 0.53447 1.709 0.08927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
Without the variable “deep” we get similar results with a bit higher p-values for “Attitude” and “stra” but still within the same signifiance codes as before. Also the r-squared and f-statistics stay relatively the same.
Residual standard error is the standard deviation of the residuals.
R-squared tells us how close the data is to the fitted regression line. The higher the R-squared, the better the model fits the data. If the R-squared was 100% the fitted values would all be on the fitted regression line. Difference between multiple and adjusted R-squared is that adjusted R-squared takes the phenomenon of R-squared automatically increasing when more explanatory variables are added to the model. The adjusted R-squared is at 19.51% which indicates that the data points will be quite scattered around the fitted regression line. This indicates that producing predictions with this model would be somewhat inaccurate.
F-statistic is variaton between sample means divided by variance within the samples. The p-value in the f-statistics indicate that atleast some of the results are statistically significant. If the p-value here would indicate insignificance we couldn’t reject the null hypothesis.
Usin qplot() we generate three diagnostics plots: “Residuals vs Fitted values”, “Normal QQ-plot” and “Residuals vs Leverage”. We give the model as parameter with a vector of indexes to choose the right plots.
Residuals vs fitted values:
plot(lm1, which = c(1))
We can see that variability of the points are about the equal so that would be good. There seems to be no signs here that would indicate a problem with the model.
Normal Q-Q:
plot(lm1, which = c(2))
If the observations are normally distributed the plot should results in a approximately straight line. In the plot above we can see that the line the pairs create is fairly straight so there should be no problem here. There are some extreme values but they are as extreme as would be expected. We can see the same kind of extremities in the low and high end so that would fit in the principles of normal distribution.
Residuals vs leverage:
plot(lm1, which = c(5))
This plot shows us if there are extreme values that are influental. Even though there is some extreme values they are no influental since they more or less follow the trend of the other values. Excluding influental point would radically change the results which don’t seem to be the case here.
Week 3, Anton Jyrkiäinen
Dependancies used:
library(dplyr);
library(ggplot2);
library(tidyr);
First we import the data which we wrangled and saved locally in the first part of the exercise. The data is set to a variable alc using read.table.
alc <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt", sep = ",", header=TRUE)
Variables in the dataset:
names(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
The dataset consists of data collected from school reports and questionnaires from two different Portugese schools regarding the performance in two subjects: mathematics and Portugese.
Link to the data here and link to metadata here.
Absences, studytime and grades go together quite well when doing analysis of school performance. I expect that absences are positively correlated and studytime and grades negatively correlated with high alcohol use. Gender will probably have no difference in this and will not be statistically significant Therefor the variables to be used are absences, studytime, G3 and sex.
Barplot of each variable we’re interested in:
ggplot(data = alc, aes(x = absences)) + geom_bar()
ggplot(data = alc, aes(x = G3)) + geom_bar()
ggplot(data = alc, aes(x = sex)) + geom_bar()
ggplot(data = alc, aes(x = studytime)) + geom_bar()
In the barplots we can observe the distribution of the variables we’ve chosen. We can see that it is more common to have none or less absences. Most of the students are getting average or a bit above average grades.There’s also quite large number of students receiving 0 grade. There’s slightly more female students than male students. Majority of the students seem to study less than 2 hours or 2-5 hours weekly.
Boxplots:
g1 <- ggplot(alc, aes(x = high_use, y = G3, col = sex))
g1 + geom_boxplot()
g2 <- ggplot(alc, aes(x = high_use, y = absences, col = sex))
g2 + geom_boxplot()
g2 <- ggplot(alc, aes(x = high_use, y = studytime, col = sex))
g2 + geom_boxplot()
From the boxplots we can see that the final grades were better in the group with no heavy use of alcohol. There also seems to be more absences in the group with heavy use. As with studytime there seems to be higher average of studying time in the group with no heavy alcohol use.
Grade means by group:
alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_grade = mean(G3))
## # A tibble: 4 x 4
## # Groups: sex [?]
## sex high_use count mean_grade
## <fct> <lgl> <int> <dbl>
## 1 F FALSE 157 9.69
## 2 F TRUE 41 10.4
## 3 M FALSE 113 11.6
## 4 M TRUE 71 9.97
Among the female students there’s a grade mean increase of 0.71 when the person is considered to have high use of alcohol. Among the male students there’s a large grade drop of 1.63 points if the student is categorized to the high use of alcohol group.
Absence mean by group:
alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_absences = mean(absences))
## # A tibble: 4 x 4
## # Groups: sex [?]
## sex high_use count mean_absences
## <fct> <lgl> <int> <dbl>
## 1 F FALSE 157 4.97
## 2 F TRUE 41 8.90
## 3 M FALSE 113 3.19
## 4 M TRUE 71 7.41
We can observe that the high alcohol use has a negative impact on the absences of the student. With the female students there’s 3.93 more school absences and with the male students the absences are more than doubled.
Study time mean by group:
alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_studytime = mean(studytime))
## # A tibble: 4 x 4
## # Groups: sex [?]
## sex high_use count mean_studytime
## <fct> <lgl> <int> <dbl>
## 1 F FALSE 157 2.34
## 2 F TRUE 41 2
## 3 M FALSE 113 1.88
## 4 M TRUE 71 1.62
Average studying times seem also to suffer from high alcohol use with lower mean studying times in both groups.
Summary of logistic regression with our variables and high/low alcohol consumption set as the target variable:
m <- glm(high_use ~ sex + absences + G3 + studytime, data = alc, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ sex + absences + G3 + studytime, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6894 -0.8391 -0.5951 1.1034 2.1852
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.64915 0.45507 -1.426 0.153725
## sexM 0.84250 0.25698 3.278 0.001044 **
## absences 0.07076 0.01821 3.887 0.000102 ***
## G3 -0.02575 0.02635 -0.977 0.328574
## studytime -0.41864 0.16119 -2.597 0.009398 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 462.21 on 381 degrees of freedom
## Residual deviance: 416.74 on 377 degrees of freedom
## AIC: 426.74
##
## Number of Fisher Scoring iterations: 4
From the summary of the model we can see that our original hypothesis does not hold. Gender does in fact seem to be statistically significant and the final grade is not. The summary reveals that it is more probable to be in the high use group if the student is male and the high use variable correlates with absences and studytime variables. The deviance residuals seem to be fine since they are roughly close to centered to zero and symmetrical. Nul deviance and residual deviance can be used to compare models and to calculate R-square and p-value. AIC can also be used to comparing to other models.
Odds ratios with their confidence intervals:
OR <- coef(m) %>% exp
CI <- confint(m) %>% exp
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.5224889 0.2125860 1.2719337
## sexM 2.3221692 1.4096724 3.8686858
## absences 1.0733272 1.0379854 1.1144115
## G3 0.9745836 0.9257160 1.0267950
## studytime 0.6579430 0.4751282 0.8953331
The OR column is the exponent of the coefficient which can be interpret as odds ratio between a unit change in the corresponging explanatory variable. Therefor the row sexM can be interpret as male students being ~2.3 times more likely to be heavy user of alcohol than the female students.
We will be dropping G3 from the model:
m <- glm(high_use ~ sex + absences + G3 + studytime, data = alc, family = "binomial")
Let’s explore the predictive power of the model:
probabilities <- predict(m, type = "response")
alc <- mutate(alc, probability = probabilities)
alc <- mutate(alc, prediction = probability > 0.5)
First we used the predict()-function to get the probability of high use. Then added the predicted probabiliets to alc and then alter the table so it has a column prediction which has a boolean value as predictor of high use. If the probability was higher than 0.5 the value is set to TRUE otherwise FALSE.
Let’s look at a confusion matrix:
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 262 8
## TRUE 85 27
We can see that in 302 case the prediction was correct and in 80 cases not.
Tabulate the target variable versus the prediction:
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table %>% addmargins
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.68586387 0.02094241 0.70680628
## TRUE 0.22251309 0.07068063 0.29319372
## Sum 0.90837696 0.09162304 1.00000000
Week 4, Anton Jyrkiäinen
R has some datasets already loaded in. Let’s use the Boston dataset from the MASS package:
library(MASS)
data("Boston")
Details about the dataset can be found here.
Structure and dimensions of the data:
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
The data set has 506 rows and 14 colums. It is about the housing values in suburbs of Boston. The data consists of information such as average number of rooms per dwelling and accessibility of to radial highways.
We can see the first and third quartiles, min, max, median and mean from the summary of the variables using summary():
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
With corrplot package we can observe and visualize hte correlations between the variables in the data:
library(dplyr);
library(corrplot)
First we calculate the correlation matrix, round it and then visualize with the correlation matrix:
cor_matrix<-cor(Boston) %>% round(digits = 2)
cor_matrix
## crim zn indus chas nox rm age dis rad tax
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47
## ptratio black lstat medv
## crim 0.29 -0.39 0.46 -0.39
## zn -0.39 0.18 -0.41 0.36
## indus 0.38 -0.36 0.60 -0.48
## chas -0.12 0.05 -0.05 0.18
## nox 0.19 -0.38 0.59 -0.43
## rm -0.36 0.13 -0.61 0.70
## age 0.26 -0.27 0.60 -0.38
## dis -0.23 0.29 -0.50 0.25
## rad 0.46 -0.44 0.49 -0.38
## tax 0.46 -0.44 0.54 -0.47
## ptratio 1.00 -0.18 0.37 -0.51
## black -0.18 1.00 -0.37 0.33
## lstat 0.37 -0.37 1.00 -0.74
## medv -0.51 0.33 -0.74 1.00
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex = 0.6)
The strong correlations are easily distinguished from the matrix. As for example “indus” (proportion of non-retail business acres per town.) is negatively correlated with “dis” (weighted mean of distances to five Boston employment centres.) and “rad” (index of accessibility to radial highways.) is positively correlated with “tax” (full-value property-tax rate per $10,000.). One anomaly that could be noted is that the “chas” (Charles River dummy variable) variable has very little correlation with any other variable in the dataset.
For linear discriminant analysis we need to scale the data. This means subtracting the column means from the corresponding colums and divide the difference with standard deviation. This is easily managed with the scale()-function.
boston_scaled <- scale(Boston)
Let’s look at the summariers of the scaled variables:
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
At first glance the most noticable difference is that all the variables have zero mean. Also all the variables go from negative to positive. The normalization of the data makes it so that comparing the variables with each other is much easier since the range of the values in the raw data could vary widely. The values are quite similar across the board now. The the potential change in the values within a variable should now be proportionately same within each variable.
We also want the data to be a data frame:
boston_scaled <- as.data.frame(boston_scaled)
Now we create a categorical variable of the crime rate in the Boston dataset using the quantiles as break points:
bins <- quantile(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
And divide the datasset to train and test sets so that 80% of the data belongs to the train set:
n <- nrow(boston_scaled)
ind <- sample(n, size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
Save from test data and drop the crime variable from test data:
correct_classes <- test$crime
test <- dplyr::select(test, -crime)
This is done because it is important when using a statistical method to predict something to have data to test how well the predictions fit.
Fit the linear discriminant analysis on the train set and print it. We use the categorical crime rate as the target variable and all the other variables as predictors:
lda.fit <- lda(crime ~ ., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2425743 0.2450495 0.2599010 0.2524752
##
## Group means:
## zn indus chas nox rm
## low 1.0276790 -0.8945876 -0.11163110 -0.8833220 0.41451190
## med_low -0.1063266 -0.2885723 -0.03371693 -0.5796700 -0.09846022
## med_high -0.3876019 0.1885343 0.21512144 0.4200668 0.02251974
## high -0.4872402 1.0171096 -0.11793298 1.0538584 -0.40191062
## age dis rad tax ptratio
## low -0.8991701 0.8864357 -0.6947544 -0.7181995 -0.49085598
## med_low -0.3854024 0.3712842 -0.5491659 -0.4696072 -0.07137524
## med_high 0.4033777 -0.3811264 -0.4349821 -0.3223199 -0.28695774
## high 0.8173493 -0.8632636 1.6382099 1.5141140 0.78087177
## black lstat medv
## low 0.37860892 -0.73848593 0.507241499
## med_low 0.31658776 -0.16793370 0.008184585
## med_high 0.08084392 0.04686293 0.126908825
## high -0.78523963 0.91990436 -0.739838854
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.092803696 0.78937057 -1.05389479
## indus 0.058380656 -0.18169750 0.28732932
## chas -0.033303999 -0.04495618 0.02604864
## nox 0.356626606 -0.75427270 -1.33323649
## rm 0.016175361 -0.05174669 -0.05437785
## age 0.233487691 -0.33951777 -0.12725798
## dis -0.078487201 -0.32243666 0.26158232
## rad 3.387656397 0.93142368 -0.08138651
## tax 0.001682572 -0.04055726 0.59962809
## ptratio 0.131273170 -0.03802694 -0.30567656
## black -0.092894573 0.02917679 0.07167844
## lstat 0.181566812 -0.25803873 0.34678385
## medv 0.089905710 -0.51358986 -0.15872972
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9502 0.0381 0.0117
Then we provide a function for the lda biplot arrows, target classes as numeric and plot the ld results:
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crime)
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)
In the exercise it was asked to save and drop the crime variable but it was already done earlier. So either I donât understand something or thereâs a mistake in the assignment.
We conitnue by predicting classes with test data and cross tabulate the results:
lda.pred <- predict(lda.fit, newdata = test)
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 14 14 1 0
## med_low 6 15 6 0
## med_high 1 1 17 2
## high 0 0 0 25
The predictions are pretty close to being correct. Most of the wrong predictions seem to be in med_low and med_high.
Next we reload the Boston dataset again, standardize it and set it to a variable:
data("Boston")
bostonScaled <- scale(Boston)
Calculate the distances between observations using euclidean distance matrix:
dist_eu <- dist(bostonScaled)
We run the k-means algorithm on the data set:
km <-kmeans(bostonScaled, centers = 3)
Plot the dataset with clusters:
pairs(bostonScaled, col = km$cluster)
Let’s investigate what is the optimal number of clusters with the method provided in the DataCamp exercises:
library(ggplot2);
# Boston dataset is available
set.seed(123)
# determine the number of clusters
k_max <- 5
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(bostonScaled, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
The plot visualises how the total of within cluster sum of squares behave when the number of clusters changes. The sweet spot is where the WCSS drops radically. As we can see from the plot the drop is most drastic when going from one cluster to two. So we should use two clusters.
Let’s run the algorithm again with two clusters:
km <-kmeans(bostonScaled, centers = 2)
pairs(bostonScaled, col = km$cluster)
Week 5, Anton Jyrkiäinen
The data set is from a human development report from United Nations Development Programme. Let’s start by loading the dataset.
human <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/human2.txt", sep = ",", header=TRUE)
More information about the data can be found here and metadata of the data we have used here
Summary of the dataset:
summary(human)
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
From the summary we can see the minimum, maximum, median, mean and 1st and 3rd quartiles. From the summary we can see that some of the minimum and maximum values have huge differences such as with adolescent birth rate (Ado.Birth).
Plot matrix with ggpairs():
library(GGally)
library(ggplot2)
ggpairs(human)
From the distributions we can see that the maternal mortality and adolescent birth rate are both very skewed. Only “Parli.F” and “Edu.Exp” seem to be close to normally distributed.
Correlation matrix:
library(dplyr);
library(corrplot)
cor_matrix<-cor(human) %>% round(digits = 2)
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex = 0.6)
Here we can easily see that life expectancy at birth is negatively correlated with maternal mortality ratio and adolescent birth rate. Same goes to expected years of schooling, which is also negatively correlated with maternal mortality ratio and adolescent birth rate. Strong positive correlations appear too. Expected years of schooling is positively correlated with life expectancy and maternal mortality ratio is positively correlated with adolescent birth rates.
Next we perform principal component analysis:
pca_human <- prcomp(human)
pca_human
## Standard deviations (1, .., p=8):
## [1] 1.854416e+04 1.855219e+02 2.518701e+01 1.145441e+01 3.766241e+00
## [6] 1.565912e+00 1.912052e-01 1.591112e-01
##
## Rotation (n x k) = (8 x 8):
## PC1 PC2 PC3 PC4
## Edu2.FM -5.607472e-06 0.0006713951 -3.412027e-05 -2.736326e-04
## Labo.FM 2.331945e-07 -0.0002819357 5.302884e-04 -4.692578e-03
## Edu.Exp -9.562910e-05 0.0075529759 1.427664e-02 -3.313505e-02
## Life.Exp -2.815823e-04 0.0283150248 1.294971e-02 -6.752684e-02
## GNI -9.999832e-01 -0.0057723054 -5.156742e-04 4.932889e-05
## Mat.Mor 5.655734e-03 -0.9916320120 1.260302e-01 -6.100534e-03
## Ado.Birth 1.233961e-03 -0.1255502723 -9.918113e-01 5.301595e-03
## Parli.F -5.526460e-05 0.0032317269 -7.398331e-03 -9.971232e-01
## PC5 PC6 PC7 PC8
## Edu2.FM -0.0022935252 2.180183e-02 6.998623e-01 7.139410e-01
## Labo.FM 0.0022190154 3.264423e-02 7.132267e-01 -7.001533e-01
## Edu.Exp 0.1431180282 9.882477e-01 -3.826887e-02 7.776451e-03
## Life.Exp 0.9865644425 -1.453515e-01 5.380452e-03 2.281723e-03
## GNI -0.0001135863 -2.711698e-05 -8.075191e-07 -1.176762e-06
## Mat.Mor 0.0266373214 1.695203e-03 1.355518e-04 8.371934e-04
## Ado.Birth 0.0188618600 1.273198e-02 -8.641234e-05 -1.707885e-04
## Parli.F -0.0716401914 -2.309896e-02 -2.642548e-03 2.680113e-03
Create summary and print it out:
s <- summary(pca_human)
s
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000
## PC8
## Standard deviation 0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion 1.0000
Create and print rounded percentages of variance capture by each PC
pca_pr <- round(100*s$importance[2,], digits = 1)
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 100 0 0 0 0 0 0 0
Create object pc_lab to be used as axis labels:
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
And lastly generate biplot:
biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])
The biplot is a way of visualizing the connections between two represantios of the same data. Because the values have very different ranges between the variables and the data is not standardized there’s not much to learn from these outputs.
Now we will perform the same analysis to the data after its standardized:
human_std <- scale(human)
pca_human <- prcomp(human_std)
pca_human
## Standard deviations (1, .., p=8):
## [1] 2.0708380 1.1397204 0.8750485 0.7788630 0.6619563 0.5363061 0.4589994
## [8] 0.3222406
##
## Rotation (n x k) = (8 x 8):
## PC1 PC2 PC3 PC4 PC5
## Edu2.FM -0.35664370 0.03796058 -0.24223089 0.62678110 -0.5983585
## Labo.FM 0.05457785 0.72432726 -0.58428770 0.06199424 0.2625067
## Edu.Exp -0.42766720 0.13940571 -0.07340270 -0.07020294 0.1659678
## Life.Exp -0.44372240 -0.02530473 0.10991305 -0.05834819 0.1628935
## GNI -0.35048295 0.05060876 -0.20168779 -0.72727675 -0.4950306
## Mat.Mor 0.43697098 0.14508727 -0.12522539 -0.25170614 -0.1800657
## Ado.Birth 0.41126010 0.07708468 0.01968243 0.04986763 -0.4672068
## Parli.F -0.08438558 0.65136866 0.72506309 0.01396293 -0.1523699
## PC6 PC7 PC8
## Edu2.FM 0.17713316 0.05773644 0.16459453
## Labo.FM -0.03500707 -0.22729927 -0.07304568
## Edu.Exp -0.38606919 0.77962966 -0.05415984
## Life.Exp -0.42242796 -0.43406432 0.62737008
## GNI 0.11120305 -0.13711838 -0.16961173
## Mat.Mor 0.17370039 0.35380306 0.72193946
## Ado.Birth -0.76056557 -0.06897064 -0.14335186
## Parli.F 0.13749772 0.00568387 -0.02306476
Create summary and print it out:
s <- summary(pca_human)
s
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
## PC7 PC8
## Standard deviation 0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion 0.98702 1.00000
Create and print rounded percentages of variance capture by each PC
pca_pr <- round(100*s$importance[2,], digits = 1)
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 53.6 16.2 9.6 7.6 5.5 3.6 2.6 1.3
As we can see the PC variables are much more distributed now.
Create object pc_lab to be used as axis labels:
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
And lastly generate biplot:
biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])
Dependencies:
library(dplyr)
library(tidyr)
library(GGally)
library(ggplot2)
library(lme4)
Load BPRS data
BPRS <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/BPRS.txt", sep =" ", header = T)
Factor treatment and subject:
BPRS$treatment <- factor(BPRS$treatment)
BPRS$subject <- factor(BPRS$subject)
Convert to long form and extract the week number:
BPRSL <- BPRS %>% gather(key = weeks, value = bprs, -treatment, -subject)
BPRSL <- BPRSL %>% mutate(week = as.integer(substr(weeks,5,5)))
ggplot(BPRSL, aes(x = week, y = bprs, linetype = subject)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ treatment, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(BPRSL$bprs), max(BPRSL$bprs)))
Standardize bprs:
BPRSL <- BPRSL %>%
group_by(week) %>%
mutate(stdbprs = (bprs - mean(bprs))/sd(bprs) ) %>%
ungroup()
Glimpse of the data:
glimpse(BPRSL)
## Observations: 360
## Variables: 6
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ subject <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1...
## $ weeks <chr> "week0", "week0", "week0", "week0", "week0", "week0"...
## $ bprs <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, ...
## $ week <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ stdbprs <dbl> -0.4245908, 0.7076513, 0.4245908, 0.4953559, 1.69836...
Plot with the standardised bprs:
ggplot(BPRSL, aes(x = week, y = stdbprs, linetype = subject)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ treatment, labeller = label_both) +
scale_y_continuous(name = "standardized bprs")
Number of weeks, baseline (week 0) included
n <- BPRSL$week %>% unique() %>% length()
Summary of the data with mean and standard error of bprs by treatment and week:
BPRSS <- BPRSL %>%
group_by(treatment, week) %>%
summarise( mean = mean(bprs), se = sd(bprs)/sqrt(n) ) %>%
ungroup()
Glimpse of the data
glimpse(BPRSS)
## Observations: 18
## Variables: 4
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2
## $ week <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 0, 1, 2, 3, 4, 5, 6, 7, 8
## $ mean <dbl> 47.00, 46.80, 43.55, 40.90, 36.60, 32.70, 29.70, 29....
## $ se <dbl> 4.534468, 5.173708, 4.003617, 3.744626, 3.259534, 2....
Mean profiles plot:
ggplot(BPRSS, aes(x = week, y = mean, linetype = treatment, shape = treatment)) +
geom_line() +
scale_linetype_manual(values = c(1,2)) +
geom_point(size=3) +
scale_shape_manual(values = c(1,2)) +
geom_errorbar(aes(ymin = mean - se, ymax = mean + se, linetype="1"), width=0.3) +
theme(legend.position = c(0.8,0.8)) +
scale_y_continuous(name = "mean(bprs) +/- se(bprs)")
Create a summary data by treatment and subject with mean as the summary variable.
BPRSL8S <- BPRSL %>%
filter(week > 0) %>%
group_by(treatment, subject) %>%
summarise( mean=mean(bprs) ) %>%
ungroup()
Glimpse the data
glimpse(BPRSL8S)
## Observations: 40
## Variables: 3
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ subject <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1...
## $ mean <dbl> 41.500, 43.125, 35.375, 52.625, 50.375, 34.000, 37.1...
Draw a boxplot of the mean versus treatment
ggplot(BPRSL8S, aes(x = treatment, y = mean)) + geom_boxplot()
Add the baseline from the original data as a new variable to the summary data
BPRSL8S2 <- BPRSL8S %>%
mutate(baseline = BPRS$week0)
Perform a two-sample t-test
t.test(mean ~ treatment, data = BPRSL8S2, var.equal = TRUE)
##
## Two Sample t-test
##
## data: mean by treatment
## t = -0.11896, df = 38, p-value = 0.9059
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -7.094183 6.306683
## sample estimates:
## mean in group 1 mean in group 2
## 36.16875 36.56250
Fit the linear model with the mean as the response
fit <- lm(mean ~ baseline + treatment, data = BPRSL8S2)
Compute the analysis of variance table for the fitted model
anova(fit)
## Analysis of Variance Table
##
## Response: mean
## Df Sum Sq Mean Sq F value Pr(>F)
## baseline 1 1868.07 1868.07 30.1437 3.077e-06 ***
## treatment 1 3.45 3.45 0.0557 0.8148
## Residuals 37 2292.97 61.97
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Load RATSL data
RATS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt", header = TRUE, sep = '\t')
Convert data to long form:
RATSL <- RATS %>%
gather(key = WD, value = Weight, -ID, -Group) %>%
mutate(Time = as.integer(substr(WD,3,4)))
Regression model of RATSL data:
RATS_reg <- lm(Weight ~ Time + Group, data = RATSL)
Summary:
summary(RATS_reg)
##
## Call:
## lm(formula = Weight ~ Time + Group, data = RATSL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -78.543 -20.884 -14.208 9.803 190.876
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 121.2063 11.3052 10.721 <2e-16 ***
## Time 0.5857 0.2002 2.926 0.0039 **
## Group 139.2169 4.6965 29.643 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 51.66 on 173 degrees of freedom
## Multiple R-squared: 0.8368, Adjusted R-squared: 0.8349
## F-statistic: 443.6 on 2 and 173 DF, p-value: < 2.2e-16
Random intercept model:
RATS_ref <- lmer(Weight ~ Time + Group + (1 | ID), data = RATSL, REML = FALSE)
Summary of the model:
summary(RATS_ref)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: Weight ~ Time + Group + (1 | ID)
## Data: RATSL
##
## AIC BIC logLik deviance df.resid
## 1344.8 1360.7 -667.4 1334.8 171
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.5073 -0.5519 -0.0399 0.5580 3.1303
##
## Random effects:
## Groups Name Variance Std.Dev.
## ID (Intercept) 2556.98 50.567
## Residual 66.44 8.151
## Number of obs: 176, groups: ID, 16
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 121.20629 29.57835 4.098
## Time 0.58568 0.03158 18.544
## Group 139.21694 15.26439 9.120
##
## Correlation of Fixed Effects:
## (Intr) Time
## Time -0.036
## Group -0.903 0.000
Random intercept model and random slope model:
RATS_ref1 <- lmer(Weight ~ Time + Group + (Time | ID), data = RATSL, REML = FALSE)
Anova test on the models:
anova(RATS_ref1, RATS_ref)
## Data: RATSL
## Models:
## RATS_ref: Weight ~ Time + Group + (1 | ID)
## RATS_ref1: Weight ~ Time + Group + (Time | ID)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## RATS_ref 5 1344.8 1360.7 -667.41 1334.8
## RATS_ref1 7 1200.8 1223.0 -593.41 1186.8 147.99 2 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1