About the project

Introduction to open data science is about discovering the patterns hidden behind numbers in matrices and arrays. The course introduces you to open data science using tools such as R, RStudio, RMarkdown, and Github. The course will include coding in R, analyzing, modelling and maintaining a course diary.

Regression and model validation

Week 2, Anton Jyrkiäinen

Part 1

The data used is from a study which investigates the relationship between teaching, learning approaches and exam results of the students.

Link to data here and link to metadata here.

First we import the data which we wrangled and saved locally in the first part of the exercise. The data is set to a variable lrn14 using read.table.

lrn14 <- read.table("C:/Users/Anton/Documents/IODS-project/learning2014.txt")

We also need to import few dependancies for later use:

library(GGally)
## Loading required package: ggplot2
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
## 
##     nasa
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

By using str() we can have a look at the structure of the data:

str(lrn14)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ Age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ Attitude: int  37 31 25 35 37 38 35 29 38 21 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ Points  : int  25 12 24 10 22 21 21 31 24 26 ...

The output gives us meta information about the data in general and about each column itself: amount of rows and colums, name of the variables, data type and the first values. There’s seven variables: gender, global attitude toward statistics, exam points and three variables “deep” (deep questions), “surf” (surface questions), “stra” (strategic questions) which are composed of means calculated from answering various questions. These three, in the same order as mentioned previously, stand for “intention to maximize understanding, with a true commitment to learning”, “memorizing without understanding, with a serious lack of personal engagement in the learning process” and “the ways students organize their studying”.

The questions were measured with a scale from 1 to 5 with (1 = disagree, 5 = agree). The data covers answers from 183 participants of which 67% were female and had a mean age of 25.6 years. Observations with 0 exam points are excluded from the data in the wrangling part of the exercise.

Using dim() we get an output which gives as the dimensions of the dataset.

dim(lrn14)
## [1] 166   7

Part 2

The ggpairs() functions provides us a matrix of plots which consists of density plots, bar plot, histograms, scatter plots, correlation coefficients and box plots. These plots represent all the possible combinations of the variables.

Using the dataset with ggpairs() we generate a graphical overview:

p <- ggpairs(lrn14, mapping = aes(col = gender), lower = list(combo = wrap("facethist", bins = 20)))
p

And with summary() we get a summary of the variables:

summary(lrn14)
##  gender       Age           Attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :14.00   Min.   :1.583   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:26.00   1st Qu.:3.333   1st Qu.:2.625  
##          Median :22.00   Median :32.00   Median :3.667   Median :3.188  
##          Mean   :25.51   Mean   :31.43   Mean   :3.680   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:37.00   3rd Qu.:4.083   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :50.00   Max.   :4.917   Max.   :5.000  
##       surf           Points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00

Just looking at the density plots they give us an rough idea how each variable is distributed. All the variables seem to be close to normally distributed expect for “age” and “deep”. The scale for age is from 17 to 55 so it makes sense for the density plot to be heavily distributed to the young side since the participants of the survey are students. In the “deep” variable there is quite interesting bias towards agreeing to a lot to the questions. From the summary we see that the mean to answering “deep questions” is 3.68.

Part 3

A regression model is made with lm(). As for the formula we will be using “Points” as dependent variable and “Attitude”, “stra” and “deep” for explanatory variables. Also a summary of the regression model is generated with summary():

lm1 <- lm(formula = Points ~ Attitude + stra + deep, data = lrn14)
lm1
## 
## Call:
## lm(formula = Points ~ Attitude + stra + deep, data = lrn14)
## 
## Coefficients:
## (Intercept)     Attitude         stra         deep  
##     11.3915       0.3525       0.9621      -0.7492

The intercept is the value of y (“Points”) when all the x (“Attitude”,“stra”,“deep”) are 0. The other values are the slope in the regression model.

summary(lm1)
## 
## Call:
## lm(formula = Points ~ Attitude + stra + deep, data = lrn14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.5239  -3.4276   0.5474   3.8220  11.5112 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.39145    3.40775   3.343  0.00103 ** 
## Attitude     0.35254    0.05683   6.203 4.44e-09 ***
## stra         0.96208    0.53668   1.793  0.07489 .  
## deep        -0.74920    0.75066  -0.998  0.31974    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 162 degrees of freedom
## Multiple R-squared:  0.2097, Adjusted R-squared:  0.195 
## F-statistic: 14.33 on 3 and 162 DF,  p-value: 2.521e-08

The residuals are difference between the observed value of the dependant variable and the predicted value. Each data point has a residual. The residuals should be normally distributed and their sum should always be 0. Residuals can reveal unexplained patterns in data which is fitted to the model.

The coefficients show us the estimate, standard error, t-value and p-value. Standard error is an estimate of the standard deviation of the coefficient estimate. T-value is estimate divided by standard error. The p-value shows the probability of observing A t-value larger than the one presented. P-value basically indicates if there’s evidence against the null hypothesis. Low p-value indicates strong evidence against the null hypothesis and high p-value indicates weak evidence or no evidence against the null hypothesis.

The output also gives us a “signif. codes” which make it easier to interpret the results with just a glance. In this case “Attitude” received p-value of under 0.01 which means there is strong evidence that the null hypothesis does not hold. This means that it is unlikely that “Attitude” does not correlate with “points”. “stra” had p-value between 0.05 and 0.1 which indicates weak evidence that the null hypothesis does not hold. Variable “deep” does not have statistically signifant relationship with the target variable.

Next we fit the model again without variable “deep”:

lm1 <- lm(formula = Points ~ Attitude + stra, data = lrn14)
lm1
## 
## Call:
## lm(formula = Points ~ Attitude + stra, data = lrn14)
## 
## Coefficients:
## (Intercept)     Attitude         stra  
##      8.9729       0.3466       0.9137
summary(lm1)
## 
## Call:
## lm(formula = Points ~ Attitude + stra, data = lrn14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6436  -3.3113   0.5575   3.7928  10.9295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.97290    2.39591   3.745  0.00025 ***
## Attitude     0.34658    0.05652   6.132 6.31e-09 ***
## stra         0.91365    0.53447   1.709  0.08927 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1951 
## F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09

Without the variable “deep” we get similar results with a bit higher p-values for “Attitude” and “stra” but still within the same signifiance codes as before. Also the r-squared and f-statistics stay relatively the same.

Part 4

Residual standard error is the standard deviation of the residuals.

R-squared tells us how close the data is to the fitted regression line. The higher the R-squared, the better the model fits the data. If the R-squared was 100% the fitted values would all be on the fitted regression line. Difference between multiple and adjusted R-squared is that adjusted R-squared takes the phenomenon of R-squared automatically increasing when more explanatory variables are added to the model. The adjusted R-squared is at 19.51% which indicates that the data points will be quite scattered around the fitted regression line. This indicates that producing predictions with this model would be somewhat inaccurate.

F-statistic is variaton between sample means divided by variance within the samples. The p-value in the f-statistics indicate that atleast some of the results are statistically significant. If the p-value here would indicate insignificance we couldn’t reject the null hypothesis.

Part 5

Usin qplot() we generate three diagnostics plots: “Residuals vs Fitted values”, “Normal QQ-plot” and “Residuals vs Leverage”. We give the model as parameter with a vector of indexes to choose the right plots.

Residuals vs fitted values:

plot(lm1, which = c(1))

We can see that variability of the points are about the equal so that would be good. There seems to be no signs here that would indicate a problem with the model.

Normal Q-Q:

plot(lm1, which = c(2))

If the observations are normally distributed the plot should results in a approximately straight line. In the plot above we can see that the line the pairs create is fairly straight so there should be no problem here. There are some extreme values but they are as extreme as would be expected. We can see the same kind of extremities in the low and high end so that would fit in the principles of normal distribution.

Residuals vs leverage:

plot(lm1, which = c(5))

This plot shows us if there are extreme values that are influental. Even though there is some extreme values they are no influental since they more or less follow the trend of the other values. Excluding influental point would radically change the results which don’t seem to be the case here.


Logistic regression

Week 3, Anton Jyrkiäinen

Dependancies used:

library(dplyr);
library(ggplot2);
library(tidyr);

First we import the data which we wrangled and saved locally in the first part of the exercise. The data is set to a variable alc using read.table.

alc <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt", sep = ",", header=TRUE)

Variables in the dataset:

names(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

The dataset consists of data collected from school reports and questionnaires from two different Portugese schools regarding the performance in two subjects: mathematics and Portugese.

Link to the data here and link to metadata here.

Absences, studytime and grades go together quite well when doing analysis of school performance. I expect that absences are positively correlated and studytime and grades negatively correlated with high alcohol use. Gender will probably have no difference in this and will not be statistically significant Therefor the variables to be used are absences, studytime, G3 and sex.

Barplot of each variable we’re interested in:

ggplot(data = alc, aes(x = absences)) + geom_bar()

ggplot(data = alc, aes(x = G3)) + geom_bar()

ggplot(data = alc, aes(x = sex)) + geom_bar()

ggplot(data = alc, aes(x = studytime)) + geom_bar()

In the barplots we can observe the distribution of the variables we’ve chosen. We can see that it is more common to have none or less absences. Most of the students are getting average or a bit above average grades.There’s also quite large number of students receiving 0 grade. There’s slightly more female students than male students. Majority of the students seem to study less than 2 hours or 2-5 hours weekly.

Boxplots:

g1 <- ggplot(alc, aes(x = high_use, y = G3, col = sex))
g1 + geom_boxplot()

g2 <- ggplot(alc, aes(x = high_use, y = absences, col = sex))
g2 + geom_boxplot()

g2 <- ggplot(alc, aes(x = high_use, y = studytime, col = sex))
g2 + geom_boxplot()

From the boxplots we can see that the final grades were better in the group with no heavy use of alcohol. There also seems to be more absences in the group with heavy use. As with studytime there seems to be higher average of studying time in the group with no heavy alcohol use.

Grade means by group:

alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_grade = mean(G3))
## # A tibble: 4 x 4
## # Groups:   sex [?]
##   sex   high_use count mean_grade
##   <fct> <lgl>    <int>      <dbl>
## 1 F     FALSE      157       9.69
## 2 F     TRUE        41      10.4 
## 3 M     FALSE      113      11.6 
## 4 M     TRUE        71       9.97

Among the female students there’s a grade mean increase of 0.71 when the person is considered to have high use of alcohol. Among the male students there’s a large grade drop of 1.63 points if the student is categorized to the high use of alcohol group.

Absence mean by group:

alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_absences = mean(absences))
## # A tibble: 4 x 4
## # Groups:   sex [?]
##   sex   high_use count mean_absences
##   <fct> <lgl>    <int>         <dbl>
## 1 F     FALSE      157          4.97
## 2 F     TRUE        41          8.90
## 3 M     FALSE      113          3.19
## 4 M     TRUE        71          7.41

We can observe that the high alcohol use has a negative impact on the absences of the student. With the female students there’s 3.93 more school absences and with the male students the absences are more than doubled.

Study time mean by group:

alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_studytime = mean(studytime))
## # A tibble: 4 x 4
## # Groups:   sex [?]
##   sex   high_use count mean_studytime
##   <fct> <lgl>    <int>          <dbl>
## 1 F     FALSE      157           2.34
## 2 F     TRUE        41           2   
## 3 M     FALSE      113           1.88
## 4 M     TRUE        71           1.62

Average studying times seem also to suffer from high alcohol use with lower mean studying times in both groups.

Summary of logistic regression with our variables and high/low alcohol consumption set as the target variable:

m <- glm(high_use ~ sex + absences + G3 + studytime, data = alc, family = "binomial")
summary(m)
## 
## Call:
## glm(formula = high_use ~ sex + absences + G3 + studytime, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6894  -0.8391  -0.5951   1.1034   2.1852  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.64915    0.45507  -1.426 0.153725    
## sexM         0.84250    0.25698   3.278 0.001044 ** 
## absences     0.07076    0.01821   3.887 0.000102 ***
## G3          -0.02575    0.02635  -0.977 0.328574    
## studytime   -0.41864    0.16119  -2.597 0.009398 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 462.21  on 381  degrees of freedom
## Residual deviance: 416.74  on 377  degrees of freedom
## AIC: 426.74
## 
## Number of Fisher Scoring iterations: 4

From the summary of the model we can see that our original hypothesis does not hold. Gender does in fact seem to be statistically significant and the final grade is not. The summary reveals that it is more probable to be in the high use group if the student is male and the high use variable correlates with absences and studytime variables. The deviance residuals seem to be fine since they are roughly close to centered to zero and symmetrical. Nul deviance and residual deviance can be used to compare models and to calculate R-square and p-value. AIC can also be used to comparing to other models.

Odds ratios with their confidence intervals:

OR <- coef(m) %>% exp
CI <- confint(m) %>% exp
cbind(OR, CI)
##                    OR     2.5 %    97.5 %
## (Intercept) 0.5224889 0.2125860 1.2719337
## sexM        2.3221692 1.4096724 3.8686858
## absences    1.0733272 1.0379854 1.1144115
## G3          0.9745836 0.9257160 1.0267950
## studytime   0.6579430 0.4751282 0.8953331

The OR column is the exponent of the coefficient which can be interpret as odds ratio between a unit change in the corresponging explanatory variable. Therefor the row sexM can be interpret as male students being ~2.3 times more likely to be heavy user of alcohol than the female students.

We will be dropping G3 from the model:

m <- glm(high_use ~ sex + absences + G3 + studytime, data = alc, family = "binomial")

Let’s explore the predictive power of the model:

probabilities <- predict(m, type = "response")
alc <- mutate(alc, probability = probabilities)
alc <- mutate(alc, prediction = probability > 0.5)

First we used the predict()-function to get the probability of high use. Then added the predicted probabiliets to alc and then alter the table so it has a column prediction which has a boolean value as predictor of high use. If the probability was higher than 0.5 the value is set to TRUE otherwise FALSE.

Let’s look at a confusion matrix:

table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   262    8
##    TRUE     85   27

We can see that in 302 case the prediction was correct and in 80 cases not.

Tabulate the target variable versus the prediction:

table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table %>% addmargins
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.68586387 0.02094241 0.70680628
##    TRUE  0.22251309 0.07068063 0.29319372
##    Sum   0.90837696 0.09162304 1.00000000

Clustering and classification

Week 4, Anton Jyrkiäinen

R has some datasets already loaded in. Let’s use the Boston dataset from the MASS package:

library(MASS)
data("Boston")

Details about the dataset can be found here.

Structure and dimensions of the data:

str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

The data set has 506 rows and 14 colums. It is about the housing values in suburbs of Boston. The data consists of information such as average number of rooms per dwelling and accessibility of to radial highways.

We can see the first and third quartiles, min, max, median and mean from the summary of the variables using summary():

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

With corrplot package we can observe and visualize hte correlations between the variables in the data:

library(dplyr);
library(corrplot)

First we calculate the correlation matrix, round it and then visualize with the correlation matrix:

cor_matrix<-cor(Boston) %>% round(digits = 2)
cor_matrix
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47
##         ptratio black lstat  medv
## crim       0.29 -0.39  0.46 -0.39
## zn        -0.39  0.18 -0.41  0.36
## indus      0.38 -0.36  0.60 -0.48
## chas      -0.12  0.05 -0.05  0.18
## nox        0.19 -0.38  0.59 -0.43
## rm        -0.36  0.13 -0.61  0.70
## age        0.26 -0.27  0.60 -0.38
## dis       -0.23  0.29 -0.50  0.25
## rad        0.46 -0.44  0.49 -0.38
## tax        0.46 -0.44  0.54 -0.47
## ptratio    1.00 -0.18  0.37 -0.51
## black     -0.18  1.00 -0.37  0.33
## lstat      0.37 -0.37  1.00 -0.74
## medv      -0.51  0.33 -0.74  1.00
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex = 0.6)

The strong correlations are easily distinguished from the matrix. As for example “indus” (proportion of non-retail business acres per town.) is negatively correlated with “dis” (weighted mean of distances to five Boston employment centres.) and “rad” (index of accessibility to radial highways.) is positively correlated with “tax” (full-value property-tax rate per $10,000.). One anomaly that could be noted is that the “chas” (Charles River dummy variable) variable has very little correlation with any other variable in the dataset.

For linear discriminant analysis we need to scale the data. This means subtracting the column means from the corresponding colums and divide the difference with standard deviation. This is easily managed with the scale()-function.

boston_scaled <- scale(Boston)

Let’s look at the summariers of the scaled variables:

summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865

At first glance the most noticable difference is that all the variables have zero mean. Also all the variables go from negative to positive. The normalization of the data makes it so that comparing the variables with each other is much easier since the range of the values in the raw data could vary widely. The values are quite similar across the board now. The the potential change in the values within a variable should now be proportionately same within each variable.

We also want the data to be a data frame:

boston_scaled <- as.data.frame(boston_scaled)

Now we create a categorical variable of the crime rate in the Boston dataset using the quantiles as break points:

bins <- quantile(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)

And divide the datasset to train and test sets so that 80% of the data belongs to the train set:

n <- nrow(boston_scaled)
ind <- sample(n, size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]

Save from test data and drop the crime variable from test data:

correct_classes <- test$crime
test <- dplyr::select(test, -crime)

This is done because it is important when using a statistical method to predict something to have data to test how well the predictions fit.

Fit the linear discriminant analysis on the train set and print it. We use the categorical crime rate as the target variable and all the other variables as predictors:

lda.fit <- lda(crime ~ ., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2425743 0.2450495 0.2599010 0.2524752 
## 
## Group means:
##                  zn      indus        chas        nox          rm
## low       1.0276790 -0.8945876 -0.11163110 -0.8833220  0.41451190
## med_low  -0.1063266 -0.2885723 -0.03371693 -0.5796700 -0.09846022
## med_high -0.3876019  0.1885343  0.21512144  0.4200668  0.02251974
## high     -0.4872402  1.0171096 -0.11793298  1.0538584 -0.40191062
##                 age        dis        rad        tax     ptratio
## low      -0.8991701  0.8864357 -0.6947544 -0.7181995 -0.49085598
## med_low  -0.3854024  0.3712842 -0.5491659 -0.4696072 -0.07137524
## med_high  0.4033777 -0.3811264 -0.4349821 -0.3223199 -0.28695774
## high      0.8173493 -0.8632636  1.6382099  1.5141140  0.78087177
##                black       lstat         medv
## low       0.37860892 -0.73848593  0.507241499
## med_low   0.31658776 -0.16793370  0.008184585
## med_high  0.08084392  0.04686293  0.126908825
## high     -0.78523963  0.91990436 -0.739838854
## 
## Coefficients of linear discriminants:
##                  LD1         LD2         LD3
## zn       0.092803696  0.78937057 -1.05389479
## indus    0.058380656 -0.18169750  0.28732932
## chas    -0.033303999 -0.04495618  0.02604864
## nox      0.356626606 -0.75427270 -1.33323649
## rm       0.016175361 -0.05174669 -0.05437785
## age      0.233487691 -0.33951777 -0.12725798
## dis     -0.078487201 -0.32243666  0.26158232
## rad      3.387656397  0.93142368 -0.08138651
## tax      0.001682572 -0.04055726  0.59962809
## ptratio  0.131273170 -0.03802694 -0.30567656
## black   -0.092894573  0.02917679  0.07167844
## lstat    0.181566812 -0.25803873  0.34678385
## medv     0.089905710 -0.51358986 -0.15872972
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9502 0.0381 0.0117

Then we provide a function for the lda biplot arrows, target classes as numeric and plot the ld results:

lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

classes <- as.numeric(train$crime)

plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)

In the exercise it was asked to save and drop the crime variable but it was already done earlier. So either I don’t understand something or there’s a mistake in the assignment.

We conitnue by predicting classes with test data and cross tabulate the results:

lda.pred <- predict(lda.fit, newdata = test)
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       14      14        1    0
##   med_low    6      15        6    0
##   med_high   1       1       17    2
##   high       0       0        0   25

The predictions are pretty close to being correct. Most of the wrong predictions seem to be in med_low and med_high.

Next we reload the Boston dataset again, standardize it and set it to a variable:

data("Boston")
bostonScaled <- scale(Boston)

Calculate the distances between observations using euclidean distance matrix:

dist_eu <- dist(bostonScaled)

We run the k-means algorithm on the data set:

km <-kmeans(bostonScaled, centers = 3)

Plot the dataset with clusters:

pairs(bostonScaled, col = km$cluster)

Let’s investigate what is the optimal number of clusters with the method provided in the DataCamp exercises:

library(ggplot2);
# Boston dataset is available
set.seed(123)

# determine the number of clusters
k_max <- 5

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(bostonScaled, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

The plot visualises how the total of within cluster sum of squares behave when the number of clusters changes. The sweet spot is where the WCSS drops radically. As we can see from the plot the drop is most drastic when going from one cluster to two. So we should use two clusters.

Let’s run the algorithm again with two clusters:

km <-kmeans(bostonScaled, centers = 2)
pairs(bostonScaled, col = km$cluster)


Dimensionality reduction techniques

Week 5, Anton Jyrkiäinen

The data set is from a human development report from United Nations Development Programme. Let’s start by loading the dataset.

human <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/human2.txt", sep = ",", header=TRUE)

More information about the data can be found here and metadata of the data we have used here

Summary of the dataset:

summary(human)
##     Edu2.FM          Labo.FM          Edu.Exp         Life.Exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50

From the summary we can see the minimum, maximum, median, mean and 1st and 3rd quartiles. From the summary we can see that some of the minimum and maximum values have huge differences such as with adolescent birth rate (Ado.Birth).

Plot matrix with ggpairs():

library(GGally)
library(ggplot2)
ggpairs(human)

From the distributions we can see that the maternal mortality and adolescent birth rate are both very skewed. Only “Parli.F” and “Edu.Exp” seem to be close to normally distributed.

Correlation matrix:

library(dplyr);
library(corrplot)
cor_matrix<-cor(human) %>% round(digits = 2)
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex = 0.6)

Here we can easily see that life expectancy at birth is negatively correlated with maternal mortality ratio and adolescent birth rate. Same goes to expected years of schooling, which is also negatively correlated with maternal mortality ratio and adolescent birth rate. Strong positive correlations appear too. Expected years of schooling is positively correlated with life expectancy and maternal mortality ratio is positively correlated with adolescent birth rates.

Next we perform principal component analysis:

pca_human <- prcomp(human)
pca_human
## Standard deviations (1, .., p=8):
## [1] 1.854416e+04 1.855219e+02 2.518701e+01 1.145441e+01 3.766241e+00
## [6] 1.565912e+00 1.912052e-01 1.591112e-01
## 
## Rotation (n x k) = (8 x 8):
##                     PC1           PC2           PC3           PC4
## Edu2.FM   -5.607472e-06  0.0006713951 -3.412027e-05 -2.736326e-04
## Labo.FM    2.331945e-07 -0.0002819357  5.302884e-04 -4.692578e-03
## Edu.Exp   -9.562910e-05  0.0075529759  1.427664e-02 -3.313505e-02
## Life.Exp  -2.815823e-04  0.0283150248  1.294971e-02 -6.752684e-02
## GNI       -9.999832e-01 -0.0057723054 -5.156742e-04  4.932889e-05
## Mat.Mor    5.655734e-03 -0.9916320120  1.260302e-01 -6.100534e-03
## Ado.Birth  1.233961e-03 -0.1255502723 -9.918113e-01  5.301595e-03
## Parli.F   -5.526460e-05  0.0032317269 -7.398331e-03 -9.971232e-01
##                     PC5           PC6           PC7           PC8
## Edu2.FM   -0.0022935252  2.180183e-02  6.998623e-01  7.139410e-01
## Labo.FM    0.0022190154  3.264423e-02  7.132267e-01 -7.001533e-01
## Edu.Exp    0.1431180282  9.882477e-01 -3.826887e-02  7.776451e-03
## Life.Exp   0.9865644425 -1.453515e-01  5.380452e-03  2.281723e-03
## GNI       -0.0001135863 -2.711698e-05 -8.075191e-07 -1.176762e-06
## Mat.Mor    0.0266373214  1.695203e-03  1.355518e-04  8.371934e-04
## Ado.Birth  0.0188618600  1.273198e-02 -8.641234e-05 -1.707885e-04
## Parli.F   -0.0716401914 -2.309896e-02 -2.642548e-03  2.680113e-03

Create summary and print it out:

s <- summary(pca_human)
s
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000
##                           PC8
## Standard deviation     0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion  1.0000

Create and print rounded percentages of variance capture by each PC

pca_pr <- round(100*s$importance[2,], digits = 1) 
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 
## 100   0   0   0   0   0   0   0

Create object pc_lab to be used as axis labels:

pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")

And lastly generate biplot:

biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])

The biplot is a way of visualizing the connections between two represantios of the same data. Because the values have very different ranges between the variables and the data is not standardized there’s not much to learn from these outputs.

Now we will perform the same analysis to the data after its standardized:

human_std <- scale(human)
pca_human <- prcomp(human_std)
pca_human
## Standard deviations (1, .., p=8):
## [1] 2.0708380 1.1397204 0.8750485 0.7788630 0.6619563 0.5363061 0.4589994
## [8] 0.3222406
## 
## Rotation (n x k) = (8 x 8):
##                   PC1         PC2         PC3         PC4        PC5
## Edu2.FM   -0.35664370  0.03796058 -0.24223089  0.62678110 -0.5983585
## Labo.FM    0.05457785  0.72432726 -0.58428770  0.06199424  0.2625067
## Edu.Exp   -0.42766720  0.13940571 -0.07340270 -0.07020294  0.1659678
## Life.Exp  -0.44372240 -0.02530473  0.10991305 -0.05834819  0.1628935
## GNI       -0.35048295  0.05060876 -0.20168779 -0.72727675 -0.4950306
## Mat.Mor    0.43697098  0.14508727 -0.12522539 -0.25170614 -0.1800657
## Ado.Birth  0.41126010  0.07708468  0.01968243  0.04986763 -0.4672068
## Parli.F   -0.08438558  0.65136866  0.72506309  0.01396293 -0.1523699
##                   PC6         PC7         PC8
## Edu2.FM    0.17713316  0.05773644  0.16459453
## Labo.FM   -0.03500707 -0.22729927 -0.07304568
## Edu.Exp   -0.38606919  0.77962966 -0.05415984
## Life.Exp  -0.42242796 -0.43406432  0.62737008
## GNI        0.11120305 -0.13711838 -0.16961173
## Mat.Mor    0.17370039  0.35380306  0.72193946
## Ado.Birth -0.76056557 -0.06897064 -0.14335186
## Parli.F    0.13749772  0.00568387 -0.02306476

Create summary and print it out:

s <- summary(pca_human)
s
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
##                            PC7     PC8
## Standard deviation     0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion  0.98702 1.00000

Create and print rounded percentages of variance capture by each PC

pca_pr <- round(100*s$importance[2,], digits = 1) 
pca_pr
##  PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8 
## 53.6 16.2  9.6  7.6  5.5  3.6  2.6  1.3

As we can see the PC variables are much more distributed now.

Create object pc_lab to be used as axis labels:

pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")

And lastly generate biplot:

biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])


Analysis of longitudinal data

Dependencies:

library(dplyr)
library(tidyr)
library(GGally)
library(ggplot2)
library(lme4)

Load BPRS data

BPRS <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/BPRS.txt", sep  =" ", header = T)

Factor treatment and subject:

BPRS$treatment <- factor(BPRS$treatment)
BPRS$subject <- factor(BPRS$subject)

Convert to long form and extract the week number:

BPRSL <- BPRS %>% gather(key = weeks, value = bprs, -treatment, -subject)
BPRSL <- BPRSL %>% mutate(week = as.integer(substr(weeks,5,5)))
ggplot(BPRSL, aes(x = week, y = bprs, linetype = subject)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ treatment, labeller = label_both) +
  theme(legend.position = "none") + 
  scale_y_continuous(limits = c(min(BPRSL$bprs), max(BPRSL$bprs)))

Standardize bprs:

BPRSL <- BPRSL %>%
  group_by(week) %>%
  mutate(stdbprs = (bprs - mean(bprs))/sd(bprs) ) %>%
  ungroup()

Glimpse of the data:

glimpse(BPRSL)
## Observations: 360
## Variables: 6
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ subject   <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1...
## $ weeks     <chr> "week0", "week0", "week0", "week0", "week0", "week0"...
## $ bprs      <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, ...
## $ week      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ stdbprs   <dbl> -0.4245908, 0.7076513, 0.4245908, 0.4953559, 1.69836...

Plot with the standardised bprs:

ggplot(BPRSL, aes(x = week, y = stdbprs, linetype = subject)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ treatment, labeller = label_both) +
  scale_y_continuous(name = "standardized bprs")

Number of weeks, baseline (week 0) included

n <- BPRSL$week %>% unique() %>% length()

Summary of the data with mean and standard error of bprs by treatment and week:

BPRSS <- BPRSL %>%
  group_by(treatment, week) %>%
  summarise( mean = mean(bprs), se = sd(bprs)/sqrt(n) ) %>%
  ungroup()

Glimpse of the data

glimpse(BPRSS)
## Observations: 18
## Variables: 4
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2
## $ week      <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 0, 1, 2, 3, 4, 5, 6, 7, 8
## $ mean      <dbl> 47.00, 46.80, 43.55, 40.90, 36.60, 32.70, 29.70, 29....
## $ se        <dbl> 4.534468, 5.173708, 4.003617, 3.744626, 3.259534, 2....

Mean profiles plot:

ggplot(BPRSS, aes(x = week, y = mean, linetype = treatment, shape = treatment)) +
  geom_line() +
  scale_linetype_manual(values = c(1,2)) +
  geom_point(size=3) +
  scale_shape_manual(values = c(1,2)) +
  geom_errorbar(aes(ymin = mean - se, ymax = mean + se, linetype="1"), width=0.3) +
  theme(legend.position = c(0.8,0.8)) +
  scale_y_continuous(name = "mean(bprs) +/- se(bprs)")

Create a summary data by treatment and subject with mean as the summary variable.

BPRSL8S <- BPRSL %>%
  filter(week > 0) %>%
  group_by(treatment, subject) %>%
  summarise( mean=mean(bprs) ) %>%
  ungroup()

Glimpse the data

glimpse(BPRSL8S)
## Observations: 40
## Variables: 3
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ subject   <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1...
## $ mean      <dbl> 41.500, 43.125, 35.375, 52.625, 50.375, 34.000, 37.1...

Draw a boxplot of the mean versus treatment

ggplot(BPRSL8S, aes(x = treatment, y = mean)) + geom_boxplot()

Add the baseline from the original data as a new variable to the summary data

BPRSL8S2 <- BPRSL8S %>%
  mutate(baseline = BPRS$week0)

Perform a two-sample t-test

t.test(mean ~ treatment, data = BPRSL8S2, var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  mean by treatment
## t = -0.11896, df = 38, p-value = 0.9059
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -7.094183  6.306683
## sample estimates:
## mean in group 1 mean in group 2 
##        36.16875        36.56250

Fit the linear model with the mean as the response

fit <- lm(mean ~ baseline + treatment, data = BPRSL8S2)

Compute the analysis of variance table for the fitted model

anova(fit)
## Analysis of Variance Table
## 
## Response: mean
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## baseline   1 1868.07 1868.07 30.1437 3.077e-06 ***
## treatment  1    3.45    3.45  0.0557    0.8148    
## Residuals 37 2292.97   61.97                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Load RATSL data

RATS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt", header = TRUE, sep = '\t')

Convert data to long form:

RATSL <- RATS %>%
  gather(key = WD, value = Weight, -ID, -Group) %>%
  mutate(Time = as.integer(substr(WD,3,4))) 

Regression model of RATSL data:

RATS_reg <- lm(Weight ~ Time + Group, data = RATSL)

Summary:

summary(RATS_reg)
## 
## Call:
## lm(formula = Weight ~ Time + Group, data = RATSL)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -78.543 -20.884 -14.208   9.803 190.876 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 121.2063    11.3052  10.721   <2e-16 ***
## Time          0.5857     0.2002   2.926   0.0039 ** 
## Group       139.2169     4.6965  29.643   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 51.66 on 173 degrees of freedom
## Multiple R-squared:  0.8368, Adjusted R-squared:  0.8349 
## F-statistic: 443.6 on 2 and 173 DF,  p-value: < 2.2e-16

Random intercept model:

RATS_ref <- lmer(Weight ~ Time + Group + (1 | ID), data = RATSL, REML = FALSE)

Summary of the model:

summary(RATS_ref)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: Weight ~ Time + Group + (1 | ID)
##    Data: RATSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   1344.8   1360.7   -667.4   1334.8      171 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.5073 -0.5519 -0.0399  0.5580  3.1303 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  ID       (Intercept) 2556.98  50.567  
##  Residual               66.44   8.151  
## Number of obs: 176, groups:  ID, 16
## 
## Fixed effects:
##              Estimate Std. Error t value
## (Intercept) 121.20629   29.57835   4.098
## Time          0.58568    0.03158  18.544
## Group       139.21694   15.26439   9.120
## 
## Correlation of Fixed Effects:
##       (Intr) Time  
## Time  -0.036       
## Group -0.903  0.000

Random intercept model and random slope model:

RATS_ref1 <- lmer(Weight ~ Time + Group + (Time | ID), data = RATSL, REML = FALSE)

Anova test on the models:

anova(RATS_ref1, RATS_ref)
## Data: RATSL
## Models:
## RATS_ref: Weight ~ Time + Group + (1 | ID)
## RATS_ref1: Weight ~ Time + Group + (Time | ID)
##           Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)    
## RATS_ref   5 1344.8 1360.7 -667.41   1334.8                             
## RATS_ref1  7 1200.8 1223.0 -593.41   1186.8 147.99      2  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1