#Part 1 :

Import the package and dataset.

#install.packages(pkgs="http://www.karlin.mff.cuni.cz/~hlavka/sms2/MSES_1.1.tar.gz",repos=NULL,type="source")
#install.packages(pkgs="http://www.karlin.mff.cuni.cz/~hlavka/sms2/MSES_1.1.zip",repos=NULL)
library(SMSdata)
data(plasma) 

I choose plasma dataset for this works.

#Part 2 :

Understanding the dataset

dim(plasma)
## [1] 10  4
n <- dim(plasma)[1]
p <- dim(plasma)[2]

The dataset have 10 observation and 4 variables.

We can look in more detail at the variables

summary(plasma)
##      group        8am             11am            3pm       
##  Group 1:7   Min.   : 89.0   Min.   : 83.0   Min.   : 83.0  
##  Group 2:3   1st Qu.:106.0   1st Qu.:119.5   1st Qu.:101.5  
##              Median :116.0   Median :131.5   Median :108.0  
##              Mean   :118.5   Mean   :127.9   Mean   :112.6  
##              3rd Qu.:134.0   3rd Qu.:138.5   3rd Qu.:124.0  
##              Max.   :151.0   Max.   :173.0   Max.   :147.0

Description of the dataset :

The evolution of citrate concentration in the plasma is observed at 3 different times of day, 8 am, 11 am, and 3 pm, for two groups of patients. Each group follows a different diet.

So we have 1 Categorical Variable (group) and 3 numeric variables (8am, 11am and 3pm)

For the next part of the analyse we must change the names of the variable because having a number as the first character can be problematic

colnames(plasma)[2:4]<-c("v8am","v11am","v3pm")
attach(plasma)

#Part 3 : Univariate analysis

Before starting the multivariate analysis, it is always important to take the time to perform a small univariate analysis in order to get to know the data set.

Group :

plot(group)

The observations are separate in two groups. 7 in the 1st and 3 in the second group. Each group follows a different diet.

8am :

hist(v8am,breaks =5,main ="Histogram of citrate concentration in the plasma at 8 am",freq = F,xlab ="8 am")

quantile(v8am)
##   0%  25%  50%  75% 100% 
##   89  106  116  134  151
summary(v8am)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    89.0   106.0   116.0   118.5   134.0   151.0
var(v8am)
## [1] 432.9444
cv <- sd(v8am) / mean(v8am) * 100
cv
## [1] 17.55892

The minimum is 89, the median is 116 and the maximum is 151.

Mean : 118.5

variance : 432.94

Coefficient of Variation : 17.56

11am :

hist(v11am,breaks =5,main ="Histogram of citrate concentration in the plasma at 11 am",freq = F,xlab ="11 am")

quantile(v11am)
##    0%   25%   50%   75%  100% 
##  83.0 119.5 131.5 138.5 173.0
summary(v11am)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    83.0   119.5   131.5   127.9   138.5   173.0
var(v11am)
## [1] 662.3222
cv_b <- sd(v11am) / mean(v11am) * 100
cv_b
## [1] 20.12167

The minimum is 83, the median is 127.9 and the maximum is 173.

Mean : 127.9

variance : 662.32

Coefficient of Variation : 20.12

We can see a increase values to the means, and the variance has increased too. The citrate concentration are less homogeneous at 11am than 8am.

3pm :

hist(v3pm,breaks =5,main ="Histogram of citrate concentration in the plasma at 3 pm",freq = F,xlab ="3 pm")

quantile(v3pm)
##    0%   25%   50%   75%  100% 
##  83.0 101.5 108.0 124.0 147.0
summary(v3pm)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    83.0   101.5   108.0   112.6   124.0   147.0
var(v3pm)
## [1] 325.1556
cv_c <- sd(v3pm) / mean(v3pm) * 100
cv_c
## [1] 16.01427

The minimum is 83, the median is 108 and the maximum is 147.

Mean : 112.6

variance : 325.16

Coefficient of Variation : 16.01

We can see a decrease values to the means, and the variance has decreased too. The citrate concentration are more homogeneous at 3pm than 11am. We return to the 8am citrate concentration level.

#Part 4 : Multivariate analysis

In first we can comparate the distribution of the 3 different time schedules.

boxplot(plasma[,2:4])

We can conclude the same thing as in the previous section, the citrate concentration in the plasma increase and then decreases. We can see 2 extrem values at 11 am.

Parallel Coordinate Plots : We make this graph to find (or not) a kind of dependency between the different covariates

library(MASS)
## Warning: package 'MASS' was built under R version 3.6.3
colorVector <- rep("black", dim(plasma)[1])
colorVector[group == "Group 1"] <- "red"
colorVector[group == "Group 2"] <- "green"


parcoord(plasma[,2:4],col=colorVector)

The greens observation represente patients of the group 2.

Group 2 has a lower citrate concentration in the plasma than the rest of the population.

At 3pm all group 2 patients have all the lowest concentration.

And in the majority of cases, the people witj a importante citrate concentration at 8am were always the same at other times.

cor(plasma[,2:4])
##            v8am     v11am      v3pm
## v8am  1.0000000 0.7833958 0.7054034
## v11am 0.7833958 1.0000000 0.7979220
## v3pm  0.7054034 0.7979220 1.0000000

All of corelation are positive.

Evolution between the times :

8am -> 11 am

plot(v8am,v11am,col=group)
abline(1,1)

Red : Group 2

black : Group 1

Why is the line x=y?

If the point is above: increase your rate between the two times

If the point is below: decrease the rate between the two times

We have 8 points above and 2 point below, for the majority of patients the citrate concentration in the plasma increase between the 2 times.

We cannot necessarily notice a group effect because the points of the different colours are above and below the line.

11am -> 3 pm

plot(v11am,v3pm,col=group)
abline(1,1)

We can see that here the concentration has dropped between the two time. (8/10 observations)

But still no group effect. The low values of the second group are still low but we do not see a more pronounced decrease than group 1. The values of group 2 are lower but for each time.