One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it.Six young health participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E). In this project, we will use data from accelerometers on the belt, forearm, arm, and dumbells. Based on those data we will build a predictive model for the variable “classe” (A, B, C, D or E)
More information can be found here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).
Training data: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
Validation data: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
#download.file(url = "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv",
# destfile = "training.csv")
input_data <- read.csv(file = "training.csv" )
#download.file(url = "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv",
# destfile = "testing.csv")
validation <- read.csv(file = "testing.csv" )
Selection of the variables that are present in the training AND the validation data set.
library(caret)
validation <- validation[,-nearZeroVar(validation)]
myvars <- names(validation)
myvars <- gsub(pattern = "problem_id",replacement = "classe", x = myvars) #keep variable classe
input_data <- input_data[myvars]
Exclusion “metadata” that should obvisouly be integrated in the predictive model.
myvars <- names(input_data) %in% c("X","user_name","raw_timestamp_part_1","cvtd_timestamp","num_window")
input_data <- input_data[!myvars]
On top of device measurement variables, “classe” variable and “raw_timestamp_part_2” are keept. The time stamp migh be potentialy a predictor.
names(input_data)
## [1] "raw_timestamp_part_2" "roll_belt" "pitch_belt"
## [4] "yaw_belt" "total_accel_belt" "gyros_belt_x"
## [7] "gyros_belt_y" "gyros_belt_z" "accel_belt_x"
## [10] "accel_belt_y" "accel_belt_z" "magnet_belt_x"
## [13] "magnet_belt_y" "magnet_belt_z" "roll_arm"
## [16] "pitch_arm" "yaw_arm" "total_accel_arm"
## [19] "gyros_arm_x" "gyros_arm_y" "gyros_arm_z"
## [22] "accel_arm_x" "accel_arm_y" "accel_arm_z"
## [25] "magnet_arm_x" "magnet_arm_y" "magnet_arm_z"
## [28] "roll_dumbbell" "pitch_dumbbell" "yaw_dumbbell"
## [31] "total_accel_dumbbell" "gyros_dumbbell_x" "gyros_dumbbell_y"
## [34] "gyros_dumbbell_z" "accel_dumbbell_x" "accel_dumbbell_y"
## [37] "accel_dumbbell_z" "magnet_dumbbell_x" "magnet_dumbbell_y"
## [40] "magnet_dumbbell_z" "roll_forearm" "pitch_forearm"
## [43] "yaw_forearm" "total_accel_forearm" "gyros_forearm_x"
## [46] "gyros_forearm_y" "gyros_forearm_z" "accel_forearm_x"
## [49] "accel_forearm_y" "accel_forearm_z" "magnet_forearm_x"
## [52] "magnet_forearm_y" "magnet_forearm_z" "classe"
After some data exploration, some strong outliers (identified as measurement device failure) have been found and removed from the total set. Below an example of outlier for the variable gyros_forearm_z
qplot(raw_timestamp_part_2, gyros_forearm_z,data = input_data, col = classe)
For the purpose of an automatic filtering, the function row_with_outlier_vect has been created to remove those very strong outliers. The rule to define a point as “outlier” has been define by using quantile: outlier if (P95 - point) > 2.7 x (P95-P05). The quantiles and the thehold (=2.7) has been tuned to give the best results considering the different variable distributions (most are non-gaussian).
row_with_outlier_vect <- function(x,treshold = 2.7){
# x as vector
# return a vector containing the row index of outliers
qnt <- quantile(x, probs=c(.05, .95), na.rm = TRUE)
H <- treshold * (quantile(x,0.95)-quantile(x,0.05))
line <- which(x < (qnt[1] - H))
line <- rbind(line,which(x > (qnt[2] + H)))
return(as.vector(line))
}
40 (/19622) rows removed by the filter:
nrow_before_filtering <- dim(input_data)[1]
row_with_outlier <- apply(X = input_data[1:53],
MARGIN = c(2),
FUN = row_with_outlier_vect,
treshold = 2.7)
row_with_outlier <- unlist(row_with_outlier) #unlist change the list structure to a single vector
input_data <- input_data[-row_with_outlier,]
nrow_after_filtering <- dim(input_data)[1]
print(nrow_before_filtering - nrow_after_filtering )
## [1] 40
Random subsampling for cross-validation:
inTrain <- createDataPartition(y = input_data$classe,
p = 0.75,
list = FALSE)
training <- input_data[inTrain,]
testing <- input_data[-inTrain,]
The structure of the data are a suite of time series. K-fold could be also a good choice, but we could also miss some specific classes in the training set (it has not been tested).
below a test of most common prediction models. Special train control for model using sub-resampling: k-fold cross-validation used. Here because of the number of iterations, no classe would be left over.
# Ramdom Forest
train_control<- trainControl(method="cv", number=5,repeats=5, savePredictions = TRUE,allowParallel = TRUE)
rf <- train(classe ~ .,data = training,method="rf",trControl=train_control)
# Linear Discriminant Analysis
lda <- train(classe~.,data=training,method="lda",trControl=train_control)
# Boost (trees)
gbm <- train(classe~.,method="gbm",data=training,verbose=FALSE)
# Naive Bayes
nb <- train(classe~.,data = training, method = "nb")
Let’s use use each model on the testing set and check the accuracy of each model.
pred_rf <- predict(rf,testing)
accu_rf <- confusionMatrix(testing$classe,pred_rf)[[3]][1]
pred_lda <- predict(lda,testing)
accu_lda <- confusionMatrix(testing$classe,pred_lda)[[3]][1]
pred_glm <- predict(lda_pca,testing)
accu_glm <-confusionMatrix(testing$classe,pred_glm)[[3]][1]
pred_nb <- predict(nb, testing)
accu_nb <- confusionMatrix(testing$classe,pred_nb)[[3]][1]
model_accuracy <- as.vector(c(accu_rf, accu_lda,accu_glm,accu_nb))
models <- c("Random Forest", "Linear Discriminant Analysis","boosting-trees","Naive Bayes")
library(ggplot2)
qplot(models, model_accuracy,aes(models),fill=models)+
geom_bar(stat = "identity")+theme(legend.position="none")+
ggtitle("models accuracy")
The ramdom forest model is here the most accurate.
Confusion matrix for the ramdom forest model:
confusionMatrix(testing$classe,pred_rf)[[2]]
## Reference
## Prediction A B C D E
## A 1394 0 0 0 0
## B 2 946 1 0 0
## C 0 0 855 0 0
## D 0 0 1 803 0
## E 0 0 0 0 893
out_of_sample_error = (1-confusionMatrix(testing$classe,pred_rf)[[3]][1])*100
names(out_of_sample_error) <- "Out_of_sample_error"
print(out_of_sample_error)
## Out_of_sample_error
## 0.08171604
Rather small out of sample error: 0.08%
Use of the rf model on the test “validation” set:
predict(rf,validation)
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E