R - Supervised Learning

October 09, 2020

knn

k nearest neighbors, k

library(class)

knn(training data, test data, labels), extra args: k = 2. prob = TRUE enables probability of each guess.

Primitive bayes

nrows(df), count rows

subset(df, conditions/ col > 0 etc.), nrows this subset

then do divisions

Naive bayes

model <- naive_bayes(y ~ x1 + x2 + … , data = df) y as explained by x1, x2 etc.

predict(model , test_df) + args type = “prob” show posterior probability (instead of priori overall prob)

Laplace correction: add 1 to prevent the 0% probability affects all estimation

Logistic Regression

model <- glm(y ~ x1 + x2 …, data = df, family = “binomial”)

predict(model, test_df, type = “response”)

ROC curve to measure how well the model performs, AUC area under curve value from 0 – 1, 1 is best. 2018-11-01 21_30_26-supervised learning chapter3.pdf - Adobe Acrobat Pro.png

library(pROC)

ROC <- roc(actual var, predicted var)

plot(ROC), auc(ROC)

Decision Tree

library(rpart)

model <- rpart(y ~ x1 + x2 + …, data = df, method = “class”, control = rpart.control(cp = 0))

predict(model, test_df, type = “class”)

split the available data into 2 sets: training and testing, if the model from training set does not perform well in testing set then the model is overfitted.

Pruning strategy, stop the tree from growing too large, by depth (max 5 level) and branch size (min 10 obs).

Pre-pruning, while building the model, by rpart.control(maxdepth = …, minsplit = ….)
Post-pruning, after the model is built, plotcp(model) -> determine pruning point, then new_model <- prune(model, cp = ….)

Ensemble method, combine a lot of random decision trees (which is only good for a few variables) into a bigger one.
library(randomForest)
model <- randomForest(y ~ x1 + x2, data =df, ntree = 500 number of trees, mtry = sqrt(p) p is number of predictors per tree)

Search This Blog

Kev's Place