R - Supervised Learning
knn
k nearest neighbors, k
library(class)
knn(training data, test data, labels), extra args: k = 2. prob = TRUE enables probability of each guess.
Primitive bayes
nrows(df), count rows
subset(df, conditions/ col > 0 etc.), nrows this subset
then do divisions
Naive bayes
model <- naive_bayes(y ~ x1 + x2 + … , data = df) y as explained by x1, x2 etc.
predict(model , test_df) + args type = “prob” show posterior probability (instead of priori overall prob)
Laplace correction: add 1 to prevent the 0% probability affects all estimation
Logistic Regression
model <- glm(y ~ x1 + x2 …, data = df, family = “binomial”)
predict(model, test_df, type = “response”)
ROC curve to measure how well the model performs, AUC area under curve value from 0 – 1, 1 is best.
library(pROC)
ROC <- roc(actual var, predicted var)
plot(ROC), auc(ROC)
Decision Tree
library(rpart)
model <- rpart(y ~ x1 + x2 + …, data = df, method = “class”, control = rpart.control(cp = 0))
predict(model, test_df, type = “class”)
split the available data into 2 sets: training and testing, if the model from training set does not perform well in testing set then the model is overfitted.
Pruning strategy, stop the tree from growing too large, by depth (max 5 level) and branch size (min 10 obs).
Pre-pruning, while building the model, by rpart.control(maxdepth = …, minsplit = ….)
Post-pruning, after the model is built, plotcp(model) -> determine pruning point, then new_model <- prune(model, cp = ….)
Ensemble method, combine a lot of random decision trees (which is only good for a few variables) into a bigger one.
library(randomForest)
model <- randomForest(y ~ x1 + x2, data =df, ntree = 500 number of trees, mtry = sqrt(p) p is number of predictors per tree)
Comments
Post a Comment