R - Unsupervised Learning
K mean
kmeans(x 2 dimensional array, center = …. how many groups, nstart = ….. how many time to run k mean)
$cluster contains the number of the cluster an observation belongs to
plot(x, col = $cluster, main = “k-means clusters”, xlab = “”, ylab = “”)
main = title, xlab,ylab = the labels
Total within SS (Sum of Squares), the lower the better
par(mfrow = c(2, 3)) Set up a 2×3 plotting grid, displaying 6 plots in 1 grid
Hierarchy clustering
hclust(dist(x), method = “complete/average/single/centroid”)
plot(hclust)
cutree(hclust, h = 6 or k = 2), cut by height(max distance of observations) or number of clusters
Complete and Average produce more balanced tree. Single and Centroid produce less balanced trees, this can be used to detect outliers!
Scaling
apply(x, 1 or 2, sd) find sd for row(1) or column(2)
scale(x) convert to normal distribution with mean = 0 and sd = 1
Dimensionality Reduction
Principal Component Analysis PCA
prcomp(x data, scale = FALSE normal distribution?, center = TRUE centered around 0 or not)
92.46% variability of the data is retained with 1 components
97.769% variability of the data is retained with 2 components
Visualizing
Biplot, same direction vector showing correlation of variables
biplot(pr)
Scree plot, either by marginal or cumulative variance explained
var explained pve = pvar/sum(pvar), pvar = psd^2
summary(pr) can also shows the cumulative var explained
Combining PCA and a clustering technique can improve the model!
Comments
Post a Comment