R - Unsupervised Learning

October 09, 2020

K mean

kmeans(x 2 dimensional array, center = …. how many groups, nstart = ….. how many time to run k mean)

$cluster contains the number of the cluster an observation belongs to

plot(x, col = $cluster, main = “k-means clusters”, xlab = “”, ylab = “”)
main = title, xlab,ylab = the labels

Total within SS (Sum of Squares), the lower the better

par(mfrow = c(2, 3)) Set up a 2×3 plotting grid, displaying 6 plots in 1 grid

Hierarchy clustering

hclust(dist(x), method = “complete/average/single/centroid”)

plot(hclust)

cutree(hclust, h = 6 or k = 2), cut by height(max distance of observations) or number of clusters

Complete and Average produce more balanced tree. Single and Centroid produce less balanced trees, this can be used to detect outliers!

Scaling

apply(x, 1 or 2, sd) find sd for row(1) or column(2)

scale(x) convert to normal distribution with mean = 0 and sd = 1

Dimensionality Reduction
Principal Component Analysis PCA

prcomp(x data, scale = FALSE normal distribution?, center = TRUE centered around 0 or not)

2018-11-05 13_56_00-unsupervised ch3.pdf - Adobe Acrobat Pro.png

92.46% variability of the data is retained with 1 components
97.769% variability of the data is retained with 2 components

Visualizing

Biplot, same direction vector showing correlation of variables

biplot(pr)

Scree plot, either by marginal or cumulative variance explained

var explained pve = pvar/sum(pvar), pvar = psd^2

summary(pr) can also shows the cumulative var explained

2018-11-05 14_09_17-Visualize variance explained _ R.png

Combining PCA and a clustering technique can improve the model!

Search This Blog

Kev's Place