R - Clustering
dist(df, method = “euclidean”) calculates distances between all observations to each other.
scale(df), normalizes each column to a normal distribution of mean = 0, sd = 1 ~ z score
Categorical variables
Logical – Jaccard Index is the ratio of common(intersecting) values of 2 variables, or the number of both TRUE/ all TRUEs variables, no FALSE included.
dist(df, method = “binary”)
More than 2 categorical values
library(dummies)
dummy.data.frame(df) converts to logical/binary values with categorical columns
Hierarchical clustering
hclust(dist, method = “complete”) calculate all relative distances
cutree(clust, k = 3), assign obs into 3 groups
ggplot()
Dendrogram
plot(clust) the clustering tree!
cutree(clust, h = 15) all obs in the same group must have heights of less than 15
par(mfrow = c(1,3)) display 3 graphs into 1
dendextend
as.dendrogram(cluster) convert cluster object to dendrogram object
color_branches(dend, h = 20) creates colored dendrogram object
plot(dend)
segmented <- mutate(customers_spend, cluster = clust_customers) add cluster back to original data
K-means
kmeans(df, centers = 2) create kmean model object
k$cluster number of cluster assinged
k$tot.withinss total within sum squared, used for the ‘elbow’ plot
map_dbl(1:10, func) is a good replacement for the for loop
Silhouette
library(cluster)
pam(df, k = 3) calculating the Sil width S(i)
plot(silhouette(si)) quick plot for the model
Comments
Post a Comment