R - Clustering

October 09, 2020

dist(df, method = “euclidean”) calculates distances between all observations to each other.

scale(df), normalizes each column to a normal distribution of mean = 0, sd = 1 ~ z score

Categorical variables

Logical – Jaccard Index is the ratio of common(intersecting) values of 2 variables, or the number of both TRUE/ all TRUEs variables, no FALSE included.

dist(df, method = “binary”)

More than 2 categorical values

library(dummies)
dummy.data.frame(df) converts to logical/binary values with categorical columns

Hierarchical clustering

hclust(dist, method = “complete”) calculate all relative distances

cutree(clust, k = 3), assign obs into 3 groups

ggplot()

Dendrogram

plot(clust) the clustering tree!

cutree(clust, h = 15) all obs in the same group must have heights of less than 15

par(mfrow = c(1,3)) display 3 graphs into 1

dendextend

as.dendrogram(cluster) convert cluster object to dendrogram object

color_branches(dend, h = 20) creates colored dendrogram object

plot(dend)

segmented <- mutate(customers_spend, cluster = clust_customers) add cluster back to original data

K-means

kmeans(df, centers = 2) create kmean model object

k$cluster number of cluster assinged

k$tot.withinss total within sum squared, used for the ‘elbow’ plot

map_dbl(1:10, func) is a good replacement for the for loop

Silhouette

library(cluster)

pam(df, k = 3) calculating the Sil width S(i)

plot(silhouette(si)) quick plot for the model

Search This Blog

Kev's Place

R - Clustering

Categorical variables

K-means

Silhouette

Comments

Post a Comment

Popular posts from this blog

Jonas - Javascript

R - Supervised Learning

Consulting Interview