R - Exploratory Stats

October 09, 2020

Sampling

sample_n(df, 150) – Select randomly 150 obs from the dataset

df %>%
group_by(x) %>%
sample_n(5) Select 5 obs from each x group in df

Single variable

geom_bar() – position = “fill” shows 100% stacked bar

geom_dotplot(dotsize) stacked dots

geom_density(bw) binwidth, this actually normalize the distribution with different bases to be comparable, histogram might be better to see which distribution has larger base.

geom_histogram(bin)

geom_boxplot(), coord_flip() to flip to horizontal box plot

+ xlim(c(100, 500)) set limits to x axis from 100 to 500

Multi variable

facet_grid(a ~ b)

Summarize()

mean
median
sd
var
n – simple count
IQR, inter-quartile range, range that has 50% of the data.
range, total value range of the dataset

Review

%>% arrange(desc(x))
%>% arrange(asc(x))
ggplot facet_wrap(~ country, scales = ‘free_y’) – separate y scales

Regression

model <- lm(y ~ x,df) – y as explained by x

summary(model) – showing coefficient, etc. but this is hard to extract the info, ->
tidy(model) from broom, to extract info
bind_rows() from dplyr, to combine rows

nest(df, -x) -> nested table with 2 columns, 1 for the group variable x, and 1 is a list/df of that variable. – mean not include the columnin the nested df
unnest(df, x) -> reverse of nest, x is the column with list data

map from purrr (pay attention to the dots)
map(numbers, ~ 1 + .) – add 1 to each num in the numbers df
good ex: mutate(model = map(data, ~ lm(percent_yes ~ year, data = .))), then
mutate(tidied = map(model, ~ tidy(.)))

gather(df,category col name, value col names)

example <- c(“apple”, “banana”, “apple”, “orange”)
recode(example, apple = “plum”, banana = “grape”), vlookup, mapping function!

Search This Blog

Kev's Place