R - Exploratory Stats
Sampling
sample_n(df, 150) – Select randomly 150 obs from the dataset
df %>%
group_by(x) %>%
sample_n(5) Select 5 obs from each x group in df
Single variable
geom_bar() – position = “fill” shows 100% stacked bar
geom_dotplot(dotsize) stacked dots
geom_density(bw) binwidth, this actually normalize the distribution with different bases to be comparable, histogram might be better to see which distribution has larger base.
geom_histogram(bin)
geom_boxplot(), coord_flip() to flip to horizontal box plot
+ xlim(c(100, 500)) set limits to x axis from 100 to 500
Multi variable
facet_grid(a ~ b)
Summarize()
mean
median
sd
var
n – simple count
IQR, inter-quartile range, range that has 50% of the data.
range, total value range of the dataset
Review
%>% arrange(desc(x))
%>% arrange(asc(x))
ggplot facet_wrap(~ country, scales = ‘free_y’) – separate y scales
Regression
model <- lm(y ~ x,df) – y as explained by x
summary(model) – showing coefficient, etc. but this is hard to extract the info, ->
tidy(model) from broom, to extract info
bind_rows() from dplyr, to combine rows
nest(df, -x) -> nested table with 2 columns, 1 for the group variable x, and 1 is a list/df of that variable. – mean not include the columnin the nested df
unnest(df, x) -> reverse of nest, x is the column with list data
map from purrr (pay attention to the dots)
map(numbers, ~ 1 + .) – add 1 to each num in the numbers df
good ex: mutate(model = map(data, ~ lm(percent_yes ~ year, data = .))), then
mutate(tidied = map(model, ~ tidy(.)))
gather(df,category col name, value col names)
example <- c(“apple”, “banana”, “apple”, “orange”)
recode(example, apple = “plum”, banana = “grape”), vlookup, mapping function!




Comments
Post a Comment