Posts

Showing posts from October, 2020

R – Stats, factor count,proportion

  Change to factor: adult$RACEHPR2 <- factor(adult$RACEHPR2, labels = c(“Latino”, “Asian”, “African American”, “White”)) show all factors: levels(v) droplevels(v) to remove unused factors Counts of factors: x <- table(adult$RBMI, adult$SRAGE_P) or count(dataframe, variable) Proportion: prop.table(x ) sum all = 1 prop.table(x,1 ) conditional proportions, sum row = 1 prop.table(x,1 ) conditional proportions, sum column = 1  

Consulting Interview

  Numerical skills – rather simple tips on quick estimation, rounding Top down approach is preferred Hypothesis – Always start with a hypothesis, then reason with the info to either prove it or disprove it. Don’t present info and a bunch of rationales before the conclusion, it’s just less clear to the listeners, especially C-level executives. Synthesis – Let’s do A. Because of B, C and D.

R – Time, lubridate

EXTRACT: month(date, label = TRUE, abbr = TRUE): label showing the names of the month year day yday 1 to 366 wday ROUND floor_date round_date ceiling_date INTERVAL (start time) %–% (end time) int_end int_start int_length %within% int_overlaps as.period() – convert interval to period as.duration() – convert interval to duration TIMEZONE force_tz(time, tzone = “”) with_tz(time, tzone = “”) IMPORT parse_date_time(time, order = “ymd”) fast_strptime(time, order = “%Y-%m-%d”), same format as strptime() fasttime(), fastPOSIXct() stamp(string) return format of time string %Y %d etc.

Jonas - Javascript

  variables – var abc = ”, – data types   Section 3 – How JS works Scoping When refer to a var that is not available in the current scope, the program will look to the outer scope (or even outer outer scope)   Section 4 – DOM Manipulation   classList, .add .remove classes .toggle, turn on/off the “active” class   Section 5 – Objects and Functions Everything is an object, check with  console.log () for __proto__ when we try to access properties or methods of an object it will look in its own scope first, then its proto parents, and so on. call, apply, bind, to change the  this  variable. And to pre-set some arguments of a function  

Bryan Peterson – Learning to See Creatively

  Expanding your vision Try different, new angles, i.e lying and climbing to shoot Close up shooting tells better story and evoke stronger feelings. Elements of Design Line Shape Form Pattern Color Composition Zoom in, fill the frame Rule of Third Frame within frame, another object acts as a frame. Picture within picture, part of the current image can be an image on its own. Just take both horizontal and vertical pics or … not, just break the rules The Magic of Light Front lighting – no notes Sidelighting – most dramatic, contrast between highlights and shadows Backlighting – sihoulette, shadow Digital Photography Using photo retouching software is fine Career Considerations – did not read  

SQL Join

Image
  cross join – match combinations union – stack, unique union all – stack, include duplicates intersect – result in the intersection except – result in the exception semi join – select rows from 1st table that are ALSO present in 2nd table where (field) in (table) anti join – select rows from 1st table that are NOT present in 2nd table where (field) not in (table)   Subclause can be used instead of group by!  

R - Supervised Learning

Image
  knn k nearest neighbors, k library(class) knn(training data, test data, labels), extra args: k = 2. prob = TRUE enables probability of each guess. Primitive bayes nrows(df), count rows subset(df, conditions/ col > 0 etc.), nrows this subset then do divisions Naive bayes model <- naive_bayes(y ~ x1 + x2 + … , data = df) y as explained by x1, x2 etc. predict(model , test_df) + args type = “prob” show posterior probability (instead of priori overall prob) Laplace correction:  add 1 to prevent the 0% probability affects all estimation Logistic Regression model <- glm(y ~ x1 + x2 …, data = df, family = “binomial”) predict(model, test_df, type = “response”) ROC curve to measure how well the model performs, AUC area under curve value from 0 – 1, 1 is best. library(pROC) ROC <- roc(actual var, predicted var) plot(ROC), auc(ROC) Decision Tree library(rpart) model <- rpart(y ~ x1 + x2 + …, data = df, method = “class”, control = rpart.control(cp = 0)) predict(m...

R - Unsupervised Learning

Image
K mean kmeans(x 2 dimensional array, center = ….  how many groups, nstart = ….. how many time to run k mean) $cluster contains the number of the cluster an observation belongs to plot(x, col = $cluster, main = “k-means clusters”, xlab = “”, ylab = “”) main = title, xlab,ylab = the labels Total within SS (Sum of Squares), the lower the better par(mfrow = c(2, 3)) Set up a 2×3 plotting grid, displaying 6 plots in 1 grid Hierarchy clustering hclust(dist(x), method = “complete/average/single/centroid”) plot(hclust) cutree(hclust, h = 6 or k = 2), cut by height(max distance of observations) or number of clusters Complete and Average produce more balanced tree. Single and Centroid produce less balanced trees, this can be used to detect outliers! Scaling apply(x, 1 or 2, sd) find sd for row(1) or column(2) scale(x) convert to normal distribution with mean = 0 and sd = 1 Dimensionality Reduction Principal Component Analysis PCA prcomp(x data, scale = FALSE normal distribution?, center = TR...

R - Clustering

  dist(df, method = “euclidean”) calculates distances between all observations to each other. scale(df), normalizes each column to a normal distribution of mean = 0, sd = 1 ~ z score Categorical variables Logical  – Jaccard Index is the ratio of common(intersecting) values  of 2 variables, or the number of both TRUE/ all TRUEs variables, no FALSE included. dist(df, method = “binary”) More than 2 categorical values library(dummies) dummy.data.frame(df ) converts to logical/binary values with categorical columns Hierarchical clustering hclust(dist, method = “complete”) calculate all relative distances cutree(clust, k = 3), assign obs into 3 groups ggplot() Dendrogram plot(clust) the clustering tree! cutree(clust, h = 15) all obs in the same group must have heights of less than 15 par(mfrow = c(1,3)) display 3 graphs into 1 dendextend as.dendrogram(cluster) convert cluster object to dendrogram object color_branches(dend, h = 20) creates colored dendrogram object plot(dend) s...

R - Correlation & Regression

  Change scale of x,y coordinates: + coord_trans(x = “log10”, y = “log10”), or + scale_y_log10() + scale_x_log10() cor(x,y) to calculate correlation coefficient, use can be used to avoid NA values ncbirths %>% summarize(N = n(), r = cor(weight, weeks, use = “ pairwise.complete.obs ”)) Add best fit line to ggplot, least R2 geom_smooth(method = “lm”, se = FALSE) Detail of the best fit line: lm_obj <- lm(y ~ x, data = df) useful function for the lm object/model (mod) coef(mod) fitted.values(mod) residuals(mod) summary(mod) df.residual(mod) Making prediction from a model and newdata (should have variables with the SAME names as the model) predict(mod, newdata) broom package , augment(mod) parse model results/parameters into a dataframe Leverage concept : Points that are close to the center of the plot have low leverage, while those far from the center have high leverage. Leverage is the .hat column after augment() Influence = Leverage and residual, how each individual observ...

R - httr Web interaction

  Interact with web raw <- GET(url_string) – retrieve from server, status starting with 2/3 is fine, 4 your problem, 5, their problem. POST(url_string, data) – send data to server content(raw, as = “text”/”parsed” (default) ) from httr, read from GET result http_error(raw) returns  TRUE if there’s an error, make error handling easier GET(url,  user_agent (“my@email.address this is a test”, query = params) user agent provides extra info for the webmaster, in case anything goes wrong. query helps add more params to the url more easily (rather than string concat or paste), param example = list (x = “asd”, y = “qwe”) -> url?x=asd&y=qwe . paste() sep = “/”, glue strings together http_type() JSON fromJSON(content(raw)) rlist package: list.select(json,var1,var2) collect var out of a json,  list.stack () stack result from list.select into a dataframe bind_rows, from dyplyr, turns list into df. XML xml2 package read_xml() xml_structure() xml_find_all(xml result, XPA...

R - Functions apply, purrr

Image
  map(df/list, function), apply function to each of df, return a list. Iterate over columns of the dataframe df, or elements of the list. map_dbl (lgl,int,chr) return a vector of specified types. map is believed to be more consistent than the sapply, lapply. Other ways to specify the function: map(df, function(x) sum( is.na(x )) ), or map(df, ~ sum( is.na (.)) ) Other shortcut: Dealing with failures map(df, safely(function here)) will return a list with all results (as normal) and errors (NULL or error report) Other to try, possibly(), quietly() Multi-dimension iteration map2(list(5,10,20),list(1,2,3),rnorm) iterate over  2 args pmap(list(n = list(5,10,20) , mean = list(1,2,3), sd = list(0.1,0.2,0.3) )) iterate over  many args invoke_map(list(func1,func2,func3), n = 5) iterate over  funtions Each has a family of functions, the _int _chr that return a vector with data types. Maps for functions with side effects Such as print,ggplot, save file walk() – same usage ...

R - Data first glimpse

Image
  Quick first look head() tail() class() dim() names() str() glimpse(), from dplyr, like str summary() Quick plot hist(x) plot(x,y) boxplot(x, horizontal = TRUE) Tidy gather(df,key,value, …) key = name of new cat column, value = name of value column, … column to gather or to not gather spread(df, key, value) key = name of new cat column, value = name of value column separate(df, col, into) col = name of 1 column to separate, into = c(names of new columns), sep = ‘-‘ unite(df, col, …) col = name of new united column, … = columns to unite, sep = ‘-‘ Clean lubridate: ymd, ymd_hms etc. text: tolower(), toupper() NA values:  complete.cases(df ) – find rows without NA,  na.omit(df ) – only select rows without NA