## Posts by kjytay

# Author: kjytay

#### Lesser known dplyr functions

Feed: R-bloggers. Author: kjytay. The dplyr package is an essential tool for manipulating data in R. The “Introduction to dplyr” vignette gives a good overview of the common dplyr functions (list taken from the vignette itself): filter() to select cases based on their values. arrange() to reorder the cases. select() and rename() to select variables based on their names. mutate() and transmute() to add new variables that are functions of existing variables. summarise() to condense multiple values to a single value. sample_n() and sample_frac() to take random samples. The “Two-table verbs” vignette gives a good introduction to using dplyr function for joining two tables together. The “Window functions” vignette talks about, well, ... Read More

#### Mixing up R markdown shortcut keys in RStudio, or how to unfold all chunks

Feed: R-bloggers. Author: kjytay. When using R markdown in RStudio, I like to insert a new chunk using the shortcut Cmd+Option+I. Unfortunately I often press a key instead of “I” and end up folding all the chunks, getting something like this: It often takes me a while (on Google) to figure out what I did and how to undo it. With this note to remind me, no longer!! The shortcut I accidentally used was Cmd+Option+O, which folds up all chunks. To unfold all chunks, use Cmd+Shift+Option+O. The full list of RStudio keyboard shortcuts can be found here. Related If you ... Read More

#### Visualizing the relationship between multiple variables

Feed: R-bloggers. Author: kjytay. Visualizing the relationship between multiple variables can get messy very quickly. This post is about how the ggpairs() function in the GGally package does this task, as well as my own method for visualizing pairwise relationships when all the variables are categorical. For all the code in this post in one file, click here. The GGally::ggpairs() function does a really good job of visualizing the pairwise relationship for a group of variables. Let’s demonstrate this on a small segment of the vehicles dataset from the fueleconomy package: library(fueleconomy) data(vehicles) df Let’s see how GGally::ggpairs() visualizes relationships ... Read More

#### Changing the variable inside an R formula

Feed: R-bloggers. Author: kjytay. I recently encountered a situation where I wanted to run several linear models, but where the response variables would depend on previous steps in the data analysis pipeline. Let me illustrate using the mtcars dataset: data(mtcars) head(mtcars) #> mpg cyl disp hp drat wt qsec vs am gear carb #> Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 #> Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 #> Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 #> Hornet 4 Drive ... Read More

#### Be careful of NA/NaN/Inf values when using base R’s plotting functions!

Feed: R-bloggers. Author: kjytay. I was recently working on a supervised learning problem (i.e. building a model using some features to predict some response variable) with a fairly large dataset. I used base R’s plot and hist functions for exploratory data analysis and all looked well. However, when I started building my models, I began to run into errors. For example, when trying to fit the lasso using the glmnet package, I encountered this error: I thought this error message was rather cryptic. However, after some debugging, I realized the error was exactly what it said it was: there were ... Read More

#### Looking at flood insurance claims with choroplethr

Feed: R-bloggers. Author: kjytay. I recently learned how to use the choroplethr package through a short tutorial by the package author Ari Lamstein (youtube link here). To cement what I learned, I thought I would use this package to visualize flood insurance claims. I am using the FIMA NFIP redacted claims dataset from FEMA, and it contains more than 2 million claims transactions going all the way back to 1970. (The dataset can be accessed from this URL.) From what I can tell, this dataset is updated somewhat regularly. For this post, the version of the dataset used was published ... Read More

#### Sampling paths from a Gaussian process

Feed: R-bloggers. Author: kjytay. Gaussian processes are a widely employed statistical tool because of their flexibility and computational tractability. (For instance, one recent area where Gaussian processes are used is in machine learning for hyperparameter optimization.) A stochastic process is a Gaussian process if (and only if) any finite subcollection of random variables has a multivariate Gaussian distribution. Here, is the index set for the Gaussian process; most often we have (to index time) or (to index space). The stochastic nature of Gaussian processes also allows it to be thought of as a distribution over functions. One draw from a ... Read More

#### Probability of winning a best-of-7-series (part 2)

Feed: R-bloggers. Author: kjytay. In this previous post, I explored the probability that a team wins a best-of-n series, given that its win probability for any one game is some constant . As one commenter pointed out, most sports models consider the home team to have an advantage, and this home advantage should affect the probability of winning a series. In this post, I will explore this question, limiting myself (most of the time) to the case of . for winning away. In general we will have , although that is not always the case. (For example, in the NBA ... Read More

#### Probability of winning a best-of-7 series

Feed: R-bloggers. Author: kjytay. The NBA playoffs are in full swing! A total of 16 teams are competing in a playoff-format competition, with the winner of each best-of-7 series moving on to the next round. In each matchup, two teams play 7 basketball games against each other, and the team that wins more games progresses. Of course, we often don’t have to play all 7 games: we can determine the overall winner once either team reaches 4 wins. During one of the games, a commentator made a remark along the lines of “you have no idea how hard it is ... Read More

#### Two interesting facts about high-dimensional random projections

Feed: R-bloggers. Author: kjytay. John Cook recently wrote an interesting blog post on random vectors and random projections. In the post, he states two surprising facts of high-dimensional geometry and gives some intuition for the second fact. In this post, I will provide R code to demonstrate both of them. Fact 1: Two randomly chosen vectors in a high-dimensional space are very likely to be nearly orthogonal. Cook does not discuss this fact as it is “well known”. Let me demonstrate it empirically. Below, the first function generates a -dimensional unit vector uniformly at random. The second function takes in ... Read More

## Recent Comments