## Posts by kjytay

# Author: kjytay

#### Non-negative least squares

Feed: R-bloggers. Author: kjytay. Imagine that one has a data matrix consisting of observations, each with features, as well as a response vector . We want to build a model for using the feature columns in . In ordinary least squares (OLS), one seeks a vector of coefficients such that In non-negative least squares (NNLS), we seek a vector coefficients such that it minimizes subject to the additional requirement that each element of is non-negative. There are a number of ways to perform NNLS in R. The first two methods come from Reference 1, while I came up with the ... Read More

#### An unofficial vignette for the gamsel package

Feed: R-bloggers. Author: kjytay. I’ve been working on a project/package that closely mirrors that of GAMSEL (generalized additive model selection), a method for fitting sparse generalized additive models (GAMs). In preparing my package, I realized that (i) the gamsel package which implements GAMSEL doesn’t have a vignette, and (ii) I could modify the vignette for my package minimally to create one for gamsel. So here it is! For a markdown version of the vignette, go here. Unfortunately LaTeX doesn’t play well on Github… The Rmarkdown file is available here: you can download it and knit it on your own machine. Introduction ... Read More

#### The hidden diagnostic plots for the lm object

Feed: R-bloggers. Author: kjytay. When plotting an lm object in R, one typically sees a 2 by 2 panel of diagnostic plots, much like the one below: set.seed(1) x This link has an excellent explanation of each of these 4 plots, and I highly recommend giving it a read. Most R users are familiar with these 4 plots. But did you know that the plot() function for lm objects can actually give you 6 plots? It says so right in the documentation: We can specify which of the 6 plots we want when calling this function using the which option ... Read More

#### Use mfcol to have plots drawn by column

Feed: R-bloggers. Author: kjytay. To plot multiple figures on a single canvas in base R, we can change the graphical parameter mfrow. For instance, the code below tells R that subsequent figures will by drawn in a 2-by-3 array: par(mfrow = c(2, 3)) If we then run this next block of code, we will get the image below: set.seed(10) n Notice how the plots are filled in by row? That is, the first plot goes in the top-left corner, the next plot goes to its right, and so on. What if we want the plots to be filled in by ... Read More

#### Lesser known dplyr functions

Feed: R-bloggers. Author: kjytay. The dplyr package is an essential tool for manipulating data in R. The “Introduction to dplyr” vignette gives a good overview of the common dplyr functions (list taken from the vignette itself): filter() to select cases based on their values. arrange() to reorder the cases. select() and rename() to select variables based on their names. mutate() and transmute() to add new variables that are functions of existing variables. summarise() to condense multiple values to a single value. sample_n() and sample_frac() to take random samples. The “Two-table verbs” vignette gives a good introduction to using dplyr function for joining two tables together. The “Window functions” vignette talks about, well, ... Read More

#### Mixing up R markdown shortcut keys in RStudio, or how to unfold all chunks

Feed: R-bloggers. Author: kjytay. When using R markdown in RStudio, I like to insert a new chunk using the shortcut Cmd+Option+I. Unfortunately I often press a key instead of “I” and end up folding all the chunks, getting something like this: It often takes me a while (on Google) to figure out what I did and how to undo it. With this note to remind me, no longer!! The shortcut I accidentally used was Cmd+Option+O, which folds up all chunks. To unfold all chunks, use Cmd+Shift+Option+O. The full list of RStudio keyboard shortcuts can be found here. Related If you ... Read More

#### Visualizing the relationship between multiple variables

Feed: R-bloggers. Author: kjytay. Visualizing the relationship between multiple variables can get messy very quickly. This post is about how the ggpairs() function in the GGally package does this task, as well as my own method for visualizing pairwise relationships when all the variables are categorical. For all the code in this post in one file, click here. The GGally::ggpairs() function does a really good job of visualizing the pairwise relationship for a group of variables. Let’s demonstrate this on a small segment of the vehicles dataset from the fueleconomy package: library(fueleconomy) data(vehicles) df Let’s see how GGally::ggpairs() visualizes relationships ... Read More

#### Changing the variable inside an R formula

Feed: R-bloggers. Author: kjytay. I recently encountered a situation where I wanted to run several linear models, but where the response variables would depend on previous steps in the data analysis pipeline. Let me illustrate using the mtcars dataset: data(mtcars) head(mtcars) #> mpg cyl disp hp drat wt qsec vs am gear carb #> Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 #> Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 #> Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 #> Hornet 4 Drive ... Read More

#### Be careful of NA/NaN/Inf values when using base R’s plotting functions!

Feed: R-bloggers. Author: kjytay. I was recently working on a supervised learning problem (i.e. building a model using some features to predict some response variable) with a fairly large dataset. I used base R’s plot and hist functions for exploratory data analysis and all looked well. However, when I started building my models, I began to run into errors. For example, when trying to fit the lasso using the glmnet package, I encountered this error: I thought this error message was rather cryptic. However, after some debugging, I realized the error was exactly what it said it was: there were ... Read More

#### Looking at flood insurance claims with choroplethr

Feed: R-bloggers. Author: kjytay. I recently learned how to use the choroplethr package through a short tutorial by the package author Ari Lamstein (youtube link here). To cement what I learned, I thought I would use this package to visualize flood insurance claims. I am using the FIMA NFIP redacted claims dataset from FEMA, and it contains more than 2 million claims transactions going all the way back to 1970. (The dataset can be accessed from this URL.) From what I can tell, this dataset is updated somewhat regularly. For this post, the version of the dataset used was published ... Read More

## Recent Comments