In last weeks post, I emphasized the importance of practicing R and the Tidyverse with small, simple problems, drilling them until you are competent.
In that post, I gave you a few very small scripts to practice (which I suggest that you memorize).
This week, I want to give you another small example. We’re going to clean up the
More specifically, we’re going to:
- Coerce the
irisdataset from an old-school data frame into a tibble.
- Rename the variables, such that the characters are lower case, and such that “snake case” is applied in place of periods.
Like last week, this is a very simple example. However, (like I mentioned in the past) this is the sort of small task that you’ll need to be able to execute fluidly if you want to work on larger projects.
If you want to do large, complex analyses, it really pays to first master techniques on a small scale using much simpler datasets.
Ok, let’s dive in.
First, let’s take a look at the complete block of code.
library(tidyverse) library(stringr) #------------------ # CONVERT TO TIBBLE #------------------ # – the iris dataframe is an old-school dataframe # ... this means that by default, it prints out # large numbers of records. # - By converting to a tibble, functions like head() # will print out a small number of records by default df.iris % colnames() %>% str_to_lower() %>% str_replace_all("\.","_") # INSPECT df.iris %>% head()
What have we done here? We’ve combined several discrete functions of the Tidyverse together in order to perform a small amount of data wrangling.
Specifically, we’ve turned the
This example is quite simple, but useful. This is the sort of small task that you’ll need to be able to do in the context of a large analysis.
To make this a little clearer, let’s break this down into its component parts.
In the section where we renamed the variables, we only used three core functions:
Each of these individual pieces are pretty straight forward.
We are using
Then, we pipe the output into the
Next, we use
Finally, using the assignment operator (at the upper, left hand side of the code), we assign the resulting transformed column names to the tibble by using
I will point out that we have used these functions in a “waterfall” pattern; we have combined them by using the the pipe operator,
The functions that we just used are all critical for doing data science in R. With that in mind, this script is a good test of your skill: can you write code like this fluently, from memory?
That should be your goal.
To get there, you need to know how the individual functions work. What that means is that you need to study the functions (how they work). But to be able to put them into practice, you need to drill them. So after you understand how they work, drill each individual function until you can write each individual function from memory. Next, you should drill small scripts (like the one in this blog post). You ultimately want to be able to “put the pieces together” quickly and seamlessly in order to solve problems and get things done.
I’ve said it before: if you want a great data science job, you need to be one of the best. If you want to be one of the best, you need to master the toolkit. And to master the toolkit, you need to drill.
To rapidly master data science, you need to practice.
You need to know what to practice, and you need to know how to practice.
Sharp Sight is dedicated to teaching you how to master the tools of data science as quickly as possible.
Sign up now for our email list, and you’ll receive regular tutorials and lessons. You’ll learn:
- What data science tools you should learn (and what not to learn)
- How to practice those tools
- How to put those tools together to execute analyses and machine learning projects
- … and more
If you sign up for our email list right now, you’ll also get access to our “Data Science Crash Course” for free.
SIGN UP NOW
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…