As a quick followup to last week’s mapping exercise (where we mapped the largest European cities), I want to map the largest cities in Asia.
When we did this last week, we used a variety of tools from the Tidyverse to scrape and wrangle the data, and we ultimately mapped the data using base
In this blog post, we’re going to scrape and wrangle the data in a very similar way, but we will visualize with a combination of
Let’s jump in.
First, we’ll load the packages that we’re going to use.
#============== # LOAD PACKAGES #============== library(rvest) library(tidyverse) library(ggmap) library(stringr)
Next, we’ll scrape the data using the
Explaining exactly how
Essentially, we are using
The brilliant thing about
#=========================== # SCRAPE DATA FROM WIKIPEDIA #=========================== html.population % html_nodes("table") %>% .[] %>% html_table(fill = TRUE) # inspect df.asia_cities %>% head() df.asia_cities %>% names()
After executing the web scraping code and inspecting the resulting data, we are going to begin some data wrangling.
First, we’re just going to remove some variables.
When we scraped the data, there were several columns that we don’t care about, like
There are several ways to remove these, but a quick way is to us the
Syntactically, this is really easy. In fact, one of the reasons that I strongly recommend using tools from the Tidyverse (like
#============================ # REMOVE EXTRANEOUS VARIABLES #============================ #df.asia_cities % names() df.asia_cities %>% head()
Now we have only three variables in the data, but we need to clean the variable names up a little.
Ideally, you want simple variable names. You also typically want your variable names to start with lower case letters (they are easier to type that way).
Here, we will manually provide new column names. We are using the
#=============== # RENAME COLUMNS #=============== colnames(df.asia_cities) % colnames() df.asia_cities %>% head() #------------------------------------------------------------------- # REMOVE EXTRA ROW AT TOP # - when we scraped the data, part of the column name # for the 'Population, City proper" column # was parsed as a row of data, instead of part of the column name # - here, we're just removing that extraneous row #------------------------------------------------------------------- df.asia_cities % head()
Now that the names are cleaned up, we will do a little cleaning on the data values.
On the Wikipedia table that originally contained the data, there were some footnotes associated with the population numbers. The footnotes were marked by bracketed numbers (e.g. ).
We need to remove those footnote markers. To do this, we will use
We will use a regular expression inside of
#========================================================================== # REMOVE "notes" FROM POPULATION NUMBERS NAMES # - two cities had extra characters for footnotes # ... we will remove these using stringr::str_replace and dplyr::mutate() #========================================================================== df.asia_cities % mutate(population = str_replace_all(population, "\[.*\]","") %>% parse_number()) # inspect df.asia_cities %>% head()
Now that we’ve removed the footnote characters, we need to create a new variable.
We will create a variable that has both the city and country in the following format: “Tokyo, Japan”.
You might be wondering why we are creating this new variable; the data already has a ‘city’ column as well as a ‘country’ column. Why do we want to duplicate this by creating a combined city/country column?
You’ll see why in a moment, but essentially, we will need this new column when we “geocode” our data to get the latitude and longitide coordinates of every city (we will need the lat/long when we plot the data).
The problem is that if we geocode based on only the city name (without the country), the geocoding process can encounter some errors due to ambiguity. (For example, would the city name “Naples” refer to Naples, Florida or Naples, Italy?)
To make sure that we don’t have any problems, we will create a variable that contains both city and country information.
#============================================================== # CREATE VARIABLE: "city_full_name" # - we need to have a combined name of the form 'City, Country' # - we need this because when we use the geocode() function to # get long/lat data, there is some ambiguity in the city names #============================================================== df.asia_cities % mutate(city_full_name = str_c(df.asia_cities$city, df.asia_cities$country, sep = ', ')) #inspect df.asia_cities %>% head()
Before we move on, we’ll quickly reorder the variables.
This is a quick and simple use of
#================== # REORDER VARIABLES #================== df.asia_cities % select(city, country, city_full_name, population) # inspect df.asia_cities %>% head() #======================================== # COERCE TO TIBBLE # - this just makes the data print better #======================================== df.asia_cities % as_tibble()
Now we’re going to obtain the longitude and latitude data by using
After obtaining the geo-data, we’ll join it to the original data using
#======================================================== # GEOCODE # - here, we're just getting longitude and latitude data # using ggmap::geocode() #======================================================== data.geo % head()
To map the data points that we’ve just gathered, we will need a map on which to plot them.
In recent posts, we have been using the
Here though, we’re going to do something different. We will use the
Here, we will retrieve a map from Stamen Maps. Stamen is a design firm in San Francisco that has created a set of maps that we can query by using
#============= # GET ASIA MAP #============= map.asia % ggmap()
What’s great about this (and one of the reasons that I like the Tidyverse) is that the tools are largely interchangeable. It is extremely easy to use a map from
Ok. Now that we have all of the components (the cleaned dataset and the background watercolor-map of Asia) we can plot the data.
First, we will do a quick “first iteration” version just to check everything. This version is unformatted; we just want to plot the data to make sure that the points are aligned properly, and that our data is properly “cleaned.”
#============================================================================ # PLOT CITIES ON MAP # - we are using the watercolor map of asia as the background (using ggmap()) # - we are using geom_point() to plot the city data as points # on top of the map #============================================================================ # FIRST ITERATION ggmap(map.asia) + geom_point(data = df.asia_cities, aes(x = lon, y = lat, size = population), color = "red", alpha = .3) + geom_point(data = df.asia_cities, aes(x = lon, y = lat, size = population), color = "red", shape = 1)
To be clear, the “first iteration” map that I’m showing your looks pretty good; there’s nothing wrong with the data, etc. However, when I initially ran this code, I found a few things amiss, and had to go back and make some adjustments to the previous data wrangling code. Keep that in mind. As you progress through a project, you may find things that are wrong with your data, and you need to iteratively go back and adjust your code until you get everything just right.
Now that we have a “first iteration” of our map that looks good, we’re going to polish everything.
This is where we will modify the theme elements of the plot, add a title and subtitles, remove extraneous non-data elements (like the axis ticks, etc).
#================================================== # FINALIZED MAP # - here I've added titles, modified theme elements # like the text, etc #================================================== ggmap(map.asia) + geom_point(data = df.asia_cities, aes(x = lon, y = lat, size = population), color = "red", alpha = .1) + geom_point(data = df.asia_cities, aes(x = lon, y = lat, size = population), color = "red", shape = 1) + labs(x = NULL, y = NULL) + labs(size = 'Population (millions)') + labs(title = "Largest Cities in Asia", subtitle = "source: https://en.wikipedia.org/wiki/List_of_Asian_cities_by_population_within_city_limits") + scale_size_continuous(range = c(.6,18), labels = scales::comma_format(), breaks = c(1500000, 10000000, 20000000)) + theme(text = element_text(color = "#4A4A4A", family = "Gill Sans")) + theme(axis.text = element_blank()) + theme(axis.ticks = element_blank()) + theme(plot.title = element_text(size = 32)) + theme(plot.subtitle = element_text(size = 10)) + theme(legend.key = element_rect(fill = "white"))
And here is the finalized map:
Having said that, I want to stress that if you don’t already know those “most important functions” backwards and forwards, you should focus your time on memorizing those first. Your first goal is to memorize essential syntax so that you know it “backwards and forwards.” After you’ve mastered the basic syntax, small projects like this will help you “put the pieces together.”
Once you’ve mastered the basic toolkit, small projects like this are excellent practice.
To rapidly master data science, you need to master the essential tools.
You need to know what tools are important, which tools are not important, and how to practice.
Sharp Sight is dedicated to teaching you how to master the tools of data science as quickly as possible.
Sign up now for our email list, and you’ll receive regular tutorials and lessons.
- What data science tools you should learn (and what not to learn)
- How to practice those tools
- How to put those tools together to execute analyses and machine learning projects
- … and more
If you sign up for our email list right now, you’ll also get access to our “Data Science Crash Course” for free.
SIGN UP NOW
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…