The larger and more complex the business the more metrics and dimensions. One day you understand that it is impossible to track them with only your eyes. Reducing the number of metrics and/or dimensions can prevent us from tracking all aspects of the business or analyzing aggregated data (for example, without dimensions), which can substantially smooth out or hide the anomalies. In such a situation, the detection of any anomalies after the actual occurrence can either be missed or have a significant time gap. Therefore, we have to react immediately in order to learn about the event as soon as possible, identify its causes, and understand what to do about it. For this, we can use the Anomaly Detection system and identify abnormal values, collect the corresponding events centrally, and monitor a much larger number of metrics and dimensions than what human capabilities allow.
In this article, by business metrics, we mean numerical indicators we regularly measure and use to track and assess the performance of a specific business process. There is a huge variety of business metrics in the industry: from conventional to unique ones. The latter are specifically developed for and used in one company or even just by one of its teams. I want to note that usually, a business metrics have dimensions, which imply the possibility of drilling down the structure of the metric. For instance, the number of sessions on the website can have dimensions: types of browsers, channels, countries, advertising campaigns, etc. where the sessions took place. The presence of a large number of dimensions per metric, on the one hand, provides a comprehensive detailed analysis, and, on the other, makes its conduct more complex.
Anomalies are abnormal values of business indicators. We cannot claim anomalies are something bad or good for business. Rather, we should see them as a signal that there have been some events that significantly influenced a business process and our goal is to determine the causes and potential consequences of such events and react immediately. Of course, from the business point of view, it is better to find such events than ignore them.
It is worth to say that such Anomaly Detection system will also signal the significant changes expected by the user, not the system. That is, the events you initiated in order to influence the business and the causes you are aware of. An example is running an irregular promo through an email campaign and expecting traffic to grow on a landing page from the same channel. Getting such a signal is also useful in terms of confirming that the promo works.
To date, a number of analytical tools have builtin systems for detecting anomalies. For example, Google Analytics has such a system.
However, in case you:
perhaps, you want to do something similar to the system I use.
Therefore, we will study four approaches for identifying anomalies in business metrics using the R language. I also will assume that we deal with unlabeled data, i.e. we did not know whether what historical values were anomalies.
For a practical example, I have extracted web data that looks like the following table:
In addition, I have added the aggregated values for each date and the metric “number of goals per session”. Then, we can visualize the metrics on the time axis separately with the following code:
click to expand R code
library(tidyverse) library(reshape2) library(lubridate) library(purrrlyr) library(ggrepel) # devtools::install_github("twitter/AnomalyDetection") library(AnomalyDetection) # install.packages("IsolationForest", repos="http://RForge.Rproject.org") library(IsolationForest) # loading raw data df < read_csv('data.csv') # summarizing metrics by dates df_tot < df %>% group_by(date) %>% summarise(sessions = sum(sessions), goals = sum(goals)) %>% ungroup() %>% mutate(channel = 'total') # bindind all together df_all < rbind(df, df_tot) %>% mutate(goals_per_session = ifelse(goals > 0, round(goals / sessions, 2), 0)) # visualizing metrics ggplot(df_all, aes(x = date)) + theme_minimal() + facet_wrap(~ channel) + geom_line(aes(y = sessions), color = 'blue') + geom_line(aes(y = goals), color = 'red') ggplot(df_all, aes(x = date, y = goals_per_session)) + theme_minimal() + facet_wrap(~ channel) + geom_line(color = 'darkgreen')
As you can see, we have data from August to November and at the end of September, in general, there was a significant spike in three of the five channels and in total as well. There are less significant spikes in Social and Direct.
Below are the different patterns in the “goals_per_session” metric:
Ok, let’s see what we can do with this example:
#1 – Model approach
The idea is the following: we create a model based on historical data and observations on which the model is the most mistaken are anomalies.
In practice, we measure a business metrics on a regular basis, usually daily. This means that they have a time series nature. Therefore, we can use a time series model and if the predicted value is significantly different from the actual value, then we detect the anomaly. This approach is good for metrics with obvious seasonal fluctuations. In our example, these are numbers of sessions and goals for the main channels. For the “goals_per_session” metric, this approach may not be as effective.
There are a lot of packages for time series modeling in R but, considering our goal of finding anomalies, I recommend using one of the readymade solutions, for instance, AnomalyDetection package.
Let’s start with the simple example of analyzing Direct traffic:
click to expand R code
##### time series modeling ##### # simple example df_ts < df_all %>% # the package works with POSIX date format mutate(date = as.POSIXct(date, origin = "19700101", tz = "UTC")) df_ts_ses < df_ts %>% dcast(., date ~ channel, value.var = 'sessions') df_ts_ses[is.na(df_ts_ses)] < 0 # example with Direct channel AnomalyDetectionTs(df_ts_ses[, c(1, 3)], max_anoms = 0.05, direction = 'both', e_value = TRUE, plot = TRUE) # 5% of anomalies AnomalyDetectionTs(df_ts_ses[, c(1, 3)], max_anoms = 0.1, direction = 'both', e_value = TRUE, plot = TRUE) # 10% of anomalies
As you can see, we can change the number of anomalies by shifting a threshold of the percent of anomalies we are allowed to detect.
In order to scale this approach to all metrics, we can map the function to all of the dimensions of all of the metrics with the following code:
click to expand R code
# scaled example df_ts < df_all %>% # removing some metrics select(goals_per_session) %>% # the package works with POSIX date format mutate(date = as.POSIXct(date, origin = "19700101", tz = "UTC")) %>% # melting data frame melt(., id.vars = c('date', 'channel'), variable.name = 'metric') %>% mutate(metric_dimension = paste(metric, channel, sep = '_')) %>% # casting dcast(., date ~ metric_dimension, value.var = 'value') df_ts[is.na(df_ts)] < 0 # anomaly detection algorithm as a function df_ts_anom < map2(df_ts %>% select(1), df_ts %>% select(c(2:ncol(.))), function(df1, df2) { AnomalyDetectionTs(data.frame(df1, df2), max_anoms = 0.1, direction = 'both', e_value = TRUE) } ) # extracting results df_ts_anom < lapply(df_ts_anom, function(x){ data.frame(x[["anoms"]]) }) # adding metric names and binding all metrics into data frame names(df_ts_anom) < colnames(df_ts[, 1]) df_ts_anom < do.call('rbind', df_ts_anom)
#2 – Statistical approach
From the statistical point of view, anomalies are extreme values or outliers. There is a number of ways and corresponding functions in R to identify such values. And, of course, the use of certain criteria should be made based on the properties of the sample.
I apply the statistical approach to analyze such indicators as “goals_per_session” in our example. This indicator has a close to normal distribution over the more popular channels and, accordingly, one, for example, can use an interquartile distance to determine the outliers.
click to expand R code
ggplot(df_all, aes(x = goals_per_session)) + theme_minimal() + facet_wrap(~ channel) + geom_histogram(binwidth = 0.01)
Often in practice, I am less concerned with whether the value is an outlier than whether it is too high or low compared to other days. For such a case, you can simply use the lower and upper, for example, 5% percentile (05% and 95100% ranges). Both approaches can be implemented and the results are visualized with the following code:
click to expand R code
df_stat_anom < df_all %>% # select the metrics select(sessions, goals) %>% group_by(channel) %>% mutate(is_low_percentile = ifelse(goals_per_session <= quantile(goals_per_session, probs = 0.05), TRUE, FALSE), is_high_percentile = ifelse(goals_per_session >= quantile(goals_per_session, probs = 0.95), TRUE, FALSE), is_outlier = case_when( goals_per_session < quantile(goals_per_session, probs = 0.25)  1.5 * IQR(goals_per_session)  goals_per_session > quantile(goals_per_session, probs = 0.75) + 1.5 * IQR(goals_per_session) ~ TRUE, TRUE ~ FALSE) ) %>% ungroup() ggplot(df_stat_anom, aes(x = date, y = goals_per_session)) + theme_minimal() + facet_wrap(~ channel) + geom_line(color = 'darkblue') + geom_point(data = df_stat_anom[df_stat_anom$is_outlier == TRUE, ], color = 'red', size = 5, alpha = 0.5) + geom_point(data = df_stat_anom[df_stat_anom$is_low_percentile == TRUE, ], color = 'blue', size = 2) + geom_point(data = df_stat_anom[df_stat_anom$is_high_percentile == TRUE, ], color = 'darkred', size = 2)
I have marked the 5% high and low percentiles with small red and blue dots and the outliers with bigger light red dots. You can shift a threshold of the percentile and interquartile distance as well (by changing the coefficient that equals 1.5 now).
#3 – The metric approach determines how far one observation is from the others in the space. Obviously, a typical observation is placed close to another, and the anomalous ones are farther. In this approach, the specific distance of the Mahalanobis works well for the business metrics analysis. A feature of the Mahalanobis distance approach different the from Euclidean one is that it takes into account the correlation between variables. From the anomaly detection point of view, this has the following effect: if for example the sessions from all channels synchronously grew twofold, then this approach can have the same anomaly estimation, as well as the case in which the number of sessions increased 1.5 times only from one channel.
In this case, observation is, for example, a day that is described by different metrics and their dimensions. Accordingly, we can look at these statistics from different angles. For example, whether the day was anomalous in terms of the structure of a certain metric, for instance, if the structure of traffic by the channels was typical. On the other hand, one can see whether the day was abnormal in terms of certain metrics, for example, the number of sessions and goals.
In addition, we need to apply a threshold of what distance will be used as an anomaly criterion. For this, we can use an exact value or combine it with the statistical approach but for distances this time. In the following example, I have analyzed the structure (dimensions) of sessions by dates and marked values that are in 95100% percentile:
click to expand R code
##### 3 Metric approach ##### # Mahalanobis distance function maha_func < function(x) { x < x %>% select(1) x < x[, which( round(colMeans(x), 4) != 0 & apply(x, MARGIN = 2, FUN = sd) != 0) ] round(mahalanobis(x, colMeans(x), cov(x)), 2) } df_ses_maha < df_all %>% # select the metrics select(goals, goals_per_session) %>% # casting dcast(., date ~ channel, value.var = 'sessions') %>% # remove total values select(total) df_ses_maha[is.na(df_ses_maha)] < 0 # adding Mahalanobis distances df_ses_maha$m_dist < maha_func(df_ses_maha) df_ses_maha < df_ses_maha %>% mutate(is_anomaly = ifelse(ecdf(m_dist)(m_dist) >= 0.95, TRUE, FALSE)) # visualization df_maha_plot < df_ses_maha %>% select(m_dist, is_anomaly) %>% melt(., id.vars = 'date') df_maha_plot < full_join(df_maha_plot, df_maha_plot, by = 'date') %>% left_join(., df_ses_maha %>% select(date, m_dist, is_anomaly), by = 'date') # color palette cols < c("#4ab04a", "#eec73a", "#ffd73e", "#f05336", "#ce472e") mv < max(df_maha_plot$m_dist) ggplot(df_maha_plot, aes(x = value.x, y = value.y, color = m_dist)) + theme_minimal() + facet_grid(variable.x ~ variable.y) + scale_color_gradientn(colors = cols, limits = c(min(df_maha_plot$m_dist), max(df_maha_plot$m_dist)), breaks = c(0, mv), labels = c("0", mv), guide = guide_colourbar(ticks = T, nbin = 50, barheight = .5, label = T, barwidth = 10)) + geom_point(aes(color = m_dist), size = 2, alpha = 0.4) + geom_text_repel(data = subset(df_maha_plot, is_anomaly == TRUE), aes(label = as.character(date)), fontface = 'bold', size = 2.5, alpha = 0.6, nudge_x = 200, direction = 'y', hjust = 1, segment.size = 0.2, max.iter = 10) + theme(legend.position = 'bottom', legend.direction = 'horizontal', panel.grid.major = element_blank())
As you can see on the chart, which is a set of intersections of all dimensions, the dates that were detected as anomalies are located farther in the space at most intersections.
#4 – Machine learning methods are also good ways of detecting anomalies but may require more attention to parameters tuning. In this article, I want to mention the Isolation Forest algorithm, which is a variation of Random Forest and its idea is the following: the algorithm creates random trees until each object is in a separate leaf and if there are outliers in the data, they will be isolated in the early stages (at a low depth of the tree). Then, for each observation, we calculate the mean of the depths of the leaves it falls into, and, based on this value, we decide whether or not it is an anomaly.
Again, as in the Metrics approach, we can estimate observations (dates in our example) in different ways and need to choose a threshold of an anomaly. I have used the same method as for the Metrics approach:
click to expand R code
##### 4 Isolation Forest ##### df_ses_if < df_all %>% # select the metrics select(goals, goals_per_session) %>% # casting dcast(., date ~ channel, value.var = 'sessions') %>% # remove total values select(total) df_ses_if[is.na(df_ses_if)] < 0 # creating trees if_trees < IsolationTrees(df_ses_if[, 1]) # evaluating anomaly score if_anom_score < AnomalyScore(df_ses_if[, 1], if_trees) # adding anomaly score df_ses_if$anom_score < round(if_anom_score$outF, 4) df_ses_if < df_ses_if %>% mutate(is_anomaly = ifelse(ecdf(anom_score)(anom_score) >= 0.95, TRUE, FALSE)) # visualization df_if_plot < df_ses_if %>% select(anom_score, is_anomaly) %>% melt(., id.vars = 'date') df_if_plot < full_join(df_if_plot, df_if_plot, by = 'date') %>% left_join(., df_ses_if %>% select(date, anom_score, is_anomaly), by = 'date') # color palette cols < c("#4ab04a", "#eec73a", "#ffd73e", "#f05336", "#ce472e") mv < max(df_if_plot$anom_score) ggplot(df_if_plot, aes(x = value.x, y = value.y, color = anom_score)) + theme_minimal() + facet_grid(variable.x ~ variable.y) + scale_color_gradientn(colors = cols, limits = c(min(df_if_plot$anom_score), max(df_if_plot$anom_score)), breaks = c(0, mv), labels = c("0", mv), guide = guide_colourbar(ticks = T, nbin = 50, barheight = .5, label = T, barwidth = 10)) + geom_point(aes(color = anom_score), size = 2, alpha = 0.4) + geom_text_repel(data = subset(df_if_plot, is_anomaly == TRUE), aes(label = as.character(date)), fontface = 'bold', size = 2.5, alpha = 0.6, nudge_x = 200, direction = 'y', hjust = 1, segment.size = 0.2, max.iter = 10) + theme(legend.position = 'bottom', legend.direction = 'horizontal', panel.grid.major = element_blank())
The results are a bit different compared to the Metrics approach but almost the same.
In conclusion, I want to add a few thoughts that can be useful when you create your own system for detecting anomalies:
The detection of anomalies in business metrics helps the business “be alert” and thus respond in a timely manner to unexpected events. And the automatic Anomaly Detection system, in turn, allows you to significantly expand the range of the metrics and their dimensions and track many aspects of the business. Of course, there is a huge variety of approaches, methods, and algorithms for detecting anomalies, and thus this article is intended to familiarize you with some of them, but I hope this will help you take the first steps to detecting anomalies for your business.
]]>This is the last post in the series of articles about using MultiChannel Attribution in marketing. In previous two articles (part 1 and part 2), we’ve reviewed a simple and powerful approach based on Markov chains that allows you to effectively attribute marketing channels.
In this article, we will review another fascinating approach that marries heuristic and probabilistic methods. Again, the core idea is straightforward and effective.
For an ecommerce site, we can come up with one or more conditions (events/actions) that serve as an evidence of passing each step of a Sales Funnel.
For some extra information about Sales Funnel, you can take a look at my (rather ugly) approach of Sales Funnel visualization with R.
Companies, naturally, lose some share of visitors on each following step of a Sales Funnel as it gets narrower. That’s why it looks like a string of bottlenecks. We can calculate a probability of transition from the previous step to the next one based on recorded history of transitions. On the other hand, customer journeys are sequences of sessions (visits) and these sessions are attributed to different marketing channels.
Therefore, we can link marketing channels with a probability of a customer passing through each step of a Sales Funnel. And here goes the core idea of the concept. The probability of moving through each “bottleneck” represents the value of the marketing channel which leads a customer through it. The higher probability of passing a “neck”, the lower the value of a channel that provided the transition. And vice versa, the lower probability, the higher value of a marketing channel in question.
Let’s study the concept with the following example. First off, we’ll define the Sales Funnel and a set of conditions which will register as customer passing through each step of the Funnel.
Second, we need to extract the data that includes sessions where corresponding events occurred. We’ll simulate this data with the following code:
library(tidyverse) library(purrrlyr) library(reshape2) ##### simulating the "real" data ##### set.seed(454) df_raw < data.frame(customer_id = paste0('id', sample(c(1:5000), replace = TRUE)), date = as.POSIXct(rbeta(10000, 0.7, 10) * 10000000, origin = '20170101', tz = "UTC"), channel = paste0('channel_', sample(c(0:7), 10000, replace = TRUE, prob = c(0.2, 0.12, 0.03, 0.07, 0.15, 0.25, 0.1, 0.08))), site_visit = 1) %>% mutate(two_pages_visit = sample(c(0,1), 10000, replace = TRUE, prob = c(0.8, 0.2)), product_page_visit = ifelse(two_pages_visit == 1, sample(c(0, 1), length(two_pages_visit[which(two_pages_visit == 1)]), replace = TRUE, prob = c(0.75, 0.25)), 0), add_to_cart = ifelse(product_page_visit == 1, sample(c(0, 1), length(product_page_visit[which(product_page_visit == 1)]), replace = TRUE, prob = c(0.1, 0.9)), 0), purchase = ifelse(add_to_cart == 1, sample(c(0, 1), length(add_to_cart[which(add_to_cart == 1)]), replace = TRUE, prob = c(0.02, 0.98)), 0)) %>% dmap_at(c('customer_id', 'channel'), as.character) %>% arrange(date) %>% mutate(session_id = row_number()) %>% arrange(customer_id, session_id) df_raw < melt(df_raw, id.vars = c('customer_id', 'date', 'channel', 'session_id'), value.name = 'trigger', variable.name = 'event') %>% filter(trigger == 1) %>% select(trigger) %>% arrange(customer_id, date)
And the data sample looks like:
Next up, the data needs to be preprocessed. For example, it would be useful to replace NA/direct channel with the previous one or separate firsttime purchasers from current customers, or even create different Sales Funnels based on new and current customers, segments, locations and so on. I will omit this step but you can find some ideas on preprocessing in my previous blogpost.
The important thing about this approach is that we only have to attribute the initial marketing channel, one that led the customer through their first step. For instance, a customer initially reviews a product page (step 2, interest) and is brought by channel_1. That means any future product page visits from other channels won’t be attributed until the customer makes a purchase and starts a new Sales Funnel journey.
Therefore, we will filter records for each customer and save the first unique event of each step of the Sales Funnel using the following code:
### removing not first events ### df_customers < df_raw %>% group_by(customer_id, event) %>% filter(date == min(date)) %>% ungroup()
I point your attention that in this way we assume that all customers were firsttime buyers, therefore every next purchase as an event will be removed with the above code.
Now, we can use the obtained data frame to compute Sales Funnel’s transition probabilities, importance of Sale Funnel steps, and their weighted importance. According to the method, the higher probability, the lower value of the channel. Therefore, we will calculate the importance of an each step as 1 minus transition probability. After that, we need to weight importances because their sum will be higher than 1. We will do these calculations with the following code:
### Sales Funnel probabilities ### sf_probs < df_customers %>% group_by(event) %>% summarise(customers_on_step = n()) %>% ungroup() %>% mutate(sf_probs = round(customers_on_step / customers_on_step[event == 'site_visit'], 3), sf_probs_step = round(customers_on_step / lag(customers_on_step), 3), sf_probs_step = ifelse(is.na(sf_probs_step) == TRUE, 1, sf_probs_step), sf_importance = 1  sf_probs_step, sf_importance_weighted = sf_importance / sum(sf_importance) )
A hint: it can be a good idea to compute Sales Funnel probabilities looking at a limited prior period, for example, 13 months. The reason is that customers’ flow or “necks” capacities could vary due to changes on a company’s site or due to changes in marketing campaigns and so on. Therefore, you can analyze the dynamics of the Sales Funnel’s transition probabilities in order to find the appropriate time period.
I can’t publish a blogpost without visualization. This time I suggest another approach for the Sales Funnel visualization that represents all customer journeys through the Sales Funnel with the following code:
### Sales Funnel visualization ### df_customers_plot < df_customers %>% group_by(event) %>% arrange(channel) %>% mutate(pl = row_number()) %>% ungroup() %>% mutate(pl_new = case_when( event == 'two_pages_visit' ~ round((max(pl[event == 'site_visit'])  max(pl[event == 'two_pages_visit'])) / 2), event == 'product_page_visit' ~ round((max(pl[event == 'site_visit'])  max(pl[event == 'product_page_visit'])) / 2), event == 'add_to_cart' ~ round((max(pl[event == 'site_visit'])  max(pl[event == 'add_to_cart'])) / 2), event == 'purchase' ~ round((max(pl[event == 'site_visit'])  max(pl[event == 'purchase'])) / 2), TRUE ~ 0 ), pl = pl + pl_new) df_customers_plot$event < factor(df_customers_plot$event, levels = c('purchase', 'add_to_cart', 'product_page_visit', 'two_pages_visit', 'site_visit' )) # color palette cols < c('#4e79a7', '#f28e2b', '#e15759', '#76b7b2', '#59a14f', '#edc948', '#b07aa1', '#ff9da7', '#9c755f', '#bab0ac') ggplot(df_customers_plot, aes(x = event, y = pl)) + theme_minimal() + scale_colour_manual(values = cols) + coord_flip() + geom_line(aes(group = customer_id, color = as.factor(channel)), size = 0.05) + geom_text(data = sf_probs, aes(x = event, y = 1, label = paste0(sf_probs*100, '%')), size = 4, fontface = 'bold') + guides(color = guide_legend(override.aes = list(size = 2))) + theme(legend.position = 'bottom', legend.direction = "horizontal", panel.grid.major.x = element_blank(), panel.grid.minor = element_blank(), plot.title = element_text(size = 20, face = "bold", vjust = 2, color = 'black', lineheight = 0.8), axis.title.y = element_text(size = 16, face = "bold"), axis.title.x = element_blank(), axis.text.x = element_blank(), axis.text.y = element_text(size = 8, angle = 90, hjust = 0.5, vjust = 0.5, face = "plain")) + ggtitle("Sales Funnel visualization  all customers journeys")
Ok, seems we now have everything to make final calculations. In the following code, we will remove all users that didn’t make a purchase. Then, we’ll link weighted importances of the Sales Funnel steps with sessions by event and, at last, summarize them.
### computing attribution ### df_attrib < df_customers %>% # removing customers without purchase group_by(customer_id) %>% filter(any(as.character(event) == 'purchase')) %>% ungroup() %>% # joining step's importances left_join(., sf_probs %>% select(event, sf_importance_weighted), by = 'event') %>% group_by(channel) %>% summarise(tot_attribution = sum(sf_importance_weighted)) %>% ungroup()
As the result, we’ve obtained the number of conversions that have been distributed by marketing channels:
In the same way you can distribute the revenue by channels.
At the end of the article, I want to share OWOX company’s blog where you can read more about the approach: Funnel Based Attribution Model.
In addition, you can find that OWOX provides an automated system for Marketing MultiChannel Attribution based on BigQuery. Therefore, if you are not familiar with R or don’t have a suitable data warehouse, I can recommend you to test their service.
SaveSaveSaveSaveSaveSaveSaveSaveSaveSave
SaveSave
SaveSave
SaveSave
SaveSaveSaveSave
SaveSaveSaveSave
SaveSave
SaveSave
SaveSave
SaveSave
SaveSave
SaveSave
]]>This is the second post about the Marketing Multichannel Attribution Model with Markov chains (here is the first one). Even though the concept of the firstorder Markov chains is pretty simple, you can face other issues and challenges when implementing the approach in practice. In this article, we will review some of them. I tried to organize this article in a way that you can use it as a framework or can help you to create your own.
The main steps that we will review are the following:
As usually, we start by simulating the data sample for experiments that includes customer ids, date stamp of contact with a marketing channel, marketing channel and conversion mark (0/1).
click to expand R code
library(tidyverse) library(reshape2) library(ChannelAttribution) library(markovchain) library(visNetwork) library(expm) library(stringr) ##### simulating the "real" data ##### set.seed(454) df_raw < data.frame(customer_id = paste0('id', sample(c(1:20000), replace = TRUE)), date = as.Date(rbeta(80000, 0.7, 10) * 100, origin = "20160101"), channel = paste0('channel_', sample(c(0:7), 80000, replace = TRUE, prob = c(0.2, 0.12, 0.03, 0.07, 0.15, 0.25, 0.1, 0.08))) ) %>% group_by(customer_id) %>% mutate(conversion = sample(c(0, 1), n(), prob = c(0.975, 0.025), replace = TRUE)) %>% ungroup() %>% dmap_at(c(1, 3), as.character) %>% arrange(customer_id, date) df_raw < df_raw %>% mutate(channel = ifelse(channel == 'channel_2', NA, channel))
In addition, I’ve replaced channel_2 with NA values. The initial data sample looks like:
##### splitting paths ##### df_paths < df_raw %>% group_by(customer_id) %>% mutate(path_no = ifelse(is.na(lag(cumsum(conversion))), 0, lag(cumsum(conversion))) + 1) %>% ungroup()
df_paths_1 < df_paths %>% filter(path_no == 1) %>% select(path_no)
##### replace some channels ##### df_path_1_clean < df_paths_1 %>% # removing NAs filter(!is.na(channel)) %>% # adding order of channels in the path group_by(customer_id) %>% mutate(ord = c(1:n()), is_non_direct = ifelse(channel == 'channel_6', 0, 1), is_non_direct_cum = cumsum(is_non_direct)) %>% # removing Direct (channel_6) when it is the first in the path filter(is_non_direct_cum != 0) %>% # replacing Direct (channel_6) with the previous touch point mutate(channel = ifelse(channel == 'channel_6', channel[which(channel != 'channel_6')][is_non_direct_cum], channel)) %>% ungroup() %>% select(ord, is_non_direct, is_non_direct_cum)
##### one and multichannel paths ##### df_path_1_clean < df_path_1_clean %>% group_by(customer_id) %>% mutate(uniq_channel_tag = ifelse(length(unique(channel)) == 1, TRUE, FALSE)) %>% ungroup() df_path_1_clean_uniq < df_path_1_clean %>% filter(uniq_channel_tag == TRUE) %>% select(uniq_channel_tag) df_path_1_clean_multi < df_path_1_clean %>% filter(uniq_channel_tag == FALSE) %>% select(uniq_channel_tag) ### experiment ### # attribution model for all paths df_all_paths < df_path_1_clean %>% group_by(customer_id) %>% summarise(path = paste(channel, collapse = ' > '), conversion = sum(conversion)) %>% ungroup() %>% filter(conversion == 1) mod_attrib < markov_model(df_all_paths, var_path = 'path', var_conv = 'conversion', out_more = TRUE) mod_attrib$removal_effects mod_attrib$result d_all < data.frame(mod_attrib$result) # attribution model for splitted multi and unique channel paths df_multi_paths < df_path_1_clean_multi %>% group_by(customer_id) %>% summarise(path = paste(channel, collapse = ' > '), conversion = sum(conversion)) %>% ungroup() %>% filter(conversion == 1) mod_attrib_alt < markov_model(df_multi_paths, var_path = 'path', var_conv = 'conversion', out_more = TRUE) mod_attrib_alt$removal_effects mod_attrib_alt$result # adding unique paths df_uniq_paths < df_path_1_clean_uniq %>% filter(conversion == 1) %>% group_by(channel) %>% summarise(conversions = sum(conversion)) %>% ungroup() d_multi < data.frame(mod_attrib_alt$result) d_split < full_join(d_multi, df_uniq_paths, by = c('channel_name' = 'channel')) %>% mutate(result = total_conversions + conversions) sum(d_all$total_conversions) sum(d_split$result)
##### Higher order of Markov chains and consequent duplicated channels in the path ##### # computing transition matrix  'manual' way df_multi_paths_m < df_multi_paths %>% mutate(path = paste0('(start) > ', path, ' > (conversion)')) m < max(str_count(df_multi_paths_m$path, '>')) + 1 # maximum path length df_multi_paths_cols < colsplit(string = df_multi_paths_m$path, pattern = ' > ', names = c(1:m)) colnames(df_multi_paths_cols) < paste0('ord_', c(1:m)) df_multi_paths_cols[df_multi_paths_cols == ''] < NA df_res < vector('list', ncol(df_multi_paths_cols)  1) for (i in c(1:(ncol(df_multi_paths_cols)  1))) { df_cache < df_multi_paths_cols %>% select(num_range("ord_", c(i, i+1))) %>% na.omit() %>% group_by_(.dots = c(paste0("ord_", c(i, i+1)))) %>% summarise(n = n()) %>% ungroup() colnames(df_cache)[c(1, 2)] < c('channel_from', 'channel_to') df_res[[i]] < df_cache } df_res < do.call('rbind', df_res) df_res_tot < df_res %>% group_by(channel_from, channel_to) %>% summarise(n = sum(n)) %>% ungroup() %>% group_by(channel_from) %>% mutate(tot_n = sum(n), perc = n / tot_n) %>% ungroup() df_dummy < data.frame(channel_from = c('(start)', '(conversion)', '(null)'), channel_to = c('(start)', '(conversion)', '(null)'), n = c(0, 0, 0), tot_n = c(0, 0, 0), perc = c(0, 1, 1)) df_res_tot < rbind(df_res_tot, df_dummy) # comparing transition matrices trans_matrix_prob_m < dcast(df_res_tot, channel_from ~ channel_to, value.var = 'perc', fun.aggregate = sum) trans_matrix_prob < data.frame(mod_attrib_alt$transition_matrix) trans_matrix_prob < dcast(trans_matrix_prob, channel_from ~ channel_to, value.var = 'transition_probability') # computing attribution  'manual' way channels_list < df_path_1_clean_multi %>% filter(conversion == 1) %>% distinct(channel) channels_list < c(channels_list$channel) df_res_ini < df_res_tot %>% select(channel_from, channel_to) df_attrib < vector('list', length(channels_list)) for (i in c(1:length(channels_list))) { channel < channels_list[i] df_res1 < df_res %>% mutate(channel_from = ifelse(channel_from == channel, NA, channel_from), channel_to = ifelse(channel_to == channel, '(null)', channel_to)) %>% na.omit() df_res_tot1 < df_res1 %>% group_by(channel_from, channel_to) %>% summarise(n = sum(n)) %>% ungroup() %>% group_by(channel_from) %>% mutate(tot_n = sum(n), perc = n / tot_n) %>% ungroup() df_res_tot1 < rbind(df_res_tot1, df_dummy) # adding (start), (conversion) and (null) states df_res_tot1 < left_join(df_res_ini, df_res_tot1, by = c('channel_from', 'channel_to')) df_res_tot1[is.na(df_res_tot1)] < 0 df_trans1 < dcast(df_res_tot1, channel_from ~ channel_to, value.var = 'perc', fun.aggregate = sum) trans_matrix_1 < df_trans1 rownames(trans_matrix_1) < trans_matrix_1$channel_from trans_matrix_1 < as.matrix(trans_matrix_1[, 1]) inist_n1 < dcast(df_res_tot1, channel_from ~ channel_to, value.var = 'n', fun.aggregate = sum) rownames(inist_n1) < inist_n1$channel_from inist_n1 < as.matrix(inist_n1[, 1]) inist_n1[is.na(inist_n1)] < 0 inist_n1 < inist_n1['(start)', ] res_num1 < inist_n1 %*% (trans_matrix_1 %^% 100000) df_cache < data.frame(channel_name = channel, conversions = as.numeric(res_num1[1, 1])) df_attrib[[i]] < df_cache } df_attrib < do.call('rbind', df_attrib) # computing removal effect and results tot_conv < sum(df_multi_paths_m$conversion) df_attrib < df_attrib %>% mutate(tot_conversions = sum(df_multi_paths_m$conversion), impact = (tot_conversions  conversions) / tot_conversions, tot_impact = sum(impact), weighted_impact = impact / tot_impact, attrib_model_conversions = round(tot_conversions * weighted_impact) ) %>% select(channel_name, attrib_model_conversions)
##### Generic Probabilistic Model ##### df_all_paths_compl < df_path_1_clean %>% group_by(customer_id) %>% summarise(path = paste(channel, collapse = ' > '), conversion = sum(conversion)) %>% ungroup() %>% mutate(null_conversion = ifelse(conversion == 1, 0, 1)) mod_attrib_complete < markov_model( df_all_paths_compl, var_path = 'path', var_conv = 'conversion', var_null = 'null_conversion', out_more = TRUE ) trans_matrix_prob < mod_attrib_complete$transition_matrix %>% dmap_at(c(1, 2), as.character) ##### viz ##### edges < data.frame( from = trans_matrix_prob$channel_from, to = trans_matrix_prob$channel_to, label = round(trans_matrix_prob$transition_probability, 2), font.size = trans_matrix_prob$transition_probability * 100, width = trans_matrix_prob$transition_probability * 15, shadow = TRUE, arrows = "to", color = list(color = "#95cbee", highlight = "red") ) nodes < data_frame(id = c( c(trans_matrix_prob$channel_from), c(trans_matrix_prob$channel_to) )) %>% distinct(id) %>% arrange(id) %>% mutate( label = id, color = ifelse( label %in% c('(start)', '(conversion)'), '#4ab04a', ifelse(label == '(null)', '#ce472e', '#ffd73e') ), shadow = TRUE, shape = "box" ) visNetwork(nodes, edges, height = "2000px", width = "100%", main = "Generic Probabilistic model's Transition Matrix") %>% visIgraphLayout(randomSeed = 123) %>% visNodes(size = 5) %>% visOptions(highlightNearest = TRUE)
##### modeling states and conversions ##### # transition matrix preprocessing trans_matrix_complete < mod_attrib_complete$transition_matrix trans_matrix_complete < rbind(trans_matrix_complete, df_dummy %>% mutate(transition_probability = perc) %>% select(channel_from, channel_to, transition_probability)) trans_matrix_complete$channel_to < factor(trans_matrix_complete$channel_to, levels = c(levels(trans_matrix_complete$channel_from))) trans_matrix_complete < dcast(trans_matrix_complete, channel_from ~ channel_to, value.var = 'transition_probability') trans_matrix_complete[is.na(trans_matrix_complete)] < 0 rownames(trans_matrix_complete) < trans_matrix_complete$channel_from trans_matrix_complete < as.matrix(trans_matrix_complete[, 1]) # creating empty matrix for modeling model_mtrx < matrix(data = 0, nrow = nrow(trans_matrix_complete), ncol = 1, dimnames = list(c(rownames(trans_matrix_complete)), '(start)')) # adding modeling number of visits model_mtrx['channel_5', ] < 1000 c(model_mtrx) %*% (trans_matrix_complete %^% 5) # after 5 steps c(model_mtrx) %*% (trans_matrix_complete %^% 100000) # after 100000 steps
##### Customer journey duration ##### # computing time lapses from the first contact to conversion/last contact df_multi_paths_tl < df_path_1_clean_multi %>% group_by(customer_id) %>% summarise(path = paste(channel, collapse = ' > '), first_touch_date = min(date), last_touch_date = max(date), tot_time_lapse = round(as.numeric(last_touch_date  first_touch_date)), conversion = sum(conversion)) %>% ungroup() # distribution plot ggplot(df_multi_paths_tl %>% filter(conversion == 1), aes(x = tot_time_lapse)) + theme_minimal() + geom_histogram(fill = '#4e79a7', binwidth = 1) # cumulative distribution plot ggplot(df_multi_paths_tl %>% filter(conversion == 1), aes(x = tot_time_lapse)) + theme_minimal() + stat_ecdf(geom = 'step', color = '#4e79a7', size = 2, alpha = 0.7) + geom_hline(yintercept = 0.95, color = '#e15759', size = 1.5) + geom_vline(xintercept = 23, color = '#e15759', size = 1.5, linetype = 2)
### for generic probabilistic model ### df_multi_paths_tl_1 < melt(df_multi_paths_tl[c(1:50), ] %>% select(customer_id, first_touch_date, last_touch_date, conversion), id.vars = c('customer_id', 'conversion'), value.name = 'touch_date') %>% arrange(customer_id) rep_date < as.Date('20160110', format = '%Y%m%d') ggplot(df_multi_paths_tl_1, aes(x = as.factor(customer_id), y = touch_date, color = factor(conversion), group = customer_id)) + theme_minimal() + coord_flip() + geom_point(size = 2) + geom_line(size = 0.5, color = 'darkgrey') + geom_hline(yintercept = as.numeric(rep_date), color = '#e15759', size = 2) + geom_rect(xmin = Inf, xmax = Inf, ymin = as.numeric(rep_date), ymax = Inf, alpha = 0.01, color = 'white', fill = 'white') + theme(legend.position = 'bottom', panel.border = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), axis.ticks.x = element_blank(), axis.ticks.y = element_blank()) + guides(colour = guide_legend(override.aes = list(size = 5)))
df_multi_paths_tl_2 < df_path_1_clean_multi %>% group_by(customer_id) %>% mutate(prev_touch_date = lag(date)) %>% ungroup() %>% filter(conversion == 1) %>% mutate(prev_time_lapse = round(as.numeric(date  prev_touch_date))) # distribution ggplot(df_multi_paths_tl_2, aes(x = prev_time_lapse)) + theme_minimal() + geom_histogram(fill = '#4e79a7', binwidth = 1) # cumulative distribution ggplot(df_multi_paths_tl_2, aes(x = prev_time_lapse)) + theme_minimal() + stat_ecdf(geom = 'step', color = '#4e79a7', size = 2, alpha = 0.7) + geom_hline(yintercept = 0.95, color = '#e15759', size = 1.5) + geom_vline(xintercept = 12, color = '#e15759', size = 1.5, linetype = 2)
# extracting data for generic model df_multi_paths_tl_3 < df_path_1_clean_multi %>% group_by(customer_id) %>% mutate(prev_time_lapse = round(as.numeric(date  lag(date)))) %>% summarise(path = paste(channel, collapse = ' > '), tot_time_lapse = round(as.numeric(max(date)  min(date))), prev_touch_tl = prev_time_lapse[which(max(date) == date)], conversion = sum(conversion)) %>% ungroup() %>% mutate(is_fruitless = ifelse(conversion == 0 & tot_time_lapse > 20 & prev_touch_tl > 10, TRUE, FALSE)) %>% filter(conversion == 1  is_fruitless == TRUE)
Recently I’ve worked with word2vec and doc2vec algorithms that I found interesting from many perspectives. Even though I used them for another purpose, the main thing they were developed for is Text analysis. As I noticed, my 2014 year’s article Twitter sentiment analysis is one of the most popular blog posts on the blog even today. Therefore, I decided to update it with a modern approach.
The problem with the previous method is that it just computes the number of positive and negative words and makes a conclusion based on their difference. Therefore, when using a simple vocabularies approach for a phrase “not bad” we’ll get a negative estimation.
But doc2vec is a deep learning algorithm that draws context from phrases. It’s currently one of the best ways of sentiment classification for movie reviews. You can use the following method to analyze feedbacks, reviews, comments, and so on. And you can expect better results comparing to tweets analysis because they usually include lots of misspelling.
We’ll use tweets for this example because it’s pretty easy to get them via Twitter API. We only need to create an app on https://dev.twitter.com (My apps menu) and find an API Key, API secret, Access Token and Access Token Secret on Keys and Access Tokens menu tab.
First, I’d like to give a credit to Dmitry Selivanov, the author of the great text2vec R package that we’ll use for sentiment analysis.
You can download a set of 1.6 million classified tweets here and use them to train a model. Before we start the analysis, I want to point your attention to how tweets were classified. There are two grades of sentiment: 0 (negative) and 4 (positive). That means that they are not neutral. I suggest using a probability of positiveness instead of class. In this case, we’ll get a range of values from 0 (completely negative) to 1 (completely positive) and assume that values from 0.35 to 0.65 are somewhere in the middle and they are neutral.
The following is the R code for training the model using DocumentTerm Matrix (DTM) that is the result of Vocabularybased vectorization. In addition, we will use TFIDF method for text preprocessing. Note that model training can take up to an hour, depending on computer’s configuration:
click to expand R code
# loading packages library(twitteR) library(ROAuth) library(tidyverse) library(text2vec) library(caret) library(glmnet) library(ggrepel) ### loading and preprocessing a training set of tweets # function for converting some symbols conv_fun < function(x) iconv(x, "latin1", "ASCII", "") ##### loading classified tweets ###### # source: http://help.sentiment140.com/forstudents/ # 0  the polarity of the tweet (0 = negative, 4 = positive) # 1  the id of the tweet # 2  the date of the tweet # 3  the query. If there is no query, then this value is NO_QUERY. # 4  the user that tweeted # 5  the text of the tweet tweets_classified < read_csv('training.1600000.processed.noemoticon.csv', col_names = c('sentiment', 'id', 'date', 'query', 'user', 'text')) %>% # converting some symbols dmap_at('text', conv_fun) %>% # replacing class values mutate(sentiment = ifelse(sentiment == 0, 0, 1)) # there are some tweets with NA ids that we replace with dummies tweets_classified_na < tweets_classified %>% filter(is.na(id) == TRUE) %>% mutate(id = c(1:n())) tweets_classified < tweets_classified %>% filter(!is.na(id)) %>% rbind(., tweets_classified_na) # data splitting on train and test set.seed(2340) trainIndex < createDataPartition(tweets_classified$sentiment, p = 0.8, list = FALSE, times = 1) tweets_train < tweets_classified[trainIndex, ] tweets_test < tweets_classified[trainIndex, ] ##### Vectorization ##### # define preprocessing function and tokenization function prep_fun < tolower tok_fun < word_tokenizer it_train < itoken(tweets_train$text, preprocessor = prep_fun, tokenizer = tok_fun, ids = tweets_train$id, progressbar = TRUE) it_test < itoken(tweets_test$text, preprocessor = prep_fun, tokenizer = tok_fun, ids = tweets_test$id, progressbar = TRUE) # creating vocabulary and documentterm matrix vocab < create_vocabulary(it_train) vectorizer < vocab_vectorizer(vocab) dtm_train < create_dtm(it_train, vectorizer) dtm_test < create_dtm(it_test, vectorizer) # define tfidf model tfidf < TfIdf$new() # fit the model to the train data and transform it with the fitted model dtm_train_tfidf < fit_transform(dtm_train, tfidf) dtm_test_tfidf < fit_transform(dtm_test, tfidf) # train the model t1 < Sys.time() glmnet_classifier < cv.glmnet(x = dtm_train_tfidf, y = tweets_train[['sentiment']], family = 'binomial', # L1 penalty alpha = 1, # interested in the area under ROC curve type.measure = "auc", # 5fold crossvalidation nfolds = 5, # high value is less accurate, but has faster training thresh = 1e3, # again lower number of iterations for faster training maxit = 1e3) print(difftime(Sys.time(), t1, units = 'mins')) plot(glmnet_classifier) print(paste("max AUC =", round(max(glmnet_classifier$cvm), 4))) preds < predict(glmnet_classifier, dtm_test_tfidf, type = 'response')[ ,1] auc(as.numeric(tweets_test$sentiment), preds) # save the model for future using saveRDS(glmnet_classifier, 'glmnet_classifier.RDS') #######################################################
As you can see, both AUC on train and test datasets are pretty high (0.876 and 0.875). Note that we saved the model and you don’t need to train it every time you need to assess some tweets. Next time you do sentiment analysis, you can start with the script below.
Ok, once we have model trained and validated, we can use it. For this, we start with tweets fetching via Twitter API and preprocessing in the same way as with classified tweets. For instance, the company I work for has just released an ambitious product for Mac users and it’s interesting to analyze how tweets about SetApp are rated.
click to expand R code
### fetching tweets ### download.file(url = "http://curl.haxx.se/ca/cacert.pem", destfile = "cacert.pem") setup_twitter_oauth('your_api_key', # api key 'your_api_secret', # api secret 'your_access_token', # access token 'your_access_token_secret' # access token secret ) df_tweets < twListToDF(searchTwitter('setapp OR #setapp', n = 1000, lang = 'en')) %>% # converting some symbols dmap_at('text', conv_fun) # preprocessing and tokenization it_tweets < itoken(df_tweets$text, preprocessor = prep_fun, tokenizer = tok_fun, ids = df_tweets$id, progressbar = TRUE) # creating vocabulary and documentterm matrix dtm_tweets < create_dtm(it_tweets, vectorizer) # transforming data with tfidf dtm_tweets_tfidf < fit_transform(dtm_tweets, tfidf) # loading classification model glmnet_classifier < readRDS('glmnet_classifier.RDS') # predict probabilities of positiveness preds_tweets < predict(glmnet_classifier, dtm_tweets_tfidf, type = 'response')[ ,1] # adding rates to initial dataset df_tweets$sentiment < preds_tweets
And finally, we can visualize the result with the following code:
click to expand R code
# color palette cols < c("#ce472e", "#f05336", "#ffd73e", "#eec73a", "#4ab04a") set.seed(932) samp_ind < sample(c(1:nrow(df_tweets)), nrow(df_tweets) * 0.1) # 10% for labeling # plotting ggplot(df_tweets, aes(x = created, y = sentiment, color = sentiment)) + theme_minimal() + scale_color_gradientn(colors = cols, limits = c(0, 1), breaks = seq(0, 1, by = 1/4), labels = c("0", round(1/4*1, 1), round(1/4*2, 1), round(1/4*3, 1), round(1/4*4, 1)), guide = guide_colourbar(ticks = T, nbin = 50, barheight = .5, label = T, barwidth = 10)) + geom_point(aes(color = sentiment), alpha = 0.8) + geom_hline(yintercept = 0.65, color = "#4ab04a", size = 1.5, alpha = 0.6, linetype = "longdash") + geom_hline(yintercept = 0.35, color = "#f05336", size = 1.5, alpha = 0.6, linetype = "longdash") + geom_smooth(size = 1.2, alpha = 0.2) + geom_label_repel(data = df_tweets[samp_ind, ], aes(label = round(sentiment, 2)), fontface = 'bold', size = 2.5, max.iter = 100) + theme(legend.position = 'bottom', legend.direction = "horizontal", panel.grid.major = element_blank(), panel.grid.minor = element_blank(), plot.title = element_text(size = 20, face = "bold", vjust = 2, color = 'black', lineheight = 0.8), axis.title.x = element_text(size = 16), axis.title.y = element_text(size = 16), axis.text.y = element_text(size = 8, face = "bold", color = 'black'), axis.text.x = element_text(size = 8, face = "bold", color = 'black')) + ggtitle("Tweets Sentiment rate (probability of positiveness)")
The green line is the boundary of positive tweets and the red one is the boundary of negative tweets. In addition, tweets are colored with red (negative), yellow (neutral) and green (positive) colors. As you can see, most of the tweets are around the green boundary and it means that they tend to be positive.
Recently I’ve worked with word2vec and doc2vec algorithms that I found interesting from many perspectives. Even though I used them for another purpose, the main thing they were developed for is Text analysis. As I noticed, my 2014 year’s article Twitter sentiment analysis is one of the most popular blog posts on the blog even today. Therefore, I decided to update it with a modern approach.
The problem with the previous method is that it just computes the number of positive and negative words and makes a conclusion based on their difference. Therefore, when using a simple vocabularies approach for a phrase “not bad” we’ll get a negative estimation.
But doc2vec is a deep learning algorithm that draws context from phrases. It’s currently one of the best ways of sentiment classification for movie reviews. You can use the following method to analyze feedbacks, reviews, comments, and so on. And you can expect better results comparing to tweets analysis because they usually include lots of misspelling.
We’ll use tweets for this example because it’s pretty easy to get them via Twitter API. We only need to create an app on https://dev.twitter.com (My apps menu) and find an API Key, API secret, Access Token and Access Token Secret on Keys and Access Tokens menu tab.
First, I’d like to give a credit to Dmitry Selivanov, the author of the great text2vec R package that we’ll use for sentiment analysis.
You can download a set of 1.6 million classified tweets here and use them to train a model. Before we start the analysis, I want to point your attention to how tweets were classified. There are two grades of sentiment: 0 (negative) and 4 (positive). That means that they are not neutral. I suggest using a probability of positiveness instead of class. In this case, we’ll get a range of values from 0 (completely negative) to 1 (completely positive) and assume that values from 0.35 to 0.65 are somewhere in the middle and they are neutral. In addition, we will use TFIDF method for text preprocessing.
The following is the R code for training the model. Note that model training can take up to an hour, depending on computer’s configuration:
click to expand R code
# loading packages library(twitteR) library(ROAuth) library(tidyverse) library(text2vec) library(caret) library(glmnet) library(ggrepel) ### loading and preprocessing a training set of tweets # function for converting some symbols conv_fun < function(x) iconv(x, "latin1", "ASCII", "") ##### loading classified tweets ###### # source: http://help.sentiment140.com/forstudents/ # 0  the polarity of the tweet (0 = negative, 4 = positive) # 1  the id of the tweet # 2  the date of the tweet # 3  the query. If there is no query, then this value is NO_QUERY. # 4  the user that tweeted # 5  the text of the tweet tweets_classified < read_csv('training.1600000.processed.noemoticon.csv', col_names = c('sentiment', 'id', 'date', 'query', 'user', 'text')) %>% # converting some symbols dmap_at('text', conv_fun) %>% # replacing class values mutate(sentiment = ifelse(sentiment == 0, 0, 1)) # data splitting on train and test set.seed(2340) trainIndex < createDataPartition(tweets_classified$sentiment, p = 0.8, list = FALSE, times = 1) tweets_train < tweets_classified[trainIndex, ] tweets_test < tweets_classified[trainIndex, ] ##### doc2vec ##### # define preprocessing function and tokenization function prep_fun < tolower tok_fun < word_tokenizer it_train < itoken(tweets_train$text, preprocessor = prep_fun, tokenizer = tok_fun, ids = tweets_train$id, progressbar = TRUE) it_test < itoken(tweets_test$text, preprocessor = prep_fun, tokenizer = tok_fun, ids = tweets_test$id, progressbar = TRUE) # creating vocabulary and documentterm matrix vocab < create_vocabulary(it_train) vectorizer < vocab_vectorizer(vocab) dtm_train < create_dtm(it_train, vectorizer) dtm_test < create_dtm(it_test, vectorizer) # define tfidf model tfidf < TfIdf$new() # fit the model to the train data and transform it with the fitted model dtm_train_tfidf < fit_transform(dtm_train, tfidf) dtm_test_tfidf < fit_transform(dtm_test, tfidf) # train the model t1 < Sys.time() glmnet_classifier < cv.glmnet(x = dtm_train_tfidf, y = tweets_train[['sentiment']], family = 'binomial', # L1 penalty alpha = 1, # interested in the area under ROC curve type.measure = "auc", # 5fold crossvalidation nfolds = 5, # high value is less accurate, but has faster training thresh = 1e3, # again lower number of iterations for faster training maxit = 1e3) print(difftime(Sys.time(), t1, units = 'mins')) plot(glmnet_classifier) print(paste("max AUC =", round(max(glmnet_classifier$cvm), 4))) preds < predict(glmnet_classifier, dtm_test_tfidf, type = 'response')[ ,1] glmnet:::auc(as.numeric(tweets_test$sentiment), preds) # save the model for future using saveRDS(glmnet_classifier, 'glmnet_classifier.RDS') #######################################################
As you can see, both AUC on train and test datasets are pretty high (0.875 and 0.874). Note that we saved the model and you don’t need to train it every time you need to assess some tweets. Next time you do sentiment analysis, you can start with the script below.
Ok, once we have model trained and validated, we can use it. For this, we start with tweets fetching via Twitter API and preprocessing in the same way as with classified tweets. For instance, the company I work for has just released an ambitious product for Mac users and it’s interesting to analyze how tweets about SetApp are rated.
click to expand R code
### fetching tweets ### download.file(url = "http://curl.haxx.se/ca/cacert.pem", destfile = "cacert.pem") setup_twitter_oauth('your_api_key', # api key 'your_api_secret', # api secret 'your_access_token', # access token 'your_access_token_secret' # access token secret ) df_tweets < twListToDF(searchTwitter('setapp OR #setapp', n = 1000, lang = 'en')) %>% # converting some symbols dmap_at('text', conv_fun) # preprocessing and tokenization it_tweets < itoken(df_tweets$text, preprocessor = prep_fun, tokenizer = tok_fun, ids = df_tweets$id, progressbar = TRUE) # creating vocabulary and documentterm matrix dtm_tweets < create_dtm(it_tweets, vectorizer) # transforming data with tfidf dtm_tweets_tfidf < fit_transform(dtm_tweets, tfidf) # loading classification model glmnet_classifier < readRDS('glmnet_classifier.RDS') # predict probabilities of positiveness preds_tweets < predict(glmnet_classifier, dtm_tweets_tfidf, type = 'response')[ ,1] # adding rates to initial dataset df_tweets$sentiment < preds_tweets
And finally, we can visualize the result with the following code:
click to expand R code
# color palette cols < c("#ce472e", "#f05336", "#ffd73e", "#eec73a", "#4ab04a") set.seed(932) samp_ind < sample(c(1:nrow(df_tweets)), nrow(df_tweets) * 0.1) # 10% for labeling # plotting ggplot(df_tweets, aes(x = created, y = sentiment, color = sentiment)) + theme_minimal() + scale_color_gradientn(colors = cols, limits = c(0, 1), breaks = seq(0, 1, by = 1/4), labels = c("0", round(1/4*1, 1), round(1/4*2, 1), round(1/4*3, 1), round(1/4*4, 1)), guide = guide_colourbar(ticks = T, nbin = 50, barheight = .5, label = T, barwidth = 10)) + geom_point(aes(color = sentiment), alpha = 0.8) + geom_hline(yintercept = 0.65, color = "#4ab04a", size = 1.5, alpha = 0.6, linetype = "longdash") + geom_hline(yintercept = 0.35, color = "#f05336", size = 1.5, alpha = 0.6, linetype = "longdash") + geom_smooth(size = 1.2, alpha = 0.2) + geom_label_repel(data = df_tweets[samp_ind, ], aes(label = round(sentiment, 2)), fontface = 'bold', size = 2.5, max.iter = 100) + theme(legend.position = 'bottom', legend.direction = "horizontal", panel.grid.major = element_blank(), panel.grid.minor = element_blank(), plot.title = element_text(size = 20, face = "bold", vjust = 2, color = 'black', lineheight = 0.8), axis.title.x = element_text(size = 16), axis.title.y = element_text(size = 16), axis.text.y = element_text(size = 8, face = "bold", color = 'black'), axis.text.x = element_text(size = 8, face = "bold", color = 'black')) + ggtitle("Tweets Sentiment rate (probability of positiveness)")
The green line is the boundary of positive tweets and the red one is the boundary of negative tweets. In addition, tweets are colored with red (negative), yellow (neutral) and green (positive) colors. As you can see, most of the tweets are around the green boundary and it means that they tend to be positive.
]]>When conducting Cohort Analysis, one of the most important measures is Customer Retention Rate. I will share a few ideas for visualizing this parameter in this post.
Last year I shared several charts for Customer Retention Rate visualization in this post. However, it is always helpful to analyze and visualize both relative (Customer Retention Rate) and absolute values (number of customers in a cohort). For this, I have created charts that combine these values.
And the first of them is the Cycle Plot in advanced view:
Note: you can find two approaches for plotting this chart with R language at the end of the post (via scaling and via multiplotting).
Let’s go over this chart. There are both the trend and the year of lifetime effect. Cohorts are composed of the periods of their lifetimes (annual in our case) sequentially: the first year is red, the second one is green, and the third one is blue. There are Retention Rate indexes (points) with smooth curves on the top of the chart and the number of customers (bars) on the bottom. Light bars are the absolute numbers of customers that we had at the beginning of the exact year of the lifetime and the dark bars are the numbers of customers who were alive afterwards. For instance, the Cohort_01 had 1402 customers at the beginning of its lifetime and 965 customers after the first year. Its retention rate is 69% for the first year (965 / 1402).
Therefore, we can easily:
We can see the flat trend in customer acquisition with seasonal picks in November (Cohort_11, Cohort_23 and Cohort_35) in our example. Although the number of customers is higher in these cohorts, the Retention Rate is lower. That means we attracted a lot of onetime buyers who were interested in discounts only but didn’t become loyal customers. We can find that Retention Rate tends to decrease in the first year. Therefore, including the flat trend in customer acquisition we would face a big problem in the business. The common Retention Rate tells us that our main losses are in the first year of the customers’ lifetime. We lost approximately 40% of clients. In addition, the third year looks problematic.
The second chart is the Bubble Chart:
We have all of the cohorts on the xaxis and transformed bars with the number of customers into bubbles. Light bubbles are the numbers of customers at the beginning of the lifetime period and dark bubbles are alive customers afterwards. The size of the bubbles depends on the number of customers in the cohort and the position of the bubbles depends on the Retention Rate value. There are exact numbers inside the bubbles e.g. 965/1402 and the Retention Rate is 69% for the first year for Cohort_1. This chart is helpful for analyzing each cohort and comparing their progress to others.
The last chart I call Hanging Bubbles:
It is pretty the same with the Bubble Chart, but includes the original numbers of customers and is scaled exactly from 1 to 0 (from 100% to 0%) on the Retention Rate axis. Therefore, it is easy to find which cohort is closer to the “death” and which one is not.
If you are interested in reproducing these charts, here is the R code:
click to expand R code
# loading libraries library(dplyr) library(reshape2) library(ggplot2) library(scales) library(gridExtra) # creating data sample set.seed(10) cohorts < data.frame(cohort = paste('cohort', formatC(c(1:36), width=2, format='d', flag='0'), sep = '_'), Y_00 = sample(c(1300:1500), 36, replace = TRUE), Y_01 = c(sample(c(800:1000), 36, replace = TRUE)), Y_02 = c(sample(c(600:800), 24, replace = TRUE), rep(NA, 12)), Y_03 = c(sample(c(400:500), 12, replace = TRUE), rep(NA, 24))) # simulating seasonality (Black Friday) cohorts[c(11, 23, 35), 2] < as.integer(cohorts[c(11, 23, 35), 2] * 1.25) cohorts[c(11, 23, 35), 3] < as.integer(cohorts[c(11, 23, 35), 3] * 1.10) cohorts[c(11, 23, 35), 4] < as.integer(cohorts[c(11, 23, 35), 4] * 1.07) # calculating retention rate and preparing data for plotting df_plot < melt(cohorts, id.vars = 'cohort', value.name = 'number', variable.name = 'year_of_LT') df_plot < df_plot %>% group_by(cohort) %>% arrange(year_of_LT) %>% mutate(number_prev_year = lag(number), number_Y_00 = number[which(year_of_LT == 'Y_00')]) %>% ungroup() %>% mutate(ret_rate_prev_year = number / number_prev_year, ret_rate = number / number_Y_00, year_cohort = paste(year_of_LT, cohort, sep = '')) ##### The first way for plotting cycle plot via scaling # calculating the coefficient for scaling 2nd axis k < max(df_plot$number_prev_year[df_plot$year_of_LT == 'Y_01'] * 1.15) / min(df_plot$ret_rate[df_plot$year_of_LT == 'Y_01']) # retention rate cycle plot ggplot(na.omit(df_plot), aes(x = year_cohort, y = ret_rate, group = year_of_LT, color = year_of_LT)) + theme_bw() + geom_point(size = 4) + geom_text(aes(label = percent(round(ret_rate, 2))), size = 4, hjust = 0.4, vjust = 0.6, face = "plain") + # smooth method can be changed (e.g. for "lm") geom_smooth(size = 2.5, method = 'loess', color = 'darkred', aes(fill = year_of_LT)) + geom_bar(aes(y = number_prev_year / k, fill = year_of_LT), alpha = 0.2, stat = 'identity') + geom_bar(aes(y = number / k, fill = year_of_LT), alpha = 0.6, stat = 'identity') + geom_text(aes(y = 0, label = cohort), color = 'white', angle = 90, size = 4, hjust = 0.05, vjust = 0.4) + geom_text(aes(y = number_prev_year / k, label = number_prev_year), angle = 90, size = 4, hjust = 0.1, vjust = 0.4) + geom_text(aes(y = number / k, label = number), angle = 90, size = 4, hjust = 0.1, vjust = 0.4) + theme(legend.position='none', plot.title = element_text(size=20, face="bold", vjust=2), axis.title.x = element_text(size=18, face="bold"), axis.title.y = element_text(size=18, face="bold"), axis.text = element_text(size=16), axis.text.x = element_blank(), axis.ticks.x = element_blank(), axis.ticks.y = element_blank(), panel.border = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank()) + labs(x = 'Year of Lifetime by Cohorts', y = 'Number of Customers / Retention Rate') + ggtitle("Customer Retention Rate  Cycle plot") ##### The second way for plotting cycle plot via multiplotting # plot #1  Retention rate p1 < ggplot(na.omit(df_plot), aes(x = year_cohort, y = ret_rate, group = year_of_LT, color = year_of_LT)) + theme_bw() + geom_point(size = 4) + geom_text(aes(label = percent(round(ret_rate, 2))), size = 4, hjust = 0.4, vjust = 0.6, face = "plain") + geom_smooth(size = 2.5, method = 'loess', color = 'darkred', aes(fill = year_of_LT)) + theme(legend.position='none', plot.title = element_text(size=20, face="bold", vjust=2), axis.title.x = element_blank(), axis.title.y = element_text(size=18, face="bold"), axis.text = element_blank(), axis.ticks.x = element_blank(), axis.ticks.y = element_blank(), panel.border = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank()) + labs(y = 'Retention Rate') + ggtitle("Customer Retention Rate  Cycle plot") # plot #2  number of customers p2 < ggplot(na.omit(df_plot), aes(x = year_cohort, group = year_of_LT, color = year_of_LT)) + theme_bw() + geom_bar(aes(y = number_prev_year, fill = year_of_LT), alpha = 0.2, stat = 'identity') + geom_bar(aes(y = number, fill = year_of_LT), alpha = 0.6, stat = 'identity') + geom_text(aes(y = number_prev_year, label = number_prev_year), angle = 90, size = 4, hjust = 0.1, vjust = 0.4) + geom_text(aes(y = number, label = number), angle = 90, size = 4, hjust = 0.1, vjust = 0.4) + geom_text(aes(y = 0, label = cohort), color = 'white', angle = 90, size = 4, hjust = 0.05, vjust = 0.4) + theme(legend.position='none', plot.title = element_text(size=20, face="bold", vjust=2), axis.title.x = element_text(size=18, face="bold"), axis.title.y = element_text(size=18, face="bold"), axis.text = element_blank(), axis.ticks.x = element_blank(), axis.ticks.y = element_blank(), panel.border = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank()) + scale_y_continuous(limits = c(0, max(df_plot$number_Y_00 * 1.1))) + labs(x = 'Year of Lifetime by Cohorts', y = 'Number of Customers') # multiplot grid.arrange(p1, p2, ncol = 1) # retention rate bubble chart ggplot(na.omit(df_plot), aes(x = cohort, y = ret_rate, group = cohort, color = year_of_LT)) + theme_bw() + scale_size(range = c(15, 40)) + geom_line(size = 2, alpha = 0.3) + geom_point(aes(size = number_prev_year), alpha = 0.3) + geom_point(aes(size = number), alpha = 0.8) + geom_smooth(linetype = 2, size = 2, method = 'loess', aes(group = year_of_LT, fill = year_of_LT), alpha = 0.2) + geom_text(aes(label = paste0(number, '/', number_prev_year, 'n', percent(round(ret_rate, 2)))), color = 'white', size = 3, hjust = 0.5, vjust = 0.5, face = "plain") + theme(legend.position='none', plot.title = element_text(size=20, face="bold", vjust=2), axis.title.x = element_text(size=18, face="bold"), axis.title.y = element_text(size=18, face="bold"), axis.text = element_text(size=16), axis.text.x = element_text(size=10, angle=90, hjust=.5, vjust=.5, face="plain"), axis.ticks.x = element_blank(), axis.ticks.y = element_blank(), panel.border = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank()) + labs(x = 'Cohorts', y = 'Retention Rate by Year of Lifetime') + ggtitle("Customer Retention Rate  Bubble chart") # retention rate falling drops chart ggplot(df_plot, aes(x = cohort, y = ret_rate, group = cohort, color = year_of_LT)) + theme_bw() + scale_size(range = c(15, 40)) + scale_y_continuous(limits = c(0, 1)) + geom_line(size = 2, alpha = 0.3) + geom_point(aes(size = number), alpha = 0.8) + geom_text(aes(label = paste0(number, 'n', percent(round(ret_rate, 2)))), color = 'white', size = 3, hjust = 0.5, vjust = 0.5, face = "plain") + theme(legend.position='none', plot.title = element_text(size=20, face="bold", vjust=2), axis.title.x = element_text(size=18, face="bold"), axis.title.y = element_text(size=18, face="bold"), axis.text = element_text(size=16), axis.text.x = element_text(size=10, angle=90, hjust=.5, vjust=.5, face="plain"), axis.ticks.x = element_blank(), axis.ticks.y = element_blank(), panel.border = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank()) + labs(x = 'Cohorts', y = 'Retention Rate by Year of Lifetime') + ggtitle("Customer Retention Rate  Falling Drops chart")
There are several posts connected with LifeCycle Grids on this blog. If you are not familiar with the concept I highly recommend you to start with Jim Novo’s book, his blog or, at least, from the first post about on my blog. We will study how to use LifeCycle Grids concept for measuring a health of the business in this post.
Obviously, that business depends on customers purchasing behavior. Purchase frequency leads to higher customer lifetime value to date and purchase recency leads to higher potential lifetime value. And they both lead to higher total customer lifetime value. Therefore, the more frequent and recent purchases that customers do the healthier business.
Previously we’ve touched a topic of comparing cohorts and their progress through the customer’s lifecycle prism. But what if we want to measure the health of business on the high level.
Once we discovered how our clients are distributed through Lifecycle Grids, we can use pretty simple and effective technic called Delta Analysis and create Delta LifeCycle Grids. Delta means the difference. The main idea behind is the following. We can create two consecutive LifeCycle Grids as of two reporting dates and calculate the difference between the corresponding cells. The delta values (differences) would tell us where we have positive and negative changes and is it good for us or not. Let’s study this idea with a practical example.
As usually we will use powerful tool R language that will allow us to create the example with more than two consecutive LifeCycle Grids and visualize the result effectively.
Ok, let’s assume that we have a dataset that looks like:
clientId orderdate 1 762 20120510 2 461 20120516 3 641 20120707 4 1040 20130222 5 128 20130115 6 339 20130313
Actually, this dataset is enough for doing LifeCycle Grids analysis. We would calculate a number of orders from each customer (frequency) and time lapse from the last purchase (recency). You can reproduce this dataset with the following code:
click to expand R code
# loading libraries library(dplyr) library(reshape2) library(ggplot2) library(lubridate) set.seed(10) # creating orders data sample orders < data.frame(clientId = sample(c(1:1500), 5000, replace = TRUE), orderdate = sample((1:500), 5000, replace = TRUE)) orders$orderdate < as.Date(orders$orderdate, origin = "20120101")
We will do Delta Analysis based on historical data for the last six months on the monthly basis and find out is there positive or negative trend for our business’s health. In this case, we need to specify seven reporting dates because we will analyze six delta values. Let’s assume that they are from 20121101 until 20130501. Further, we need to create LifeCycle Grids for each date and calculate differences between them consequently. We will use the following code:
click to expand R code
# specifying reporting dates for monthly analysis rep.dates < seq(as.Date('20121101', format = '%Y%m%d'), as.Date('20130501', format = '%Y%m%d'), "month") # creating empty data frames lcg.cache < data.frame() lcg < data.frame() # creating LCG for each reporting date for (i in c(1:length(rep.dates))) { customers < orders %>% filter(orderdate < rep.dates[i]) %>% group_by(clientId) %>% summarise(frequency = n(), recency = as.numeric(rep.dates[i]  max(orderdate))) %>% # adding segments mutate(segm.freq = ifelse(between(frequency, 1, 1), '1', ifelse(between(frequency, 2, 2), '2', ifelse(between(frequency, 3, 3), '3', ifelse(between(frequency, 4, 5), '45', '>5'))))) %>% mutate(segm.rec = ifelse(between(recency, 0, 30), '030 days', ifelse(between(recency, 31, 90), '3190 days', ifelse(between(recency, 91, 180), '91180 days', '>180 days')))) %>% ungroup() # defining order of boundaries customers$segm.freq < factor(customers$segm.freq, levels = c('>5', '45', '3', '2', '1')) customers$segm.rec < factor(customers$segm.rec, levels = c('>180 days', '91180 days', '3190 days', '030 days')) # creating LCG as of reporting date lcg.cache < customers %>% group_by(segm.freq, segm.rec) %>% summarise(quantity = n()) %>% ungroup() %>% mutate(repdate = format(rep.dates[i], format = '%Y%m')) # binding all LCGs lcg < rbind(lcg, lcg.cache) } # calculating Delta LCG delta.lcg < lcg %>% group_by(segm.freq, segm.rec) %>% arrange(repdate) %>% mutate(prev = lag(quantity), delta = quantity  prev) %>% # removing base reporting period na.omit() %>% ungroup()
And the final step before analyzing is visualization. There are two examples that I want to share with you:
The first one shows deltas/differences only. The second one includes both differences and total numbers. You can reproduce these plots with the following code:
click to expand R code
# plotting results ggplot(delta.lcg, aes(x = repdate, y = delta, fill = repdate)) + theme_bw() + theme(panel.grid = element_blank(), axis.text.x = element_text(size = 8, angle = 90, hjust = 0.5, vjust = 0.5, face = "plain"), legend.title = element_blank()) + geom_bar(stat = 'identity', alpha = 0.6) + geom_text(aes(y = 0, label = delta), color = 'darkred', vjust = 0, size = 5, fontface = "bold") + facet_grid(segm.freq ~ segm.rec) + xlab("Reporting Date") + ylab("Delta Value") + ggtitle("Delta LifeCycle Grids") ggplot(delta.lcg, aes(x = repdate, y = delta, fill = repdate)) + theme_bw() + theme(panel.grid = element_blank(), axis.text.x = element_text(size = 8, angle = 90, hjust = 0.5, vjust = 0.5, face = "plain"), legend.title = element_blank()) + geom_bar(stat = 'identity', alpha = 0.6) + geom_bar(aes(y = quantity, color=repdate), stat = 'identity', alpha=0) + geom_text(aes(y = 0, label = delta), color = 'darkred', vjust = 0, size = 5, fontface = "bold") + geom_text(aes(y = quantity, label = quantity), vjust = 0, size = 4) + facet_grid(segm.freq ~ segm.rec) + xlab("Reporting Date") + ylab("Delta Value") + ggtitle("Delta LifeCycle Grids")
You can see both positive and negative numbers in the Delta LifeCycle Grids and dynamics of deltas, depending on whether the number of customers in the cell grew or got smaller. The healthier business the more positive deltas you should see on the right part of the chart (‘030 days’ and ’3190 days’ columns) and more negative deltas on the left part (’91180 days’ and ‘>180 days’ columns). In other words, this means the number of customers grow in ‘alive’ cells and get smaller in ‘dead’ cells.
Let’s take a look on our example. It would be helpful to start with the last month (201305) analysis and then take a look on trends.
We can create a separate matrix with numbers and totals with the following code:
click to expand R code
# last month delta matrix delta.lcg.m < delta.lcg %>% filter(repdate == '201305') delta.lcg.m < dcast(delta.lcg.m, segm.freq ~ segm.rec, value.var = 'delta') row.names(delta.lcg.m) < delta.lcg.m$segm.freq delta.lcg.m < delta.lcg.m[, 1] delta.lcg.m$total < rowSums(delta.lcg.m) delta.lcg.m['total', ] < colSums(delta.lcg.m)
Our matrix looks like the following:
Let’s start with columns and rows totals. You can see that we have the positive values on the left side of the matrix (4 and 11) and the negative ones on the right (2 and 0). This means our customers have moved from ‘alive’ cells to ‘dead’ and it would be expensive to reactivate them. Regarding rows, we have negative values on the bottom and positive ones on the top. This means that customers have moved from onetime buyers to more frequent and it is good, but we can see the negative delta (8) in the cell ‘030 days – 1 purchase’ where brand new customers are placed. This loss might be amplified by a lack of new customers to replace those moving out of the cell. A short summary is the following: although we have positive changes in purchases frequency, valuable customers tend to leave us. Additionally we need to check the number of new clients.
If we take a look on the trends we would find almost the same situation as we have with the last month. And this situation is not good. We tend to lose good and best customers (the growth in cells ‘>180 days – from 3 to >5 purchases’).
It seems like our retention politics doesn’t work effectively with customers who have become frequent buyers. And we need to pay attention to the number of brand new customers.
Let’s make a conclusion. Delta LifeCycle Grids approach allows us to represent changes in consecutive LifeCycle Grids and identify customer flows and trends. We need take into account that there could be quite a lot of reasons for changes. For instance, customers could change their behavior because of changes in product/service, seasonality or promo activities, etc. Don’t forget about and take these reasons into account.
]]>Sales (purchasing or conversion) Funnel is a consumerfocused marketing model which illustrates the theoretical customer journey towards the purchase of a product or service. Classically, the Sales funnel includes at least four steps:
There can be more steps and this depends on the business model and level of detail you want to/can represent. Obviously, there is quite a lot of creative work in order to establish the model and define the steps of the customer journey in practice. Once you managed this it is always interesting to see the result in visual form. Usually, the Sales Funnel looks like a tornado and it narrows from the top to the bottom. Therefore, the Sales Funnel allows you to find where there is a bottle funnel neck or what is the step where you lose customers.
We will study a very simple example of creating a Sales Funnel by focusing on visualization in this post.
If we speak about ecommerce, we can use content of the site for defining steps. Let’s assume that our main page and landing pages of ad campaigns are responsible for the Awareness step. The pages with product or service descriptions correspond to the Interest from customers. Shopping cart page confirms the Desire. Finally, the conversion page (“thank you page”) is the Action. In other words, our theoretical customer journey looks like the following:
Note: you can distribute all pages to the Sales Funnel steps with the content grouping option in Google Analytics. Further, you can find a lot of examples on how to extract data from Google Analytics directly from R via API. And if you use the Google Analytics content grouping option you can extract these groups.
Let’s say we have a dataset with the grouped pages and number of sessions (visits). Further, we need to define which pages belong to each step of the funnel:
content  step  number 
ad landing


shopping cart


thank you page

Furthermore, we can make some improvements to the bottom of the Sales Funnel. If you use LifeCycle Grids you can identify the customer’s lifecycle phase. Therefore, we can represent a breakdown of customers who made purchases into their status (e.g. new customer, engaged and loyal). If you are not familiar with the LifeCycle Grids approach, please start here.
Here is our data set with the number of customers and their statuses:
content  step  number 
By using the same column names with the content table, it is easier to combine data.
The logic behind the R code is simple: we need to combine tables, define the order of steps, calculate dummy values for centering bars, calculate a share of session proceeded to the next step and plot the result:
Let’s go through the plot: there are stacked bars on the top of the chart that represent the total number of sessions (visits) at each step of Sales Funnel. 78.3% of sessions went from the Awareness step to the Interest one, 72.2% from the Interest to the Desire and 92.3% from the Desire to the Action. Even though it is not necessary, we used another logic at the bottom of the Funnel. If we can identify the customer’s type, we can split the total number of Actions (120K in our case) into different customer types (e.g. new, engaged and loyal customers). Further, we calculated the share of each customer type in the Action (20.8%, 33.3% and 45.8%). If we aren’t new in the market and have a number of loyal customers, we would take the hourglass view of the Sales Funnel.
click to expand R code
library(dplyr) library(ggplot2) library(reshape2) # creating a data samples # content df.content < data.frame(content = c('main', 'ad landing', 'product 1', 'product 2', 'product 3', 'product 4', 'shopping cart', 'thank you page'), step = c('awareness', 'awareness', 'interest', 'interest', 'interest', 'interest', 'desire', 'action'), number = c(150000, 80000, 80000, 40000, 35000, 25000, 130000, 120000)) # customers df.customers < data.frame(content = c('new', 'engaged', 'loyal'), step = c('new', 'engaged', 'loyal'), number = c(25000, 40000, 55000)) # combining two data sets df.all < rbind(df.content, df.customers) # calculating dummies, max and min values of X for plotting df.all < df.all %>% group_by(step) %>% mutate(totnum = sum(number)) %>% ungroup() %>% mutate(dum = (max(totnum)  totnum)/2, maxx = totnum + dum, minx = dum) # data frame for plotting funnel lines df.lines < df.all %>% select(step, maxx, minx) %>% group_by(step) %>% unique() %>% ungroup() # data frame with dummies df.dum < df.all %>% select(step, dum) %>% unique() %>% mutate(content = 'dummy', number = dum) %>% select(content, step, number) # data frame with rates conv < df.all$totnum[df.all$step == 'action'] df.rates < df.all %>% select(step, totnum) %>% group_by(step) %>% unique() %>% ungroup() %>% mutate(prevnum = lag(totnum), rate = ifelse(step == 'new'  step == 'engaged'  step == 'loyal', round(totnum / conv, 3), round(totnum / prevnum, 3))) %>% select(step, rate) df.rates < na.omit(df.rates) # creting final data frame df.all < df.all %>% select(content, step, number) df.all < rbind(df.all, df.dum) df.all < df.all %>% group_by(step) %>% arrange(desc(content)) %>% ungroup() # calculating position of labels df.all < df.all %>% group_by(step) %>% mutate(pos = cumsum(number)  0.5*number) %>% ungroup() # defining order of steps df.all$step < factor(df.all$step, levels = c('loyal', 'engaged', 'new', 'action', 'desire', 'interest', 'awareness')) list < c(unique(as.character(df.all$content))) df.all$content < factor(df.all$content, levels = c('dummy', c(list))) # creating custom palette with 'white' color for dummies cols < c("#ffffff", "#fec44f", "#fc9272", "#a1d99b", "#fee0d2", "#2ca25f", "#8856a7", "#43a2ca", "#fdbb84", "#e34a33", "#a6bddb", "#dd1c77", "#ffeda0", "#756bb1") # plotting chart ggplot() + theme_minimal() + coord_flip() + scale_fill_manual(values=cols) + geom_bar(data=df.all, aes(x=step, y=number, fill=content), stat="identity", width=1) + geom_text(data=df.all[df.all$content!='dummy', ], aes(x=step, y=pos, label=paste0(content, '', number/1000, 'K')), size=4, color='white', fontface="bold") + geom_ribbon(data=df.lines, aes(x=step, ymax=max(maxx), ymin=maxx, group=1), fill='white') + geom_line(data=df.lines, aes(x=step, y=maxx, group=1), color='darkred', size=4) + geom_ribbon(data=df.lines, aes(x=step, ymax=minx, ymin=min(minx), group=1), fill='white') + geom_line(data=df.lines, aes(x=step, y=minx, group=1), color='darkred', size=4) + geom_text(data=df.rates, aes(x=step, y=(df.lines$minx[1]), label=paste0(rate*100, '%')), hjust=1.2, color='darkblue', fontface="bold") + theme(legend.position='none', axis.ticks=element_blank(), axis.text.x=element_blank(), axis.title.x=element_blank())
Previously I shared the data visualization approach for descriptive analysis of progress of cohorts with the “layercake” chart (part I and part II). In this post, I want to share another interesting visualization that not only can be used for descriptive analysis as well but would be more helpful for analyzing a large number of cohorts. For instance, if you need to form and analyze weekly cohorts, you would have 52 cohorts within a year.
The Heatmap chart would be helpful for primary analysis and we will study how to create it with the R programming language. But firstly, I would like to give credit to John Egan who shared the idea of using the Cohort Activity Heatmap and to Ben Moore whose great post helped me to reproduce such a beautiful color palette.
The following is my interpretation of using the Heatmap for Cohort Analysis.
Let’s assume we form weekly cohorts and have 100 ones as of the reporting date. We’ve tracked the number of customers who made a purchase and the total gross margin per weekly cohort per time lapse (a week in our case). We can easily calculate two extra values based on these data:
In addition, I’ve simulated some purchase patterns that can be plausible, specifically:
Based on these data we can plot at least four types of charts using Heatmap:
Furthermore, charts can be represented based on calendar dates and the serial number of the week of the lifetime (e.g. 1^{st} week, 2^{nd} week, etc. from the first purchase date) as well. Therefore, we can see the influence of seasonality or other occurrences on all existing cohorts as of calendar date and the progress of each cohort comparing to the others based on the serial number of the week of the lifetime.
And our eight charts are the following:
We have placed dates (calendar or week of lifetime) on the xaxis and cohorts on the yaxis. The color of the heatmap represents the value (number of customers, gross margin, per customer gross margin and CLV to date).
Based on this type of visualization we can easily identify general purchasing behaviors, for instance:
You can produce this example via the following R code:
click to expand R code
#loading libraries library(dplyr) library(ggplot2) library(reshape2) #simulating dataset cohorts < data.frame() set.seed(10) for (i in c(1:100)) { coh < data.frame(cohort=i, date=c(i:100), week.lt=c(1:(100i+1)), num=replicate(1, sample(c(1:40), 100i+1, rep=TRUE)), av=replicate(1, sample(c(5:10), 100i+1, rep=TRUE))) coh$num[coh$week.lt==1] < sample(c(90:100), 1, rep=TRUE) ifelse(max(coh$date)>1, coh$num[coh$week.lt==2] < sample(c(75:90), 1, rep=TRUE), NA) ifelse(max(coh$date)>2, coh$num[coh$week.lt==3] < sample(c(60:75), 1, rep=TRUE), NA) ifelse(max(coh$date)>3, coh$num[coh$week.lt==4] < sample(c(40:60), 1, rep=TRUE), NA) ifelse(max(coh$date)>34, {coh$num[coh$date==35] < sample(c(60:85), 1, rep=TRUE) coh$av[coh$date==35] < 4}, NA) ifelse(max(coh$date)>47, {coh$num[coh$date==48] < sample(c(60:85), 1, rep=TRUE) coh$av[coh$date==48] < 4}, NA) ifelse(max(coh$date)>86, {coh$num[coh$date==87] < sample(c(60:85), 1, rep=TRUE) coh$av[coh$date==87] < 4}, NA) ifelse(max(coh$date)>99, {coh$num[coh$date==100] < sample(c(60:85), 1, rep=TRUE) coh$av[coh$date==100] < 4}, NA) coh$gr.marg < coh$av*coh$num cohorts < rbind(cohorts, coh) } cohorts$cohort < formatC(cohorts$cohort, width=3, format='d', flag='0') cohorts$cohort < paste('coh:week:', cohorts$cohort, sep='') cohorts$date < formatC(cohorts$date, width=3, format='d', flag='0') cohorts$date < paste('cal_week:', cohorts$date, sep='') cohorts$week.lt < formatC(cohorts$week.lt, width=3, format='d', flag='0') cohorts$week.lt < paste('week:', cohorts$week.lt, sep='') #calculating CLV to date cohorts < cohorts %>% group_by(cohort) %>% mutate(clv=cumsum(gr.marg)/num[week.lt=='week:001']) %>% ungroup() #color palette cols < c("#e7f0fa", "#c9e2f6", "#95cbee", "#0099dc", "#4ab04a", "#ffd73e", "#eec73a", "#e29421", "#e29421", "#f05336", "#ce472e") #Heatmap based on Number of active customers t < max(cohorts$num) ggplot(cohorts, aes(y=cohort, x=date, fill=num)) + theme_minimal() + geom_tile(colour="white", width=.9, height=.9) + scale_fill_gradientn(colours=cols, limits=c(0, t), breaks=seq(0, t, by=t/4), labels=c("0", round(t/4*1, 1), round(t/4*2, 1), round(t/4*3, 1), round(t/4*4, 1)), guide=guide_colourbar(ticks=T, nbin=50, barheight=.5, label=T, barwidth=10)) + theme(legend.position='bottom', legend.direction="horizontal", plot.title = element_text(size=20, face="bold", vjust=2), axis.text.x=element_text(size=8, angle=90, hjust=.5, vjust=.5, face="plain")) + ggtitle("Cohort Activity Heatmap (number of customers who purchased  calendar view)") ggplot(cohorts, aes(y=cohort, x=week.lt, fill=num)) + theme_minimal() + geom_tile(colour="white", width=.9, height=.9) + scale_fill_gradientn(colours=cols, limits=c(0, t), breaks=seq(0, t, by=t/4), labels=c("0", round(t/4*1, 1), round(t/4*2, 1), round(t/4*3, 1), round(t/4*4, 1)), guide=guide_colourbar(ticks=T, nbin=50, barheight=.5, label=T, barwidth=10)) + theme(legend.position='bottom', legend.direction="horizontal", plot.title = element_text(size=20, face="bold", vjust=2), axis.text.x=element_text(size=8, angle=90, hjust=.5, vjust=.5, face="plain")) + ggtitle("Cohort Activity Heatmap (number of customers who purchased  lifetime view)") # Heatmap based on Gross margin t < max(cohorts$gr.marg) ggplot(cohorts, aes(y=cohort, x=date, fill=gr.marg)) + theme_minimal() + geom_tile(colour="white", width=.9, height=.9) + scale_fill_gradientn(colours=cols, limits=c(0, t), breaks=seq(0, t, by=t/4), labels=c("0", round(t/4*1, 1), round(t/4*2, 1), round(t/4*3, 1), round(t/4*4, 1)), guide=guide_colourbar(ticks=T, nbin=50, barheight=.5, label=T, barwidth=10)) + theme(legend.position='bottom', legend.direction="horizontal", plot.title = element_text(size=20, face="bold", vjust=2), axis.text.x=element_text(size=8, angle=90, hjust=.5, vjust=.5, face="plain")) + ggtitle("Heatmap based on Gross margin (calendar view)") ggplot(cohorts, aes(y=cohort, x=week.lt, fill=gr.marg)) + theme_minimal() + geom_tile(colour="white", width=.9, height=.9) + scale_fill_gradientn(colours=cols, limits=c(0, t), breaks=seq(0, t, by=t/4), labels=c("0", round(t/4*1, 1), round(t/4*2, 1), round(t/4*3, 1), round(t/4*4, 1)), guide=guide_colourbar(ticks=T, nbin=50, barheight=.5, label=T, barwidth=10)) + theme(legend.position='bottom', legend.direction="horizontal", plot.title = element_text(size=20, face="bold", vjust=2), axis.text.x=element_text(size=8, angle=90, hjust=.5, vjust=.5, face="plain")) + ggtitle("Heatmap based on Gross margin (lifetime view)") # Heatmap of per customer gross margin t < max(cohorts$av) ggplot(cohorts, aes(y=cohort, x=date, fill=av)) + theme_minimal() + geom_tile(colour="white", width=.9, height=.9) + scale_fill_gradientn(colours=cols, limits=c(0, t), breaks=seq(0, t, by=t/4), labels=c("0", round(t/4*1, 1), round(t/4*2, 1), round(t/4*3, 1), round(t/4*4, 1)), guide=guide_colourbar(ticks=T, nbin=50, barheight=.5, label=T, barwidth=10)) + theme(legend.position='bottom', legend.direction="horizontal", plot.title = element_text(size=20, face="bold", vjust=2), axis.text.x=element_text(size=8, angle=90, hjust=.5, vjust=.5, face="plain")) + ggtitle("Heatmap based on per customer gross margin (calendar view)") ggplot(cohorts, aes(y=cohort, x=week.lt, fill=av)) + theme_minimal() + geom_tile(colour="white", width=.9, height=.9) + scale_fill_gradientn(colours=cols, limits=c(0, t), breaks=seq(0, t, by=t/4), labels=c("0", round(t/4*1, 1), round(t/4*2, 1), round(t/4*3, 1), round(t/4*4, 1)), guide=guide_colourbar(ticks=T, nbin=50, barheight=.5, label=T, barwidth=10)) + theme(legend.position='bottom', legend.direction="horizontal", plot.title = element_text(size=20, face="bold", vjust=2), axis.text.x=element_text(size=8, angle=90, hjust=.5, vjust=.5, face="plain")) + ggtitle("Heatmap based on per customer gross margin (lifetime view)") # Heatmap of CLV to date t < max(cohorts$clv) ggplot(cohorts, aes(y=cohort, x=date, fill=clv)) + theme_minimal() + geom_tile(colour="white", width=.9, height=.9) + scale_fill_gradientn(colours=cols, limits=c(0, t), breaks=seq(0, t, by=t/4), labels=c("0", round(t/4*1, 1), round(t/4*2, 1), round(t/4*3, 1), round(t/4*4, 1)), guide=guide_colourbar(ticks=T, nbin=50, barheight=.5, label=T, barwidth=10)) + theme(legend.position='bottom', legend.direction="horizontal", plot.title = element_text(size=20, face="bold", vjust=2), axis.text.x=element_text(size=8, angle=90, hjust=.5, vjust=.5, face="plain")) + ggtitle("Heatmap based on CLV to date of customers who ever purchased (calendar view)") ggplot(cohorts, aes(y=cohort, x=week.lt, fill=clv)) + theme_minimal() + geom_tile(colour="white", width=.9, height=.9) + scale_fill_gradientn(colours=cols, limits=c(0, t), breaks=seq(0, t, by=t/4), labels=c("0", round(t/4*1, 1), round(t/4*2, 1), round(t/4*3, 1), round(t/4*4, 1)), guide=guide_colourbar(ticks=T, nbin=50, barheight=.5, label=T, barwidth=10)) + theme(legend.position='bottom', legend.direction="horizontal", plot.title = element_text(size=20, face="bold", vjust=2), axis.text.x=element_text(size=8, angle=90, hjust=.5, vjust=.5, face="plain")) + ggtitle("Heatmap based on CLV to date of customers who ever purchased (lifetime view)")
This is the third post about LifeCycle Grids. You can find the first post about the sense of LifeCycle Grids and AZ process for creating and visualizing with R programming language here. Lastly, here is the second post about adding monetary metrics (customer lifetime value – CLV – and customer acquisition cost – CAC) to the LifeCycle Grids.
Even after we added CLV and CAC to the LifeCycle Grids and obtained a representative view, there is always space for improvements. The main problem is that we combined customers with different characteristics like source of attraction, actual lifetime, and other features that we can use for better analysis. Therefore, in order to make our analysis more detailed, we will study how to combine Cohort Analysis and LifeCycle Grids in this post.
The main principle of Cohort Analysis is to combine customers through some common characteristics (e.g. registration date, first purchase date, medium/source/campaign of attraction and so on). Cohort Analysis allows us to split customers into homogeneous groups. Therefore, we can obtain benefits from combining homogeneous cohorts through acquisition characteristics with the homogeneous groups (grids) through lifecycle phase.
In addition, we will assume that we have not only actual CLV (CLV to date) but also predicted CLV (potential value). This can be helpful in some cases, for example when different advertisement campaigns are targeted to various customer segments that have different behavior and potential values as a result.
We will study how to combine customers with both first purchase date cohorts and campaign cohorts and distribute them into LifeCycle Grids, which interesting perspectives we have for analyzing, and how to visualize results.
Ok, let’s start by creating data sample with the following code:
click to expand R code
# loading libraries library(dplyr) library(reshape2) library(ggplot2) library(googleVis) set.seed(10) # creating orders data sample data < data.frame(orderId=sample(c(1:5000), 25000, replace=TRUE), product=sample(c('NULL','a','b','c'), 25000, replace=TRUE, prob=c(0.15, 0.65, 0.3, 0.15))) order < data.frame(orderId=c(1:5000), clientId=sample(c(1:1500), 5000, replace=TRUE)) date < data.frame(orderId=c(1:5000), orderdate=sample((1:500), 5000, replace=TRUE)) orders < merge(data, order, by='orderId') orders < merge(orders, date, by='orderId') orders < orders[orders$product!='NULL', ] orders$orderdate < as.Date(orders$orderdate, origin="20120101") rm(data, date, order) # creating data frames with CAC, Gross margin, Campaigns and Potential CLV gr.margin < data.frame(product=c('a', 'b', 'c'), grossmarg=c(1, 2, 3)) campaign < data.frame(clientId=c(1:1500), campaign=paste('campaign', sample(c(1:7), 1500, replace=TRUE), sep=' ')) cac < data.frame(campaign=unique(campaign$campaign), cac=sample(c(10:15), 7, replace=TRUE)) campaign < merge(campaign, cac, by='campaign') potential < data.frame(clientId=c(1:1500), clv.p=sample(c(0:50), 1500, replace=TRUE)) rm(cac) # reporting date today < as.Date('20130516', format='%Y%m%d')
As a result, we’ve obtained the following data frames:
We will start by calculating necessary indexes (CLV, frequency, recency, potential value, CAC and average time lapses between purchases), adding campaigns, and defining cohort features based on the first purchase date for each customer with the following code:
click to expand R code
# calculating CLV, frequency, recency, average time lapses between purchases and defining cohorts orders < merge(orders, gr.margin, by='product') customers < orders %>% # combining products and summarising gross margin group_by(orderId, clientId, orderdate) %>% summarise(grossmarg=sum(grossmarg)) %>% ungroup() %>% # calculating frequency, recency, average time lapses between purchases and defining cohorts group_by(clientId) %>% mutate(frequency=n(), recency=as.numeric(todaymax(orderdate)), av.gap=round(as.numeric(max(orderdate)min(orderdate))/frequency, 0), cohort=format(min(orderdate), format='%Y%m')) %>% ungroup() %>% # calculating CLV to date group_by(clientId, cohort, frequency, recency, av.gap) %>% summarise(clv=sum(grossmarg)) %>% arrange(clientId) %>% ungroup() # calculating potential CLV and CAC customers < merge(customers, campaign, by='clientId') customers < merge(customers, potential, by='clientId') # leading the potential value to more or less real value customers$clv.p < round(customers$clv.p / sqrt(customers$recency) * customers$frequency, 2) rm(potential, gr.margin, today)
Therefore, we’ve obtained the customers data frame that looks like:
clientId cohort frequency recency av.gap clv campaign cac clv.p 1 201206 5 23 60 32 campaign 2 14 25.02 2 201201 2 426 36 20 campaign 4 10 4.65 3 201209 4 64 48 24 campaign 4 10 17.50 4 201203 2 286 66 25 campaign 2 14 0.24 5 201201 6 89 66 54 campaign 1 11 11.45 6 201204 5 85 64 27 campaign 3 12 3.25
Furthermore, we need to define segments based on frequency and recency values. We will do this with the following code:
click to expand R code
# adding segments customers < customers %>% mutate(segm.freq=ifelse(between(frequency, 1, 1), '1', ifelse(between(frequency, 2, 2), '2', ifelse(between(frequency, 3, 3), '3', ifelse(between(frequency, 4, 4), '4', ifelse(between(frequency, 5, 5), '5', '>5')))))) %>% mutate(segm.rec=ifelse(between(recency, 0, 30), '030 days', ifelse(between(recency, 31, 60), '3160 days', ifelse(between(recency, 61, 90), '6190 days', ifelse(between(recency, 91, 120), '91120 days', ifelse(between(recency, 121, 180), '121180 days', '>180 days')))))) # defining order of boundaries customers$segm.freq < factor(customers$segm.freq, levels=c('>5', '5', '4', '3', '2', '1')) customers$segm.rec < factor(customers$segm.rec, levels=c('>180 days', '121180 days', '91120 days', '6190 days', '3160 days', '030 days'))
Ok, this is the time for combining Cohort Analysis and LifeCycle Grids into the mixed segmentation model.
We will start with a fairly common approach of combining cohorts, specifically with the first purchase date cohorts where the first purchase date is used for combining customers into groups (cohorts). Let’s take a look at this mixed segmentation from three perspectives, which I believe can be interesting:
Let’s work with these prospects. We will start by combining LifeCycle Grids and first purchase date cohorts using the following code:
click to expand R code
lcg.coh < customers %>% group_by(cohort, segm.rec, segm.freq) %>% # calculating cumulative values summarise(quantity=n(), cac=sum(cac), clv=sum(clv), clv.p=sum(clv.p), av.gap=sum(av.gap)) %>% ungroup() %>% # calculating average values mutate(av.cac=round(cac/quantity, 2), av.clv=round(clv/quantity, 2), av.clv.p=round(clv.p/quantity, 2), av.clv.tot=av.clv+av.clv.p, av.gap=round(av.gap/quantity, 2), diff=av.clvav.cac)
1. Structure of averages and comparison cohorts
We will start with two trivial charts:
click to expand R code
ggplot(lcg.coh, aes(x=cohort, fill=cohort)) + theme_bw() + theme(panel.grid = element_blank())+ geom_bar(aes(y=diff), stat='identity', alpha=0.5) + geom_text(aes(y=diff, label=round(diff,0)), size=4) + facet_grid(segm.freq ~ segm.rec) + theme(axis.text.x=element_text(angle=90, hjust=.5, vjust=.5, face="plain")) + ggtitle("Cohorts in LifeCycle Grids  difference between av.CLV to date and av.CAC") ggplot(lcg.coh, aes(x=cohort, fill=cohort)) + theme_bw() + theme(panel.grid = element_blank())+ geom_bar(aes(y=av.clv.tot), stat='identity', alpha=0.2) + geom_text(aes(y=av.clv.tot+10, label=round(av.clv.tot,0), color=cohort), size=4) + geom_bar(aes(y=av.clv), stat='identity', alpha=0.7) + geom_errorbar(aes(y=av.cac, ymax=av.cac, ymin=av.cac), color='red', size=1.2) + geom_text(aes(y=av.cac, label=round(av.cac,0)), size=4, color='darkred', vjust=.5) + facet_grid(segm.freq ~ segm.rec) + theme(axis.text.x=element_text(angle=90, hjust=.5, vjust=.5, face="plain")) + ggtitle("Cohorts in LifeCycle Grids  total av.CLV and av.CAC")
Let’s look at cell [>5 purchases : 91120 days]. The 201201 cohort has the highest actual customer´s net value and the highest total CLV. It is the oldest and had more chances to be more valuable. Compared to the 201202 cohort, the lifetime difference is only one month but values are significantly better. Therefore, we can distribute our limited advertisement budget more accurately than just by knowing the grid’s total average.
2. Analyzing customer flows
Let’s study how we can visualize customers’ flows from cell to cell with the Sankey diagram. Assume we want to see the progress of the 201209 cohort as of the dates: 20121001, 20130101 and 20130401. We will use the following code:
click to expand R code
# customers flows analysis (FPD cohorts) # defining cohort and reporting dates coh < '201209' report.dates < c('20121001', '20130101', '20130401') report.dates < as.Date(report.dates, format='%Y%m%d') # defining segments for each cohort's customer for reporting dates df.sankey < data.frame() for (i in 1:length(report.dates)) { orders.cache < orders %>% filter(orderdate < report.dates[i]) customers.cache < orders.cache %>% select(product, grossmarg) %>% unique() %>% group_by(clientId) %>% mutate(frequency=n(), recency=as.numeric(report.dates[i]  max(orderdate)), cohort=format(min(orderdate), format='%Y%m')) %>% ungroup() %>% select(clientId, frequency, recency, cohort) %>% unique() %>% filter(cohort==coh) %>% mutate(segm.freq=ifelse(between(frequency, 1, 1), '1 purch', ifelse(between(frequency, 2, 2), '2 purch', ifelse(between(frequency, 3, 3), '3 purch', ifelse(between(frequency, 4, 4), '4 purch', ifelse(between(frequency, 5, 5), '5 purch', '>5 purch')))))) %>% mutate(segm.rec=ifelse(between(recency, 0, 30), '030 days', ifelse(between(recency, 31, 60), '3160 days', ifelse(between(recency, 61, 90), '6190 days', ifelse(between(recency, 91, 120), '91120 days', ifelse(between(recency, 121, 180), '121180 days', '>180 days')))))) %>% mutate(cohort.segm=paste(cohort, segm.rec, segm.freq, sep=' : '), report.date=report.dates[i]) %>% select(clientId, cohort.segm, report.date) df.sankey < rbind(df.sankey, customers.cache) } # processing data for Sankey diagram format df.sankey < dcast(df.sankey, clientId ~ report.date, value.var='cohort.segm', fun.aggregate = NULL) write.csv(df.sankey, 'customers_path.csv', row.names=FALSE) df.sankey < df.sankey %>% select(clientId) df.sankey.plot < data.frame() for (i in 2:ncol(df.sankey)) { df.sankey.cache < df.sankey %>% group_by(df.sankey[ , i1], df.sankey[ , i]) %>% summarise(n=n()) %>% ungroup() colnames(df.sankey.cache)[1:2] < c('from', 'to') df.sankey.cache$from < paste(df.sankey.cache$from, ' (', report.dates[i1], ')', sep='') df.sankey.cache$to < paste(df.sankey.cache$to, ' (', report.dates[i], ')', sep='') df.sankey.plot < rbind(df.sankey.plot, df.sankey.cache) } # plotting plot(gvisSankey(df.sankey.plot, from='from', to='to', weight='n', options=list(height=900, width=1800, sankey="{link:{color:{fill:'lightblue'}}}")))
Note: if you plot this chart on your computer, it is interactive and you can highlight any paths and checkpoints.
Therefore, we can easily identify dominant paths, find the proportion of the best or worst clients, direct activities on customers who are in the exact checkpoint of their path and analyze the effect of these activities, and compare the progress of different cohorts. Lastly, we saved the path for each client in the customers_path.csv file that you can use for future work.
3. Analyzing purchasing pace
We will start by plotting the average time lapses between purchases. Actually, we already calculated this index when we created lcg.coh data frame:
click to expand R code
ggplot(lcg.coh, aes(x=cohort, fill=cohort)) + theme_bw() + theme(panel.grid = element_blank())+ geom_bar(aes(y=av.gap), stat='identity', alpha=0.6) + geom_text(aes(y=av.gap, label=round(av.gap,0)), size=4) + facet_grid(segm.freq ~ segm.rec) + theme(axis.text.x=element_text(angle=90, hjust=.5, vjust=.5, face="plain")) + ggtitle("Cohorts in LifeCycle Grids  average time lapses between purchases")
This is not surprising: the time lapses increase by increasing the age of cohort. Of course, we have to take into account the boundaries on the recency axis. The larger the range of the boundary, the higher the probability to find older cohorts with a higher pace than younger cohorts. However, our main goal is to work with values that we’ve calculated.
Here are some points I want you to pay attention to:
The second example of cohort analysis is to combine customers by the advertisement campaign they were attracted by. This obviously can be helpful because we usually attract different customers with different campaigns. Therefore, we can expect that clients who were attracted by one campaign have some similarities in behavior and are sensitive to the exact same offers/communication channels. Furthermore, we would easily compare progress of campaigns in terms of monetary values (CLV and CAC).
We will use the same charts as the ones for first purchase date cohorts. Because we’ve added the campaign name to the data sample earlier, we can adapt our code by changing “cohort” value to “campaign” only:
click to expand R code
# campaign cohorts lcg.camp < customers %>% group_by(campaign, segm.rec, segm.freq) %>% # calculating cumulative values summarise(quantity=n(), cac=sum(cac), clv=sum(clv), clv.p=sum(clv.p), av.gap=sum(av.gap)) %>% ungroup() %>% # calculating average values mutate(av.cac=round(cac/quantity, 2), av.clv=round(clv/quantity, 2), av.clv.p=round(clv.p/quantity, 2), av.clv.tot=av.clv+av.clv.p, av.gap=round(av.gap/quantity, 2), diff=av.clvav.cac) ggplot(lcg.camp, aes(x=campaign, fill=campaign)) + theme_bw() + theme(panel.grid = element_blank())+ geom_bar(aes(y=diff), stat='identity', alpha=0.5) + geom_text(aes(y=diff, label=round(diff,0)), size=4) + facet_grid(segm.freq ~ segm.rec) + theme(axis.text.x=element_text(angle=90, hjust=.5, vjust=.5, face="plain")) + ggtitle("Campaigns in LifeCycle Grids  difference between av.CLV to date and av.CAC") ggplot(lcg.camp, aes(x=campaign, fill=campaign)) + theme_bw() + theme(panel.grid = element_blank())+ geom_bar(aes(y=av.clv.tot), stat='identity', alpha=0.2) + geom_text(aes(y=av.clv.tot+10, label=round(av.clv.tot,0), color=campaign), size=4) + geom_bar(aes(y=av.clv), stat='identity', alpha=0.7) + geom_errorbar(aes(y=av.cac, ymax=av.cac, ymin=av.cac), color='red', size=1.2) + geom_text(aes(y=av.cac, label=round(av.cac,0)), size=4, color='darkred', vjust=.5) + facet_grid(segm.freq ~ segm.rec) + theme(axis.text.x=element_text(angle=90, hjust=.5, vjust=.5, face="plain")) + ggtitle("Campaigns in LifeCycle Grids  total av.CLV and av.CAC") ggplot(lcg.camp, aes(x=campaign, fill=campaign)) + theme_bw() + theme(panel.grid = element_blank())+ geom_bar(aes(y=av.gap), stat='identity', alpha=0.6) + geom_text(aes(y=av.gap, label=round(av.gap,0)), size=4) + facet_grid(segm.freq ~ segm.rec) + theme(axis.text.x=element_text(angle=90, hjust=.5, vjust=.5, face="plain")) + ggtitle("Campaigns in LifeCycle Grids  average time lapses between purchases")
And we’ve obtained these charts:
I believe everything is clear with these charts. I don’t think that Sankey diagram can be helpful enough for campaign cohorts. If we have some general campaigns that work for a long time period, we can obtain chaotic paths. Instead, I suggest studying a more accurate and visual approach that would be used for campaigns as well as first purchase date cohorts.
Each customer has a path of migration from one cell to another that is based on purchasing behavior and affects CLV. They all have the same initial cell [1 purchase : 030 days], but since maximum 30 days (in our case) they had started a journey through grids. My idea is to analyze the path patterns of each cohort and identify cohorts that attracted customers with the path we prefer or not in order to make relevant offers. We will use the lifecycle phase sequential analysis for this. Note: you can find the example of shopping cart sequential analysis in my previous posts that started here so you can obtain other benefits of the method.
Everything we need for this is to reproduce paths through grids for each customer. We will do this with the following code:
click to expand R code
# lifecycle phase sequential analysis library(TraMineR) min.date < min(orders$orderdate) max.date < max(orders$orderdate) l < c(seq(0,as.numeric(max.datemin.date), 10), as.numeric(max.datemin.date)) df < data.frame() for (i in l) { cur.date < min.date + i print(cur.date) orders.cache < orders %>% filter(orderdate <= cur.date) customers.cache < orders.cache %>% select(product, grossmarg) %>% unique() %>% group_by(clientId) %>% mutate(frequency=n(), recency=as.numeric(cur.date  max(orderdate))) %>% ungroup() %>% select(clientId, frequency, recency) %>% unique() %>% mutate(segm= ifelse(between(frequency, 1, 2) & between(recency, 0, 60), 'new customer', ifelse(between(frequency, 1, 2) & between(recency, 61, 180), 'under risk new customer', ifelse(between(frequency, 1, 2) & recency > 180, '1x buyer', ifelse(between(frequency, 3, 4) & between(recency, 0, 60), 'engaged customer', ifelse(between(frequency, 3, 4) & between(recency, 61, 180), 'under risk engaged customer', ifelse(between(frequency, 3, 4) & recency > 180, 'former engaged customer', ifelse(frequency > 4 & between(recency, 0, 60), 'best customer', ifelse(frequency > 4 & between(recency, 61, 180), 'under risk best customer', ifelse(frequency > 4 & recency > 180, 'former best customer', NA)))))))))) %>% mutate(report.date=i) %>% select(clientId, segm, report.date) df < rbind(df, customers.cache) }
We’ve checked the position of each customer in the LifeCycle Grids as of past dates. Here are two things that I want you to pay attention to: If you have quite a few customers, it would take a lot of time to reproduce grids for each past day. Therefore, we’ve used 10day gaps in the loop that doesn´t seem detrimental in terms of accuracy as our minimal recency time lapse is 60 days.
Secondly, it would be quite tough to work with 36 grids. Therefore, we used 9 segments which you can adapt to your needs:
We will do sequential analysis with the following code:
click to expand R code
# converting data to the sequence format df < dcast(df, clientId ~ report.date, value.var='segm', fun.aggregate = NULL) df.seq < seqdef(df, 2:ncol(df), left='DEL', right='DEL', xtstep=10) # creating df with first purch.date and campaign cohort features feat < df %>% select(clientId) feat < merge(feat, campaign[, 1:2], by='clientId') feat < merge(feat, customers[, 1:2], by='clientId') # plotting the 10 most frequent sequences based on campaign seqfplot(df.seq, border=NA, group=feat$campaign) # plotting the 10 most frequent sequences based on campaign seqfplot(df.seq, border=NA, group=feat$campaign, cex.legend=0.9) # plotting the 10 most frequent sequences based on first purch.date cohort coh.list < sort(unique(feat$cohort)) # defining cohorts for plotting feat.coh.list < feat[feat$cohort %in% coh.list[1:6] , ] df.coh < df %>% filter(clientId %in% c(feat.coh.list$clientId)) df.seq.coh < seqdef(df.coh, 2:ncol(df.coh), left='DEL', right='DEL', xtstep=10) seqfplot(df.seq.coh, border=NA, group=feat.coh.list$cohort, cex.legend=0.9)
What I really like about this approach is that we’ve easily led all customers to point zero of their life/lifetime with us. What I mean is that we replaced the first day of their lifetime (exact calendar date) with us with day 0 for all customers. This way, we switched from calendar dates to sequence dates. Therefore, all our sequences start from day 0. The white spaces mean that the next cell is unknown at the moment.
We can compare cohorts via share of customers with the different paths and current lifecycle phases (last color stripe). We can see that, for instance, the 201201 cohort has brought us some part of customers who are the best now (yellow stripe), but the 201203 cohort has not.
This way, we can identify different patterns in paths. For example, we can see the history of migrations for the current best customers. Have they become the best ones by avoiding “under risk” or “former” segments? Was there anything that could affect them and how we would use this with other clients?
Conclusions. We’ve studied how Cohort Analysis can help us to combine customers into groups based on common characteristics and obtain a clearer view of differences between customers who are in the same cell of LifeCycle Grids. Also, we’ve touched upon sequential analyzes which helped us to find some patterns in the customers’ journey through grids. Lastly, we’ve found that customers who are in the same lifecycle phase can have significantly different purchasing behaviors. Therefore, it can be the topic for the future work: how to create cohorts based on purchasing behaviors/patterns.
Thank you for reading this! Feel free to share your thoughts about.
]]>We studied a very powerful approach for customer segmentation in the previous post, which is based on the customer’s lifecycle. We used two metrics: frequency and recency. It is also possible and very helpful to add monetary value to our segmentation. If you have customer acquisition cost (CAC) and customer lifetime value (CLV), you can easily add these data to the calculations.
We will create the same data sample as in the previous post, but with two added data frames:
click to expand R code
# loading libraries library(dplyr) library(reshape2) library(ggplot2) # creating data sample set.seed(10) data < data.frame(orderId=sample(c(1:1000), 5000, replace=TRUE), product=sample(c('NULL','a','b','c'), 5000, replace=TRUE, prob=c(0.15, 0.65, 0.3, 0.15))) order < data.frame(orderId=c(1:1000), clientId=sample(c(1:300), 1000, replace=TRUE)) gender < data.frame(clientId=c(1:300), gender=sample(c('male', 'female'), 300, replace=TRUE, prob=c(0.40, 0.60))) date < data.frame(orderId=c(1:1000), orderdate=sample((1:100), 1000, replace=TRUE)) orders < merge(data, order, by='orderId') orders < merge(orders, gender, by='clientId') orders < merge(orders, date, by='orderId') orders < orders[orders$product!='NULL', ] orders$orderdate < as.Date(orders$orderdate, origin="20120101") # creating data frames with CAC and Gross margin cac < data.frame(clientId=unique(orders$clientId), cac=sample(c(10:15), 289, replace=TRUE)) gr.margin < data.frame(product=c('a', 'b', 'c'), grossmarg=c(1, 2, 3)) rm(data, date, order, gender)
Next, we will calculate CLV to date (actual amount that we earned) using gross margin values and orders of the products. We will use the following code:
click to expand R code
# reporting date today < as.Date('20120411', format='%Y%m%d') # calculating customer lifetime value orders < merge(orders, gr.margin, by='product') clv < orders %>% group_by(clientId) %>% summarise(clv=sum(grossmarg)) %>% ungroup() # processing data orders < dcast(orders, orderId + clientId + gender + orderdate ~ product, value.var='product', fun.aggregate=length) orders < orders %>% group_by(clientId) %>% mutate(frequency=n(), recency=as.numeric(todayorderdate)) %>% filter(orderdate==max(orderdate)) %>% filter(orderId==max(orderId)) %>% ungroup() orders.segm < orders %>% mutate(segm.freq=ifelse(between(frequency, 1, 1), '1', ifelse(between(frequency, 2, 2), '2', ifelse(between(frequency, 3, 3), '3', ifelse(between(frequency, 4, 4), '4', ifelse(between(frequency, 5, 5), '5', '>5')))))) %>% mutate(segm.rec=ifelse(between(recency, 0, 6), '06 days', ifelse(between(recency, 7, 13), '713 days', ifelse(between(recency, 14, 19), '1419 days', ifelse(between(recency, 20, 45), '2045 days', ifelse(between(recency, 46, 80), '4680 days', '>80 days')))))) %>% # creating last cart feature mutate(cart=paste(ifelse(a!=0, 'a', ''), ifelse(b!=0, 'b', ''), ifelse(c!=0, 'c', ''), sep='')) %>% arrange(clientId) # defining order of boundaries orders.segm$segm.freq < factor(orders.segm$segm.freq, levels=c('>5', '5', '4', '3', '2', '1')) orders.segm$segm.rec < factor(orders.segm$segm.rec, levels=c('>80 days', '4680 days', '2045 days', '1419 days', '713 days', '06 days'))
Note: if you prefer to use potential/expected/predicted CLV or total CLV (sum of CLV to date and potential CLV) you can adapt this code or find the example in the next post.
In addition, we need to merge orders.segm with the CAC and CLV data, and combine the data with the segments. We will calculate total CAC and CLV to date, as well as their average with the following code:
click to expand R code
orders.segm < merge(orders.segm, cac, by='clientId') orders.segm < merge(orders.segm, clv, by='clientId') lcg.clv < orders.segm %>% group_by(segm.rec, segm.freq) %>% summarise(quantity=n(), # calculating cumulative CAC and CLV cac=sum(cac), clv=sum(clv)) %>% ungroup() %>% # calculating CAC and CLV per client mutate(cac1=round(cac/quantity, 2), clv1=round(clv/quantity, 2)) lcg.clv < melt(lcg.clv, id.vars=c('segm.rec', 'segm.freq', 'quantity'))
Ok, let’s plot two charts: the first one representing the totals and the second one representing the averages:
click to expand R code
ggplot(lcg.clv[lcg.clv$variable %in% c('clv', 'cac'), ], aes(x=variable, y=value, fill=variable)) + theme_bw() + theme(panel.grid = element_blank())+ geom_bar(stat='identity', alpha=0.6, aes(width=quantity/max(quantity))) + geom_text(aes(y=value, label=value), size=4) + facet_grid(segm.freq ~ segm.rec) + ggtitle("LifeCycle Grids  CLV vs CAC (total)") ggplot(lcg.clv[lcg.clv$variable %in% c('clv1', 'cac1'), ], aes(x=variable, y=value, fill=variable)) + theme_bw() + theme(panel.grid = element_blank())+ geom_bar(stat='identity', alpha=0.6, aes(width=quantity/max(quantity))) + geom_text(aes(y=value, label=value), size=4) + facet_grid(segm.freq ~ segm.rec) + ggtitle("LifeCycle Grids  CLV vs CAC (average)")
You can find in the grids that the width of bars depends on the number of customers. I think these visualizations are very helpful. You can see the difference between CLV to date and CAC and make decisions about on paid campaigns or initiatives like:
Therefore, we have got a very interesting visualization. We can analyze and make decisions based on the three customer lifecycle metrics: recency, frequency and monetary value.
Thank you for reading this!
]]>I want to share a very powerful approach for customer segmentation in this post. It is based on customer’s lifecycle, specifically on frequency and recency of purchases. The idea of using these metrics comes from the RFM analysis. Recency and frequency are very important behavior metrics. We are interested in frequent and recent purchases, because frequency affects client’s lifetime value and recency affects retention. Therefore, these metrics can help us to understand the current phase of the client’s lifecycle. When we know each client’s phase, we can split customer base into groups (segments) in order to:
For this, we will use a matrix called LifeCycle Grids. We will study how to process initial data (transaction) to the matrix, how to visualize it, and how to do some indepth analysis. We will do all these steps with the R programming language.
Let’s create a data sample with the following code:
click to expand R code
# loading libraries library(dplyr) library(reshape2) library(ggplot2) # creating data sample set.seed(10) data < data.frame(orderId=sample(c(1:1000), 5000, replace=TRUE), product=sample(c('NULL','a','b','c'), 5000, replace=TRUE, prob=c(0.15, 0.65, 0.3, 0.15))) order < data.frame(orderId=c(1:1000), clientId=sample(c(1:300), 1000, replace=TRUE)) gender < data.frame(clientId=c(1:300), gender=sample(c('male', 'female'), 300, replace=TRUE, prob=c(0.40, 0.60))) date < data.frame(orderId=c(1:1000), orderdate=sample((1:100), 1000, replace=TRUE)) orders < merge(data, order, by='orderId') orders < merge(orders, gender, by='clientId') orders < merge(orders, date, by='orderId') orders < orders[orders$product!='NULL', ] orders$orderdate < as.Date(orders$orderdate, origin="20120101") rm(data, date, order, gender)
The head of our data sample looks like:
orderId clientId product gender orderdate 1 1 254 a female 20120403 2 1 254 b female 20120403 3 1 254 c female 20120403 4 1 254 b female 20120403 5 2 151 a female 20120131 6 2 151 b female 20120131
You can see that there is a gender of customer in the table. We will use it as an example of some indepth analysis later. I recommend you to use any additional features, that you have, for seeking insights. It can be source of client, channel, campaign, geo data and so on.
A few words about LifeCycle Grids. It is a matrix with 2 dimensions:
The first step is to think about suitable grids for your business. It is impossible to work with infinite segments. Therefore, we need to define some boundaries of frequency and recency, which should help us to split customers into homogeneous groups (segments). The analysis of the distribution of the frequency and the recency in our data set combined with the knowledge of business aspects can help us to find suitable boundaries.
Therefore, we need to calculate two values:
Then, plot the distribution with the following code:
click to expand R code
# reporting date today < as.Date('20120411', format='%Y%m%d') # processing data orders < dcast(orders, orderId + clientId + gender + orderdate ~ product, value.var='product', fun.aggregate=length) orders < orders %>% group_by(clientId) %>% mutate(frequency=n(), recency=as.numeric(todayorderdate)) %>% filter(orderdate==max(orderdate)) %>% filter(orderId==max(orderId)) %>% ungroup() # exploratory analysis ggplot(orders, aes(x=frequency)) + theme_bw() + scale_x_continuous(breaks=c(1:10)) + geom_bar(alpha=0.6, binwidth=1) + ggtitle("Dustribution by frequency") ggplot(orders, aes(x=recency)) + theme_bw() + geom_bar(alpha=0.6, binwidth=1) + ggtitle("Dustribution by recency")
Early behavior is most important, so finer detail is good there. Usually, there is a significant difference between customers who bought 1 time and those who bought 3 times, but is there any difference between customers who bought 50 times and other who bought 53 times? That is why it makes sense to set boundaries from lower values to higher gaps. We will use the following boundaries:
Next, we need to add segments to each client based on the boundaries. Also, we will create new variable ‘cart’, which includes products from the last cart, for doing indepth analysis.
click to expand R code
orders.segm < orders %>% mutate(segm.freq=ifelse(between(frequency, 1, 1), '1', ifelse(between(frequency, 2, 2), '2', ifelse(between(frequency, 3, 3), '3', ifelse(between(frequency, 4, 4), '4', ifelse(between(frequency, 5, 5), '5', '>5')))))) %>% mutate(segm.rec=ifelse(between(recency, 0, 6), '06 days', ifelse(between(recency, 7, 13), '713 days', ifelse(between(recency, 14, 19), '1419 days', ifelse(between(recency, 20, 45), '2045 days', ifelse(between(recency, 46, 80), '4680 days', '>80 days')))))) %>% # creating last cart feature mutate(cart=paste(ifelse(a!=0, 'a', ''), ifelse(b!=0, 'b', ''), ifelse(c!=0, 'c', ''), sep='')) %>% arrange(clientId) # defining order of boundaries orders.segm$segm.freq < factor(orders.segm$segm.freq, levels=c('>5', '5', '4', '3', '2', '1')) orders.segm$segm.rec < factor(orders.segm$segm.rec, levels=c('>80 days', '4680 days', '2045 days', '1419 days', '713 days', '06 days'))
We have everything need to create LifeCycle Grids. We need to combine clients into segments with the following code:
click to expand R code
lcg < orders.segm %>% group_by(segm.rec, segm.freq) %>% summarise(quantity=n()) %>% mutate(client='client') %>% ungroup()
The classic matrix can be created with the following code:
click to expand R code
lcg.matrix < dcast(lcg, segm.freq ~ segm.rec, value.var='quantity', fun.aggregate=sum)
However, I suppose a good visualization is obtained through the following code:
click to expand R code
ggplot(lcg, aes(x=client, y=quantity, fill=quantity)) + theme_bw() + theme(panel.grid = element_blank())+ geom_bar(stat='identity', alpha=0.6) + geom_text(aes(y=max(quantity)/2, label=quantity), size=4) + facet_grid(segm.freq ~ segm.rec) + ggtitle("LifeCycle Grids")
I’ve added colored borders for a better understanding of how to work with this matrix. We have four quadrants:
Does it make sense to make the same offer to all of these customers? Certainly, it doesn’t! It makes sense to create different approaches not only for each quadrant, but for border cells as well.
What I really like about this model of segmentation is that it is stable and alive simultaneously. It is alive in terms of customers flow. Every day, with or without purchases, it will provide customers flow from one cell to another. And it is stable in terms of working with segments. It allows to work with customers who are on the same lifecycle phase. That means you can create suitable campaigns / offers / emails for each or several close cells and use them constantly.
Ok, it’s time to study how we can do some indepth analysis. R allows us to create subsegments and visualize them effectively. It can be helpful to distribute each cell via some features. For instance, there can distribute customers by gender. For the other example, where our products have different lifecycles, it can be helpful to analyze which product/s was/were in the last cart or we can combine these features. Let’s do this with the following code:
click to expand R code
lcg.sub < orders.segm %>% group_by(gender, cart, segm.rec, segm.freq) %>% summarise(quantity=n()) %>% mutate(client='client') %>% ungroup() ggplot(lcg.sub, aes(x=client, y=quantity, fill=gender)) + theme_bw() + scale_fill_brewer(palette='Set1') + theme(panel.grid = element_blank())+ geom_bar(stat='identity', position='fill' , alpha=0.6) + facet_grid(segm.freq ~ segm.rec) + ggtitle("LifeCycle Grids by gender (propotion)")
click to expand R code
ggplot(lcg.sub, aes(x=gender, y=quantity, fill=cart)) + theme_bw() + scale_fill_brewer(palette='Set1') + theme(panel.grid = element_blank())+ geom_bar(stat='identity', position='fill' , alpha=0.6) + facet_grid(segm.freq ~ segm.rec) + ggtitle("LifeCycle Grids by gender and last cart (propotion)")
Therefore, there is a lot of space for creativity. If you want to know much more about LifeCycle Grids and strategies for working with quadrants, I highly recommend that you read Jim Novo’s works, e.g. this blogpost.
Thank you for reading this!
]]>This is the third part of the sequence of shopping carts indepth analysis. We processed initial data in the required format, did the exploratory analysis and started the indepth analysis in the first post. Finally, we used cluster analysis for creating customer segments in the second post. As I mentioned in the first post, the sequence can be presented as either state or an event. We dealt with sequences of states until then, which helped us to find some patterns in customers behavior, including time lapses between purchases, and to create the dummy variable ‘nopurch’ for customers who left us with high probability.
Here, we will focus on analyzing sequences of events that can be helpful as well. We will cover how to find patterns of events. For instance, we will find events that occur systematically together and in the same order, relationships with customers’ characteristics (typical differences in event sequences between men and women), and association rules between event subsequences.
First of all, we need to create the event sequence object. We can do this easily by converting the state object to event sequence that we already have with the following code:
df.evseq < seqecreate(df.seq, tevent='state') # converting state object to event sequence head(df.evseq)
## [1] (a)46(a;b)24 ## [2] (a;c)26(a;b)24 ## [3] (a;c)20(a;b;c)27(a;b)20(a;c)8 ## [4] (a;c)38(a)13(a;b)26 ## [5] (a;b)4(c)10(a;c)10(nopurch)39 ## [6] (a;b)33(a;b;c)27(a;b)12
As you can see, the df.evseq object includes time lapses and this will allow us to use these data for some custom analysis later.
We are starting by searching for frequent event subsequences. A subsequence is formed by a subset of the events and that respects the order of the events in sequence. For instance, (a;b) > (a) is a subsequence of (a;b) > (a;b;c) > (a) since the order of events are respected. A subsequence is called “frequent” if it occurs in more than a given minimum number of sequences. This required minimum number of sequences to which the subsequence must belong to is called minimum support. It should be set by us.
Minimum support can be defined in percentages by the pMinSupport argument and in numbers by the minSupport argument. Since our data set is small, we will create a subsequence list object with minimum support 1% and plot the first 10 subsequences with the following code:
df.subseq < seqefsub(df.evseq, pMinSupport=0.01) # searching for frequent event subsequences plot(df.subseq[1:10], col="cyan", ylab="Frequency", xlab="Subsequences", cex=1.5) # plotting
In order to do some custom analysis, in addition to the minimum support, TraMineR also allows to control the search of frequent subsequences with time constraints. For instance, we can specify:
For example, if we want to find the subsequences which are enclosed in a 30 days interval with no more than 10 days between two transitions, we would use the following code:
time.constraint < seqeconstraint(maxGap=10, windowSize=30) # creating variable with conditions df.subseq.time.constr < seqefsub(df.evseq, pMinSupport=0.01, constraint=time.constraint) # searching for frequent event subsequences plot(df.subseq.time.constr[1:10], col="cyan", ylab="Frequency", xlab="Subsequences", cex=1.5) # plotting
Furthermore, we can identify the frequent subsequences that are most strongly related with a given factor or find discriminant event subsequences. The discriminant power is evaluated with the pvalue of a Chisquare independence test. The subsequences are then ordered by decreasing the discriminant power. Just as a reminder, in the first post we created the factor variable (df.feat$sex) which consists of the gender of each client. We will search for the subsequences which are related to gender of client with the following code:
discrseq < seqecmpgroup(df.subseq, group=df.feat$sex) # searching for frequent sequences that are related to gender head(discrseq) plot(discrseq[1:10], cex=1.5) # plotting 10 frequent subsequences plot(discrseq[1:10], ptype="resid", cex=1.5) # plotting 10 residuals
## Subsequence Support p.value statistic index Freq.female ## 1 (a)(a;c) 0.07612457 0.05445187 3.698792 21 0.05000000 ## 2 (a;c)(a;b;c) 0.12456747 0.06200868 3.482828 11 0.15555556 ## 3(a;c)(a;b;c)(a;b) 0.02768166 0.06257011 3.467916 37 0.04444444 ## 4 (a;c)(a;c) 0.02768166 0.06626360 3.373233 38 0.01111111 ## 5 (a) 0.36678201 0.10055558 2.696710 4 0.32777778 ## 6(a)(a;c)(nopurch) 0.01038062 0.10127354 2.685372 78 0.00000000 ## Freq.male Resid.female Resid.male ## 1 0.11926606 1.2703487 1.632473 ## 2 0.07339450 1.1779548 1.513741 ## 3 0.00000000 1.3517187 1.737038 ## 4 0.05504587 1.3362173 1.717118 ## 5 0.43119266 0.8640601 1.110368 ## 6 0.02752294 1.3669353 1.756592 ## ## Computed on 289 event sequences ## Constraint Value ## countMethod COBJ
In the resulting plots, the color of each bar is defined by the associated Pearson residual of the Chisquare test. For residuals below 2 (dark red), the subsequence is significantly less frequent than expected under the independence, whereas for residuals greater than 2 (dark blue), the subsequence is significantly more frequent. We plotted two charts: the first one displays frequencies, the second one, – residuals. There are several sequences that we can say are related to men and we need to pay attention to (a) > (a;c) > (nopurch) one, because it leads to an increased our customer churn rate.
And finally, we will search for sequential association rules. Association rules learning is a popular and wellresearched method for discovering relations between variables (subsequences in our case). We will be searching for rules with the following code:
rules < TraMineR:::seqerules(df.subseq) # searching for rules head(rules)
## Rules Support Conf Lift Standardlift JMeasure ## 1 (a;b;c) => (a;b) 71 0.4057143 0.6043888 0.2700793 0.2129659 ## 2 (a;b) => (a;b;c) 62 0.3195876 0.5277761 0.2069131 0.2404954 ## 3 (a) => (a;b;c) 38 0.3584906 0.5920216 0.3571436 0.1789503 ## 4 (a) => (a;b) 37 0.3490566 0.5199864 0.3319886 0.3122991 ## 5 (a;b) => (a;c) 36 0.1855670 0.4965636 0.3125759 0.1212142 ## 6 (a;c) => (a;b;c) 36 0.3333333 0.5504762 0.3319335 0.2176328 ## ImplicStat p.value p.valueB1 p.valueB2 ## 1 1.3809650 0.9163551 1 1 ## 2 1.1973199 0.8844091 1 1 ## 3 0.6474871 0.7413416 1 1 ## 4 2.0592757 0.9802661 1 1 ## 5 1.4967428 0.9327699 1 1 ## 6 0.9802203 0.8365113 1 1
Here I want you to pay attention. Association rules learning uses a minimum support value as a main parameter (1% in our case). Therefore, you should take this into account when you are calculating the df.subseq variable and defining the pMinSupport.
As a result, we obtain rules in the if/then format and with several important parameters. For instance, the first one (a;b;c) => (a;b) means: if customer buys the (a;b;c) cart, s/he is likely to also buy the (a;b) cart. The support parameter means that there are 71 subsequences which contain the (a;b;c) and (a;b) subsequences. The confidence 0.4057 means that in 40.57% of the times a customer buys (a;b;c), (a;b) is bought as well.
Thank you for reading this!
]]>This is the second part of the indepth sequence analysis. In the previous post, we processed data in the required format, plotted a Sankey diagram, and did some distribution, frequency, time lapse and entropy analysis with visualization. For dessert, clustering! Clustering is an exploratory data analysis method aimed at finding automatically homogeneous groups or clusters in the data. It simplifies the large number of distinct sequences in a few types of trajectories.
Let’s assume that we want to identify four segments of customers based on their behavior (purchase sequences). We will use the hierarchical clustering method Ward for clustering our customers with the following code:
# CLUSTERING library(cluster) df.om < seqdist(df.seq, method='OM', indel=1, sm='TRATE', with.missing=TRUE) # computing the optimal matching distances clusterward < agnes(df.om, diss=TRUE, method="ward") # building a Ward hierarchical clustering df.cl4 < cutree(clusterward, k=4) # cut the tree for creating 4 clusters cl4.lab < factor(df.cl4, labels=paste("Cluster", 1:4)) # creating label with the number of cluster for each customer
Once we have identified clusters, we can plot three types of graphics we are familiar with from the previous post. These graphics can help us to identify the typical patterns that characterize the clusters. We will start with a distribution analysis for each cluster which shows the state distribution at each time point (the columns of the sequence object), continue with a frequency plot, and finish with a mean time spent in each state plot:
# distribution chart seqdplot(df.seq, group=cl4.lab, border=NA) # frequence chart seqfplot(df.seq, group=cl4.lab, pbarw=T, border=NA) # mean time plot seqmtplot(df.seq, group=cl4.lab, border=NA)
It is also possible an advanced approach of clustering. The command below finds and plots the representative set that, with a neighborhood radius of 10% (default tsim value), covers at least 35% (trep parameter) of the sequences in each of the four cl4.lab groups:
seqrplot(df.seq, group=cl4.lab, dist.matrix=df.om, trep=0.35, border=NA)
In the resulting plot the selected representative sequences are plotted bottomup according to their representativeness score with a bar width proportional to the number of sequences assigned to them. At the top of the plot, two parallel series of symbols standing each for a representative sequence are displayed horizontally on a scale ranging from 0 to the maximal theoretical distance Dmax. The location of the symbol associated with the representative sequence indicates on axis A the discrepancy within the subset of sequences and on axis B the mean distance to the representative sequence.
We learn from the plots that nine, three, one and three representatives, respectively, are necessary for each of the four groups to achieve the 35% coverage and that the actual coverage is 36.5%, 36.4%, 38.3% and 43.6%, respectively.
So, what is the main point of preceding analysis? We can use it for:
and so on.
Be sociable, share this post!
]]>Although the sankey diagram from the previous post provided us with a very descriptive tool, we can consider it a rather exploratory analisys. As I mentioned, sequence mining can give us the opportunity to recommend this or that product based on previous purchases, but we should find the right moment and patterns in purchasing behavior. Therefore, the sankey diagram is not enough as it doesn’t show the duration between purchases. The other challenge is to understand that the customer has left us or just hasn’t made his/her next purchase yet. Therefore, in this post you will find technics which can help you to find patterns in customer’s behavior and churn based on purchase sequence. And you will find several interesting visualizations.
I will use an amazing R package – TraMineR. It allows us to extract all (or even more) data that we need. I highly recommend that you read this package manual because I won’t cover all features it has.
After we load the necessary libraries with the following code,
library(dplyr) library(TraMineR) library(reshape2) library(googleVis)
we will simulate a sample of the data set. Suppose we sell 3 products (or product categories), A, B and C, and the client can purchase any combinations of products. Also, we know the date of purchase and the customer’s gender. Let’s do this with the following code:
# creating an example of shopping carts set.seed(10) data < data.frame(orderId=sample(c(1:1000), 5000, replace=TRUE), product=sample(c('NULL','a','b','c'), 5000, replace=TRUE, prob=c(0.15, 0.65, 0.3, 0.15))) order < data.frame(orderId=c(1:1000), clientId=sample(c(1:300), 1000, replace=TRUE)) sex < data.frame(clientId=c(1:300), sex=sample(c('male', 'female'), 300, replace=TRUE, prob=c(0.40, 0.60))) date < data.frame(orderId=c(1:1000), orderdate=sample((1:90), 1000, replace=TRUE)) orders < merge(data, order, by='orderId') orders < merge(orders, sex, by='clientId') orders < merge(orders, date, by='orderId') orders < orders[orders$product!='NULL', ] orders$orderdate < as.Date(orders$orderdate, origin="20120101") rm(data, date, order, sex)
Let’s take a look at the data frame we obtained. It looks similar to reality (head(orders) function):
## orderId clientId product sex orderdate ## 1 1 254 a female 20120325 ## 2 1 254 b female 20120325 ## 3 1 254 c female 20120325 ## 4 1 254 b female 20120325 ## 5 2 151 a female 20120128 ## 6 2 151 b female 20120128
Next, we will combine the products of each order to the cart. It is possible that the customer made two or more purchases on the same date. For instance, the client purchased product A on 20120101 at 10:00 and products B and C on 20120101 at 10:02. To me, this is the same shopping cart/order (A, B, C) which was split because of some reason but probably these two carts were created during the same session/visit. It is really easy to combine products with the following code:
# combining products to the cart df < orders %>% arrange(product) %>% select(orderId) %>% unique() %>% group_by(clientId, sex, orderdate) %>% summarise(cart=paste(product,collapse=";")) %>% ungroup()
Finally, we have a df data frame which looks like (head(df) function):
## clientId sex orderdate cart ## 1 1 male 20120122 a ## 2 1 male 20120214 a ## 3 1 male 20120308 a;b ## 4 1 male 20120314 a;b ## 5 2 female 20120211 a;c ## 6 2 female 20120308 a;b
After this, we are ready to process carts/orders into the required format. And there will be some important clauses I want you to pay attention to:
a) client hasn’t purchased for the last X days/months/years,
b) client hasn’t purchased for X days/months/years from the last purchase,
c) client hasn’t purchased for defined period from the last purchase.
I will share a combination of b) and c) approaches. For instance, we assume that our usual client should purchase once per month (30 days) and we will use this parameter for clients who purchased once. Also, we will take into account the customer’s purchasing habits. We will calculate the average time lapse between the customer’s purchases and define a critical period as the average time lapse multiplied for 1.5 times for clients who make a purchase more than once.
This approach allows us to identify broken sequences and either can be helpful to find patterns of the customer’s churn or won’t lead us to count the states of carts/orders that have an improbable duration.
We will use a loop for extracting each client from the data set, will calculate the average time lapse between purchases (with a 1.5 coefficient) or 30 days for onetimebuyers and will add both ‘nopurch’ dummies and the end date for each cart (state) with the following code:
max.date < max(df$orderdate)+1 ids < unique(df$clientId) df.new < data.frame() for (i in 1:length(ids)) { df.cache < df %>% filter(clientId==ids[i]) ifelse(nrow(df.cache)==1, av.dur < 30, av.dur < round(((max(df.cache$orderdate)  min(df.cache$orderdate))/(nrow(df.cache)1))*1.5, 0)) df.cache < rbind(df.cache, data.frame(clientId=df.cache$clientId[nrow(df.cache)], sex=df.cache$sex[nrow(df.cache)], orderdate=max(df.cache$orderdate)+av.dur, cart='nopurch')) ifelse(max(df.cache$orderdate) > max.date, df.cache$orderdate[which.max(df.cache$orderdate)] < max.date, NA) df.cache$to < c(df.cache$orderdate[2:nrow(df.cache)]1, max.date) # order# for Sankey diagram df.cache < df.cache %>% mutate(ord = paste('ord', c(1:nrow(df.cache)), sep='')) df.new < rbind(df.new, df.cache) } # filtering dummies df.new < df.new %>% filter(cart!='nopurch'  to != orderdate) rm(orders, df, df.cache, i, ids, max.date, av.dur)
Let’s take a look for the first 4 clients (head(df.new, n=16) function):
## clientId sex orderdate cart to ord ## 1 1 male 20120122 a 20120213 ord1 ## 2 1 male 20120214 a 20120307 ord2 ## 3 1 male 20120308 a;b 20120313 ord3 ## 4 1 male 20120314 a;b 20120331 ord4 ## 5 2 female 20120211 a;c 20120307 ord1 ## 6 2 female 20120308 a;b 20120310 ord2 ## 7 2 female 20120311 a;b 20120331 ord3 ## 8 3 female 20120117 a;c 20120205 ord1 ## 9 3 female 20120206 a;b;c 20120303 ord2 ## 10 3 female 20120304 a;b 20120323 ord3 ## 11 3 female 20120324 a;c 20120331 ord4 ## 12 4 female 20120105 a;c 20120131 ord1 ## 13 4 female 20120201 a;c 20120211 ord2 ## 14 4 female 20120212 a 20120224 ord3 ## 15 4 female 20120225 a;b 20120321 ord4 ## 16 4 female 20120322 nopurch 20120401 ord5
The calculation for client #1 is the following:
20120314 – 20120122 = 52 days / 3 periods = 17 days * 1.5 = 26 days. So, the average duration is 26 days and he is still our client because the duration from 20120314 to our reporting date (20120401) is 18 days.
You can see ‘nopurch’ cart in client’s #4 sequence because:
20120225 – 20120105 = 51 days / 3 periods = 17 days * 1.5 = 26 days. So, average duration is 26 days and she is not our client because the duration from 20120225 to our reporting date (20120401) is 36 days (longer than 26 days).
Let create a sankey diagram with the data we have:
##### Sankey diagram ####### df.sankey < df.new %>% select(clientId, cart, ord) df.sankey < dcast(df.sankey, clientId ~ ord, value.var='cart', fun.aggregate = NULL) df.sankey[is.na(df.sankey)] < 'unknown' # chosing a length of sequence df.sankey < df.sankey %>% select(ord1, ord2, ord3, ord4) # replacing NAs after 'nopurch' for 'nopurch' df.sankey[df.sankey[, 2]=='nopurch', 3] < 'nopurch' df.sankey[df.sankey[, 3]=='nopurch', 4] < 'nopurch' df.sankey.plot < data.frame() for (i in 2:ncol(df.sankey)) { df.sankey.cache < df.sankey %>% group_by(df.sankey[ , i1], df.sankey[ , i]) %>% summarise(n=n()) %>% ungroup() colnames(df.sankey.cache)[1:2] < c('from', 'to') # adding tags to carts df.sankey.cache$from < paste(df.sankey.cache$from, '(', i1, ')', sep='') df.sankey.cache$to < paste(df.sankey.cache$to, '(', i, ')', sep='') df.sankey.plot < rbind(df.sankey.plot, df.sankey.cache) } plot(gvisSankey(df.sankey.plot, from='from', to='to', weight='n', options=list(height=900, width=1800, sankey="{link:{color:{fill:'lightblue'}}}"))) rm(df.sankey, df.sankey.cache, df.sankey.plot, i)
Now we can see both broken sequences (‘nopurch’ variable) and ‘unknown’ states. This means that we defined customers as ‘alive’ but they didn’t make their next purchases as of the reporting date:
Ok, we can start an indepth analysis. Because TraMineR doesn’t work with the dates format, we will convert dates to numbers. Also, we will change unclear dates (e.g. 14636, 14684, etc.) to the much clearer 1, 2, 3 and so on with the following code:
df.new < df.new %>% # chosing a length of sequence filter(ord %in% c('ord1', 'ord2', 'ord3', 'ord4')) %>% select(ord) # converting dates to numbers min.date < as.Date(min(df.new$orderdate), format="%Y%m%d") df.new$orderdate < as.numeric(df.new$orderdatemin.date+1) df.new$to < as.numeric(df.new$tomin.date+1)
From this point on, we will start to work on our main goal. First of all, we need to create a variable in the TraMineR format. The data frame we created is in SPELL format. Since TraMineR’s default format is STS, we will create a new STS variable (df.form) with the following code:
df.form < seqformat(df.new, id='clientId', begin='orderdate', end='to', status='cart', from='SPELL', to='STS', process=FALSE)
Furthermore, we will create the TraMiner’s object and see the summary with the following code:
df.seq < seqdef(df.form, left='DEL', right='unknown', xtstep=10, void='unknown') # xtstep  step between ticks (days) summary(df.seq)
Note: I used left=’DEL’ parameter in order to remove NAs. The reason to the occurrence of NAs is, for example, that our min date in the data set was 20120101 which was converted to y1 value. If the customer’s first purchase was on 20120102 or y2, the algorithm generates an NA for y1. In this case the left=’DEL’ parameter moves the whole sequence one step back (from y2 to y1). Therefore, all of our sequences start from the y1 day. This way, we switched from calendar dates to sequence dates. The other parameters right=’unknown’ and void=’unknown’ mean that we replaced NAs and void elements at the end of the sequences for ‘unknown’. This is helpful for customers who are ‘alive’ but didn’t make their next purchase as of the reporting date.
Also, we will use the client’s gender as a feature in the analysis. Therefore, we will create a feature with the following code:
df.feat < unique(df.new[ , c('clientId', 'sex')])
We will start with a distribution analysis which shows the state distribution at each time point (the columns of the sequence object) and plot two charts:
# distribution analysis seqdplot(df.seq, border=NA, withlegend='right') seqdplot(df.seq, border=NA, group=df.feat$sex) # distribution based on gender
You can find some differences between the female’s and the male’s carts/orders distributions. For example, let’s take a look at (A;B) carts. Also, you can see an abrupt increase of ‘nopurch’ carts on the 31st day. This isn’t surprising because we used 30 days as the critical time lapse for onetimebuyers.
Furthermore, we can take a numeric data with the function:
seqstatd(df.seq)
In order to exclude the ‘unknown’ state from subsequent charts, we will reprocess our sequence object with the following code:
df.seq < seqdef(df.form, left='DEL', right='DEL', xtstep=10)
We will analyse the most frequent sequences with the following charts and stats:
# the 10 most frequent sequences seqfplot(df.seq, border=NA, withlegend='right') # the 10 most frequent sequences based on gender seqfplot(df.seq, group=df.feat$sex, border=NA) # returning the frequency stats seqtab(df.seq) # frequency table seqtab(df.seq[, 1:30]) # frequency table for 1st month
Each sequence is plotted as a horizontal bar split in as many colorized cells as there are states in the sequence. The sequences are ordered by decreasing frequency from bottom up and the bar widths are set proportionally to the sequence frequency. You can find, for instance, that male buyers didn’t purchase after the (A;B) carts while female buyers have long sequences of the same carts. Furthermore, we can see the most frequent sequences of shopping carts and exact day when states changed.
We will calculate the mean time spent on each state (with each cart/order) with the following code:
# mean time spent in each state seqmtplot(df.seq, title='Mean time', withlegend='right') seqmtplot(df.seq, group=df.feat$sex, title='Mean time') statd < seqistatd(df.seq) #function returns for each sequence the time spent in the different states apply(statd, 2, mean) #We may be interested in the mean time spent in each state
You can see that the average time on, for instance, (A;B) state is longer for female buyers, but (A) and (A;C) – for male one.
We will analyze entropy with the following code:
# calculating entropy df.ient < seqient(df.seq) hist(df.ient, col='cyan', main=NULL, xlab='Entropy') # plot an histogram of the within entropy of the sequences # entrophy distribution based on gender df.ent < cbind(df.seq, df.ient) boxplot(Entropy ~ df.feat$sex, data=df.ent, xlab='Gender', ylab='Sequences entropy', col='cyan')
The chart shows that the entropy is slightly higher for the male clients when compared to the female ones.
I will cover clustering of clients based on their sequences in the next post. Don’t miss it if this interests you!
]]>We studied how we can visualize the structure of a shopping cart in the previous post. Although you can find a great deal of materials on how to analyze combinations of products in the shopping cart (e.g. via association rules), there is a lack of sources on how to analyze the sequences of shopping carts. This post is an attempt to make up for this lack of sources.
The sequence analysis of the shopping carts can bring you useful knowledge of patterns of customer’s behavior. You can discover dependencies between product sets. For example, client bought product A and B in the first cart and product A in both the second and third cart. Probably, he wasn’t satisfied with product B (its price, quality, etc.) or you can discover that after “A, B, C” carts clients purchased product D very often. It can give you the opportunity to recommend this product to clients who didn’t purchase D after an “A, B, C” cart.
As I’m a big fan of visualization I will recommend an interesting chart for this analysis: Sankey diagram. So, let’s start!
After we load the necessary libraries with the following code,
# loading libraries library(googleVis) library(dplyr) library(reshape2)
we will simulate an example of the data set. Suppose we sell 3 products (or product categories), A, B and C, and each product can be sold with a different probability. Also, a client can purchase any combinations of products. Let’s do this with the following code:
# creating an example of orders set.seed(15) df < data.frame(orderId=c(1:1000), clientId=sample(c(1:300), 1000, replace=TRUE), prod1=sample(c('NULL','a'), 1000, replace=TRUE, prob=c(0.15, 0.5)), prod2=sample(c('NULL','b'), 1000, replace=TRUE, prob=c(0.15, 0.3)), prod3=sample(c('NULL','c'), 1000, replace=TRUE, prob=c(0.15, 0.2))) # combining products df$cart < paste(df$prod1, df$prod2, df$prod3, sep=';') df$cart < gsub('NULL;;NULL', '', df$cart) df < df[df$cart!='NULL', ] df < df %>% select(orderId, clientId, cart) %>% arrange(clientId, orderId, cart)
We generated 1000 orders from 300 clients and our data set looks like this:
head(df)
## orderId clientId cart ## 1 451 1 a;b;c ## 2 217 2 a;b ## 3 261 2 a;b ## 4 577 2 a;b ## 5 902 2 c ## 6 199 3 a;b;c
After this, we need to arrange orders from each client with the following code. Note: we assume that the order/cart serial numbers were assigned based on the purchase date. In other cases, you can use purchase date for identifying the sequence.
orders < df %>% group_by(clientId) %>% mutate(n.ord = paste('ord', c(1:n()), sep='')) %>% ungroup()
The head of the data frame we obtain is:
head(orders)
## orderId clientId cart n.ord ## 1 451 1 a;b;c ord1 ## 2 217 2 a;b ord1 ## 3 261 2 a;b ord2 ## 4 577 2 a;b ord3 ## 5 902 2 c ord4 ## 6 199 3 a;b;c ord1
The next step is to create a matrix with sequences with the following code:
orders < dcast(orders, clientId ~ n.ord, value.var='cart', fun.aggregate = NULL)
The head of the data frame we obtain is:
## clientId ord1 ord10 ord11 ord2 ord3 ord4 ord5 ord6 ord7 ord8 ord9 ## 1 1 a;b;c <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> ## 2 2 a;b <NA> <NA> a;b a;b c <NA> <NA> <NA> <NA> <NA> ## 3 3 a;b;c <NA> <NA> a;b a <NA> <NA> <NA> <NA> <NA> <NA> ## 4 4 a;c <NA> <NA> a a;c b;c a;b <NA> <NA> <NA> <NA> ## 5 5 a;b;c <NA> <NA> a;c a;b;c a <NA> <NA> <NA> <NA> <NA> ## 6 6 a <NA> <NA> b;c b <NA> <NA> <NA> <NA> <NA> <NA>
Therefore, we just need to choose a number of carts/orders in the sequence we want to analyze. I will choose 5 carts with the following code:
orders < orders %>% select(ord1, ord2, ord3, ord4, ord5)
Also, if you have a lot of product combinations instead of 7 as in my example, you can limit them with the filter() function (e.g. filter(ord1==’a;b;c’)) for clarity.
And finally we will create a data set for plotting with the following code:
orders.plot < data.frame() for (i in 2:ncol(orders)) { ord.cache < orders %>% group_by(orders[ , i1], orders[ , i]) %>% summarise(n=n()) %>% ungroup() colnames(ord.cache)[1:2] < c('from', 'to') # adding tags to carts ord.cache$from < paste(ord.cache$from, '(', i1, ')', sep='') ord.cache$to < paste(ord.cache$to, '(', i, ')', sep='') orders.plot < rbind(orders.plot, ord.cache) }
Note: I added tags to combinations with their number in the sequence because it is impossible to create a Sankey diagram from A product to A product for example. So, I transformed the sequence A –> A to A(1) –> A(2).
Finally, we will get a great type of visualization with the following code:
plot(gvisSankey(orders.plot, from='from', to='to', weight='n', options=list(height=900, width=1800, sankey="{link:{color:{fill:'lightblue'}}}")))
The bandwidths correspond to the weight of sequence. You can highlight any cart/order and path of sequence as well. The size of the plot can be changed via changing height and width parameters. Note: the NAs in our chart mean that the sequence ended. Feel free to share your ideas and comments!
]]>This post was updated on 12/05/2015.
In this post, we will review a very interesting type of visualization – the Multilayer Pie Chart – and use it for one of the marketing analytics tasks – the shopping carts analysis. We will go from the initial data processing to the shopping carts analysis visualization. I will share the R code in that you shouldn’t write code for every layer of chart. You can also find an example about how to create a Multilayer Pie Chart here.
Ok, let’s suppose we have a list of first orders/carts that were bought by our clients. Each order consists one or several products (or category of products). Our task is to visualize a relationship between products and see the share of orders that includes each product or combination of products. The Multilayer Pie Chart can help us to draw each product and its intersections with others.
After we loaded the necessary libraries with the following code:
# loading libraries library(dplyr) library(reshape2) library(plotrix)
we will simulate an example of the data set. Suppose we sell 4 products (or product categories): a, b, c and d and each product can be sold with a different probability. Also, a client can purchase any combinations of products, e.g. “a” or “a,b,a,d” and so on. Let’s do this with the following code:
# creating an example of orders set.seed(15) df < data.frame(orderId=sample(c(1:1000), 5000, replace=TRUE), product=sample(c('NULL','a','b','c','d'), 5000, replace=TRUE, prob=c(0.15, 0.65, 0.3, 0.15, 0.1))) df < df[df$product!='NULL', ]
After this, we will process data for creating data frame for analysis. Specifically, we will:
# processing initial data # we need to be sure that product's names are unique df$product < paste0("#", df$product, "#") prod.matrix < df %>% # removing duplicated products from each order group_by(orderId, product) %>% arrange(product) %>% unique() %>% # combining products to cart and calculating number of products group_by(orderId) %>% summarise(cart=paste(product,collapse=";"), prod.num=n()) %>% # calculating number of carts group_by(cart, prod.num) %>% summarise(num=n()) %>% ungroup()
Let’s take a look on the resulting data frame with the head(prod.matrix) function:
cart prod.num num 1 #a# 1 123 2 #a#;#b# 2 241 3 #a#;#b#;#c# 3 168 4 #a#;#b#;#c#;#d# 4 71 5 #a#;#b#;#d# 3 125 6 #a#;#c# 2 105
From this point we start working on our Multilayer Pie Chart. My idea is to place orders that include one product into the core of the chart. Therefore, we’ve calculated the total number of products in each combination (‘prod.num’ value) and will split data frame for two data frames: the first one (one.prod) that will include carts with one product and the second one (sev.prod) with more than one product.
# calculating total number of orders/carts tot < sum(prod.matrix$num) # spliting orders for sets with 1 product and more than 1 product one.prod < prod.matrix %>% filter(prod.num == 1) sev.prod < prod.matrix %>% filter(prod.num > 1) %>% arrange(desc(prod.num))
Therefore, the data is ready for plotting. We will define parameters for the chart with the following code:
# defining parameters for pie chart iniR < 0.2 # initial radius cols < c("#ffffff", "#fec44f", "#fc9272", "#a1d99b", "#fee0d2", "#2ca25f", "#8856a7", "#43a2ca", "#fdbb84", "#e34a33", "#a6bddb", "#dd1c77", "#ffeda0", "#756bb1") prod < df %>% select(product) %>% arrange(product) %>% unique() prod < c('NO', c(prod$product)) colors < as.list(setNames(cols[ c(1:(length(prod)))], prod))
Note: we’ve defined the color palette with fourteen colors including white color for spaces. This means if you have more than thirteen unique products in the data set, you need to add extra colors. Finally, we will plot the Multilayer Pie Chart with the following code:
# 0 circle: blank pie(1, radius=iniR, init.angle=90, col=c('white'), border = NA, labels='') # drawing circles from last to 2nd for (i in length(prod):2) { p < grep(prod[i], sev.prod$cart) col < rep('NO', times=nrow(sev.prod)) col[p] < prod[i] floating.pie(0,0,c(sev.prod$num, totsum(sev.prod$num)), radius=(1+i)*iniR, startpos=pi/2, col=as.character(colors [ c(col, 'NO')]), border="#44aaff") } # 1 circle: orders with 1 product floating.pie(0,0,c(totsum(one.prod$num),one.prod$num), radius=2*iniR, startpos=pi/2, col=as.character(colors [ c('NO',one.prod$cart)]), border="#44aaff") # legend legend(1.5, 2*iniR, gsub("_"," ",names(colors)[1]), col=as.character(colors [1]), pch=19, bty='n', ncol=1)
In case you want to add some statistics on plot, e.g. total number of each combination or share of combinations in total amount, we just need to create this table and add it on plot with the following code:
# creating a table with the stats stat.tab < prod.matrix %>% select(prod.num) %>% mutate(share=num/tot) %>% arrange(desc(num)) library(scales) stat.tab$share < percent(stat.tab$share) # converting values to percents # adding a table with the stats addtable2plot(2.5, 1.5, stat.tab, bty="n", display.rownames=FALSE, hlines=FALSE, vlines=FALSE, title="The stats")
Therefore, we’ve studied how The Multilayer Pie Chart can help us to draw each product and its intersections with others.
]]>When we spend more money for attracting new customers then they bring us by the first but, usually, by the next purchases, we appeal to customer’s lifetime value (CLV). We expect that customers will spend with us for years and it means we expect to earn some profit finally. In this case, retention is vital parameter. Most of our customers are fickle and some of them make one purchase only. So, the retention ratio should be controlled and managed as well as possible.
Cohort analysis gives us food for thought. In this case, we will use data we have from the previous post. Just to recall, we have the following number of customers who purchased in a particular month of their lifetime:
For testing, you can create this data frame using the code:
cohort.clients < data.frame(cohort=c('Cohort01','Cohort02',
'Cohort03','Cohort04','Cohort05','Cohort06','Cohort07',
'Cohort08','Cohort09','Cohort10','Cohort11','Cohort12'),
M01=c(11000,0,0,0,0,0,0,0,0,0,0,0),
M02=c(1900,10000,0,0,0,0,0,0,0,0,0,0),
M03=c(1400,2000,11500,0,0,0,0,0,0,0,0,0),
M04=c(1100,1300,2400,13200,0,0,0,0,0,0,0,0),
M05=c(1000,1100,1400,2400,11100,0,0,0,0,0,0,0),
M06=c(900,900,1200,1600,1900,10300,0,0,0,0,0,0),
M07=c(850,900,1100,1300,1300,1900,13000,0,0,0,0,0),
M08=c(850,850,1000,1200,1100,1300,1900,11500,0,0,0,0),
M09=c(800,800,950,1100,1100,1250,1000,1200,11000,0,0,0),
M10=c(800,780,900,1050,1050,1200,900,1200,1900,13200,0,0),
M11=c(750,750,900,1000,1000,1180,800,1100,1150,2000,11300,0),
M12=c(740,700,870,1000,900,1100,700,1050,1025,1300,1800,20000))
Firstly, we need to process data to the following view:
That is because we want to compare cohorts’ behavior for the same months of lifetime. If months M01, M02, …, M12 mean calendar months as January, February, …, December in the first table, that they are sequence numbers of lifetime month in the second table.
Suppose data set with customers is in cohort.clients data frame. R code for processing data can be the next:
#connect libraries
library(dplyr)
library(ggplot2)
library(reshape2)
cohort.clients.r < cohort.clients #create new data frame
totcols < ncol(cohort.clients.r) #count number of columns in data set
for (i in 1:nrow(cohort.clients.r)) { #for loop for shifting each row
df < cohort.clients.r[i,] #select row from data frame
df < df[ , !df[]==0] #remove columns with zeros
partcols < ncol(df) #count number of columns in row (w/o zeros)
#fill columns after values by zeros
if (partcols < totcols) df[, c((partcols+1):totcols)] < 0
cohort.clients.r[i,] < df #replace initial row by new one
}
Furthermore we should calculate retention ratio. I use formula:
Retention ratio = # clients in particular month / # clients in 1st month of lifetime
Here are two alternative codes in R you can use:
#calculate retention (1) x < cohort.clients.r[,c(2:13)] y < cohort.clients.r[,2] reten.r < apply(x, 2, function(x) x/y ) reten.r < data.frame(cohort=(cohort.clients.r$cohort), reten.r)
or:
#calculate retention (2) c < ncol(cohort.clients.r) reten.r < cohort.clients.r for (i in 2:c) { reten.r[, (c+i1)] < reten.r[, i] / reten.r[, 2] } reten.r < reten.r[,c(2:c)] colnames(reten.r) < colnames(cohort.clients.r)
Here is the result of calculation (reten.r data frame):
And finally I propose to create 3 useful charts for visualizing retention ratio.
1. Cohort retention ratio dynamics:
Note: I’ve removed the first (M01) month from charts because it is always equal 1.0 (100%). The red line on the plot is the average ratio. It is easy to identify cohorts which are above and below. So, the first thought that I have is to compare them and find reasons of such difference. For example, look at Cohort07 and Cohort06:
2. Chart for analyzing how many customers stick around for the second month:
Our retention ratio decreased from 1.0 (100%) in the first month to 0.10.21 (1021%) in the second month, this is the biggest drop in our example. That is why it is important to see how our dynamic changes (and its trend) from one cohort to another for the second month only. Also, this chart shows month to month dynamic because the second month for Cohort01 is February, for Cohort02 – March, etc. We see negative trend (red line) and we should find insights. Also, you can choose any other month you want (follow the notes in the code).
3. And for the dessert – here is my favorite one – Cycle plot:
This plot is a mix of the first and the second charts. It presents the sequence of the 2nd chart for each month and gives us an interesting view. The first (red) curve is retention of 2nd month (M02) of cohorts from 01 to 11 (it is the same with 2nd chart), the second (yellow) curve is retention of 3rd month (M03) of cohorts from 01 to 10, etc. Here we can see total trend from month to month as well, as cohorts’ comparison within each month. Furthermore, I’ve added two blue lines for cohort07 and cohort06 to show the difference between them (you can choose any other cohorts – follow the notes in the code). So, we can see cycles of each cohort in each month.
The code for these charts is:
#charts reten.r < reten.r[,2] #remove M01 data because it is always 100%
#dynamics analysis chart cohort.chart1 < melt(reten.r, id.vars = 'cohort') colnames(cohort.chart1) < c('cohort', 'month', 'retention') cohort.chart1 < filter(cohort.chart1, retention != 0) p < ggplot(cohort.chart1, aes(x=month, y=retention, group=cohort, colour=cohort)) p + geom_line(size=2, alpha=1/2) + geom_point(size=3, alpha=1) + geom_smooth(aes(group=1), method = 'loess', size=2, colour='red', se=FALSE) + labs(title="Cohorts Retention ratio dynamics") #second month analysis chart cohort.chart2 < filter(cohort.chart1, month=='M02') #choose any month instead of M02 p < ggplot(cohort.chart2, aes(x=cohort, y=retention, colour=cohort)) p + geom_point(size=3) + geom_line(aes(group=1), size=2, alpha=1/2) + geom_smooth(aes(group=1), size=2, colour='red', method = 'lm', se=FALSE) + labs(title="Cohorts Retention ratio for 2nd month")
#cycle plot cohort.chart3 < cohort.chart1 cohort.chart3 < mutate(cohort.chart3, month_cohort = paste(month, cohort)) p < ggplot(cohort.chart3, aes(x=month_cohort, y=retention, group=month, colour=month))
#choose any cohorts instead of Cohort07 and Cohort06 m1 < filter(cohort.chart3, cohort=='Cohort07') m2 < filter(cohort.chart3, cohort=='Cohort06')
p + geom_point(size=3) +
geom_line(aes(group=month), size=2, alpha=1/2) +
labs(title="Cohorts Retention ratio cycle plot") +
geom_line(data=m1, aes(group=1), colour='blue', size=2, alpha=1/5) +
geom_line(data=m2, aes(group=1), colour='blue', size=2, alpha=1/5) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
I wish you find insight of your data! If you have any questions, opinions, etc., feel free to contact me.
]]>I want to consider an approach of forecasting I really like and frequently use. It allows to include the promo campaigns (or another activities and other variables as well) effect into the prediction of total amount. I will use a fictitious example and data in this post, but it works really good with my real data. So, you can adapt this algorithm for your requirements and test it. Also, it seems simple for nonmath people because of complete automation.
Suppose we sell some service and our business depends on the number of subscribers we attract. Definitely, we measure the number of customers and want to predict their quantity. If we know the customer’s life time value (CLV) it allows us to predict the total revenue based on quantity of customers and CLV. So, this case looks like justified.
The first way we can use for solving this problem is multiple regression. We can find a great number of relevant indicators which influence the number of subscribers. It can be service price, seasonality, promotional activities, even S&P or Dow Jones index, etc. After we found all parameters affecting number of customers we can calculate formula and predict number of customers.
This approach has disadvantages:
On the other hand, stock market analysts use timeseries forecasting. They are resigned by the fact that stock prices are influenced by a great number of indicators. Thus, they are looking for dependence inside the price curve. This approach is not fully suitable for us too. In case we regularly attracted extra customers via promos (we can see some peaks on curve) the timeseries algorithms can identify peaks as seasonality and draw future curve with the same peaks, but what we can do if we are not planning promos in these periods or we are going to make extra promos or change their intensity.
And final statement before we start working on our prediction algorithm. I’m sure it is important for marketers to see how their promos or activities affect the number of customers / revenue (or subscribers in our case).
So, our task is to create the model which doesn’t depend on a great number of predictors from one side (looks like timeseries forecasting) and on the other side includes promos effect on total number of subscribers from the other side (looks like regression).
My answer is extended ARIMA model. ARIMA is Auto Regression Integrated Moving Average. “Extended” means we can include some other information in timeseries forecasting based on ARIMA model. In our case, other information is the result of promos we had and we are going to get in the future. In case we repeat promo campaigns every year at the same period and get approximately the same number of new customers ARIMA model (not extended) would be enough. It should recognize peaks as a seasonality. This example we won’t review.
Let’s start. Suppose our data is:
We have (from the left to the right):
We need only two variables to make the prediction (‘growth’ and ‘extended’). There are other variables just for your information. Also we have two last months without number of subscribers (we are going to predict these values), but we should have promos effect which we are planning to get in future. Further, the heatmap of growth and extended variables look alike. Thus, we can make conclusion that they are connected.
In the example we will predict values from the 37th to the 42nd to see accuracy of prediction on factual data.
The code in R can be the next:
#load libraries library(forecast) library(TSA)
#load data set df < read.csv(file='data.csv')
#define periods (convenient for future, you can just change values for period you want to predict or include to factual) s.date < c(2010,1) #start date  factual e.date < c(2011,12) #end date  factual f.s.date < c(2013,1) #start date  prediction f.e.date < c(2013,12) #end date  prediction
#transform values to timeseries and define past and future periods growth < ts(df$Growth, start=s.date, end=e.date, frequency=12) ext < ts(df$Extended, start=s.date, end=f.e.date, frequency=12) past < window(ext, s.date, e.date) future < window(ext, f.s.date, f.e.date)
#ARIMA model fit < auto.arima(growth, xreg=past, stepwise=FALSE, approximation=FALSE) #determine model forecast < forecast(fit, xreg=future) #make prediction plot(forecast) #plot chart summary(forecast) #print predicted values
We should get chart and values:
As you remember we have factual data for Jan.2013Apr.2013 which we can compare: 384 vs 451, 1224 vs 1271, 709 vs 796 and 699 vs 753. Although values are not very close, we can see that February promo affected and we saw a peak. After we add Jan.2013Mar.2013 to factual periods, our prediction for April will be 718 which is closer to 699 than 753. That means once we have factual data we should recalculate and precise the prediction.
Thus, we have predicted number of subscribers including promo campaigns effect. If we are not satisfied with this number we can add some activity and measure new prediction. Suppose we add new activity for attracting 523 new customers in April 2013 (this means Extended will be 500 instead of 23). In this case our prediction will be:
We got the new peak 1295 in April instead of 753 (in previous prediction). Thus, we have tool for targeting number of subscribers, the only thing we need is to attract these subscribers which we are going to use for prediction ;).
Note, for making prediction for more periods just add values of extended variable in the initial data and change prediction period in the R code.
In case when described approach works poorly I can recommend you this great book written by ‘forecast’ package creator prof. Rob J Hyndman to deepen into forecasting.
]]>Continue to exploit a great idea of ‘layercake’ graph.
If you liked the approach I shared in the previous topic, perhaps, you would have one or two questions we should answer additionally. Recall “Total revenue by Cohort” chart:
As total revenue depends on the number of customers we attracted and on the amount of money each of them spent with us, there is a sense to dig deeply.
The number of active customers can be visualized with the algorithm we used for total revenue. After we processed a large amount of data it should be on the following structure. There are Cohort01, Cohort02, etc. – cohort’s name due to customer signup date or first purchase date and M1, M2, etc. – period of cohort’s lifetime (first month, second month, etc.):
For example, Cohort1 signed up in January (M1) and included 11,000 clients who made purchases during the first month (M1). Cohort5 signed up in May (M5) and there were 1,100 active clients in September (M9).
Ok. Suppose you’ve done data process and got cohort.clients data frame as a result and it looks like the table above. You can reproduce this data frame with the following code:
cohort.clients < data.frame(cohort=c('Cohort01', 'Cohort02', 'Cohort03', 'Cohort04', 'Cohort05', 'Cohort06', 'Cohort07', 'Cohort08', 'Cohort09', 'Cohort10', 'Cohort11', 'Cohort12'), M1=c(11000,0,0,0,0,0,0,0,0,0,0,0), M2=c(1900,10000,0,0,0,0,0,0,0,0,0,0), M3=c(1400,2000,11500,0,0,0,0,0,0,0,0,0), M4=c(1100,1300,2400,13200,0,0,0,0,0,0,0,0), M5=c(1000,1100,1400,2400,11100,0,0,0,0,0,0,0), M6=c(900,900,1200,1600,1900,10300,0,0,0,0,0,0), M7=c(850,900,1100,1300,1300,1900,13000,0,0,0,0,0), M8=c(850,850,1000,1200,1100,1300,1900,11500,0,0,0,0), M9=c(800,800,950,1100,1100,1250,1000,1200,11000,0,0,0), M10=c(800,780,900,1050,1050,1200,900,1200,1900,13200,0,0), M11=c(750,750,900,1000,1000,1180,800,1100,1150,2000,11300,0), M12=c(740,700,870,1000,900,1100,700,1050,1025,1300,1800,20000))
Let’s create the “layercake” chart with the following R code:
#connect necessary libraries
library(ggplot2)
library(reshape2)
#we need to melt data cohort.chart.cl < melt(cohort.clients, id.vars = 'cohort') colnames(cohort.chart.cl) < c('cohort', 'month', 'clients') #define palette reds < colorRampPalette(c('pink', 'dark red')) #plot data p < ggplot(cohort.chart.cl, aes(x=month, y=clients, group=cohort)) p + geom_area(aes(fill = cohort)) + scale_fill_manual(values = reds(nrow(cohort.clients))) + ggtitle('Active clients by Cohort')
And we will take the second amazing chart:
It seems like a lot of customers purchased once and gone. It can be a reason why total revenue is mainly provided by new customers.
And finally we can calculate and visualize the average revenue per client. The R code can be as the following:
#we need to divide the data frames (excluding cohort name) rev.per.client < cohort.sum[,c(2:13)]/cohort.clients[,c(2:13)] rev.per.client[is.na(rev.per.client)] < 0 rev.per.client < cbind(cohort.sum[,1], rev.per.client) #define palette greens < colorRampPalette(c('light green', 'dark green')) #melt and plot data cohort.chart.per.cl < melt(rev.per.client, id.vars = 'cohort.sum[, 1]') colnames(cohort.chart.per.cl) < c('cohort', 'month', 'average_revenue') p < ggplot(cohort.chart.per.cl, aes(x=month, y=average_revenue, group=cohort)) p + geom_area(aes(fill = cohort)) + scale_fill_manual(values = greens(nrow(cohort.clients))) + ggtitle('Average revenue per client by Cohort')
And we will take the third chart:
It seems like Cohort02 customers increased their average purchases during M5M8 months. It can be a sign.
Note: The last chart shows average revenue per customer of each cohort, but it isn’t cumulative value as in previous two charts, it doesn’t show total average revenue for all clients. This chart can be used for comparing cohorts, not for summarizing. Please, don’t be confused.
]]>Cohort Analysis is one of the most powerful and demanded techniques available to marketers for assessing longterm trends in customer retention and calculating lifetime value.
If you studied custora’s university, you could be interested in amazing “layercake graph” they propose for Cohort Analysis.
Custora says: “The distinctive “layercake graph” produced by looking at cohorts in calendar time can provide powerful insights into the health of your business. At a given point in time, what percentage of your revenue or profit came from new vs. repeat customers? Tracking how that ratio has changed over time can give you insight into whether you’re fueling topline growth solely through new customer acquisition – or whether you’re continuing to nurture those relationships with your existing customers over time.”
Usually, we focus on calculating lifetime value or comparing cohorts, but I was really impressed with this useful analytical approach and tried to do the same chart in R. Now, we can see what I’ve got.
After we processed a great deal of data it should be on the following structure. There are Cohort01, Cohort02, etc. – cohort’s name due to customer signup date or first purchase date and M1, M2, etc. – period of cohort’s lifetime (first month, second month, etc.):
For example, Cohort1 signed up in January (M1) and brought us $270,000 during the first month (M1). Cohort5 signed up in May (M5) and brought us $31,000 in September (M9).
Ok. Suppose you’ve done data process and got cohort.sum data frame as a result and it looks like the table above. You can reproduce this data frame with the following code:
cohort.sum < data.frame(cohort=c('Cohort01', 'Cohort02', 'Cohort03', 'Cohort04', 'Cohort05', 'Cohort06', 'Cohort07', 'Cohort08', 'Cohort09', 'Cohort10', 'Cohort11', 'Cohort12'), M1=c(270000,0,0,0,0,0,0,0,0,0,0,0), M2=c(85000,275000,0,0,0,0,0,0,0,0,0,0), M3=c(72000,63000,277000,0,0,0,0,0,0,0,0,0), M4=c(52000,42000,76000,361000,0,0,0,0,0,0,0,0), M5=c(50000,45000,60000,80000,288000,0,0,0,0,0,0,0), M6=c(51000,52000,55000,51000,58000,253000,0,0,0,0,0,0), M7=c(51000,69000,48000,45000,42000,54000,272000,0,0,0,0,0), M8=c(46000,85000,77000,41000,38000,37000,74000,352000,0,0,0,0), M9=c(38000,42000,72000,41000,31000,30000,49000,107000,285000,0,0,0), M10=c(39000,38000,45000,33000,34000,34000,46000,83000,69000,279000,0,0), M11=c(38000,42000,31000,32000,26000,28000,43000,82000,51000,87000,282000,0), M12=c(35000,35000,38000,45000,35000,32000,48000,44000,47000,52000,92000,500000))
Let’s create the “layercake” chart with the following R code:
#connect necessary libraries
library(ggplot2)
library(reshape2)
#we need to melt data cohort.chart < melt(cohort.sum, id.vars = "cohort") colnames(cohort.chart) < c('cohort', 'month', 'revenue') #define palette blues < colorRampPalette(c('lightblue', 'darkblue')) #plot data p < ggplot(cohort.chart, aes(x=month, y=revenue, group=cohort)) p + geom_area(aes(fill = cohort)) + scale_fill_manual(values = blues(nrow(cohort.sum))) + ggtitle('Total revenue by Cohort')
And we will take such amazing chart:
You can see that monthly revenue is highly dependent on new customers who do their first purchases. But during the time company accumulates several layers of incomes from existing (loyal) customers and reduced dependence. Further, it seems like there was some activity (e.g. promo) in the eighth month (M8) and a few cohorts responded. Really helpful chart.
]]>Continue to dig tweets. After we reviewed how to count positive, negative and neutral tweets in the previous post, I discovered another great idea. Suppose positive or negative mark is not enough and we want to understand the rate of positivity or negativity. For example, if word “good” has 4 points rating, but “perfect” has 6. In this way we can try to measure the rate of satisfaction or opinion in tweets and take a chart with trend as the following:
We need another dictionary for managing this task, specifically the dictionary with rating of words. We can create it or find results of great research of affective ratings (e.g. here).
And of course, our algorithm should bypass Twitter’s API limitation via accumulating historical data. This approach was described in the previous post.
Note, I will use average rating for evaluating tweets based on words rating it consists of. For example, if we’ve found “good” (4 points) and “perfect” (6 points) in the tweet, it would be evaluated as (4+6)/2=5. In this way we will avoid the influence of several negative words that could have higher total rating, e.g. one word “good” (4 points) should have higher rating than three words “bad” (for 1,5 points each).
Let’s start. We need to create Twitter Application (https://apps.twitter.com/) in order to have an access to Twitter’s API. Then we will get Consumer Key and Consumer Secret. And finally, our code in R:
#connect all libraries library(twitteR) library(ROAuth) library(plyr) library(dplyr) library(stringr) library(ggplot2)
#connect to API download.file(url='http://curl.haxx.se/ca/cacert.pem', destfile='cacert.pem') reqURL < 'https://api.twitter.com/oauth/request_token' accessURL < 'https://api.twitter.com/oauth/access_token' authURL < 'https://api.twitter.com/oauth/authorize' consumerKey < '____________' #put the Consumer Key from Twitter Application consumerSecret < '______________' #put the Consumer Secret from Twitter Application Cred < OAuthFactory$new(consumerKey=consumerKey, consumerSecret=consumerSecret, requestURL=reqURL, accessURL=accessURL, authURL=authURL) Cred$handshake(cainfo = system.file('CurlSSL', 'cacert.pem', package = 'RCurl')) #There is URL in Console. You need to go to, get code and enter it on Console
save(Cred, file='twitter authentication.Rdata') load('twitter authentication.Rdata') #Once you launched the code first time, you can start from this line in the future (libraries should be connected) registerTwitterOAuth(Cred)
#the function for extracting and analyzing tweets search < function(searchterm) { #extract tweets and create storage file list < searchTwitter(searchterm, cainfo='cacert.pem', n=1500) df < twListToDF(list) df < df[, order(names(df))] df$created < strftime(df$created, '%Y%m%d') if (file.exists(paste(searchterm, '_stack_val.csv'))==FALSE) write.csv(df, file=paste(searchterm, '_stack_val.csv'), row.names=F)
#merge the last extraction with storage file and remove duplicates stack < read.csv(file=paste(searchterm, '_stack_val.csv')) stack < rbind(stack, df) stack < subset(stack, !duplicated(stack$text)) write.csv(stack, file=paste(searchterm, '_stack_val.csv'), row.names=F)
#tweets evaluation function score.sentiment < function(sentences, valence, .progress='none') { require(plyr) require(stringr) scores < laply(sentences, function(sentence, valence){ sentence < gsub('[[:punct:]]', '', sentence) #cleaning tweets sentence < gsub('[[:cntrl:]]', '', sentence) #cleaning tweets sentence < gsub('\d+', '', sentence) #cleaning tweets sentence < tolower(sentence) #cleaning tweets word.list < str_split(sentence, '\s+') #separating words words < unlist(word.list) val.matches < match(words, valence$Word) #find words from tweet in "Word" column of dictionary val.match < valence$Rating[val.matches] #evaluating words which were found (suppose rating is in "Rating" column of dictionary). val.match < na.omit(val.match) val.match < as.numeric(val.match) score < sum(val.match)/length(val.match) #rating of tweet (average value of evaluated words) return(score) }, valence, .progress=.progress) scores.df < data.frame(score=scores, text=sentences) #save results to the data frame return(scores.df) }
valence < read.csv('dictionary.csv', sep=',' , header=TRUE) #load dictionary from .csv file
Dataset < stack Dataset$text < as.factor(Dataset$text) scores < score.sentiment(Dataset$text, valence, .progress='text') #start score function write.csv(scores, file=paste(searchterm, '_scores_val.csv'), row.names=TRUE) #save evaluation results into the file
#modify evaluation stat < scores stat$created < stack$created stat$created < as.Date(stat$created) stat < na.omit(stat) #delete unvalued tweets write.csv(stat, file=paste(searchterm, '_opin_val.csv'), row.names=TRUE)
#chart ggplot(stat, aes(created, score)) + geom_point(size=1) + stat_summary(fun.data = 'mean_cl_normal', mult = 1, geom = 'smooth') + ggtitle(searchterm)
ggsave(file=paste(searchterm, '_plot_val.jpeg')) }
search("______") #enter keyword
Finally, we will get 4 files:
Recently I’ve designed a relatively simple code in R for analyzing Twitter posts content via calculating the number of positive, negative and neutral words. The idea of processing tweets is based on the presentation http://www.slideshare.net/ajayohri/twitteranalysisbykaifyrais. The words in the tweet correspond with the words in dictionaries that you can find on the internet or create by yourself. It is also possible to edit these dictionaries. Really great work, but I’ve discovered some issue.
There are some limitations in the API of Twitter. It depends on the total number of tweets you access via API, but usually you can get tweets for the last 78 days (not longer, and it can be 12 days only). The 7 to 8 days time limit doesn’t allow us to analyze historical trends.
My idea is to create a storage file in order to accumulate historical data and bypass API’s limitations. If you extract tweets regularly, you would analyze the dynamics of sentiments with the chart like this one:
Furthermore, this algorithm includes a function that allows you to extract quite a few keywords that you are interested in. The process can be repeated several times a day and data set for each keyword will be saved separatly. It can helpful, for example, for doing competitors analysis.
Let’s start. We need to create Twitter Application (https://apps.twitter.com/) in order to have an access to Twitter’s API. Then we will get Consumer Key and Consumer Secret.
#connect all libraries library(twitteR) library(ROAuth) library(plyr) library(dplyr) library(stringr) library(ggplot2)
#connect to API download.file(url='http://curl.haxx.se/ca/cacert.pem', destfile='cacert.pem') reqURL < 'https://api.twitter.com/oauth/request_token' accessURL < 'https://api.twitter.com/oauth/access_token' authURL < 'https://api.twitter.com/oauth/authorize' consumerKey < '____________' #put the Consumer Key from Twitter Application consumerSecret < '______________' #put the Consumer Secret from Twitter Application Cred < OAuthFactory$new(consumerKey=consumerKey, consumerSecret=consumerSecret, requestURL=reqURL, accessURL=accessURL, authURL=authURL) Cred$handshake(cainfo = system.file('CurlSSL', 'cacert.pem', package = 'RCurl')) #There is URL in Console. You need to go to, get code and enter it in Console
save(Cred, file='twitter authentication.Rdata') load('twitter authentication.Rdata') #Once you launched the code first time, you can start from this line in the future (libraries should be connected) registerTwitterOAuth(Cred)
#the function for extracting and analyzing tweets search < function(searchterm) { #extact tweets and create storage file
list < searchTwitter(searchterm, cainfo='cacert.pem', n=1500) df < twListToDF(list) df < df[, order(names(df))] df$created < strftime(df$created, '%Y%m%d') if (file.exists(paste(searchterm, '_stack.csv'))==FALSE) write.csv(df, file=paste(searchterm, '_stack.csv'), row.names=F)
#merge the last extraction with storage file and remove duplicates stack < read.csv(file=paste(searchterm, '_stack.csv')) stack < rbind(stack, df) stack < subset(stack, !duplicated(stack$text)) write.csv(stack, file=paste(searchterm, '_stack.csv'), row.names=F)
#tweets evaluation function score.sentiment < function(sentences, pos.words, neg.words, .progress='none') { require(plyr) require(stringr) scores < laply(sentences, function(sentence, pos.words, neg.words){ sentence < gsub('[[:punct:]]', "", sentence) sentence < gsub('[[:cntrl:]]', "", sentence) sentence < gsub('\d+', "", sentence) sentence < tolower(sentence) word.list < str_split(sentence, '\s+') words < unlist(word.list) pos.matches < match(words, pos.words) neg.matches < match(words, neg.words) pos.matches < !is.na(pos.matches) neg.matches < !is.na(neg.matches) score < sum(pos.matches)  sum(neg.matches) return(score) }, pos.words, neg.words, .progress=.progress) scores.df < data.frame(score=scores, text=sentences) return(scores.df) }
pos < scan('C:/___________/positivewords.txt', what='character', comment.char=';') #folder with positive dictionary neg < scan('C:/___________/negativewords.txt', what='character', comment.char=';') #folder with negative dictionary pos.words < c(pos, 'upgrade') neg.words < c(neg, 'wtf', 'wait', 'waiting', 'epicfail')
Dataset < stack Dataset$text < as.factor(Dataset$text) scores < score.sentiment(Dataset$text, pos.words, neg.words, .progress='text') write.csv(scores, file=paste(searchterm, '_scores.csv'), row.names=TRUE) #save evaluation results
#total score calculation: positive / negative / neutral stat < scores stat$created < stack$created stat$created < as.Date(stat$created) stat < mutate(stat, tweet=ifelse(stat$score > 0, 'positive', ifelse(stat$score < 0, 'negative', 'neutral'))) by.tweet < group_by(stat, tweet, created) by.tweet < summarise(by.tweet, number=n()) write.csv(by.tweet, file=paste(searchterm, '_opin.csv'), row.names=TRUE)
#chart ggplot(by.tweet, aes(created, number)) + geom_line(aes(group=tweet, color=tweet), size=2) + geom_point(aes(group=tweet, color=tweet), size=4) + theme(text = element_text(size=18), axis.text.x = element_text(angle=90, vjust=1)) + #stat_summary(fun.y = 'sum', fun.ymin='sum', fun.ymax='sum', colour = 'yellow', size=2, geom = 'line') + ggtitle(searchterm)
ggsave(file=paste(searchterm, '_plot.jpeg'))
}
search("______") #enter keyword
Finally we will get four files:
I was surfing GitHub when I found this repository: Awesome Data Science
It has an extensive list of data science bloggers, MOOCS and the diamond: a free list of 24 free datasets sources. Excellent to study and apply some data science techniques.
Some highlights:
DSC Resources
Additional Reading
Follow us on Twitter: @DataScienceCtrl  @AnalyticBridge
Originally posted here. Check original article for most recent updates.
Confidence interval is abbreviated as CI. In this new article (part of our series on robust techniques for automated data science) we describe an implementation both in Excel and Perl, and discussion about our popular modelfree confidence interval technique introduced in our original Analyticbridge article. This technique has the following advantages:
This is part of our series on data science techniques suitable for automation, usebla by nonexperts. The next one to be detailed (with source code) will be our Hidden Decision Trees.
Figure 1: Confidence bands based on our CI (bold red and blue curves) – Comparison with traditional normal model (light red anf blue curves)
Figure 1 is based on simulated data that does not follow a normal distribution : see section 2 and Figure 2 in this article. It shows how sensitive CI’s are to model assumptions, when using the traditional approach, leading to very conservative (and plain wrong) CI’s. Classical CI’s are just based on 2 parameters: mean and variance. With the classical model, all data sets with same mean and same variance have same CI’s. To the contrary, our CI’s are based on k parameters – average values computed on k different bins – see next section for details. In short, they are much better predictive indicators when your data is not normal. Yet they are so easy to understand and compute, you don’t even need to understand probability 101 to get started. The attached spreadsheet and Perl scripts have all computations done for you.
1. General Framework
We assume that we have n observations from a continuous or discrete variable. We randomly assign a bin number to each observation: we create k bins (1 ≤ k ≤ n) that have similar or identical sizes. We compute the average value in each bin, then we sort these averages. Let p(m) be the mth lowest average (1 ≤ m ≤ k/2, with p(1) being the minimum average). Then our CI is defined as follows:
The confidence level represents the probability that a new observation (from the same data set) will be between the lower and upper bounds of the CI. Note that this method produces asymetrical CI’s. It is equivalent to designing percentilebased confidence intervals on aggregated data. In practice, k is chosen much smaller than n, say k = SQRT(n). Also m is chosen to that 1 – 2m/(k+1) is as close as possible to a prespecified confidence level, for instance 0.95. Note that the higher m, the more robust (outliernonsensitive) your CI.
If you can’t find m and k to satisfy level = 0.95 (say), then compute a few CI’s (with different values of m), with confidence level close to 0.95. Then inperpolate or extrapolate the lower and upper bounds to get a CI with 0.95 confidence level. The concept is easy to visualize if you look at Figure 1. Also, do proper crossvalidation: slpit your data in two; compute CI’s using the first half, and test them on the other half, to see if they still continue to have sense (same confidence level, etc.)
CI’s are extensively used in quality control, to check if a batch of new products (say, batteries) have failure rates, lifetime or other performance metrics that are within reason, that are acceptable. Or if wine advertised with 12.5% alcohol content has an actual alcohol content reasonably close to 12.5% in each batch, year after year. By “acceptable” or “reasonable”, we mean between the upper and lower bound of a CI with prespecified confidence level. CI are also used in scoring algorithms, to provide CI to each score.The CI provides an indication about how accurate the score is. Very small confidence levels (that is, narrow CI’s) corresponds to data well understood, with all sources of variances perfectly explained. Converserly, large CI’s mean lot’s of noise and high individual variance in the data. Finally, if your data is stratified in multiple heterogeneous segments, compute separate CI’s for each strata.
That’s it, no need to know even rudimentary statistical science to understand this CI concept, as well as the concept of hypothesis testing (derived from CI) explained below in section 3.
When Big Data is Useful
If you look closely at Figure 1, it’s clear that you can’t compute accurate CI’s with a high (above 0.99) level, with just a small sample and (say) k=100 bins. The higher the level, the more volatile the CI. Typically, an 0.999 level requires 10,000 or more observations to get something stable. These highlevel CI’s are needed especially in the context of assessing failure rates, food quality, fraud detection or sound statistical litigation. There are ways to work with much smaller samples by combining 2 tests, see section 3.
An advantage of big data is that you can create many different combinations of k bins (that is, test many values of m and k) to look at how the confidence bands in Figure 1 change depending on the bin selection – even allowing you to create CI’s for these confidence bands, just like you could do with Bayesian models.
2. Computations: Excel, Source Code
The first step is to reshuffle your data to make sure that your observations are in perfect random order: read A New Big Data Theorem section in this article for an explanation why reshuffling is necessary (look at the second theorem). In short, you want to create bins that have the same mix of values: if the first half of your data set consisted of negative values, and the second half of positive values, you might end up with bins either filled with positive or negative values. You don’t want that; you want each bin to be well balanced.
Reshuffling Step
Unless you know that your data is in an arbitrary order (this is the case most frequently), reshuffling is recommended. Reshuffling can easily be performed as follows:
Note that we use 100,000 + INT(10,000*RAND()) rather than just simply RAND() to make sure that all random numbers are integers with the same number of digits. This way, whether you sort alphabetically or numerically, the result will be identical, and correct. Sorting numbers of variable length alphabetically (without knowing it) is a source of many bugs in software engineering. This little trick helps you avoid this problem.
If the order in your data set is very important, just add a column that has the original rank attached to each observation (in your initial data set), and keep it through the resshuffling process (after each observation has been assigned to a bin), so that you can always recover the original order if necessary, by sorting back according to this extra column.
The Spreadsheet
Download the Excel spreadsheet. Figures 1 and 2 are in the spreadsheet, as well as all CI computations, and more. The spreadsheet illustrates many not so well known but useful analytic Excel functions, such as: FREQUENCY, PERCENTILE, CONFIDENCE.NORM, RAND, AVERAGEIF, MOD (for bin creations) and RANK. The CI computations are in cells O2:Q27 in the Confidence Intervals tab. You can modify the data in column B, and all CI’s will automatically be recomputed. Beware if you change the number of bins (cell F2): this can screw up the RANK function in column J (some ranks will be missing) and then screw up the CI’s.
For other examples of great spreadsheet (from a tutorial point of view), check the Excel section in our data science cheat sheet.
Simulated Data
The simulated data in our Excel spreadsheet (see the data simulation tab), represents a mixture of two uniform distributions, driven by the parameters in the orange cells F2, F3 and H2. The 1,000 original simulated values (see Figure 2) were stored in column D, and were subsequently hardcopied into column B in the Confidence Interval (results) tab (they still reside there), because otherwise, each time you modify the spreadsheet, new deviates produced by the RAND Excel function are automatically updated, changing everything and making our experiment nonreproducible. This is a drawback of Excel, thought I’ve heard that it is possible to freeze numbers produced by the function RAND. The simulated data is remarkably nonGaussian, see Figure 2. It provides a great example of data that causes big problems with traditional statistical science, as described in our following subsection.
In any case, this is an interesting tutorial on how to generate simulated data in Excel. Other examples can be found in our Data Science Cheat Sheet (see Excel section).
Comparison with Traditional Confidence Intervals
We provide a comparison with standard CI’s (available in all statistical packages) in Figure 1, and in our spreadsheet. There are a few ways to compute traditional CI’s:
As you can see in Figure 1, traditional CI’s fail miserably if your data has either a short or long tail, compared with data originating from a Gaussian process.
Perl Code
Here’s some simple source code to compute CI for given m and k:
$k=50; # number of bins
$m=5;
open(IN,”< data.txt”);
$binNumber=0;
while ($value=<IN>) {
$value=~s/n//g;
$binNumber = $n % $k;
$binSum[$binNumber] += $value;
$binCount[$binNumber] ++;
$n++;
}
if ($n < $k) {
print “Error: Too few observations: n < k (choose a smaller k)n”;
exit();
}
if ($m> $k/2) {
print “Error: reduce m (must be <= k/2)n”;
exit();
}
for ($binNumber=0; $binNumber<$k; $binNumber++) {
$binAVG[$binNumber] = $binSum[$binNumber]/$binCount[$binNumber];
}
$binNumber=0;
foreach $avg (sort { $a <=> $b } @binAVG) { # sorting bins numerically
$sortedBinAVG[$binNumber] = $avg;
$binNumber++;
}
$CI_LowerBound= $sortedBinAVG[$m];
$CI_UpperBound= $sortedBinAVG[$k$m+1];
$CI_level=12*$m/($k+1);
print “CI = [$CI_LowerBound,$CI_UpperBound] (level = $CI_level)n”;
Exercise: write the code in R or Python.
3. Application to Statistical Testing
Rather than using pvalues and other dangerous concepts (about to become extinct) that nobody but statisticians understand, here is an easy way to perform statistical tests. The method below is part of what we call rebel statistical science.
Let’s say that you want to test, with 99.5% confidence (level = 0.995), whether or not a wine manufacturer consistently produces a specific wine that has a 12.5% alcohol content. Maybe you are a lawyer, and the wine manufacturer is accused of lying on the bottle labels (claiming that alcohol content is 12.5% when indeed it is 13%), maybe to save some money. The test to perform is as follows: check out 100 bottles from various batches, and compute an 0.995level CI for alcohol content. Is 12.5% between the upper and lower bounds? Note that you might not be able to get an exact 0.995level CI if your sample size n is too small (say n=100), you will have to extrapolate from lower level CI’s, but the reason here to use a high confidence level is to give the defendant the benefit of the doubt rather than wrongly accusing him based on a too smallconfidence level. If 12.5% is found inside even a small 0.50level CI (which will be the case if the wine is truly 12.5% alcohol), then a fortiori it will be inside an 0.995level CI, because these CI’s are nested (see Figure 1 to understand these ideas). Likewise, if the wine truly has a 13% alcohol content, a tiny 0.03level CI containing the value 13% will be enough to prove it.
One way to better answer these statistical tests (when your highlevel CI’s don’t provide an answer) is to produce 2 or 3 tests (but no more, otherwise your results will be biased). Test whether the alcohol rate is
4. Miscellaneous
We include two figures in this section. The first one is about the data used in our test and Excel spreadsheet, to produce our confidence intervals. And the other figure shows the theorem that justifies the construction of our confidence intervals.
Figure 2: Simulated data used to compute CI’s: asymmetric mixture of nonnormal distrubutions
Figure 3: Theorem used to justify our confidence intervals
]]>Originally posted on DataSciebceCentral, by Dr. Granville. Click here to read original article and comments.
Specifically designed in the context of big data in our research lab, the new and simple strong correlation synthetic metric proposed in this article should be used, whenever you want to check if there is a real association between two variables, especially in largescale automated data science or machine learning projects. Use this new metric now, to avoid being accused of reckless data science and evenbeing sued for wrongful analytic practice.
In this paper, the traditional correlation is referred to as the weak correlation, as it captures only a small part of the association between two variables: weak correlation results in capturing spurious correlations and predictive modeling deficiencies, even with as few as 100 variables. In short, our strong correlation (with a value between 0 and 1) is high (say above 0.80) if not only the weak correlation is also high (in absolute value), but when the internal structures (autodependencies) of both variables X and Y that you want to compare, exhibit a similar pattern or correlogram. Yet this new metric is simple and involves just one parameter a (with a = 0 corresponding to weak correlation, and a =1 being the recommended value forstrong correlation). This setting is designed to avoid overfitting.
Our strong correlation blends together the concept of ordinary or weak regression – indeed, an improved, robust, outlierresistant version of ordinary regression (or see my book pages 130140) – together with the concept of X and Y sharing similar bumpiness (or see my book pages 125128).
In short, even nowadays, what makes two variables X and Y seem related in most scientific articles and pretty much all articles written by journalists, is based on ordinary (weak) regression. But there are plenty of other metrics that you can use to compare two variables. Including bumpiness in the mix (together with weak regression in just one single blended metric called strong correlation to boost accuracy) guarantees that high strong correlation means that the two variables are really associated, not just based on flawy, oldfashioned weak correlations, but also associated based on sharing similar internal autodependencies and structure. To put it differently, two variables can be highly weakly correlated yet have very different bumpiness coefficients, as shown in my original article – meaning that there might be no causal relationship (or see my book pages 165168) or hidden factors explaining the link. An artificial example is provided below in figure 3.
Using strong, rather than weak correlation, eliminates the majority of these spurious correlations, as we shall see in the examples below. This strong correlation metric is designed to be integrated in automated data science algorithms.
1. Formal definition of strong correlation
Let’s define
Note that c1(X), and c1(Y) are the bumpiness coefficients (or see my book pages 125128) for X and Y. Also, d(X, Y) and thus r(X, Y) are between 0 and 1, with 1 meaning strong similarity between X and Y, and 0 meaning either dissimilar lag1 autocorrelations for X and Y, or lack of oldfashioned correlation.
The strong correlation between X and Y is, by definition, r(X, Y). This is an approximation to having both spectra identical, a solution mentioned in my article The curse of Big Data (see also my book pages 4145).
This definition of strong correlation was initially suggested in one of our weekly challenges.
2. Comparison with traditional (weak) correlation
When a = 0, weak and strong correlations are identical. Note that the strong correlation r(X, Y) still shares the same properties as the weak correlation c(X, Y): it is symmetric and invariant under linear transformations (such as rescaling) of variables X or Y, regardless of a.
]]>Originally posted on DataScienceCentral, by Dr. Granville. Click here to read original article and comments.
Here I will discuss a general framework to process web traffic data. The concept of MapReduce will be naturally introduced. Let’s say you want to design a system to score Internet clicks, to measure the chance for a click to convert, or the chance to be fraudulent or unbillable. The data comes from a publisher or ad network; it could be Google. Conversion data is limited and poor (some conversions are tracked, some are not; some conversions are soft, just a clickout, and conversion rate is above 10%; some conversions are hard, for instance a credit card purchase, and conversion rate is below 1%). Here, for now, we just ignore the conversion data and focus on the low hanging fruits: click data. Other valuable data is impression data (for instance a click not associated with an impression is very suspicious). But impression data is huge, 20 times bigger than click data. We ignore impression data here.
Here, we work with complete click data collected over a 7day time period. Let’s assume that we have 50 million clicks in the data set. Working with a sample is risky, because much of the fraud is spread across a large number of affiliates, and involve clusters (small and large) of affiliates, and tons of IP addresses but few clicks per IP per day (low frequency).
The data set (ideally, a tabseparated text file, as CSV files can cause field misalignment here due to text values containing field separators) contains 60 fields: keyword (user query or advertiser keyword blended together, argh…), referral (actual referral domain or ad exchange domain, blended together, argh…), user agent (UA, a long string; UA is also known as browser, but it can be a bot), affiliate ID, partner ID (a partner has multiple affiliates), IP address, time, city and a bunch of other parameters.
The first step is to extract the relevant fields for this quick analysis (a few days of work). Based on domain expertise, we retained the following fields:
These 5 metrics are the base metrics to create the following summary table. Each (IP, Day, UA ID, Partner ID, Affiliate ID) represents our atomic (most granular) data bucket.
Building a summary table: the Map step
The summary table will be built as a text file (just like in Hadoop), the data key (for joins or groupings) being (IP, Day, UA ID, Partner ID, Affiliate ID). For each atomic bucket (IP, Day, UA ID, Partner ID, Affiliate ID) we also compute:
The list of UA’s, for a specific bucket, looks like ~67239~451~7842, meaning that in the bucket in question, there are three browsers (with ID 6723, 45 and 784), 12 clicks (9 + 1 + 2), and that (for instance) browser 6723 generated 9 clicks.
In Perl, these computations are easily performed, as you sequentially browse the data. The following updates the click count:
$hash_clicks{“IPtDaytUA_IDtPartner_IDtAffiliate_ID”};
Updating the list of UA’s associated with a bucket is a bit less easy, but still almost trivial.
The problem is that at some point, the hash table becomes too big and will slow down your Perl script to a crawl. The solution is to split the big data in smaller data sets (called subsets), and perform this operation separately on each subset. This is called the Map step, in MapReduce. You need to decide which fields to use for the mapping. Here, IP address is a good choice because it is very granular (good for load balance), and the most important metric. We can split the IP address field in 20 ranges based on the first byte of the IP address. This will result in 20 subsets. The splitting in 20 subsets is easily done by browsing sequentially the big data set with a Perl script, looking at the IP field, and throwing each observation in the right subset based on the IP address.
Building a summary table: the Reduce step
Now, after producing the 20 summary tables (one for each subset), we need to merge them together. We can’t simply use hash table here, because they will grow too large and it won’t work – the reason why we used the Map step in the first place.
Here’s the work around: …
]]>Originally posted on DataSciebceCentral, by Dr. Granville. Click here to read original article and comments.
This article discusses a far more general version of the technique described in our article The best kept secret about regression. Here we adapt our methodology so that it applies to data sets with a more complex structure, in particular with highly correlated independent variables.
Our goal is to produce a regression tool that can be used as a black box, be very robust and parameterfree, and usable and easytointerpret by nonstatisticians. It is part of a bigger project: automating many fundamental data science tasks, to make it easy, scalable and cheap for data consumers, not just for data experts. Our previous attempts at automation include
Readers are invited to further formalize the technology outlined here, and challenge my proposed methodology.
1. Introduction
As in our previous paper, without loss of generality, we focus on linear regression with centered variables (with zero mean), and no intercept. Generalization to logistic or noncentered variables is straightforward.
Thus we are still dealing with the following regression framework:
Y = a_1 * X_1 + … + a_n * X_n + noise
Remember that the solution proposed in our previous paper was
When cov(X_i, X_j) = 0 for i < j, my regression and the classical regression produce identical regression coefficients, and M = 1.
Terminology: Z is the noise, Y is the (observed) response, the a_i’s are the regression coefficients, and and S = a_1 * X_1 + … + a_n * X_n is the estimated or predicted response. The X_i’s are the independent variables or features.
2. Revisiting our previous data set
I have added more crosscorrelations to the previous simulated dataset consisting of 4 independent variables, still denoted as x, y, z, u in the new, updated attached spreadsheet. Now corr(x, y) = 0.99.
]]>Originally posted on Hadoop36o, by Dr. Granville. Click here to read original article and comments.
The new variance introduced in this article fixes two big data problems associated with the traditional variance and the way it is computed in Hadoop, using a numerically unstable formula.
Synthetic Metrics
This new metric is synthetic: It was not derived naturally from mathematics like the variance taught in any statistics 101 course, or the variance currently implemented in Hadoop (see above picture). Bysynthetic, I mean that it was built to address issues with big data (outliers) and the way many big data computations are now done: Map Reduce framework, Hadoop being an implementation. It is a topdown approach to metric design – from data to theory, rather than the bottomup traditional approach – from theory to data.
Other synthetic metrics designed in our research laboratory include:
Hadoop, numerical and statistical stability
There are two issues with the formula used for computing Variance in Hadoop. First, the formula used, namely Var(x1, … , xn) = {SUM(xi^2)/n} – {SUM(xi)/n}^2, is notoriously unstable. For large n, while both terms cancel out somewhat, each one taken separately can take a huge value, because of the squares aggregated over billions of observations. It results in numerical inaccuracies, with people having reported negative variances. Read the comments attached to my article The curse of Big Data for details. Besides, there are variance formula that do not require two passes of the entire data sets, and that are numerically stable.
]]>Originally posted on DataScienceCentral, by Dr. Granville. Click here to read original article and comments.
In this article, I proposes a simple metric to measure predictive power. It is used for combinatorial feature selection, where a large number of feature combinations need to be ranked automatically and very fast, for instance in the context of transaction scoring, in order to optimize predictive models. This is about rather big data, and we would like to see an Hadoop methodology for the technology proposed here. It can easily be implemented in a Map Reduce framework. It was developed by the author in the context of credit card fraud detection, and click/keyword scoring. This material will be part of our data science apprenticeship, and included in our Wiley book.
Feature selection is a methodology used to detect the best subset of features, out of dozens or hundreds of features (also called variables or rules). By “best”, we mean with highest predictive power, a concept defined in the following subsection. In short, we want to remove duplicate features, simplify a bit the correlation structure (among features) and remove features that bring no value, such as a features taking on random values, thus lacking predictive power, or features (rules) that are almost never triggered (except if they are perfect fraud indicators when triggered).
The problem is combinatorial in nature. You want a manageable, small set of features (say 20 features) selected from (say) a set of 500 features, to run our hidden decision trees (or some other classification / scoring technique) in a way that is statistically robust. But there are 2.7 * 1035 combinations of 20 features out of 500, and you need to compute all of them to find the one with maximum predictive power. This problem is computationally intractable, and you need to find an alternate solution. The good thing is that you don’t need to find the absolute maximum; you just need to find a subset of 20 features that is good enough.
One way to proceed is to compute the predictive power of each feature. Then, add one feature at a time to the subset (starting with 0 feature) until you reach either
At each iteration, choose the feature to be added among the two remaining features with the highest predictive power: you will choose (among these two features) the one that increases the overall predictive power (of the subset under construction) most. Now you have reduced your computations from 2.7 * 1035 to 40 = 2 * 20.
]]>Originally posted on DataScienceCentral, by Dr. Granville. Click here to read original article and comments.
This is a component often missing, yet valuable for most systems, algorithms and architectures that are dealing with online or mobile data, known as digital data: be it transaction scoring, fraud detection, online marketing, marketing mix and advertising optimization, online search, plagiarism and spam detection, etc.
I will call it an Internet Topology Mapping. It might not be stored as a traditional database (it could be a graph database, a file system, or a set of lookup tables). It must be prebuilt (e.g. as lookup tables, with regular updates) to be efficiently used.
So what is the Internet Topology Mapping?
Essentially, it is a system that matches an IP address (Internet or mobile) with a domain name (ISP). When you process a transaction in real time in production mode (e.g. an online credit card transaction, to decide whether to accept or decline it), your system only has a few milliseconds to score the transaction to make the decision. In short, you only have a few milliseconds to call and run an algorithm (subprocess), on the fly, separately for each credit card transaction, to decide on accepting/rejecting. If the algorithm involves matching the IP address with an ISP domain name (this operation is callednslookup), it won’t work: direct nslookups take between a few hundreds to a few thousands milliseconds, and they will slow the system to a grind.
Because of that, Internet Topology Mappings are missing in most systems. Yet there is a very simple workaround to build it:
When processing a transaction, access this lookup table (stored in memory, or least with some caching available in memory) to detect the domain name. Now you can use a rule system that does incorporate domain names.
Example of rules and metrics based on domain names are:
Originally posted on DataScienceCentral, by Dr. Granville. Click here to read original article and comments.
Hidden decision trees (HDT) is a technique patented by Dr. Granville, to score large volumes of transaction data. It blends robust logistic regression with hundreds small decision trees (each one representing for instance a specific type of fraudulent transaction) and offers significant advantages over both logistic regression and decision trees: robustness, ease of interpretation, and no tree pruning, no node splitting criteria. It makes this methodology powerful and easy to implement even for someone with no statistical background.
Hidden Decision Trees is a statistical and data mining methodology (just like logistic regression, SVM, neural networks or decision trees) to handle problems with large amounts of data, nonlinearity and strongly correlated independent variables.
The technique is easy to implement in any programming language. It is more robust than decision trees or logistic regression, and helps detect natural final nodes. Implementations typically rely heavily on large, granular hash tables.
No decision tree is actually built (thus the name hidden decision trees), but the final output of a hidden decision tree procedure consists of a few hundred nodes from multiple nonoverlapping small decision trees. Each of these parent (invisible) decision trees corresponds e.g. to a particular type of fraud, in fraud detection models. Interpretation is straightforward, in contrast with traditional decision trees.
The methodology was first invented in the context of credit card fraud detection, back in 2003. It is not implemented in any statistical package at this time. Frequently, hidden decision trees are combined with logistic regression in an hybrid scoring algorithm, where 80% of the transactions are scored via hidden decision trees, while the remaining 20% are scored using a compatible logistic regression type of scoring.
Hidden decision trees take advantage of the structure of large multivariate features typically observed when scoring a large number of transactions, e.g. for fraud detection. The technique is not connected with hidden Markov fields.
Potential Applications
Implementation
The model presented here is used in the context of click scoring. The purpose is to create predictive scores, where score = f(response), that is, score is a function of the response. The response is sometimes referred to as the dependent variable in statistical and predictive models.
Originally posted on Analyticbridge, by Dr. Granville. Click here to read original article and comments.
With big data, one sometimes has to compute correlations involving thousands of buckets of paired observations or time series. For instance a data bucket corresponds to a node in a decision tree, a customer segment, or a subset of observations having the same multivariate feature. Specific contexts of interest include multivariate feature selection (a combinatorial problem) or identification of best predictive set of metrics.
In large data sets, some buckets will contain outliers or meaningless data, and buckets might have different sizes. We need something better than the tools offered by traditional statistics. In particular, we want a correlation metric that satisfies the following
Five conditions:
Note that RSquared, a goodnessoffit measure used to compare model efficiency across multiple models, is typically the square of the correlation coefficient between observations and predicted values, measured on a training set via sound crossvalidation techniques. It suffers the same drawbacks, and benefits from the same cures as traditional correlation. So we will focus here on the correlation.
To illustrate the first condition (dependence on n), let’s consider the following madeup data set with two paired variables or time series X, Y: …
]]>