Author: Marc Borowczak.

Moving legacy data to modern big data platform can be daunting at times. It doesn’t have to be. In this short tutorial, we’ll briefly review an approach and demonstrate on my preferred data set: This isn’t a ML repository nor a Kaggle competition data set, simply the data I accumulated over decades to keep track of my plastic model collection, and as such definitely meets the legacy standard!

We’ll describe steps followed on a laptop VirtualBox machine running Ubuntu 16.04.1 LTS Gnome. The following steps are then required:

- Import the .csv file in MySQL, and optionally backup a compressed MySQL database file.
- Connect to MySQL database in Spark 2.0.1 and then access the data: we’ll demonstrate an interactive Python approach using Jupyter PySpark in this post and leave an Rstudio Sparkyl access based on existing methods for another post.

There’s really no need to abandon legacy data: Migrating data to new platform will enable businesses to extract and analyze data on a broader time scale, and open new ways to leverage ML techniques, analyze results and act on findings.

Additional routes methods to import CSV data will be discussed in a forthcoming post.

]]>Author: Leandro Guerra

I was surfing GitHub when I found this repository: Awesome Data Science

It has an extensive list of data science bloggers, MOOCS and the diamond: a free list of 24 free datasets sources. Excellent to study and apply some data science techniques.

**Some highlights**:

- Google Making Sense of Data
- Coursera Introduction to Data Science
- Data Science – 9 Steps Courses, A Specialization on Coursera
- Data Mining – 5 Steps Courses, A Specialization on Coursera
- CS 109 Data Science
- Schoolofdata
- OpenIntro
- Data science MOOC
- CS 171 Visualization
- Process Mining: Data science in Action

- Academic Torrents
- hadoopilluminated.com
- data.gov
- freebase.com
- usgovxml.com
- enigma.io
- datahub.io
- aws.amazon.com/datasets
- databib.org
- datacite.org
- quandl.com
- figshare.com
- GeoLite Legacy Downloadable Databases
- Quora’s Big Datasets Answer
- Public Big Data Sets
- Houston Data Portal
- Kaggle Data Sources
- A Deep Catalog of Human Genetic Variation
- A community-curated database of well-known people, places, and things
- Google Public Data
- World Bank Data
- NYC Taxi data
- Open Data Philly Connecting people with data for Philadelphia
- A list of useful sources A blog post includes many data set databases

**DSC Resources**

- Career: Training | Books | Cheat Sheet | Apprenticeship | Certification | Salary Surveys | Jobs
- Knowledge: Research | Competitions | Webinars | Our Book | Members Only | Search DSC
- Buzz: Business News | Announcements | Events | RSS Feeds
- Misc: Top Links | Code Snippets | External Resources | Best Blogs | Subscribe | For Bloggers

**Additional Reading**

- Data Scientist Reveals his Growth Hacking Techniques
- Top data science keywords on DSC
- 4 easy steps to becoming a data scientist
- 13 New Trends in Big Data and Data Science
- 22 tips for better data science
- Data Science Compared to 16 Analytic Disciplines
- How to detect spurious correlations, and how to find the real ones
- 17 short tutorials all data scientists should read (and practice)
- 10 types of data scientists
- 66 job interview questions for data scientists
- High versus low-level data science

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Author: Vincent Granville

The following is a selection of featured articles that were posted in our previous weekly digests, in short, the best of the best on DSC. **Single-starred articles** are written by external/guest bloggers. Older popular articles are being added regularly, so please check out this page once a week!

Our upcoming book on data science 2.0 (or *data science automation* or *data science handbook* or *the little data science book*, not sure yet about the title) will be based on some of these (edited and revised) articles: **these articles are double-starred **below, with red starring **.

Double-starred articles, with blue starring **, were published in our Wiley book.

A great selection of articles prior to 2014, broken down by category, can be found here.

**How to access these articles?**

- Articles published after October 2015, click here.
- Articles published between February 2015 and October 2015, click here.
- Articles published prior to February 2015: see below.

**January 2015**

- The impact of asking the wrong question ** – 01/26
- Mysterious gaps in Google Analytics numbers ** – 01/26
- The 5 Essential Skills Any Data Scientist Needs – * 01/26
- Text Mining: Clustering and Unsupervised Methods – * 01/26
- Why Topological Data Analysis Works * – 01/19
- Statistician versus data scientist arrogance * – 01/19
- Top 10 Big Data and Analytics References * – 01/19
- In Big Data, Preparing the Data is Most of the Work * – 01/19
- VC investment analytics with data visualization * – 01/19
- Is Big Data the Single Biggest Threat To Your Job * – 01/19
- Data Science Art ** – 01/19
- 7 Traps to Avoid Being Fooled by Statistical Randomness* – 01/12
- Three Types Of Analytic Talent You Need * – 01/12
- Is Big Data the Single Biggest Threat To Your Job * – 01/12
- 9 tips for effective data mining * – 01/12
- How to Lie with Visualizations: Statistics, Causation vs Correlation, and Intuition * – 01/12
- Text Analysis 101; A Basic Understanding for Business Users * – 01/12
- Insight-driven vs. Intuition-driven Decision Making * – 01/12
- Understanding Linear Regression * – 01/12
- Latest online courses on Data Science * – 01/12
- The 2 Types Of Data Scientists Everyone Should Know About * – 01/05
- Get started with Hadoop and Spark in 10 minutes * – 01/15
- Most popular data science skills ** – 01/05
- What’s Hot & What’s Not in Data Science 2015 – 01/05
- Common Problems with Data * – 01/05
- Some statisticians have a biased view on data science ** – 01/05
- Engineering a far worse attack than Sony, without hacking ** – 01/05
- New Model for Scientific Research ** – 01/05
- Data Science Software Tools That Cost Nothing * – 01/05
- Comparison of statistical software * – 01/05

**December 2014**

- Regression Analysis using R explained * – 12/29
- 200 machine learning and data science resources – 12/29
- Choosing the Right BI Tool * – 12/29
- What can be predicted, and what can’t? ** – 12/29
- Are Earthquakes becoming more severe? ** – 12/29
- Update about the Data Science Apprenticeship ** – 12/29
- The data science project lifecycle * – 12/22
- A Statistician’s View on Big Data and Data Science * – 12/22
- Data Visualization of Employee metrics at the top Tech companies * – 12/22
- Best solution to a problem: data science versus statistical paradigm ** – 12/22
- Discover, Access, Distill: The Essence of Data Science** – 12/15
- 10 data science predictions for 2015 – 12/15
- Big Data: The Key Vocabulary Everyone Should Understand * – 12/15
- Great Machine Learning Infographics * – 12/15
- Big Data, IOT and Security – OH MY! * – 12/15
- Sentiment Analysis of 11 Million Tweets * – 12/15
- What is the future of Data visualization and Dashboard solutions? * – 12/15
- Data science without statistics is possible, even desirable ** – 12/15
- 5 basic rules of data organization ** – 12/15
- Introducing the Linked Data Business Cube * – 12/8
- Data Scientist: Owning Up to the Title * – 12/8
- Trends in Big Data Vs Hadoop Vs Business Intelligence * – 12/8
- Small versus big data, to choose a used car ** – 12/8
- Highest Paying Programming Skills – 12/01
- Text Analysis for Business Users * – 12/01
- 4 easy steps to becoming a data scientist ** – 12/01
- Easiest way to learn machine learning * – 12/01
- Implementing a Distributed Deep Learning Network over Spark * – 12/01
- High versus low-level data science ** – 12/01

**November 2014**

- Popular Software Skills in Data Science Job postings * – 11/24
- Don’t Expect A Large Salary Increase If You Didn’t Go To College * – 11/24
- A Study on Romantic Breakups on Twitter Using Data Science * – 11/24
- Trust The Algorithms, Not The Data * – 11/24
- Four great data science, big data, and deep machine learning books – 11/24
- The Only Skill you Should be Concerned With * – 11/17
- Unicorn Data Scientist Shares his Secrets with You ** – 11/17
- Why Data Scientists create poor Data Products ? * – 11/17
- 13 New Trends in Big Data and Data Science ** – 11/17
- 20 most commented blog posts on DSC – 11/17
- Why you should stay away from the stock market – 11/17
- 22 tips for better data science ** – 11/10
- My thoughts on data science and big data ** – 11/10
- The growth of data science over the last two years: 300% ** – 11/10
- Start with Good Science on Good Data, Then we’ll Talk Big Data * – 11/10
- Adversarial analytics and business hacking: Amazon case study ** – 11/10
- 3 Must-Ask Questions Before Choosing That Machine Learning Algorithm! * – 11/10
- Pseudo data science funded by political money, on Facebook ** – 11/10
- Social Influence Analysis * – 11/10
- My Data Science Apprenticeship Project ** – 11/03
- The Single Best Predictive Modeling Technique * – 11/03
- Data science versus statistics, to solve problems: case study ** – 11/03
- 13 Machine Learning Books – 11/03

**October 2014**

- Prescriptive versus Predictive Analytics * – 10/27
- Data Science 2.0 ** – 10/27
- List of data set repositories for cool data science projects ** – 10/27
- 15 timeless data science articles – 10/27
- Fake traffic un-detected by Google Analytics – 10/27
- 50 Face Recognition APIs – 10/27
- A Comprehensive List of Big Data Statistics – 10/27
- Bottom line on Data Visualization * – 10/20
- Data Science is Dead – Long Live the Data Scientist * – 10/20
- A data scientist shares his passions ** – 10/20
- Is data science a new paradigm, or recycled material? ** – 10/20
- 2-D random walks: simulation, video with R source code, curious facts ** – 10/20
- Data scientists face burnout due to work-related stress – 10/20
- How to Compute Moving Average in R Language and Python * – 10/20
- What is Map-Reduce? – 10/20
- Data scientists face burnout due to work-related stress – 10/13
- A new version of the famous 3V diagram from Drew Conway – 10/13
- Popular predictive apps and APIs ** – 10/13
- Do you know what is bigger than Big Data? * – 10/13
- 200 Top Bloggers on Data Science Central ** – 10/13
- Unraveling Real-Time Predictive Analytics * – 10/13
- Political orientation of 100,000 websites ** – 10/13
- Top 30 DSC blogs, based on new scoring technology ** – 10/6
- The end of the Data Scientist Bubble ** – 10/6
- Apache Spark: distributed data processing faster than Hadoop * – 10/6
- The 22 Skills of a Data Scientist – 10/6
- 10 most popular data science presentations on Slideshare – 10/6
- 38 Seminal Articles Every Data Scientist Should Read ** – 10/6
- Hadoop is Dead. DataFlow is Alive! – 10/6

**September 2014**

- 100+ leading blogs for statisticians and like-minded professionals – 9/29
- Great list of resources: data science, visualization, machine learn…** – 9/29
- Difference between data engineers and data scientists ** – 9/29
- 20 Big Data Repositories You Should Check Out ** – 9/29
- 50 Data Science and Statistics Blogs Worth Reading – 9/29
- Preliminary findings about Zipf’s Law (the thick tail distribution) * – 9/29
- 43 Data Science Thought Leaders, According to Berkeley University – 9/29
- Top 2,500 Data Science, Big Data and Analytics Websites ** – 9/22
- Top 2,500 Websites – top of the top ** – 9/22
- Job skills required to get hired by data science startups ** – 9/22
- Top Cities and Other Demographics for Data Scientists ** – 9/22
- Defining Big Data ** – 9/22
- Big data disguised as small data, causing dangerous side effects ** – 9/22
- Web crawler for clustering 2,500 data science websites ** – 9/22
- How to find the real web domain hidden behind a bit.ly shortened URL? ** – 9/22
- Web traffic statistics from competitors – which vendor do you trust? ** – 9/22
- Big Data Technology Vendor Consolidation * – 9/15
- Interactive visualization of growing Data Science / Big Data profil…** – 9/15
- Lesson 8: Graph Databases * – 9/15
- 180 leading data science, big data and analytics bloggers ** – 9/15
- Skills you need to become a data scientist * – 9/15
- NoSQL Databases are Good for Everything – Except Maybe this One Thing * – 9/8
- How to get published on Data Science Central ** – 9/8
- Curious formula generating all digits of square root numbers ** – 9/8
- 9 Lessons: Picking the Right NoSQL Tools * – 9/8
- Are You A Data Scientist ? * – 9/8
- Some software and skills that every Data Scientist should know * – 9/8
- Frozen versus liquid analytics ** – 9/1
- 33 unusual problems that can be solved with data science ** – 9/1
- How to Become a Data Scientist * – 9/1
- Synthetic criterion to choose the right variables for your predicti…** – 9/1
- Crazy Data Science Tutorial: Classification and Clustering * – 9/1
- Grouping and Summarizing Data of Big Text Files in R * – 9/1
- How to design better search engines? ** – 9/1

**August 2014**

- Network graph analysis for fraud detection and mitigation * – 08/25
- Data analysis software compared * – 08/25
- 21 great charts – 08/25
- Easiest way to learn machine learning * – 08/25
- From Data Analyst to Predictive Modelers to Data Scientists * – 08/25
- Why Zipf’s law explains so many big data and physics phenomenons ** – 08/25
- Data Science Projects ** – 08/18
- Challenge of the week – Modeling and explaining the law of series ** – 08/11
- Your Data Science Portfolio: Math Skills Don’t Matter * – 08/11
- Data Science: Fixing the Talent Shortage ** – 08/11
- Is Python or Perl faster than R? ** – 08/11
- Black-box Confidence Intervals: Excel and Perl Implementation ** – 08/11
- Word Clouds of Big Data, Data Science and Other Buzz Words * – 08/04
- Huge Trello List of Great Data Science Resources * – 08/04
- The law of series: why 4 plane crashing in 6 months is a coincidence ** – 08/04
- Data Science Cheat Sheet ** – 08/04

**July 2014**

- Can A 50-Person Startup Threaten Oracle, IBM, And Microsoft? – 07/28
- 5 Industries That Need Big Data * – 07/28
- 10 types of regressions. Which one to use? ** – 07/28
- Data Scientist Core Skills * – 07/28
- Challenge of the Week – Time Series and Spatial Processes ** – 07/28
- Rants from a great under-paid data scientist – 07/28
- 16 analytic disciplines compared to data science ** – 07/28
- 15 interviews with 15 data scientists – 07/21
- 10 Features all Dashboards Should Have ** – 07/21
- How to Process Text Files in the Data Analytics * – 07/21
- The fastest growing data science / big data profiles on Twitter ** – 07/21
- Great list of resources – NoSQL, Big Data, Machine Learning and more – 07/21
- 10 Features any Great Dashboard Should Have ** – 07/14
- Twelve Emerging Trends in Data Analytics (part 1 of 4) * – 07/14
- Top Data Scientists on Twitter ** – 07/14
- Beyond The Visualization Zoo * – 07/07
- 100 Big data analytics companies, tools, software – 07/07
- 25 Data Scientists Popular on LinkedIn – 07/07
- 35 books on Data Visualization – 07/07
- 12 Books and other resources to learn R – 07/07
- Comparison of Tableau, Qlikview and Omniscope * – 07/07
- Challenge of the Week – Random Numbers ** – 07/07

**June 2014**

- Clustering Similar Images Using MapReduce Style Feature Extraction … * – 06/30
- Unsolicited data scientists solving your problems without using you… ** – 06/30
- Three Fundamental Google rules detected thanks to data science ** – 6/30
- Internet of Things? Maybe. Maybe Not * – 06/30
- Is data science a sin against the norms of statisticians? ** – 06/23
- Must read before attending any data science interview ** – 06/23
- Best kept secret about data science competitions ** – 06/23
- The cost of underestimating the power big data – 06/23
- Challenge of the week: Piecewise linear clustering versus SVM – 06/23
- Data Science Summer Reading List 2014 * – 06/16
- Should you Tell Your Kids to be Data Scientists – Not Doctors? – 06/16
- About the Curse of Dimensionality * – 06/16
- Big Data Poster – 06/16
- Data science comic strips – 06/16
- A Tour of Machine Learning Algorithms * – 06/09
- Build basic recommendation engine using R * – 06/09
- 40 maps that explain the Internet * – 06/09
- 100+ Interesting Data Sets for Data Science ** – 06/09
- Data Science Has Been Using Rebel Statistics for a Long Time ** – 06/09
- New Data Science Projects Added to Data Science Apprenticeship ** – 06/09
- The Greatest Database Ever * -6/02
- The 10 Algorithms That Dominate Our World * – 06/02
- Top Ten Big Data Analytics Tips * – 06/02
- Three interesting but little known programming languages * – 06/02
- Being a data scientist in a small country: challenges and solutions – 06/02
- List of NoSQL Databases * – 06/02

**May 2014**

- Tutorial: How to detect spurious correlations, and how to find the … ** – 05/26
- 77 People Who Truly Have Written Interesting Things About Data – 05/26
- Automatic Identification of Replicated Criminal Websites Using Comb… * – 05/26
- Research Brief: Four Functional Clusters of Analytics Professionals * – 05/26
- 50 big data companies to follow * – 05/26
- More than 100 data science, analytics, big data, visualization books – 05/26
- Proposal for a new type of scoring system ** – 05/26
- Resource: Tons of data sets ** – 05/26
- My answer to spurious correlations (previous challenge of the week) ** – 05/26
- 15+ Great Books for Hadoop – 05/26
- Journey of a data scientist ** – 05/19
- 20 Excel Spreadsheet Secrets – 05/19
- 50 copies of data science book, signed by the author: get yours! – 05/19
- The salary of a data science author – 05/19
- Data Science Cheat Sheet ** – 05/19
- Academic salaries exposed – 05/19
- Which one is best: R, SAS or Python, for data science? – 05/12
- 30 Basic Tools For Data Visualization * – 05/12
- How the gap between data science and statistics grew over time – 05/12
- A Statistician’s View on Data and Data Science * – 05/12
- Statisticians, big data gurus, data scientists, data miners: we’re … – 05/12
- Large set of Machine Learning and Related Resources – 05/12
- Why the human species hasn’t produced trillionaires yet? – 05/12

**April 2014**

- Data science displacing traditional science – 04/28
- Good and not so good companies for data scientists – 04/28
- 16 resources to learn and understand Hadoop ** – 04/21
- How to identify the right data scientist for your company ** – 04/21
- Big data: are we making a big mistake? My reaction ** – 04/14
- Data Science for business hacking ** – 04/14
- The Data Science Venn Diagram Revisited – 04/07
- Selection of must-read articles – 04/07
- Foundations of classical statistical theory being questioned – 04/07

**March 2014**

- Is Data Scientist the right career path for you? – 03/31
- Nate Silver’s famous run of successful predictions came to an halt – 03/31
- Top 10 Capabilities for Exploring Complex Relationships in Data for… * – 03/31
- The Data Science Toolkit – My Boot Camp Ciriculum * – 03/31
- From the trenches: 360-degree data science ** – 03/31
- Foundations of classical statistical theory being questioned ** – 03/31
- Comparing apples and oranges – 03/31
- Jackknife logistic and linear regression for clustering and predict…** – 3/24
- Big Data A to ZZ – A Glossary of my Favorite Data Science Things * – 3/24
- Machine Learning in Parallel with Support Vector Machines, Generali…* – 3/20
- Big data, big pay: 10 data jobs with climbing salaries – 3/24
- Learn experimental design with our live, real-time ongoing analysis ** – 3/24
- Another interesting ‘data science is dead’ article – 3/24
- New book: Data Just Right – 3/24
- Analytics Handbook – 3/24
- The best kept secret about linear and logistic regression ** – 3/17
- 7 Key Skills of Effective Data Scientists * – 3/17
- The Ideal Data Science Team * – 3/17
- Two big datasets to challenge your data science expertise – 3/17
- Great example of root cause analysis ** – 3/17
- Life Cycle of Data Science Projects ** – 3/17
- 17 areas to benefit from big data analytics in next 10 years ** – 3/17
- R in the cloud – 3/17
- Predictive model used in air traffic cancellator ** – 3/10
- Sometimes outliers are real data ** – 3/10
- The Growth of Hadoop from 2006 to 2014 – 3/10
- How to compete against data scientists charging $30/hour – ** 3/10
- Recommender Systems – past, present, and future * – 3/3
- Introduction to my data science book – 3/3
- Forecasting with the Baum-Welch Algorithm and Hidden Markov Models * – 3/3
- The Data Science Toolkit – The Future Web Toolkit * – 3/3

**February 2014**

- 20 short tutorials all data scientists should read (and practice) ** – 2/24
- Salary history and career path of a data scientist ** – 2/24
- Big Data Vendor Revenue and Market Forecast 2013-2017 – 2/24
- Two periodic tables for data scientists – 2/24
- How much is big data compressible? An interesting theorem ** – 2/24
- One Page R: A Survival Guide to Data Science with R – 2/17
- Interview with Dr. Roy Marsten, the Man Shaping Big Data – 2/17
- The top 1% data users consume 99% of all the data being produced – 2/17
- Big data is cheap and easy ** – 2/17
- R skills attract the highest salaries – 2/17
- Proposal for bulk email processing ** – 2/17
- 10 questions about big data and data science ** – 2/10
- How analytics will drive the future – * 2/10
- R + Python * – 2/10
- 2013 Data Science Salary Survey, by O’Reilly – 2/10
- Exploratory Data Analysis – Kernel Density Estimation and Rug Plots…* – 2/10
- Interview with David Cox, the most famous statistician still alive – 2/10
- California regulator seeks to shut down ‘learn to code’ bootcamps – 2/10
- Scary fraud scheme to empty your bank account – 2/10
- Big data misused to justify vaccination ** – 2/10
- Ingredients Of Data Science * – 2/3
- Predicting the Super Bowl * – 2/3
- Machine learning: an overview * – 2/3
- The Data Science Toolkit – first steps towards becoming a Data Scie…* – 2/3

**January 2014**

- Practical illustration of Map-Reduce (Hadoop-style), on real data ** – 1/27
- My thoughts on big data and data science: no, it’s not hype ** – 1/27
- Three myths about data scientists and big data ** – 1/27
- Why Companies can’t find analytic talent ** – 1/27
- Six categories of Data Scientists ** – 1/20
- 2014 Analytics Salary Guide – 1/20
- Machine Learning and Data Mining Books * – 1/20
- Big Data and Data Science Books * – 1/20
- Data Scientist versus Data Architect ** – 1/20
- Data Scientist versus Data Engineer ** – 1/20
- Data Scientist versus Statistician ** – 1/20
- Data Scientist versus Business Analyst ** – 1/20
- Big Data & Natural Language Processing – 1/20
- 10 ways for Banks to achieve greater profit and customer satisfaction * – 1/13
- Boosting Algorithms for Better Predictions * – 1/6
- Big data sets available for free ** – 1/6
- 6,000 Companies Hiring Data Scientists – 1/6
- What is Wrong with the Definition of Data Science ** – 1/6

**December 2013**

- A synthetic variance designed for Hadoop and big data ** – 12/30
- Facebook missing revenue because of poor data science integration ** – 12/30
- The Youngest Data Scientist ** – 12/23
- Harvard classes on data science – 12/23
- Uniquely identify a human being with two questions ** – 12/23
- Operational Data Science: excerpt from 2 great articles * – 12/16
- Has the pace of information growth started to slow? ** – 12/16
- Detecting Patterns with the Naked Eye ** – 12/16
- Retailers Using Big Data: The Secret Behind Amazon and Nordstrom’s …* – 12/16
- Why statistical community is disconnected from Big Data and how to …** – 12/9
- Lambda Architecture for Big Data Systems * – 12/9
- How to estimate how well connected your colleagues are ** – 12/9
- New in Plotly: Interactive Graphs with IPython * – 12/9
- Predictive Analytics for Financial Services * – 12/9
- How to cut everyone’s commute time by a factor two ** – 12/9
- A New Source of Revenue for Data Scientists: Selling Data ** – 12/2
- Moore’s law applied to big data ** – 12/2
- Attribution Modeling ** – 12/2
- Salaries for Hadoop professionals ** – 12/2
- A Statistician’s View on Big Data and Data Science * – 12/2
- 125 Years of Public Health Data to Help Fight Contagious Diseases – 12/2

**November 2013**

- Java Coding Samples for Online Data-mining * – 11/25
- Taxonomy of Data Scientists ** – 11/25
- The Data Science Equation ** – 11/25
- ETL, ELT and Data Hub: Where Hadoop is the right fit ? * – 11/25
- 23 Great Schools with Master’s Programs in Data Science – 11/25
- Big data set – 3.5 billion web pages – made available for all of us ** – 11/25
- Statistics needed for DS and data mining * – 11/25
- Another large data set – 250 million data points ** 11/25
- Fast Combinatorial Feature Selection with New Definition of Predict… ** – 11/18
- How to compare and rank data science programs? ** – 11/18
- Hidden decision trees revisited ** – 11/18
- Zipfian Academy versus Data Science Apprenticeship – 11/11
- Hadoop as a Data Management Hub * – 11/11
- A Practical Introduction to Data Science from Zipfian Academy * – 11/11
- More than 100 data science, analytics, big data, visualization books ** – 11/11
- Predictive Analytics: Man versus Machine Competition – 11/04
- Data Science Project: Captcha Attack – 11/04
- 16 Reasons Data Scientists are Difficult to Manage ** – 11/04
- Interesting Data Science Application: Steganography ** – 11/04

**October 2013**

- IBM Distinguished Engineer solves Big Data Conjecture ** – 10/28
- R Tutorial for Beginners: A Quick Start-Up Kit * – 10/28
- The Professionalization of Data Science* – 10/28
- Big Data Micro-Segmentation * – 10/28
- Difference between data engineers and data scientists ** – 10/21
- A little known component that should be part of most data science a… ** – 10/21
- Warm-up exercise before data science * – 10/21
- Get state and region from zip code, with simple Perl, Python, or R – 10/21
- Credit card number and password encoder / decoder ** – 10/21
- WEKA: Pluses and minuses – 10/21
- Oil n Gas Sensor Data + Big Data Analytics = Game Changer * – 10/21
- 11 Features any database, SQL or NoSQL, should have ** – 10/14
- Basic Understanding of Big Data * – 10/14
- Python Scikit-learn to simplify Machine learning * – 10/07
- Top Big Data Skills In Demand * – 10/07
- Data Science programs and training currently available – 10/07
- Sample source code for various data science tasks and projects – 10/07
- Wine and alcohol analytics ** – 10/07

**September 2013**

- Data Scientist vs. Statistician ** – 09/30
- Random Forests Algorithm * – 09/30
- Clustering idea for very large datasets ** – 09/30
- Analytics for kids ** – 09/30
- A Data Science Example: Deciding When to Sell Your House * – 09/23
- 50+ Open Source Tools for Big Data – 09/23
- Deriving Value with Data Visualization Tools * – 09/23
- Our best data visualization articles – 09/23
- Google F1 Database: One Step Closer To Discovering The DB Holy Grail – 09/23
- The Purple People: finding business expertise among Data Scientists * – 09/23
- Building better search tools: problems and solutions ** – 9/16
- Hadoop vs. NoSql vs. Sql vs. NewSql By Example * – 9/16
- The Best Of Open Source For Big Data * – 9/16
- The dangers of pseudo analytic science ** – 9/16
- Can you win a Facebook data science job? Take the test! – 9/9
- Marrying computer science, statistics and domain expertize ** – 9/9
- What will America pay for H1-B Jobs? * – 9/9
- Question: Career change into data science (many comments) * – 9/9
- Is the peanut war fueled by lack of analytic thinking? – 9/9
- Conditional Formatting in Excel – Highest Number in Each Row * – 9/9
- How to eliminate a trillion dollars in healthcare costs – 9/2
- Predictive Analytics: Harnessing the Power of Big Data * – 9/2
- An indispensable Python : Data sourcing to Data science * – 9/2
- Data Scientist Core Skills * – 9/2
- Prescriptive Analytics * – 9/2
- Top Languages for analytics, data mining, data science – 9/2
- The death of the statistician – 9/2
- Same dataset – Two different kind of visualizations * – 9/2
- Predictive modeling is useless! Here’s why * – 9/2

**August 2013**

- BI vs. Big Data vs. Data Analytics By Example * – 8/26
- Data Science / Big Data Salary Survey by Burtch Works – 8/26
- 40 maps that explain the world – 8/26
- A Data Scientist’s Guide to Making Money from Start-Ups – 8/26
- How do I forecast a timeseries of data using GARCH(1,1)? * – 8/26
- How is big data used in the porn industry? – 8/26
- Will Data Science Forever Change Branding Strategies? – 8/19
- Batch vs. Real Time Data Processing * – 8/19
- Who Has The Largest Predictive Data Analytics? * – 8/19
- Underfitting/Overfitting Problem in Machine learning * – 8/19
- Why is Vlookup (in Excel) 1,000 times slower than hash tables in Py…** – 8/19
- 101 prime resources on mathematics – 8/19
- SQL: optimizing or eliminating joins? ** – 8/12
- Hadoop: What It Is And Why It’s Such A Big Deal * – 8/12
- A new type of weapons-grade secure email ** – 8/12
- 60+ R resources – 8/12
- 10 Enterprise Predictive Analytics Platforms Compared (2013) – 8/12
- What Makes a Good Data Scientist? – 8/12

Author: Vincent Granville

Here is selection containing both external and internal papers, focusing on various technical aspects of data science and big data. Feel free to add your favorites.

*Complex Open Text Analysis: Source: Avinash Kaushik*

**External Papers**

- Bigtable: A Distributed Storage System for Structured Data
- A Few Useful Things to Know about Machine Learning
- Random Forests
- A Relational Model of Data for Large Shared Data Banks
- Map-Reduce for Machine Learning on Multicore
- Pasting Small Votes for Classification in Large Databases and On-Line
- Recommendations Item-to-Item Collaborative Filtering
- Recursive Deep Models for Semantic Compositionality Over a Sentimen…
- Spanner: Google’s Globally-Distributed Database
- Megastore: Providing Scalable, Highly Available Storage for Interac…
- F1: A Distributed SQL Database That Scales
- APACHE DRILL: Interactive Ad-Hoc Analysis at Scale
- A New Approach to Linear Filtering and Prediction Problems
- Top 10 algorithms on Data mining
- The PageRank Citation Ranking: Bringing Order to the Web
- MapReduce: Simplified Data Processing on Large Clusters
- The Google File System
- Amazon’s Dynamo

**DSC Internal Papers**

- How to detect spurious correlations, and how to find the …
- Automated Data Science: Confidence Intervals
- 16 analytic disciplines compared to data science
- From the trenches: 360-degree data science
- 10 types of regressions. Which one to use?
- Practical illustration of Map-Reduce (Hadoop-style), on real data
- Jackknife logistic and linear regression for clustering and predict…
- A synthetic variance designed for Hadoop and big data
- Fast Combinatorial Feature Selection with New Definition of Predict…
- Internet topology mapping
- 11 Features any database, SQL or NoSQL, should have
- 10 Features all Dashboards Should Have
- Clustering idea for very large datasets
- Hidden decision trees revisited
- Correlation and R-Squared for Big Data
- What Map Reduce can’t do
- Excel for Big Data
- Fast clustering algorithms for massive datasets
- The curse of big data
- Interesting Data Science Application: Steganography

Author: Vincent Granville

*Originally posted here. Check original article for most recent updates.*

Confidence interval is abbreviated as CI. In this new article (part of our series on robust techniques for automated data science) we describe an implementation both in Excel and Perl, and discussion about our popular model-free confidence interval technique introduced in our original Analyticbridge article. This technique has the following advantages:

- Very easy to understand by non-statisticians (business analysts, software engineers, programmers, data architects)
- Simple (if not basic) to code; no need to use tables of Gaussian, Student or other statistical distributions
- Robust, not sensitive to outliers
- Model-independent, data-driven: no assumptions required about the data set; it works with non-normal data, and produces asymmetrical confidence intervals
- Therefore, suitable for black-box implementation or automated data science

This is part of our series on data science techniques suitable for automation, usebla by non-experts. The next one to be detailed (with source code) will be our Hidden Decision Trees.

**Figure 1**: Confidence bands based on our CI (bold red and blue curves) – *Comparison with traditional normal model (light red anf blue curves)*

Figure 1 is based on simulated data that does not follow a normal distribution : see section 2 and Figure 2 in this article. It shows how sensitive CI’s are to model assumptions, when using the traditional approach, leading to very conservative (and plain wrong) CI’s. Classical CI’s are just based on 2 parameters: mean and variance. With the classical model, all data sets with same mean and same variance have same CI’s. To the contrary, our CI’s are based on k parameters – average values computed on k different bins – see next section for details. In short, they are much better predictive indicators when your data is not normal. Yet they are so easy to understand and compute, you don’t even need to understand probability 101 to get started. The attached spreadsheet and Perl scripts have all computations done for you.

**1. General Framework**

We assume that we have n observations from a continuous or discrete variable. We randomly assign a bin number to each observation: we create k bins (1 ≤ k ≤ n) that have similar or identical sizes. We compute the average value in each bin, then we sort these averages. Let p(m) be the m-th lowest average (1 ≤ m ≤ k/2, with p(1) being the minimum average). Then our CI is defined as follows:

- Lower bound: p(m)
- Upper bound: p(k-m+1)
- Confidence level, also called level or CI level: equal to 1 – 2m/(k+1)

The confidence level represents the probability that a new observation (from the same data set) will be between the lower and upper bounds of the CI. Note that this method produces asymetrical CI’s. It is equivalent to designing percentile-based confidence intervals on aggregated data. In practice, k is chosen much smaller than n, say k = SQRT(n). Also m is chosen to that 1 – 2m/(k+1) is as close as possible to a pre-specified confidence level, for instance 0.95. Note that the higher m, the more robust (outlier-nonsensitive) your CI.

If you can’t find m and k to satisfy level = 0.95 (say), then compute a few CI’s (with different values of m), with confidence level close to 0.95. Then inperpolate or extrapolate the lower and upper bounds to get a CI with 0.95 confidence level. The concept is easy to visualize if you look at Figure 1. Also, do proper cross-validation: slpit your data in two; compute CI’s using the first half, and test them on the other half, to see if they still continue to have sense (same confidence level, etc.)

CI’s are extensively used in quality control, to check if a batch of new products (say, batteries) have failure rates, lifetime or other performance metrics that are within reason, that are acceptable. Or if wine advertised with 12.5% alcohol content has an actual alcohol content reasonably close to 12.5% in each batch, year after year. By “acceptable” or “reasonable”, we mean between the upper and lower bound of a CI with pre-specified confidence level. CI are also used in scoring algorithms, to provide CI to each score.The CI provides an indication about how accurate the score is. Very small confidence levels (that is, narrow CI’s) corresponds to data well understood, with all sources of variances perfectly explained. Converserly, large CI’s mean lot’s of noise and high individual variance in the data. Finally, if your data is stratified in multiple heterogeneous segments, compute separate CI’s for each strata.

That’s it, no need to know even rudimentary statistical science to understand this CI concept, as well as the concept of hypothesis testing (derived from CI) explained below in section 3.

**When Big Data is Useful**

If you look closely at Figure 1, it’s clear that you can’t compute accurate CI’s with a high (above 0.99) level, with just a small sample and (say) k=100 bins. The higher the level, the more volatile the CI. Typically, an 0.999 level requires 10,000 or more observations to get something stable. These high-level CI’s are needed especially in the context of assessing failure rates, food quality, fraud detection or sound statistical litigation. There are ways to work with much smaller samples by combining 2 tests, see section 3.

An advantage of big data is that you can create many different combinations of k bins (that is, test many values of m and k) to look at how the confidence bands in Figure 1 change depending on the bin selection – even allowing you to create CI’s for these confidence bands, just like you could do with Bayesian models.

**2. Computations: Excel, Source Code**

The first step is to re-shuffle your data to make sure that your observations are in perfect random order: read *A New Big Data Theorem* section in this article for an explanation why reshuffling is necessary (look at the second theorem). In short, you want to create bins that have the same mix of values: if the first half of your data set consisted of negative values, and the second half of positive values, you might end up with bins either filled with positive or negative values. You don’t want that; you want each bin to be well balanced.

**Reshuffling Step**

Unless you know that your data is in an arbitrary order (this is the case most frequently), reshuffling is recommended. Reshuffling can easily be performed as follows:

- Add a column or variable called RAN, made up of simulated random numbers, using a function such as 100,000 + INT(10,000*RAND()) where RAND() returns a random number between 0 and 1.
- Sort your data by column RAN
- Delete column RAN

Note that we use 100,000 + INT(10,000*RAND()) rather than just simply RAND() to make sure that all random numbers are integers with the same number of digits. This way, whether you sort alphabetically or numerically, the result will be identical, and correct. Sorting numbers of variable length alphabetically (without knowing it) is a source of many bugs in software engineering. This little trick helps you avoid this problem.

If the order in your data set is very important, just add a column that has the original rank attached to each observation (in your initial data set), and keep it through the res-shuffling process (after each observation has been assigned to a bin), so that you can always recover the original order if necessary, by sorting back according to this extra column.

**The Spreadsheet**

Download the Excel spreadsheet. Figures 1 and 2 are in the spreadsheet, as well as all CI computations, and more. The spreadsheet illustrates many not so well known but useful analytic Excel functions, such as: FREQUENCY, PERCENTILE, CONFIDENCE.NORM, RAND, AVERAGEIF, MOD (for bin creations) and RANK. The CI computations are in cells O2:Q27 in the Confidence Intervals tab. You can modify the data in column B, and all CI’s will automatically be re-computed. Beware if you change the number of bins (cell F2): this can screw up the RANK function in column J (some ranks will be missing) and then screw up the CI’s.

For other examples of great spreadsheet (from a tutorial point of view), check the Excel section in our data science cheat sheet.

**Simulated Data**

The simulated data in our Excel spreadsheet (see the *data simulation* tab), represents a mixture of two uniform distributions, driven by the parameters in the orange cells F2, F3 and H2. The 1,000 original simulated values (see Figure 2) were stored in column D, and were subsequently hard-copied into column B in the *Confidence Interval* (results) tab (they still reside there), because otherwise, each time you modify the spreadsheet, new deviates produced by the RAND Excel function are automatically updated, changing everything and making our experiment non-reproducible. This is a drawback of Excel, thought I’ve heard that it is possible to freeze numbers produced by the function RAND. The simulated data is remarkably non-Gaussian, see Figure 2. It provides a great example of data that causes big problems with traditional statistical science, as described in our following subsection.

In any case, this is an interesting tutorial on how to generate simulated data in Excel. Other examples can be found in our Data Science Cheat Sheet (see Excel section).

**Comparison with Traditional Confidence Intervals**

We provide a comparison with standard CI’s (available in all statistical packages) in Figure 1, and in our spreadsheet. There are a few ways to compute traditional CI’s:

- Simulate Gaussian deviates with pre-specified variance matching your data variance, by(1) generating (say) 10 million uniform deviates on [-1, +1] using a great random generator, (2) randomly grouping these generated values in 10,000 buckets each having 1,000 deviates, and (3) compute averages in each bucket. These 10,000 averages will approximate very well a Gaussian distribution, all you need to do is to scale them so that the variance matches the variance in your data sets. And then compute intervals that contain 99%, 95%, or 90% off all the scaled averages: these are your standard Gaussian CI’s.
- Use libraries to simulate Gaussian deviates, rather than the cave-man appoach mentioned above. Source code and simulators can be found in books such as Numerical Recipes.
- In our Excel spreadsheet, we used the Confidence.norm function.

As you can see in Figure 1, traditional CI’s fail miserably if your data has either a short or long tail, compared with data originating from a Gaussian process.

**Perl Code**

Here’s some simple source code to compute CI for given m and k:

$k=50; # number of bins

$m=5;

open(IN,”< data.txt”);

$binNumber=0;

while ($value=<IN>) {

$value=~s/n//g;

$binNumber = $n % $k;

$binSum[$binNumber] += $value;

$binCount[$binNumber] ++;

$n++;

}

if ($n < $k) {

print “Error: Too few observations: n < k (choose a smaller k)n”;

exit();

}

if ($m> $k/2) {

print “Error: reduce m (must be <= k/2)n”;

exit();

}

for ($binNumber=0; $binNumber<$k; $binNumber++) {

$binAVG[$binNumber] = $binSum[$binNumber]/$binCount[$binNumber];

}

$binNumber=0;

foreach $avg (sort { $a <=> $b } @binAVG) { # sorting bins numerically

$sortedBinAVG[$binNumber] = $avg;

$binNumber++;

}

$CI_LowerBound= $sortedBinAVG[$m];

$CI_UpperBound= $sortedBinAVG[$k-$m+1];

$CI_level=1-2*$m/($k+1);

print “CI = [$CI_LowerBound,$CI_UpperBound] (level = $CI_level)n”;

**Exercise**: write the code in R or Python.

**3. Application to Statistical Testing**

Rather than using *p*-values and other dangerous concepts (about to become extinct) that nobody but statisticians understand, here is an easy way to perform statistical tests. The method below is part of what we call rebel statistical science.

Let’s say that you want to test, with 99.5% confidence (level = 0.995), whether or not a wine manufacturer consistently produces a specific wine that has a 12.5% alcohol content. Maybe you are a lawyer, and the wine manufacturer is accused of lying on the bottle labels (claiming that alcohol content is 12.5% when indeed it is 13%), maybe to save some money. The test to perform is as follows: check out 100 bottles from various batches, and compute an 0.995-level CI for alcohol content. Is 12.5% between the upper and lower bounds? Note that you might not be able to get an exact 0.995-level CI if your sample size n is too small (say n=100), you will have to extrapolate from lower level CI’s, but the reason here to use a high confidence level is to give the defendant the *benefit of the doubt* rather than wrongly accusing him based on a too small*confidence level*. If 12.5% is found inside even a small 0.50-level CI (which will be the case if the wine is truly 12.5% alcohol), then a fortiori it will be inside an 0.995-level CI, because these CI’s are nested (see Figure 1 to understand these ideas). Likewise, if the wine truly has a 13% alcohol content, a tiny 0.03-level CI containing the value 13% will be enough to prove it.

One way to better answer these statistical tests (when your high-level CI’s don’t provide an answer) is to produce 2 or 3 tests (but no more, otherwise your results will be biased). Test whether the alcohol rate is

- As declared by the defendant (test #1)
- Equal to a pre-specified value (the median computed on a decent sample, test #2)

**4. Miscellaneous**

We include two figures in this section. The first one is about the data used in our test and Excel spreadsheet, to produce our confidence intervals. And the other figure shows the theorem that justifies the construction of our confidence intervals.

**Figure 2**: Simulated data used to compute CI’s: asymmetric mixture of non-normal distrubutions

**Figure 3**: Theorem used to justify our confidence intervals

Author: Vincent Granville

*Originally posted on DataSciebceCentral, by Dr. Granville. Click here to read original article and comments.*

Specifically designed in the context of big data in our research lab, the new and simple *strong correlation* synthetic metric proposed in this article should be used, whenever you want to check if there is a real association between two variables, especially in large-scale automated data science or machine learning projects. Use this new metric now, to avoid being accused of reckless data science and evenbeing sued for wrongful analytic practice.

In this paper, the traditional correlation is referred to as the *weak correlation*, as it captures only a small part of the association between two variables: *weak correlation* results in capturing spurious correlations and predictive modeling deficiencies, even with as few as 100 variables. In short, our *strong correlation* (with a value between 0 and 1) is high (say above 0.80) if not only the *weak correlation* is also high (in absolute value), but when the internal structures (auto-dependencies) of both variables X and Y that you want to compare, exhibit a similar pattern or correlogram. Yet this new metric is simple and involves just one parameter **a** (with **a** = 0 corresponding to *weak correlation*, and **a** =1 being the recommended value for*strong correlation*). This setting is designed to avoid over-fitting.

Our *strong correlation* blends together the concept of ordinary or *weak regression* – indeed, an improved, robust, outlier-resistant version of ordinary regression (or see my book pages 130-140) – together with the concept of X and Y sharing similar bumpiness (or see my book pages 125-128).

In short, even nowadays, what makes two variables X and Y *seem* related in most scientific articles and pretty much all articles written by journalists, is based on ordinary (weak) regression. But there are plenty of other metrics that you can use to compare two variables. Including bumpiness in the mix (together with weak regression in just one single blended metric called *strong correlation* to boost accuracy) guarantees that high *strong* correlation means that the two variables are really associated, not just based on flawy, old-fashioned *weak* correlations, but also associated based on sharing similar internal auto-dependencies and structure. To put it differently, two variables can be highly *weakly* correlated yet have very different bumpiness coefficients, as shown in my original article – meaning that there might be no causal relationship (or see my book pages 165-168) or hidden factors explaining the link. An artificial example is provided below in figure 3.

Using *strong*, rather than *weak* correlation, eliminates the majority of these spurious correlations, as we shall see in the examples below. This *strong correlation* metric is designed to be integrated in automated data science algorithms.

**1. Formal definition of strong correlation**

Let’s define

c(X, Y) as the absolute value of the ordinary correlation, with value between 0 and 1. This number is high (close to 1) if X and Y are highly correlated. I recommend using my rank-based, L-1 correlation (or see my book pages 130-140) to eliminate problems caused by outliers.*Weak correlation*- c1(X) as the lag-1 auto-correlation for X, that is, if we have n observations X_1 … X_n, then c1(X) = c(X_1 … X_{n-1}, X_2 … X_n)
- c1(Y) as the lag-1 auto-correlation for Y
d(X, Y) = exp{ –*d-correlation***a*** | ln( c1(X) / c1(Y) ) | }, with possible adjustment if numerator or denominator is zero, and parameter**a**must be positive or zero. This number, with value between 0 and 1, is high (close to 1) if X and Y have similar lag-1 auto-correlations.r(X, Y) = min{ c(X, Y), d(X, Y) }*Strong correlation*

Note that c1(X), and c1(Y) are the bumpiness coefficients (or see my book pages 125-128) for X and Y. Also, d(X, Y) and thus r(X, Y) are between 0 and 1, with 1 meaning strong similarity between X and Y, and 0 meaning either dissimilar lag-1 auto-correlations for X and Y, or lack of old-fashioned correlation.

The *strong correlation* between X and Y is, by definition, r(X, Y). This is an approximation to having both spectra identical, a solution mentioned in my article The curse of Big Data (see also my book pages 41-45).

This definition of strong correlation was initially suggested in one of our weekly challenges.

**2. Comparison with traditional ( weak) correlation**

When **a** = 0, weak and strong correlations are identical. Note that the *strong correlation* r(X, Y) still shares the same properties as the *weak correlation* c(X, Y): it is symmetric and invariant under linear transformations (such as re-scaling) of variables X or Y, regardless of **a**.

Author: Vincent Granville

*Originally posted on DataScienceCentral, by Dr. Granville. Click here to read original article and comments.*

Here I will discuss a general framework to process web traffic data. The concept of Map-Reduce will be naturally introduced. Let’s say you want to design a system to score Internet clicks, to measure the chance for a click to convert, or the chance to be fraudulent or un-billable. The data comes from a publisher or ad network; it could be Google. Conversion data is limited and poor (some conversions are tracked, some are not; some conversions are soft, just a click-out, and conversion rate is above 10%; some conversions are hard, for instance a credit card purchase, and conversion rate is below 1%). Here, for now, we just ignore the conversion data and focus on the low hanging fruits: click data. Other valuable data is impression data (for instance a click not associated with an impression is very suspicious). But impression data is huge, 20 times bigger than click data. We ignore impression data here.

Here, we work with complete click data collected over a 7-day time period. Let’s assume that we have 50 million clicks in the data set. Working with a sample is risky, because much of the fraud is spread across a large number of affiliates, and involve clusters (small and large) of affiliates, and tons of IP addresses but few clicks per IP per day (low frequency).

The data set (ideally, a tab-separated text file, as CSV files can cause field misalignment here due to text values containing field separators) contains 60 fields: keyword (user query or advertiser keyword blended together, argh…), referral (actual referral domain or ad exchange domain, blended together, argh…), user agent (UA, a long string; UA is also known as browser, but it can be a bot), affiliate ID, partner ID (a partner has multiple affiliates), IP address, time, city and a bunch of other parameters.

The first step is to extract the relevant fields for this quick analysis (a few days of work). Based on domain expertise, we retained the following fields:

- IP address
- Day
- UA (user agent) ID – so we created a look-up table for UA’s
- Partner ID
- Affiliate ID

These 5 metrics are the base metrics to create the following summary table. Each (IP, Day, UA ID, Partner ID, Affiliate ID) represents our atomic (most granular) data bucket.

**Building a summary table: the Map step**

The summary table will be built as a text file (just like in Hadoop), the data key (for joins or groupings) being (IP, Day, UA ID, Partner ID, Affiliate ID). For each atomic bucket (IP, Day, UA ID, Partner ID, Affiliate ID) we also compute:

- number of clicks
- number of unique UA’s
- list of UA

The list of UA’s, for a specific bucket, looks like ~6723|9~45|1~784|2, meaning that in the bucket in question, there are three browsers (with ID 6723, 45 and 784), 12 clicks (9 + 1 + 2), and that (for instance) browser 6723 generated 9 clicks.

In Perl, these computations are easily performed, as you sequentially browse the data. The following updates the click count:

$hash_clicks{“IPtDaytUA_IDtPartner_IDtAffiliate_ID”};

Updating the list of UA’s associated with a bucket is a bit less easy, but still almost trivial.

The problem is that at some point, the hash table becomes too big and will slow down your Perl script to a crawl. The solution is to split the big data in smaller data sets (called subsets), and perform this operation separately on each subset. This is called the *Map* step, in Map-Reduce. You need to decide which fields to use for the mapping. Here, IP address is a good choice because it is very granular (good for load balance), and the most important metric. We can split the IP address field in 20 ranges based on the first byte of the IP address. This will result in 20 subsets. The splitting in 20 subsets is easily done by browsing sequentially the big data set with a Perl script, looking at the IP field, and throwing each observation in the right subset based on the IP address.

**Building a summary table: the Reduce step**

Now, after producing the 20 summary tables (one for each subset), we need to merge them together. We can’t simply use hash table here, because they will grow too large and it won’t work – the reason why we used the *Map* step in the first place.

Here’s the work around: …

]]>Author: Vincent Granville

*Originally posted on DataSciebceCentral, by Dr. Granville. Click here to read original article and comments.*

This article discusses a far more general version of the technique described in our article The best kept secret about regression. Here we adapt our methodology so that it applies to data sets with a more complex structure, in particular with highly correlated independent variables.

Our goal is to produce a regression tool that can be used as a black box, be very robust and parameter-free, and usable and easy-to-interpret by non-statisticians. It is part of a bigger project: automating many fundamental data science tasks, to make it easy, scalable and cheap for data consumers, not just for data experts. Our previous attempts at automation include

Readers are invited to further formalize the technology outlined here, and challenge my proposed methodology.

**1. Introduction**

As in our previous paper, without loss of generality, we focus on linear regression with centered variables (with zero mean), and no intercept. Generalization to logistic or non-centered variables is straightforward.

Thus we are still dealing with the following regression framework:

Y = a_1 * X_1 + … + a_n * X_n + noise

Remember that the solution proposed in our previous paper was

- b_i = cov(Y, X_i) / var(X_i), i = 1, …, n
- a_i = M * b_i, i = 1, …, n
- M (a real number, not a matrix) is chosen to minimize var(Z), with Z = Y – a_1 * X_1 + … + a_n * X_n

When cov(X_i, X_j) = 0 for i < j, my regression and the classical regression produce identical regression coefficients, and M = 1.

Terminology: Z is the noise, Y is the (observed) response, the a_i’s are the regression coefficients, and and S = a_1 * X_1 + … + a_n * X_n is the estimated or predicted response. The X_i’s are the independent variables or features.

**2. Re-visiting our previous data set**

I have added more cross-correlations to the previous simulated dataset consisting of 4 independent variables, still denoted as x, y, z, u in the new, updated attached spreadsheet. Now corr(x, y) = 0.99.

]]>Author: Vincent Granville

*Originally posted on Hadoop36o, by Dr. Granville. Click here to read original article and comments.*

The new variance introduced in this article fixes two big data problems associated with the traditional variance and the way it is computed in Hadoop, using a numerically unstable formula.

**Synthetic Metrics**

This new metric is *synthetic*: It was not derived naturally from mathematics like the variance taught in any statistics 101 course, or the variance currently implemented in Hadoop (see above picture). By*synthetic*, I mean that it was built to address issues with big data (outliers) and the way many big data computations are now done: Map Reduce framework, Hadoop being an implementation. It is a top-down approach to metric design – from data to theory, rather than the bottom-up traditional approach – from theory to data.

Other synthetic metrics designed in our research laboratory include:

**Hadoop, numerical and statistical stability**

There are two issues with the formula used for computing Variance in Hadoop. First, the formula used, namely Var(x1, … , xn) = {SUM(xi^2)/n} – {SUM(xi)/n}^2, is notoriously unstable. For large n, while both terms cancel out somewhat, each one taken separately can take a huge value, because of the squares aggregated over billions of observations. It results in numerical inaccuracies, with people having reported negative variances. Read the comments attached to my article The curse of Big Data for details. Besides, there are variance formula that do not require two passes of the entire data sets, and that are numerically stable.

]]>Author: Vincent Granville

*Originally posted on DataScienceCentral, by Dr. Granville. Click here to read original article and comments.*

In this article, I proposes a simple metric to measure **predictive power**. It is used for combinatorial feature selection, where a large number of feature combinations need to be ranked automatically and very fast, for instance in the context of transaction scoring, in order to optimize predictive models. This is about rather big data, and we would like to see an Hadoop methodology for the technology proposed here. It can easily be implemented in a Map Reduce framework. It was developed by the author in the context of credit card fraud detection, and click/keyword scoring. This material will be part of our data science apprenticeship, and included in our Wiley book.

*Feature selection* is a methodology used to detect the best subset of features, out of dozens or hundreds of features (also called variables or rules). By “best”, we mean with highest *predictive power*, a concept defined in the following subsection. In short, we want to remove duplicate features, simplify a bit the correlation structure (among features) and remove features that bring no value, such as a features taking on random values, thus lacking predictive power, or features (rules) that are almost never triggered (except if they are perfect fraud indicators when triggered).

The problem is combinatorial in nature. You want a manageable, small set of features (say 20 features) selected from (say) a set of 500 features, to run our *hidden decision trees* (or some other classification / scoring technique) in a way that is statistically robust. But there are 2.7 * 1035 combinations of 20 features out of 500, and you need to compute all of them to find the one with maximum predictive power. This problem is computationally intractable, and you need to find an alternate solution. The good thing is that you don’t need to find the absolute maximum; you just need to find a subset of 20 features that is good enough.

One way to proceed is to compute the predictive power of each feature. Then, add one feature at a time to the subset (starting with 0 feature) until you reach either

- 20 features (your limit)
- Adding a new feature does not significantly improve the overall predictive power of the subset (in short, convergence has been attained)

At each iteration, choose the feature to be added among the two remaining features with the highest predictive power: you will choose (among these two features) the one that increases the overall predictive power (of the subset under construction) most. Now you have reduced your computations from 2.7 * 1035 to 40 = 2 * 20.

]]>Author: Vincent Granville

*Originally posted on DataScienceCentral, by Dr. Granville. Click here to read original article and comments.*

This is a component often missing, yet valuable for most systems, algorithms and architectures that are dealing with online or mobile data, known as digital data: be it transaction scoring, fraud detection, online marketing, marketing mix and advertising optimization, online search, plagiarism and spam detection, etc.

I will call it an **Internet Topology Mapping**. It might not be stored as a traditional database (it could be a graph database, a file system, or a set of look-up tables). It must be pre-built (e.g. as look-up tables, with regular updates) to be efficiently used.

**So what is the Internet Topology Mapping?**

Essentially, it is a system that matches an IP address (Internet or mobile) with a domain name (ISP). When you process a transaction in real time in production mode (e.g. an online credit card transaction, to decide whether to accept or decline it), your system only has a few milliseconds to score the transaction to make the decision. In short, you only have a few milliseconds to call and run an algorithm (sub-process), on the fly, separately for each credit card transaction, to decide on accepting/rejecting. If the algorithm involves matching the IP address with an ISP domain name (this operation is called*nslookup*), it won’t work: direct nslookups take between a few hundreds to a few thousands milliseconds, and they will slow the system to a grind.

Because of that, Internet Topology Mappings are missing in most systems. Yet there is a very simple workaround to build it:

- Look at all the IP addresses in your database. Chances are, even if you are Google, 99.9% of your traffic is generated by fewer than 100,000,000 IP addresses. Indeed, the total number of IP addresses (the whole universe) consists of less than 256^4 = 4,294,967,296 IP addresses. That’s about 4 billion, not that big of a number in the real scheme of big data. Also, many IP addresses are clustered: 120.176.231.xxx are likely to be part of the same domain, for xxx in the range (say) 124-180. In short, you need to store a lookup table possibly as small as 20,000,000 records (IP ranges / domain mapping) to solve the nslookup issue for 99.9% of your transactions. For the remaining 0.1%, you can either assign ‘Unknown Domain’ (not recommended, since quite a few IP addresses actually have unknown domain), or ‘missing’ (better) or perform the cave-man, very slow nslookup on the fly.
- Create the look-up table that maps IP ranges to domain names, for 99.9% of your traffic.

When processing a transaction, access this look-up table (stored in memory, or least with some caching available in memory) to detect the domain name. Now you can use a rule system that does incorporate domain names.

**Example of rules and metrics based on domain names are**:

- domain extension (.com, .cc etc.)
- length of domain name

Author: Vincent Granville

Hidden decision trees (HDT) is a technique patented by Dr. Granville, to *score* large volumes of transaction data. It blends robust logistic regression with hundreds small decision trees (each one representing for instance a specific type of fraudulent transaction) and offers significant advantages over both logistic regression and decision trees: robustness, ease of interpretation, and no tree pruning, no node splitting criteria. It makes this methodology powerful and easy to implement even for someone with no statistical background.

*Hidden Decision Trees* is a statistical and data mining methodology (just like logistic regression, SVM, neural networks or decision trees) to handle problems with large amounts of data, non-linearity and strongly correlated independent variables.

The technique is easy to implement in any programming language. It is more robust than decision trees or logistic regression, and helps detect natural final nodes. Implementations typically rely heavily on large, granular hash tables.

No decision tree is actually built (thus the name hidden decision trees), but the final output of a hidden decision tree procedure consists of a few hundred nodes from multiple non-overlapping small decision trees. Each of these parent (invisible) decision trees corresponds e.g. to a particular type of fraud, in fraud detection models. Interpretation is straightforward, in contrast with traditional decision trees.

The methodology was first invented in the context of credit card fraud detection, back in 2003. It is not implemented in any statistical package at this time. Frequently, hidden decision trees are combined with logistic regression in an hybrid scoring algorithm, where 80% of the transactions are scored via hidden decision trees, while the remaining 20% are scored using a compatible logistic regression type of scoring.

Hidden decision trees take advantage of the structure of large multivariate features typically observed when scoring a large number of transactions, e.g. for fraud detection. The technique is not connected with hidden Markov fields.

**Potential Applications**

- Fraud detection, spam detection
- Web analytics
- Keyword scoring/bidding (ad networks, paid search)
- Transaction scoring (click, impression, conversion, action)
- Click fraud detection
- Web site scoring, ad scoring, landing page / advertiser scoring
- Collective filtering (social network analytics)
- Relevancy algorithms

- Text mining
- Scoring and ranking algorithms
- Infringement detection
- User feedback: automated clustering

**Implementation**

The model presented here is used in the context of click scoring. The purpose is to create predictive scores, where *score* = f(*response*), that is, score is a function of the response. The response is sometimes referred to as the *dependent variable* in statistical and predictive models.

- Examples of Response:
- Odds of converting (Internet traffic data – hard/soft conversions)
- CR (conversion rate)
- Probability that transaction is fraudulent

- Independent variables: Called
*features*or rules. They are highly correlated

Author: Vincent Granville

*Originally posted on Analyticbridge, by Dr. Granville. Click here to read original article and comments.*

With big data, one sometimes has to compute correlations involving thousands of buckets of paired observations or time series. For instance a data bucket corresponds to a node in a decision tree, a customer segment, or a subset of observations having the same multivariate feature. Specific contexts of interest include multivariate feature selection (a combinatorial problem) or identification of best predictive set of metrics.

In large data sets, some buckets will contain outliers or meaningless data, and buckets might have different sizes. We need something better than the tools offered by traditional statistics. In particular, we want a correlation metric that satisfies the following

**Five conditions:**

- Independent of sample size to allow comparisons across buckets of different sizes: a correlation of (say) 0.6 must always correspond to (say) the 74-th percentile among all potential paired series (X, Y) of size n, regardless of n
- Same bounds as old-fashioned correlation for back-compatibility : it must be between -1 and +1, with -1 and +1 attained by extreme, singular data sets, and 0 meaning no correlation
- More general than traditional correlation: it measures the degree of monotonicity between two variables X and Y (does X grow when Y grows?) rather than linearity (Y = a + b*X + noise, with a, b chosen to minimize noise). Yet not as general as the distance correlation (equal to zero if and only if X and Y are independent) or my structuredness coefficient.
- Not sensitive to outliers, robust
- More intuitive, more compatible with the way the human brain perceives correlations

Note that R-Squared, a goodness-of-fit measure used to compare model efficiency across multiple models, is typically the square of the correlation coefficient between observations and predicted values, measured on a training set via sound cross-validation techniques. It suffers the same drawbacks, and benefits from the same cures as traditional correlation. So we will focus here on the correlation.

To illustrate the first condition (dependence on n), let’s consider the following made-up data set with two paired variables or time series X, Y: …

]]>