- Home
- Data Science
- Data Science Techniques
Category: Data Science Techniques
Migrating an Excel Spreadsheet to MySQL and to Spark 2.0.1 (Part 1)
Feed: Resources Discussions - Data Science Central. Author: Marc Borowczak. Moving legacy data to modern big data platform can be daunting at times. It doesn’t have to be. In this short tutorial, we’ll briefly review an approach and demonstrate on my preferred data set: This isn’t a ML repository nor a Kaggle competition data set, simply the data I accumulated over decades to keep track of my plastic model collection, and as such definitely meets the legacy standard! We’ll describe steps followed on a laptop VirtualBox machine running Ubuntu 16.04.1 LTS Gnome. The following steps are then required: Import the .csv file in MySQL, ... Read More
Awesome Data Science Repository
Feed: Data Science Research Discussions - Data Science Central Author: Leandro Guerra I was surfing GitHub when I found this repository: Awesome Data Science It has an extensive list of data science bloggers, MOOCS and the diamond: a free list of 24 free datasets sources. Excellent to study and apply some data science techniques. Some highlights: MOOC's Google Making Sense of Data Coursera Introduction to Data Science Data Science - 9 Steps Courses, A Specialization on Coursera Data Mining - 5 Steps Courses, A Specialization on Coursera CS 109 Data Science Schoolofdata OpenIntro Data science MOOC CS 171 Visualization Process ... Read More
Selection of best articles from our past weekly digests

Feed: Resources Discussions - Data Science Central Author: Vincent Granville The following is a selection of featured articles that were posted in our previous weekly digests, in short, the best of the best on DSC. Single-starred articles are written by external/guest bloggers. Older popular articles are being added regularly, so please check out this page once a week! Our upcoming book on data science 2.0 (or data science automation or data science handbook or the little data science book, not sure yet about the title) will be based on some of these (edited and revised) articles: these articles are double-starred ... Read More
38 Seminal Articles Every Data Scientist Should Read

Feed: Resources Discussions - Data Science Central Author: Vincent Granville Here is selection containing both external and internal papers, focusing on various technical aspects of data science and big data. Feel free to add your favorites. Complex Open Text Analysis: Source: Avinash Kaushik External Papers Bigtable: A Distributed Storage System for Structured Data A Few Useful Things to Know about Machine Learning Random Forests A Relational Model of Data for Large Shared Data Banks Map-Reduce for Machine Learning on Multicore Pasting Small Votes for Classification in Large Databases and On-Line Recommendations Item-to-Item Collaborative Filtering Recursive Deep Models for Semantic Compositionality ... Read More
Black-box Confidence Intervals: Excel and Perl Implementation

Feed: Data Science Research Discussions - Data Science Central Author: Vincent Granville Originally posted here. Check original article for most recent updates. Confidence interval is abbreviated as CI. In this new article (part of our series on robust techniques for automated data science) we describe an implementation both in Excel and Perl, and discussion about our popular model-free confidence interval technique introduced in our original Analyticbridge article. This technique has the following advantages: Very easy to understand by non-statisticians (business analysts, software engineers, programmers, data architects) Simple (if not basic) to code; no need to use tables of Gaussian, Student or ... Read More
How to detect spurious correlations, and how to find the real ones
Feed: Data Science Research Discussions - Data Science Central Author: Vincent Granville Originally posted on DataSciebceCentral, by Dr. Granville. Click here to read original article and comments. Specifically designed in the context of big data in our research lab, the new and simple strong correlation synthetic metric proposed in this article should be used, whenever you want to check if there is a real association between two variables, especially in large-scale automated data science or machine learning projects. Use this new metric now, to avoid being accused of reckless data science and evenbeing sued for wrongful analytic practice. In this paper, the traditional correlation is referred to ... Read More
Practical illustration of Map-Reduce (Hadoop-style), on real data
Feed: Data Science Research Discussions - Data Science Central Author: Vincent Granville Originally posted on DataScienceCentral, by Dr. Granville. Click here to read original article and comments. Here I will discuss a general framework to process web traffic data. The concept of Map-Reduce will be naturally introduced. Let's say you want to design a system to score Internet clicks, to measure the chance for a click to convert, or the chance to be fraudulent or un-billable. The data comes from a publisher or ad network; it could be Google. Conversion data is limited and poor (some conversions are tracked, some are not; some ... Read More
Jackknife logistic and linear regression for clustering and predictions
Feed: Data Science Research Discussions - Data Science Central Author: Vincent Granville Originally posted on DataSciebceCentral, by Dr. Granville. Click here to read original article and comments. This article discusses a far more general version of the technique described in our article The best kept secret about regression. Here we adapt our methodology so that it applies to data sets with a more complex structure, in particular with highly correlated independent variables. Our goal is to produce a regression tool that can be used as a black box, be very robust and parameter-free, and usable and easy-to-interpret by non-statisticians. It is part of a ... Read More
A synthetic variance designed for Hadoop and big data
Feed: Data Science Research Discussions - Data Science Central Author: Vincent Granville Originally posted on Hadoop36o, by Dr. Granville. Click here to read original article and comments. The new variance introduced in this article fixes two big data problems associated with the traditional variance and the way it is computed in Hadoop, using a numerically unstable formula. Synthetic Metrics This new metric is synthetic: It was not derived naturally from mathematics like the variance taught in any statistics 101 course, or the variance currently implemented in Hadoop (see above picture). Bysynthetic, I mean that it was built to address issues with big data (outliers) and the ... Read More
Fast Combinatorial Feature Selection with New Definition of Predictive Power
Feed: Data Science Research Discussions - Data Science Central Author: Vincent Granville Originally posted on DataScienceCentral, by Dr. Granville. Click here to read original article and comments. In this article, I proposes a simple metric to measure predictive power. It is used for combinatorial feature selection, where a large number of feature combinations need to be ranked automatically and very fast, for instance in the context of transaction scoring, in order to optimize predictive models. This is about rather big data, and we would like to see an Hadoop methodology for the technology proposed here. It can easily be implemented in a Map Reduce ... Read More
Recent Comments