- Home
- Uncategorized
- Microsoft SQL Server
- Automation
- Tips and Tricks
- Oracle
- MySQL & MariaDB
- NoSQL Databases
- DB2
- Cloud Offerings
- Tools
- Business Intelligence
- Data Science
- Azure
- AWS
- Google Cloud
- Big Data
- Hadoop
- Hardware
- Microsoft
- Vendors
- Relational Databases
- Oracle Database
- Machine Learning
- Featured
- Big Data Applications
- IBM
- Internet of Things
- Data Vizualisation
- Data Science Techniques
- Data Science Use Cases
- DynamoDB
- MongoDB
- Cassandra
- PostgreSQL
- Migration to Cloud
- Events
- Books
- Online Training
- Training & Certifications
- Education
- SAP
- Tableau
- Microsoft Power BI
- Neo4j
- Python
- R
- Tutorials
- Datawarehousing
- Polybase
- Teradata
- Amazon Redshift
- MapR
- Hortonworks
- IBM BigInsights
- SAS
- CRM
- NuoDB
- OrientDB
- Actian
- Graph Databases
- Document Stores
- Key/Value Stores
- ArangoDB
- Column Stores
- HBASE
- Multi-Model Databases
- In-Memory
- Redis
- MemSQL
- Industry Trends
- Distributed
- Hazelcast
- MapD
- AllegroGraph
- erwin
- Players
- CLUSTRIX
- Integrators
- IDERA
- Workato
- Veriflow
- Arisant
- Liquid Web
- Artificial Intelligence
- puppet
- chef
- ansible
- cronJ
- Practical Machine Learning with R and Python – Part 6
Practical Machine Learning with R and Python – Part 6
Feed: R-bloggers.
Author: Tinniam V Ganesh.
This is the final and concluding part of my series on ‘Practical Machine Learning with R and Python’. In this series I included the implementations of the most common Machine Learning algorithms in R and Python. The algorithms implemented were
1. Practical Machine Learning with R and Python – Part 1 In this initial post, I touch upon regression of a continuous target variable. Specifically I touch upon Univariate, Multivariate, Polynomial regression and KNN regression in both R and Python
2. Practical Machine Learning with R and Python – Part 2 In this post, I discuss Logistic Regression, KNN classification and Cross Validation error for both LOOCV and K-Fold in both R and Python
3. Practical Machine Learning with R and Python – Part 3 This 3rd part included feature selection in Machine Learning. Specifically I touch best fit, forward fit, backward fit, ridge(L2 regularization) & lasso (L1 regularization). The post includes equivalent code in R and Python.
4. Practical Machine Learning with R and Python – Part 4 In this part I discussed SVMs, Decision Trees, Validation, Precision-Recall, AUC and ROC curves
5. Practical Machine Learning with R and Python – Part 5 In this penultimate part, I touch upon B-splines, natural splines, smoothing spline, Generalized Additive Models(GAMs), Decision Trees, Random Forests and Gradient Boosted Treess.
In this last part I cover Unsupervised Learning. Specifically I cover the implementations of Principal Component Analysis (PCA). K-Means and Heirarchical Clustering. You can download this R Markdown file from Github at MachineLearning-RandPython-Part6
1.1a Principal Component Analysis (PCA) – R code
Principal Component Analysis is used to reduce the dimensionality of the input. In the code below 8 x 8 pixel of handwritten digits is reduced into its principal components. Then a scatter plot of the first 2 principal components give a very good visial representation of the data
library(dplyr)
library(ggplot2)
digits= read.csv("digits.csv")
digitClasses factor(digits$X0.000000000000000000e.00.29)
digitsPCA=prcomp(digits[,1:64])
df data.frame(digitsPCA$x)
df1 cbind(df,digitClasses)
1.1 b Variance explained vs no principal components – R code
In the code below the variance explained vs the number of principal components is plotted. It can be seen that with 20 Principal components almost 90% of the variance is explained by this reduced dimensional model.
digits= read.csv("digits.csv")
digitClasses factor(digits$X0.000000000000000000e.00.29)
digitsPCA=prcomp(digits[,1:64])
sd=digitsPCA$sdev
digitsVar=digitsPCA$sdev^2
percentVarExp=digitsVar/sum(digitsVar)
1.1c Principal Component Analysis (PCA) – Python code
import numpy as np
from sklearn.decomposition import PCA
from sklearn import decomposition
from sklearn import datasets
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
digits = load_digits()
pca = PCA(2)
projected = pca.fit_transform(digits.data)
1.1 b Variance vs no principal components
– Python code
import numpy as np
from sklearn.decomposition import PCA
from sklearn import decomposition
from sklearn import datasets
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
digits = load_digits()
pca = PCA(64)
projected = pca.fit_transform(digits.data)
varianceExp= pca.explained_variance_ratio_
totVarExp=np.cumsum(np.round(pca.explained_variance_ratio_, decimals=4)*100)
1.2a K-Means – R code
In the code first the scatter plot of the first 2 Principal Components of the handwritten digits is plotted as a scatter plot. Over this plot 10 centroids of the 10 different clusters corresponding the 10 diferent digits is plotted over the original scatter plot.
library(ggplot2)
digits= read.csv("digits.csv")
digitClasses factor(digits$X0.000000000000000000e.00.29)
digitsPCA=prcomp(digits[,1:64])
df data.frame(digitsPCA$x)
df1 cbind(df,digitClasses)
a df[,1:2]
kkmeans(a,10,1000)
df2data.frame(k$centers)
1.2b K-Means – Python code
The centroids of the 10 different handwritten digits is plotted over the scatter plot of the first 2 principal components.
import numpy as np
from sklearn.decomposition import PCA
from sklearn import decomposition
from sklearn import datasets
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.cluster import KMeans
digits = load_digits()
pca = PCA(2)
projected = pca.fit_transform(digits.data)
kmeans = KMeans(n_clusters=10)
kmeans.fit(projected)
y_kmeans = kmeans.predict(projected)
centers = kmeans.cluster_centers_
centers
1.3a Heirarchical clusters – R code
Herirachical clusters is another type of unsupervised learning. It successively joins the closest pair of objects (points or clusters) in succession based on some ‘distance’ metric. In this type of clustering we do not have choose the number of centroids. We can cut the created dendrogram mat an appropriate height to get a desired and reasonable number of clusters These are the following ‘distance’ metrics used while combining successive objects
- Ward
- Complete
- Single
- Average
- Centroid
iris datasets::iris
iris2 iris[,-5]
species iris[,5]
d_iris dist(iris2)
hc_iris hclust(d_iris, method = "average")
sub_grp cutree(hc_iris, k = 3)
table(sub_grp)
## sub_grp
## 1 2 3
## 50 64 36
1.3a Heirarchical clusters – Python code
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage
iris = load_iris()
Z = linkage(iris.data, 'average')
Leave a Reply
You must be logged in to post a comment.