## Posts by chris2016

# Author: chris2016

#### Running UMAP for data visualisation in R

Feed: R-bloggers. Author: chris2016. UMAP is a non linear dimensionality reduction algorithm in the same family as t-SNE. In the first phase of UMAP a weighted k nearest neighbour graph is computed, in the second a low dimensionality layout of this is then calculated. Then the embedded data points can be visualised in a new space and compared with other variables of interest. It can be used for the analysis of many types of data, including, single cell RNA-seq and cancer omic data. One easy way to run UMAP on your data and visualise the results is to make a ... Read More

#### Quick and easy t-SNE analysis in R

Feed: R-bloggers. Author: chris2016. t-SNE is a useful dimensionality reduction method that allows you to visualise data embedded in a lower number of dimensions, e.g. 2, in order to see patterns and trends in the data. It can deal with more complex patterns of Gaussian clusters in multidimensional space compared to PCA. Although is not suited to finding outliers because how the samples are arranged does not directly represent distance, like in PCA. An easy way to run t-SNE on your data is to use a pre-made wrapper function that uses the Rtsne package and ggplot2. Like the one that ... Read More

#### Easy quick PCA analysis in R

Feed: R-bloggers. Author: chris2016. Principal component analysis (PCA) is very useful for doing some basic quality control (e.g. looking for batch effects) and assessment of how the data is distributed (e.g. finding outliers). A straightforward way is to make your own wrapper function for prcomp and ggplot2, another way is to use the one that comes with M3C (https://bioconductor.org/packages/devel/bioc/html/M3C.html) or another package. M3C is a consensus clustering tool that makes some major modifications to the Monti et al. (2003) algorithm so that it behaves better, it also provides functions for data visualisation. Let’s have a go on an example cancer ... Read More

#### Using clusterlab to benchmark clustering algorithms

Feed: R-bloggers. Author: chris2016. Clusterlab is a CRAN package (https://cran.r-project.org/web/packages/clusterlab/index.html) for the routine testing of clustering algorithms. It can simulate positive (data-sets with >1 clusters) and negative controls (data-sets with 1 cluster). Why test clustering algorithms? Because they often fail in identifying the true K in practice, published algorithms are not always well tested, and we need to know about ones that have strange behaviour. I’ve found in many own experiments on clustering algorithms that algorithms many people are using are not necessary ones that provide the most sensible results. I can give a good example below. I was interested ... Read More

#### Part 5: Code corrections to optimism corrected bootstrapping series

Feed: R-bloggers. Author: chris2016. The truth is out there R readers, but often it is not what we have been led to believe. The previous post examined the strong positive results bias in optimism corrected bootstrapping (a method of assessing a machine learning model’s predictive power) with increasing p (completely random features). There were 2 implementations of the method given, 1 has a slight error, 2 seems fine. The trend is still the same with the corrected code, but the problem with my code is I did not set ‘replace=TRUE’ in the call to the ‘sample’ function. Thanks to ECOQUANT ... Read More

#### Part 4: Why does bias occur in optimism corrected bootstrapping?

Feed: R-bloggers. Author: chris2016. In the previous parts of the series we demonstrated a positive results bias in optimism corrected bootstrapping by simply adding random features to our labels. This problem is due to an ‘information leak’ in the algorithm, meaning the training and test datasets are not kept seperate when estimating the optimism. Due to this, the optimism, under some conditions, can be very under estimated. Let’s analyse the code, it is pretty straightforward to understand then we can see where the problem originates. Fit a model M to entire data S and estimate predictive ability C. ## this ... Read More

#### Part 3: Two more implementations of optimism corrected bootstrapping show shocking bias

Feed: R-bloggers. Author: chris2016. Welcome to part III of debunking the optimism corrected bootstrap in high dimensions (quite high number of features) in the Christmas holidays. Previously we saw with a reproducible code implementation that this method is very bias when we have many features (50-100 or more). I suggest avoiding this method until at some point it has been reassessed thoroughly to find how bad this situation is with different numbers of dimensions. Yes, I know for some statisticians this is your favorite method and they tell people how their method is lacking in statistical power, but clearly this ... Read More

#### Part 2: Optimism corrected bootstrapping is definitely bias, further evidence

Feed: R-bloggers. Author: chris2016. Some people are very fond of the technique known as ‘optimism corrected bootstrapping’, however, this method is bias and this becomes apparent as we increase the number of noise features to high numbers (as shown very clearly in my previous blog post). This needs exposing, I don’t have the time to do a publication on this nor the interest so hence this article. Now, I have reproduced the bias with my own code. Let’s first just show the plot from last time again to recap, so as features increase (that are just noise), this method gives ... Read More

#### Optimism corrected bootstrapping: a problematic method

Feed: R-bloggers. Author: chris2016. There are lots of ways to assess how predictive a model is while correcting for overfitting. In Caret the main methods I use are leave one out cross validation, for when we have relatively few samples, and k fold cross validation when we have more. There also is another method called ‘optimism corrected bootstrapping’, that attempts to save statistical power, by first getting the overfitted result, then trying to correct this result by bootstrapping the data to estimate the degree of optimism. A few simple tests in Caret can demonstrate this method is bunk. This is ... Read More

#### Simulating NXN dimensional Gaussian clusters in R

Feed: R-bloggers. Author: chris2016. Gaussian clusters are found in a range of fields and simulating them is important as often we will want to test a given class discovery tools performance under conditions where the ground truth is known (e.g. K=6). However, a flexible Gaussian cluster simulator for simulating Gaussian clusters with defined variance, spacing, and size does not exist. This is why we made ‘clusterlab’, a new CRAN package (https://cran.r-project.org/web/packages/clusterlab/index.html). It was initially based on the Scikit-Learn make.blobs function, but now is much more sophisticated. Clusterlab works in 2D space initially, because it is easy to work in mathematically ... Read More

## Recent Comments