#### Setting up RStudio Server on a Cloud for Collaboration and Reproducibility

Feed: R-bloggers. Author: R Views. Roland Stevenson is a data scientist and consultant who may be reached on Linkedin. When setting up R and RStudio Server on a cloud Linux instance, some thought should be given to implementing a workflow that facilitates collaboration and ensures R project reproducibility. There are many possible workflows to accomplish this. In this post, we offer an "opinionated" solution based on what we have found to work in a production environment. We assume all development takes place on an RStudio Server cloud Linux instance, ensuring that only one operating system needs to be supported.

#### Vectorizing functions in R is easy

Feed: R-bloggers. Author: Roman Luštrik. Imagine you have a function that only takes one argument, but you would really like to work on a vector of values. A short example on how function Vectorize() can accomplish this. Let's say we have a data.frame xy data.frame(sample = c("C_pre_sample1", "C_post_sample1", "T_pre_sample2", "T_post_sample2", "NA_pre_sample1"), value = runif(5)) # sample value # 1 C_pre_sample1 0.3048032 # 2 C_post_sample1 0.3487163 # 3 T_pre_sample2 0.3359707 # 4 T_post_sample2 0.6698358 # 5 NA_pre_sample1 0.9490707 and you want to subset only samples that start with C_pre or T_pre. Of course you can construct a nice regular expression, implement

#### Free Python Data Science coding Book series

Feed: Featured Blog Posts - Data Science Central. Author: ajit jaokar. In this post, I explain How you can participate further in the free book series which we are launching based on the early experiences and Useful resources we recommend based on our experience for learning coding for Data Science (using Python – tensorflow and keras) To provide some context, I posted about the idea of learning coding for machine learning / deep learning in a weekend We have had considerable success with this – and now we are planning the next stage. To participate in this book series and

#### Create your own version of Anscombe’s quartet: Dissimilar data that have similar statistics

Feed: SAS Blogs. Author: Rick Wicklin. I think every course in exploratory data analysis should begin by studying Anscombe's quartet. Anscombe's quartet is a set of four data sets (N=11) that have nearly identical descriptive statistics but different graphical properties. They are a great reminder of why you should graph your data. You can read about Anscombe's quartet on Wikipedia, or check out a quick visual summary by my colleague Robert Allison. Anscombe's first two examples are shown below: The Wikipedia article states, "It is not known how Anscombe created his data sets." Really? Creating different data sets that have

#### Bayes vs. the Invaders! Part Three: The Parallax View

Feed: R-bloggers. Author: tealeaf. In the previous post of this series unveiling the relationship between UFO sightings and population, we crossed the threshold of normality underpinning linear models to construct a generalised linear model based on the more theoretically satisfying Poisson distribution. On inspection, however, this model revealed itself to be less well suited to the data than we had, in our tragic ignorance, hoped. While it appeared, on visual inspection, to capture some features of the data, the predictive posterior density plot demonstrated that it still fell short of addressing the subtleties of the original. In this post, we

#### Writing a letter to DataCamp

Feed: R-bloggers. Author: Rstats on Julia Silge. Since 2017 I have been an instructor for DataCamp, the VC-backed online data science education platform. What this means is that I am not an employee, but I have developed content for the company as a contractor. I have two courses there, one on text mining and one on practical supervised machine learning. About two weeks ago, DataCamp published a blog post outlining an incident of sexual misconduct at the company. The post was published one day after a group of over 100 instructors sent a letter to DataCamp saying that the way

#### Customize Your Interactive EDA: Explore the Fuel Economy of the U.S. Car Market

Feed: R-bloggers. Author: An Accounting and Data Science Nerd's Corner. Interactive EDA is nice but customized interactive EDA is even nicer. To celebrate the new CRAN version of my 'ExPanDaR' package I prepare a customized variant of 'ExPanD' to explore the U.S. EPA data on fuel economy. Our objective is to develop an interactive display that guides the reader on how to explore the fuel economy data in an intuitive way. First, let's load the packages and the data from EPA's web page. In addition, I prepared a small data set containing the countries of domicile for the car producers

#### Bioconductor S4 classes for high-throughput omics data

Feed: R-bloggers. Author: Dror Berel's R Blog. Bioconductor S4 classes for high-throughput omics data Motivation Multi-omics data integration and analysis. What a beast! It is one of the major challenges in the era of personalized/precision medicine (or whatever you want to call it). Denfinetely mine, as someone who is expected to grab such messy data from multiple sources, allign and annotate it altogether, and then testing some interesting hypothesis with it. If you thought analysis of a single omic (microarray, RNAseq etc) is overwhelming, what about integration of multiple single-omic data? Multi-omics data adds another layer of priceless high-throughput biologic

#### A Detailed Guide to Plotting Line Graphs in R using ggplot geom_line

Feed: R-bloggers. Author: Michael Toth. When it comes to data visualization, it can be fun to think of all the flashy and exciting ways to display a dataset. But if you're trying to convey information, flashy isn't always the way to go. In fact, one of the most powerful ways to communicate the relationship between two variables is the simple line graph. A line graph is a type of graph that displays information as a series of data points connected by straight line segments. The price of Netflix stock (NFLX) displayed as a line graph Line graph of average monthly

#### Two interesting facts about high-dimensional random projections

Feed: R-bloggers. Author: kjytay. John Cook recently wrote an interesting blog post on random vectors and random projections. In the post, he states two surprising facts of high-dimensional geometry and gives some intuition for the second fact. In this post, I will provide R code to demonstrate both of them. Fact 1: Two randomly chosen vectors in a high-dimensional space are very likely to be nearly orthogonal. Cook does not discuss this fact as it is "well known". Let me demonstrate it empirically. Below, the first function generates a -dimensional unit vector uniformly at random. The second function takes in

