- Big Data
- Obstacles to performance in parallel programming
Making your code run faster is often the primary goal when using parallel programming techniques in R, but sometimes the effort of converting your code to use a parallel framework leads only to disappointment, at least initially. Norman Matloff, author of Parallel Computing for Data Science: With Examples in R, C++ and CUDA, has shared chapter 2 of that book online, and it describes some of the issues that can lead to poor performance. They include:
- Communications overhead, particularly an issue with fine-grained parallelism consisting of a very large number of relatively small tasks;
- Load balance, where the computing resources aren’t contributing equally to the problem;
- Impacts from use of RAM and virtual memory, such as cache misses and page faults;
- Network effects, such as latency and bandwidth, that impact performance and communication overhead;
- Interprocess conflicts and thread scheduling;
- Data access and other I/O considerations.
The chapter is well worth a read for anyone writing parallel code in R (or indeed any programming language). It’s also worth checking out Norm Matloff’s keynote from the useR!2017 conference, embedded below.
Norm Matloff: Understanding overhead issues in parallel computation
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…