March 19, 2017
0. Matt Brandwein of Cloudera briefed me on the new Cloudera Data Science Workbench. The problem it purports to solve is:
- One way to do data science is to repeatedly jump through the hoops of working with a properly-secured Hadoop cluster. This is difficult.
- Another way is to extract data from a Hadoop cluster onto your personal machine. This is insecure (once the data arrives) and not very parallelized.
- A third way is needed.
Cloudera’s idea for a third way is:
- You don’t run anything on your desktop/laptop machine except a browser.
- The browser connects you to a Docker container that holds (and isolates) a kind of virtual desktop for you.
- The Docker container runs on your Cloudera cluster, so connectivity-to-Hadoop and security are handled rather automagically.
In theory, that’s pure goodness … assuming that the automagic works sufficiently well. I gather that Cloudera Data Science Workbench has been beta tested by 5 large organizations and many 10s of users. We’ll see what is or isn’t missing as more customers take it for a spin.
1. Recall that Cloudera installations have 4 kinds of nodes. 3 are obvious:
- Hadoop worker nodes.
- Hadoop master nodes.
- Nodes that run Cloudera Manager.
The fourth kind are edge/gateway nodes. Those handle connections to the outside world, and can also run selected third-party software. They also are where Cloudera Data Science Workbench lives.
2. One point of this architecture is to let each data scientist run the languages and tools of her choice. Docker isolation is supposed to make that practical and safe.
And so we have a case of the workbench metaphor actually being accurate! While a “workbench” is commonly just an integrated set of tools, in this case it’s also a place for you to use other tools your personally like and bring in.
Surely there are some restrictions as to which tools you can use, but I didn’t ask for those to be spelled out.
3. Matt kept talking about security, to an extent I recall in almost no other analytics-oriented briefing. This had several aspects.
- As noted above, a lot of the hassle of Hadoop-based data science relates to security.
- As also noted above, evading the hassle by extracting data is a huge security risk. (If you lose customer data, you’re going to have a very, very bad day.)
- According to Matt, standard uses of notebook tools such as Jupyter or Zeppelin wind up having data stored wherever code is. Cloudera’s otherwise similar notebook-style interface evidently avoids that flaw. (Presumably, it you want to see the output, you rerun the script against the data store yourself.)
4. To a first approximation, the target users of Cloudera Data Science Workbench can be characterized the same way BI-oriented business analysts are. They’re people with:
- Sufficiently good quantitative skills to do the analysis.
- Sufficiently good computer skills to do SQL queries and so on, but not a lot more than that.
Of course, “sufficiently good quantitative skills” can mean something quite different in data science than it does for the glorified arithmetic of ordinary business intelligence.
5. Cloudera Data Science Workbench doesn’t have any special magic in parallelization. It just helps you access the parallelism that’s already out there. Some algorithms are easy to parallelize. Some libraries have parallelized a few algorithms beyond that. Otherwise, you’re on your own.
6. When I asked whether Cloudera Data Science Workbench was open source (like most of what Cloudera provides) or closed source (like Cloudera Manager), I didn’t get the clearest of answers. On the one hand, it’s a Cloudera-specific product, as the name suggests; on the other, it’s positioned as having been stitched together almost entirely from a collection of open source projects.