June 14, 2017
I talked with Cloudera before the recent release of Altus. In simplest terms, Cloudera’s cloud strategy aspires to:
- Provide all the important advantages of on-premises Cloudera.
- Provide all the important advantages of native cloud offerings such as Amazon EMR (Elastic MapReduce, or at least come sufficiently close to that goal.
- Benefit from customers’ desire to have on-premises and cloud deployments that work:
- Alike in any case.
- Together, to the extent that that makes use-case sense.
In other words, Cloudera is porting its software to an important new platform.* And this port isn’t complete yet, in that Altus is geared only for certain workloads. Specifically, Altus is focused on “data pipelines”, aka data transformation, aka “data processing”, aka new-age ETL (Extract/Transform/Load). (Other kinds of workload are on the roadmap, including several different styles of Impala use.) So what about that is particularly interesting? Well, let’s drill down.
*Or, if you prefer, improving on early versions of the port.
Since so much of the Hadoop and Spark stacks is open source, competition often isn’t based on core product architecture or features, but rather on factors such as:
- Ease of management. This one is nuanced in the case of cloud/Altus. For starters:
- One of Cloudera’s main areas of differentiation has always been Cloudera Manager.
- Cloudera Director was Cloudera’s first foray into cloud-specific management.
- Cloudera Altus features easier/simpler management than Cloudera Director, meant to be analogous to native Amazon management tools, and good-enough for use cases that don’t require strenuous optimization.
- Cloudera Altus also includes an optional workload analyzer, in slight conflict with other parts of the Altus story. More on that below.
- Ease of development. Frankly, this rarely seems to come up as a differentiator in the Hadoop/Spark world, various “notebook” offerings such as Databricks’ or Cloudera’s notwithstanding.
- Price. When price is the major determinant, Cloudera is sad.
- Open source purity. Ditto. But at most enterprises — at least those with hefty IT budgets — emphasis on open source purity either is a proxy for price shopping, or else boils down to largely bogus concerns about vendor lock-in.
Of course, “core” kinds of considerations are present to some extent too, including:
- Performance, concurrency, etc. I no longer hear many allegations of differences in across-the-board Hadoop performance. But the subject does arise in specific areas, most obviously in analytic SQL processing. It arises in the case of Altus as well, in that Cloudera improved in a couple of areas that it concedes were previously Amazon EMR advantages, namely:
- Interacting with S3 data stores.
- Spinning instances up and down.
- Reliability and data safety. Cloudera mentioned that it did some work so as to be comfortable with S3’s eventual consistency model.
Recently, Cloudera has succeeded at blowing security up into a major competitive consideration. Of course, they’re trying that with Altus as well. Much of the Cloudera Altus story is the usual — rah-rah Cloudera security, Sentry, Kerberos everywhere, etc. But there’s one aspect that I find to be simple yet really interesting:
- Cloudera Altus doesn’t manage data for you.
- Rather, it launches and manages jobs on a separate Hadoop cluster.
Thus, there are very few new security risks to running Cloudera Altus, beyond whatever risks are inherent to running any version of Hadoop in the public cloud.
Where things get a bit more complicated is some features for workload analysis.
- Cloudera recently introduced some capabilities for on-the-fly trouble-shooting. That’s fine.
- Cloudera has also now announced an offline workload analyzer, which compares actual metrics computed from your log files to “normal” ones from well-running jobs. For that, you really do have to ship information to a separate cluster managed by Cloudera.
The information shipped is logs rather than actual query results or raw data. In theory, an attacker who had all those logs could conceivably make inferences about the data itself; but in practice, that doesn’t seem like an important security risk at all.
So is this an odd situation where that strategy works, or could what we might call light-touch managed services turn out to be widespread and important? That’s a good question to address in a separate post.