Category: Cloudera
New in Cloudera Enterprise 5.8: SQL Editor and Other Productivity Improvements – Cloudera Engineering Blog

Cloudera Enterprise 5.8 includes the latest release of Hue (3.10), the web UI that makes Apache Hadoop easier to use. As part of Cloudera’s continuing investments in user experience and productivity, Cloudera Enterprise 5.8 includes a new release of Hue that makes several common tasks much easier. In the remainder of this post, we’ll provide a summary of the main improvements. (Hue 3.10 is also available for a quick try in one click on demo.gethue.com.) New SQL Editor Hue’s new code editor is a single-page app that is much simpler to use than the previous editor. (Although the editor currently focuses ... Read More
Resolving Lock Contention in Apache Solr: A Performance-Analysis Detective Story – Cloudera Engineering Blog

This case study is an instructive example of how performance analysis is a multi-faceted process that often leads one in surprising directions. Apache Solr Near Real Time (NRT) Search allows Solr users to search documents indexed just seconds ago. It’s a critical feature in many real-time analytics applications. As Solr indexes more and more documents in near real time, end-user expectations for performance get higher and higher. However, recently the Cloudera Search team found that Solr NRT indexing throughput often hit a bottleneck even when there are plenty of CPU, disk, and network resources available. Latency was average, in the ... Read More
Analytics and BI on Amazon S3 with Apache Impala (Incubating) – Cloudera Engineering Blog

Thanks to new optimizations for running Impala on Amazon S3, doubling cluster size on AWS doubles multi-user performance while keeping total workload cost roughly the same. With public-cloud deployments becoming increasingly popular, Cloudera is continuing to build out the capabilities of its platform to best take advantage of the cost-effective and flexible nature of the cloud. The current release of Cloudera’s platform (5.8) includes a major step forward in that area with Impala 2.6 able to store and query data directly from the Amazon S3 object store. By decoupling data and compute, Impala enables high-performance analytics across heterogeneous data stores at ... Read More
Securing Apache Spark Shuffle using Apache Commons Crypto – Cloudera Engineering Blog

Learn how the performance advantages of the Crypto cryptographic library will provide an upgrade for Spark shuffle encryption over the current approach. When running a big data computing job, the data being processed may contain sensitive information that users don’t want anyone else to access. Encrypting that sensitive data is becoming more and more important, especially for enterprise users. For Apache Spark, which is the emerging standard for big data processing, data is transferred via network and also spilled to disk during shuffles—thus, unencrypted shuffle data will result in an unprotected Spark job. And because shuffle performance is important for ... Read More
BI and SQL Analytics with Apache Impala (Incubating) in CDH 5.8: 3x Faster on Secure Clusters – Cloudera Engineering Blog

Released with CDH 5.8, Impala 2.6 brings solid performance improvements, particularly for clusters secured by Kerberos running BI workloads on Apache Hadoop. Just a few months back, we showed you how Impala 2.5 delivered a 4x performance boost compared to Impala 2.3 for BI workloads on Hadoop via the introduction of several features like runtime filters. Here’s an update: Compared to two releases ago, Impala 2.6 delivers 12x better performance on secure workloads and continues this drumbeat of consistent performance improvement. We are excited to share details on performance improvements in Impala 2.6 with you here. (Impala 2.6 also brings ... Read More
Multi-node Clusters with Cloudera QuickStart for Docker – Cloudera Engineering Blog

Getting hands-on with a multi-node cluster for self-learning or testing is even easier, now. Last December, we introduced the Cloudera QuickStart Docker image to make it easier than ever before to explore Cloudera’s distributed data processing platform, including tools such as Apache Impala (incubating), Apache Spark, and Apache Solr. While the single-node getting-started image was well-received, we noted a large number of requests from the community for a multi-node CDH deployment via Docker. Today, we are excited to announce the new-and-improved Cloudera QuickStart for Docker. To enable a multi-node cluster deployment on the same Docker host, we created a CDH ... Read More
How-to: Ingest Email into Apache Hadoop in Real Time for Analysis – Cloudera Engineering Blog

Apache Hadoop is a proven platform for long-term storage and archiving of structured and unstructured data. Related ecosystem tools, such as Apache Flume and Apache Sqoop, allow users to easily ingest structured and semi-structured data without requiring the creation of custom code. Unstructured data, however, is a more challenging subset of data that typically lends itself to batch-ingestion methods. Although such methods are suitable for many use cases, with the advent of technologies like Apache Spark, Apache Kafka, and Apache Impala (Incubating), Hadoop is also increasingly a real-time platform. In particular, compliance-related use cases centered on electronic forms of communication, ... Read More
Livy, the Open Source REST Service for Apache Spark, Joins Cloudera Labs – Cloudera Engineering Blog

Livy, which streamlines Spark architecture for web/mobile apps, is the newest addition to Cloudera Labs. With respect to the impact of Apache Spark on the Apache Hadoop ecosystem, its virtual overnight adoption as the default data processing engine—and as a standard for powering advanced analytic applications—speaks for itself. But, that’s not to say that there isn’t work yet to be done, particularly in the areas of performance at scale/under multi-tenancy, developer productivity, and extensibility. For example, architectural options for Spark-based applications have been limited by, among other things, a lack of direct access to Spark resources by remote applications without ... Read More
Cloudera Enterprise 5.8 is Now Available – Cloudera Engineering Blog

Cloudera Enterprise 5.8 is now generally available (comprising CDH 5.8, Cloudera Manager 5.8, and Cloudera Navigator 2.7). Cloudera is excited to announce the general availability of Cloudera Enterprise 5.8! Main highlights of this release include Impala read/write support on Amazon S3, a redesigned SQL query editor GUI, the expansion of role-based access control functionality to Cloudera Search, and the GA of Cloudera Navigator Optimizer to facilitate and optimize workload migrations. For those new to it, Cloudera Navigator Optimizer (previously in beta) is a cloud-based service that helps with offload planning and active data optimization for Apache Hadoop. For example, it ... Read More
List Of NoSQL Databases [currently >225]
Your Ultimate Guide to the Non-Relational Universe! [including a historic Archive 2009-2011] News Feed covering some changes here ! NoSQL DEFINITION: Next Generation Databases mostly addressing some of the points : being non-relational, distributed, open-source and horizontally scalable . The original intention has been modern web-scale databases. The movement began early 2009 and is growing [...] ... Read More
Recent Comments