- Home
- Tag: Cloudera
Posts tagged Cloudera
Tag: Cloudera
Top 30 Data Scientist to follow in 2019
Feed: Big Data Made Simple. Author: Neeraj R. The field of data is ever growing and ever progressive. And the need for organized data is on the rise. But it is not available in that manner from the beginning itself. The raw data is much unstructured and mixed up. There are many things which have no importance in the actual work, yet they get mixed up with entire data and have to be filtered out. That is where the job of a data scientist comes into the play.The primary job being, organizing and maintaining proper data using scientific methods, processes, ... Read More
Under the hood: Performance, scale, security for cloud analytics with ADLS Gen2

Feed: Microsoft Azure Blog. Author: James Baker. On February 7, 2019 we announced the general availability of Azure Data Lake Storage (ADLS) Gen2. Azure is now the only cloud provider to offer a no-compromise cloud storage solution that is fast, secure, massively scalable, cost-effective, and fully capable of running the most demanding production workloads. In this blog post we’ll take a closer look at the technical foundation of ADLS that will power the end to end analytics scenarios our customers demand. ADLS is the only cloud storage service that is purpose-built for big data analytics. It is designed to integrate ... Read More
Using Native Math Libraries to Accelerate Spark Machine Learning Applications

Feed: Cloudera Engineering Blog. Author: Tom Wheeler. Spark ML is one of the dominant frameworks for many major machine learning algorithms, such as the Alternating Least Squares (ALS) algorithm for recommendation systems, the Principal Component Analysis algorithm, and the Random Forest algorithm. However, the complexity of configuring it optimally means that frequently, Spark ML is underutilized.. Using native math libraries for Spark ML can help unlock the full potential of Spark ML. This article discusses how to accelerate model training speed by using native libraries for Spark ML. It also discusses why Spark ML benefits from native libraries, how to enable ... Read More
Integrating Machine Learning Models into Your Big Data Pipelines in Real-Time With No Coding

Feed: Cloudera Engineering Blog. Author: Tom Wheeler. [Editor’s note: This article was originally published on the Hortonworks Community Connection, but reproduced here because CDSW is now available on both Cloudera and Hortonworks platforms.] Using Deployed Models as a Function as a Service Using Cloudera Data Science Workbench with Apache NiFi, we can easily call functions within our deployed models from Apache NiFi as part of flows. I am working against CDSW on HDP (https://www.cloudera.com/documentation/data-science-workbench/latest/topics/cdsw_hdp.html), but it will work for all CDSW regardless of install type. In my simple example, I built a Python model that uses TextBlob to run sentiment analysis against ... Read More
Using Sqoop to Import Data from MySQL to Cloudera Data Warehouse

Feed: Cloudera Engineering Blog. Author: Tom Wheeler. Cloudera Data Warehouse offers a powerful combination of flexibility and cost-savings. Using Cloudera Data Warehouse, you can transform and optimize your current traditional data warehouse by moving select workloads to your CDH cluster. This article shows you how to transform your current setup into a modern data warehouse by moving some initial data over to Impala on your CDH cluster. Prerequisites To use the following data import scenario, you need the following: A moderate-sized CDH cluster that is managed by Cloudera Manager, which configures everything properly so you don’t need to worry about ... Read More
Scalability Improvement of Apache Impala 2.12.0 in CDH 5.15.0

Feed: Cloudera Engineering Blog. Author: Tom Wheeler. Key Takeaways We have significantly improved Impala in CDH 5.15.0 to address some of the scalability bottlenecks in query execution. 64 concurrent streams of TPC-DS queries at 10TB scale in a 135-node cluster now run at 6x query throughput compared to previous releases. In addition to running faster, the query success rate also improved from 73% to 100%. Overall, Impala in CDH 5.15.0 provides massive improvements in throughput and reliability while reducing the resource usage significantly. It can now reliably handle concurrent complex queries on large data sets that were not possible before ... Read More
Network Security with Cloudera Altus and Apache Spot

Feed: Cloudera Engineering Blog. Author: Tom Wheeler. In the last few years, IT security threats to enterprise systems have increased, which has necessitated installing log ingestion and analysis solutions in any enterprise network. This blog post illustrates how Cloudera built its own scalable solution for log ingestion and analytics using Apache Spot and Cloudera Altus. By leveraging transient workloads in the cloud, Cloudera reduced the solution’s operational costs by 50% when compared to traditional, persistent cluster approaches. At Cloudera, the Infosec team needed to build a centralized log ingestion and analytical solution to monitor all key systems in the company ... Read More
SMM 1.2 Released with Powerful New Alerting and Topic Lifecycle Management Features with Schema Registry Integration

Feed: Cloudera Engineering Blog. Author: Tom Wheeler. [Editor’s note: Now that the recent merger is complete, the Cloudera Engineering blog will expand to cover products, such as this, originally developed for the Hortonworks platform. Please stay tuned for future product announcements regarding availability of these products on the Cloudera platform.] Since the release of Streams Messaging Manager (SMM) at the end of last summer, our customers have started to cure the Kafka Blindness within their organizations by using SMM to monitor their Kafka clusters and streaming microservices applications. With the release of SMM 1.2, we have delivered on the top ... Read More
Individually great, collectively unmatched: Announcing updates to 3 great Azure Data Services

Feed: Microsoft Azure Blog. Author: Jurgen Willis. As Julia White mentioned in her blog today, we’re pleased to announce the general availability of Azure Data Lake Storage Gen2 and Azure Data Explorer. We also announced the preview of Azure Data Factory Mapping Data Flow. With these updates, Azure continues to be the best cloud for analytics with unmatched price-performance and security. In this blog post we’ll take a closer look at the technical capabilities of these new features. Azure Data Lake Storage - The no compromise Data Lake Azure Data Lake Storage (ADLS) combines the scalability, cost effectiveness, security model, ... Read More
Announcing the general availability of Lsv2-series Azure Virtual Machines

Feed: Microsoft Azure Blog. Author: Joel Pelley. After wrapping up a successful preview with fantastic customer engagement, we are excited to officially announce the general availability of the Lsv2-series Azure Virtual Machines (VMs). Customers from all over the globe and across a broad range of industries participated in the Lsv2-series VMs preview during the second half of 2018. General overview The Lsv2-series features high throughput, low latency, and directly mapped local NVMe storage. The Lsv2 VMs run on the AMD EPYCTM 7551 processor with an all core boost of 2.55GHz. The Lsv2-series VMs offer various configurations from 8 to 80 ... Read More
Recent Comments