- Home
- Tag: Hive
Posts tagged Hive
Tag: Hive
Amazon Athena now supports views in Apache Hive metastores
Feed: Recent Announcements. You can now use Amazon Athena to query views stored in your self-managed Apache Hive metastores. Hive views are defined using the Hive Query Language (HiveQL) which is not fully compatible with Athena's standard SQL. With this new capability, Athena automatically handles HiveQL syntax differences so you can query Hive views without changing your view definitions or maintaining a complex translation layer. A view is a logical table created using the results of a query that executes against a physical table each time the view is referenced. Views are commonly used to focus, simplify, and optimize access ... Read More
Optimizing Hive on Tez Performance
Feed: Cloudera Blog. Author: Jay Desai. Posted in Technical | May 09, 2022 8 min read Tuning Hive on Tez queries can never be done in a one-size-fits-all approach. The performance on queries depends on the size of the data, file types, query design, and query patterns. During performance testing, evaluate and validate configuration parameters and any SQL modifications. It is advisable to make one change at a time during performance testing of the workload, and would be best to assess the impact of tuning changes in your development and QA environments before using them in production environments. Cloudera WXM ... Read More
Up to 15 times improvement in Hive write performance with the Amazon EMR Hive zero-rename feature

Feed: AWS Big Data Blog. Our customers use Apache Hive on Amazon EMR for large-scale data analytics and extract, transform, and load (ETL) jobs. Amazon EMR Hive uses Apache Tez as the default job execution engine, which creates Directed Acyclic Graphs (DAGs) to process data. Each DAG can contain multiple vertices from which tasks are created to run the application in parallel. Their final output is written to Amazon Simple Storage Service (Amazon S3). Hive initially writes data to staging directories and then move it to the final location after a series of rename operations. This design of Hive renames ... Read More
Migrate Hive data from CDH to CDP public cloud

Feed: Cloudera Blog. Author: Shailesh Shiwalkar. Posted in Technical | June 25, 2021 7 min read Introduction Many Cloudera customers are making the transition from being completely on-prem to cloud by either backing up their data in the cloud, or running multi-functional analytics on CDP Public cloud in AWS or Azure. The Replication Manager service facilitates both disaster recovery and data migration across different environments. Using easy-to-define policies, Replication Manager solves one of the biggest barriers for the customers in their cloud adoption journey by allowing them to move both tables/structured data and files/unstructured data to the CDP cloud of ... Read More
Amazon VPC Flow Logs now supports Apache Parquet, Hive-compatible prefixes and Hourly partitioned files
Feed: Recent Announcements. Amazon Virtual Public Cloud (VPC) is introducing three new features to make it faster, easier and more cost efficient to store and run analytics on your Amazon VPC Flow Logs. First, VPC Flow Logs can now be delivered to Amazon S3 in the Apache Parquet file format. Second, they can be stored in S3 with Hive-compatible prefixes. And third, your VPC Flow Logs can be delivered as hourly partitioned files. All of these features are available when you choose S3 as the destination for your VPC Flow Logs. Queries on VPC Flow Logs stored in Apache Parquet ... Read More
Amazon EMR now supports Apache Spark SQL to insert data into and update Apache Hive metadata tables when Apache Ranger integration is enabled
Feed: Recent Announcements. This January, we launched Amazon EMR integration with Apache Ranger, a feature that allows you to define and enforce database, table, and column-level permissions when Apache Spark users access data in Amazon S3 through the Hive Metastore. Previously, with Apache Ranger is enabled, you were limited to only being able to read data using Spark SQL statements such as SHOW DATABASES and DESCRIBE TABLE. Now, you can also insert data into, or update the Apache Hive metadata tables with these statements: INSERT INTO, INSERT OVERWRITE, and ALTER TABLE. This feature is enabled on Amazon EMR 6.4 in the following ... Read More
Now use Apache Spark, Hive, and Presto on Amazon EMR clusters directly from Amazon Sagemaker Studio for large-scale data processing and machine learning
Feed: Recent Announcements. You can now use open source frameworks such as Apache Spark, Apache Hive, and Presto running on Amazon EMR clusters directly from Amazon SageMaker Studio notebooks to run petabyte-scale data analytics and machine learning. Amazon EMR automatically installs and configures open source frameworks and provides a performance-optimized runtime that is compatible with and faster than standard open source. For e.g. Spark 3.0 on Amazon EMR is 1.7x faster than it’s open source equivalent. Amazon SageMaker Studio provides a single, web-based visual interface where you can perform all ML development steps required to prepare data, as well as ... Read More
How WANdisco LiveData Migrator Can Migrate Apache Hive Metastore to AWS Glue Data Catalog

Feed: AWS Partner Network (APN) Blog. Author: Paul Scott-Murphy. By Paul Scott-Murphy, Chief Technology Officer – WANdiscoBy Roy Hasson, Principal Product Manager – AWS Glue / AWS Lake Formation WANdisco is the LiveData company and an AWS ISV Partner with the Migration Competency that provides technology and software products that are used to simplify and automate the migration of big data to the cloud. In this post, we’ll explain the challenges of migrating large, complex, actively-used structured datasets to Amazon Web Service (AWS), and how the combination of WANdisco LiveData Migrator, Amazon Simple Storage Service (Amazon S3), and AWS Glue ... Read More
Build Slowly Changing Dimensions Type 2 (SCD2) with Apache Spark and Apache Hudi on Amazon EMR

Feed: AWS Big Data Blog. Organizations across the globe are striving to improve the scalability and cost efficiency of the data warehouse. Offloading data and data processing from a data warehouse to a data lake empowers companies to introduce new use cases like ad hoc data analysis and AI and machine learning (ML), reusing the same data stored on Amazon Simple Storage Service (Amazon S3). This approach avoids data silos and allows you to process the data at very large scale while keeping the data access cost-effective. Starting off with this new approach can bring with it several challenges: Choosing ... Read More
Real-Time Big Data Analytics: How to Replicate from MySQL to Hadoop

Feed: Planet MySQL; Author: Continuent; First off: Happy 15th birthday, Hadoop! It wasn’t an April Fool’s joke then, and it isn’t today either: Hadoop’s initial release was on the 1st of April 2006 :-) As most of you will know, Apaches Hadoop is a powerful and popular tool, which has been driving much of the Big Data movement over the years. It is generally understood to be a system that provides a (distributed) file system, which in turn stores data to be used by applications without knowing about the structure of the data. In other words, it’s a file system ... Read More
Recent Comments