- Home
- Tag: Apache
Posts tagged Apache
Tag: Apache
Apache Iceberg: An Introduction from Rackspace on Running the New Open Table Format on AWS

Feed: AWS Partner Network (APN) Blog. Author: Chaitanya Varma Mudundi. By Chaitanya Varma Mudundi, Professional Services Big Data Engineer – Rackspace Rackspace Data-driven decision making is accelerating and defining the way organizations work. With this transformation, there has been a rapid adoption of data lakes across the industry. To fuel this transformation, data lakes have evolved over the last decade. Apache Hive is a standard for data lakes, but while Apache Hive can solve some of the issues with the processing of data, it falls short at a few other objectives for next-generation data processing. In this post, I will ... Read More
Build a high-performance, ACID compliant, evolving data lake using Apache Iceberg on Amazon EMR

Feed: AWS Big Data Blog. Amazon EMR is a cloud big data platform for running large-scale distributed data processing jobs, interactive SQL queries, and machine learning (ML) applications using open-source analytics frameworks such as Apache Spark, Apache Hive, and Presto. Apache Iceberg is an open table format for huge analytic datasets. Table formats typically indicate the format and location of individual table files. Iceberg adds functionality on top of that to help manage petabyte-scale datasets as well as newer data lake requirements such as transactions, upsert/merge, time travel, and schema and partition evolution. Iceberg adds tables to compute engines including ... Read More
Amazon MSK adds support for Apache Kafka version 3.1.1 and 3.2.0
Feed: Recent Announcements. Amazon MSK is a fully managed service for Apache Kafka that makes it easier for you to build and run applications that use Apache Kafka as a data store. Amazon MSK is 100% compatible with Apache Kafka, which enables you to quickly migrate your existing Apache Kafka workloads to Amazon MSK with confidence or build new ones from scratch. With Amazon MSK, you can spend more time innovating on applications and less time managing clusters. To learn how to get started, see the Amazon MSK Developer Guide ... Read More
Use the AWS Glue connector to read and write Apache Iceberg tables with ACID transactions and perform time travel

Feed: AWS Big Data Blog. Nowadays, many customers have built their data lakes as the core of their data analytic systems. In a typical use case of data lakes, many concurrent queries run to retrieve consistent snapshots of business insights by aggregating query results. A large volume of data constantly comes from different data sources into the data lakes. There is also a common demand to reflect the changes occurring in the data sources into the data lakes. This means that not only inserts but also updates and deletes need to be replicated into the data lakes. Apache Iceberg provides ... Read More
Build an Apache Iceberg data lake using Amazon Athena, Amazon EMR, and AWS Glue

Feed: AWS Big Data Blog. Most businesses store their critical data in a data lake, where you can bring data from various sources to a centralized storage. The data is processed by specialized big data compute engines, such as Amazon Athena for interactive queries, Amazon EMR for Apache Spark applications, Amazon SageMaker for machine learning, and Amazon QuickSight for data visualization. Apache Iceberg is an open-source table format for data stored in data lakes. It is optimized for data access patterns in Amazon Simple Storage Service (Amazon S3) cloud object storage. Iceberg helps data engineers tackle complex challenges in data ... Read More
Implement a CDC-based UPSERT in a data lake using Apache Iceberg and AWS Glue

Feed: AWS Big Data Blog. As the implementation of data lakes and modern data architecture increases, customers’ expectations around its features also increase, which include ACID transaction, UPSERT, time travel, schema evolution, auto compaction, and many more. By default, Amazon Simple Storage Service (Amazon S3) objects are immutable, which means you can’t update records in your data lake because it supports append-only transactions. But there are use cases where you might be receiving incremental updates with change data capture (CDC) from your source systems, and you might need to update existing data in Amazon S3 to have a golden copy ... Read More
Amazon EMR 6.6 adds support for Apache Spark 3.2, HUDI 0.10.1, Iceberg 0.13, Trino 0.367, PrestoDB 0.267, and more
Feed: Recent Announcements. Amazon EMR release 6.6 now supports Apache Spark 3.2, Apache Spark RAPIDS 22.02, CUDA 11, Apache Hudi 0.10.1, Apache Iceberg 0.13, Trino 0.367, and PrestoDB 0.267. You can use the performance-optimized version of Apache Spark 3.2 on EMR on EC2, EKS, and recently released EMR Serverless. In addition Apache Hudi 0.10.1 and Apache Iceberg 0.13 are available on EC2, EKS, and Serverless. Apache Hive 3.1.2 is available on EMR on EC2 and EMR Serverless. Trino 0.367 and PrestoDB 0.267 are only available on EMR on EC2. Each Amazon EMR release version uses a default Amazon Linux 2 ... Read More
Amazon Managed Streaming for Apache Kafka is now FedRAMP compliant
Feed: Recent Announcements. Amazon Managed Streaming for Apache Kafka (Amazon MSK) is now authorized as FedRAMP Moderate in US East (Ohio), US East (N. Virginia), US West (N. California), US West (Oregon) and as FedRAMP High in AWS GovCloud (US) Regions. The Federal Risk and Authorization Management Program (FedRAMP) is a US government-wide program that delivers a standard approach to the security assessment, authorization, and continuous monitoring for cloud products and services. FedRAMP uses the National Institute of Standards and Technology (NIST) Special Publication 800 series and requires cloud service providers to receive an independent security assessment conducted by a third-party ... Read More
AWS Glue now supports SASL authentication for Apache Kafka
Feed: Recent Announcements. AWS Glue can now connect to Apache Kafka using additional client authentication mechanisms. AWS Glue now supports SASL (Simple Authentication and Security Layer) using either SCRAM (Salted Challenge Response Authentication Mechanism) or GSSAPI (Kerberos). AWS Glue supports data streams including Amazon Kinesis and Apache Kafka, applies complex transformations in-flight and loads it into a target data store for Analytics and Machine Learning. With this feature, you can now stream data from Apache Kafka producers that use SASL (SCRAM and GSSAPI) for client authentication. You can choose from these client authentication mechanisms when creating a Kafka connection in ... Read More
Amazon Athena now supports views in Apache Hive metastores
Feed: Recent Announcements. You can now use Amazon Athena to query views stored in your self-managed Apache Hive metastores. Hive views are defined using the Hive Query Language (HiveQL) which is not fully compatible with Athena's standard SQL. With this new capability, Athena automatically handles HiveQL syntax differences so you can query Hive views without changing your view definitions or maintaining a complex translation layer. A view is a logical table created using the results of a query that executes against a physical table each time the view is referenced. Views are commonly used to focus, simplify, and optimize access ... Read More
Recent Comments