- Home
- Tag: Spark
Posts tagged Spark
Tag: Spark
Disaster recovery considerations with Amazon EMR on Amazon EC2 for Spark workloads

Feed: AWS Big Data Blog. Amazon EMR is a cloud big data platform for running large-scale distributed data processing jobs, interactive SQL queries, and machine learning (ML) applications using open-source analytics frameworks such as Apache Spark, Apache Hive, and Presto. Amazon EMR launches all nodes for a given cluster in the same Amazon Elastic Compute Cloud (Amazon EC2) Availability Zone to improve performance. During an Availability Zone failure or due to any unexpected interruption, Amazon EMR may not be accessible, and we need a disaster recovery (DR) strategy to mitigate this problem. Part of architecting a resilient, highly available Amazon ... Read More
Amazon EMR 6.6 adds support for Apache Spark 3.2, HUDI 0.10.1, Iceberg 0.13, Trino 0.367, PrestoDB 0.267, and more
Feed: Recent Announcements. Amazon EMR release 6.6 now supports Apache Spark 3.2, Apache Spark RAPIDS 22.02, CUDA 11, Apache Hudi 0.10.1, Apache Iceberg 0.13, Trino 0.367, and PrestoDB 0.267. You can use the performance-optimized version of Apache Spark 3.2 on EMR on EC2, EKS, and recently released EMR Serverless. In addition Apache Hudi 0.10.1 and Apache Iceberg 0.13 are available on EC2, EKS, and Serverless. Apache Hive 3.1.2 is available on EMR on EC2 and EMR Serverless. Trino 0.367 and PrestoDB 0.267 are only available on EMR on EC2. Each Amazon EMR release version uses a default Amazon Linux 2 ... Read More
Amazon EMR on Amazon EKS provides up to 61% lower costs and up to 68% performance improvement for Spark workloads

Feed: AWS Big Data Blog. Amazon EMR on Amazon EKS is a deployment option offered by Amazon EMR that enables you to run Apache Spark applications on Amazon Elastic Kubernetes Service (Amazon EKS) in a cost-effective manner. It uses the EMR runtime for Apache Spark to increase performance so that your jobs run faster and cost less. In our benchmark tests using TPC-DS datasets at 3 TB scale, we observed that Amazon EMR on EKS provides up to 61% lower costs and up to 68% improved performance compared to running open-source Apache Spark on Amazon EKS via equivalent configurations. In ... Read More
Introducing AWS Glue Auto Scaling: Automatically resize serverless computing resources for lower cost with optimized Apache Spark

Feed: AWS Big Data Blog. Data created in the cloud is growing fast in recent days, so scalability is a key factor in distributed data processing. Many customers benefit from the scalability of the AWS Glue serverless Spark runtime. Today, we’re pleased to announce the release of AWS Glue Auto Scaling, which helps you scale your AWS Glue Spark jobs automatically based on the requirements calculated dynamically during the job run, and accelerate job runs at lower cost without detailed capacity planning. Before AWS Glue Auto Scaling, you had to predict workload patterns in advance. For example, in cases when ... Read More
Amazon Keyspaces now helps you read and write data in Apache Spark more easily
Feed: Recent Announcements. Apache Spark is an open-source engine for large-scale data analytics. Customers use Apache Spark to perform analytics on data stored in Amazon Keyspaces more efficiently. Customers also use Amazon Keyspaces to provide applications consistent, single-digit-millisecond read access to analytics data from Spark. Now, you can read and write data between Amazon Keyspaces and Spark more easily by using the open-source Spark Cassandra Connector. Amazon Keyspaces support for the Spark Cassandra Connector helps you run Cassandra workloads in Spark-based analytics pipelines more easily by using a fully managed and serverless database service. With Amazon Keyspaces, you don’t need ... Read More
Amazon EMR Managed Scaling is now Spark shuffle data aware
Feed: Recent Announcements. Amazon EMR Managed Scaling automatically resizes EMR clusters for best performance and resource utilization. Today, we are excited to announce a new capability in Managed Scaling that prevents it from scaling down instances that store intermediate shuffle data for Apache Spark. Intelligently scaling down clusters without removing the instances that store intermediate shuffle data prevents job re-attempts and re-computations, which leads to better performance, and lower cost. With EMR Managed Scaling you specify the minimum and maximum compute limits for your clusters. EMR Managed Scaling can be used with Amazon EC2 Spot Instances, that let you take advantage ... Read More
Introducing the new ArangoDB Datasource for Apache Spark
Feed: ArangoDB blog: latest news from the NoSQL multi-model database. Author: Michele Rastelli. Estimated reading time: 8 minutes We are proud to announce the general availability of ArangoDB Datasource for Apache Spark: a new generation Spark connector for ArangoDB. Nowadays, Apache Spark is one of the most popular analytics frameworks for large-scale data processing. It is designed to process in parallel data that is too large or complex for traditional databases, providing high performances by optimizing query execution, caching data in-memory and controlling the data distribution. It exposes a pluggable DataSource API that can be implemented to allow interaction with ... Read More
Amazon FinSpace managed Apache Spark clusters now support Apache Spark 3
Feed: Recent Announcements. Amazon FinSpace managed Spark clusters now support Apache Spark 3.1.2. Apache Spark 3 has query optimization features like dynamic partition pruning to optimize joins such as joining a large fact table of trades with a smaller dimension table of execution centers. It also includes changes to be more compatible with the ANSI SQL standard and features 30 new built-in functions. FinSpace Spark clusters make it simple for analysts to launch, connect, resize and terminate clusters. FinSpace Spark clusters are available in five sizes, so you can select a configuration that is suitable for your workload, and they ... Read More
Amazon EMR now supports Apache Spark SQL to insert data into and update Glue Data Catalog tables when Lake Formation integration is enabled
Feed: Recent Announcements. Amazon EMR integration with AWS Lake Formation allows you to define and enforce database, table, and column-level permissions when Apache Spark users access data in Amazon S3 through the Glue Data Catalog. Previously, with AWS Lake Formation integration is enabled, you were limited to only being able to read data using Spark SQL statements such as SHOW DATABASES and DESCRIBE TABLE. Now, you can also insert data into, or update the Glue Data Catalog tables with these statements: INSERT INTO, INSERT OVERWRITE, and ALTER TABLE. This feature is enabled on Amazon EMR 5.34 in the following AWS ... Read More
Amazon SageMaker Feature Store connector for Apache Spark for easy batch data ingestion
Feed: Recent Announcements. Amazon SageMaker Feature Store is announcing a new enhancement, a connector for Apache Spark that makes batch data ingestion easier for customers. Amazon SageMaker Feature Store is a fully managed, purpose-built repository to store, update, retrieve, and share machine learning (ML) model features. There are various ways to ingest data into SageMaker Feature Store including the PutRecord API, SageMaker Python SDK’s FeatureGroup.ingest functionality and SageMaker Processing Job ... Read More
Recent Comments