Tag Banner

All news with #apache spark tag

Tue, December 2, 2025

AWS launches Apache Spark Upgrade Agent for Amazon EMR

🛠️ AWS announced the Apache Spark upgrade agent, a capability that automates and accelerates Spark version upgrades for Amazon EMR on EC2 and EMR Serverless. The agent performs automated code analysis across PySpark and Scala, identifies API and behavioral changes for Spark 2.4→3.5, and suggests precise code transformations. Engineers can invoke the agent from SageMaker Unified Studio, the Kiro CLI, or any MCP-compatible IDE, interact via natural-language prompts, review proposed edits, and approve implementations. Functional correctness is validated through data quality checks to help maintain processing accuracy during migration.

read more →

Wed, November 26, 2025

AWS Glue 5.1 GA: Spark 3.5, Iceberg 3.0, Lake Formation

⚡ AWS Glue 5.1 is now generally available, upgrading core engines to Apache Spark 3.5.6, Python 3.11, and Scala 2.12.18 to deliver performance and security improvements. The release refreshes open table format support (Apache Hudi 1.0.2, Apache Iceberg 1.10.0, Delta Lake 3.3.2) and adds Apache Iceberg format 3.0 features such as default column values and deletion vectors. AWS Lake Formation now enforces fine‑grained write control for Spark DDL/DML, and Glue adds full‑table access control for Hudi and Delta tables in Spark.

read more →

Fri, November 21, 2025

Amazon EMR Serverless Adds Apache Spark 4.0.1 (Preview)

🚀 Amazon EMR Serverless now supports Apache Spark 4.0.1 (preview), enabling teams to build data pipelines using standard ANSI SQL and native VARIANT types for semi-structured data. The release adds Apache Iceberg v3 table format to provide transactional guarantees and audit-ready change tracking. Improved streaming controls make it easier to manage stateful, real-time applications and monitor streaming jobs.

read more →

Fri, November 21, 2025

Amazon Athena for Apache Spark Integrated with SageMaker

🚀 Amazon SageMaker now supports Amazon Athena for Apache Spark, combining a new notebook experience with a fast serverless Spark runtime in a single workspace. Data engineers, analysts, and data scientists can query data, run Python, develop jobs, train models, and visualize results with no infrastructure to manage and second-level billing. The service runs Spark 3.5.6, is optimized for Apache Iceberg and Delta Lake, and adds debugging, real-time Spark UI monitoring, and secure Spark Connect communication. Table-level access controls are enforced through AWS Lake Formation.

read more →

Fri, November 21, 2025

Amazon EMR 7.12 Adds Apache Iceberg v3 Table Format

🆕 Amazon EMR 7.12 now supports the Apache Iceberg v3 table format (Iceberg 1.10) and includes Apache Spark 3.5.6. This update reduces storage and pipeline costs by marking deleted rows instead of rewriting files, while adding automatic row-level history for stronger governance and change-data capture. It also introduces table-level encryption and integrates with AWS Lake Formation. Apache Trino 476 is included, and EMR 7.12 is available in all Regions that support EMR.

read more →

Fri, November 21, 2025

AWS Glue adds DynamoDB connector with Spark DataFrame

🚀 AWS Glue now includes a new Amazon DynamoDB connector that natively supports Apache Spark DataFrames. This enables developers to reuse existing Spark DataFrame code across AWS Glue, Amazon EMR, and other Spark environments with minimal modification, replacing prior reliance on Glue-specific DynamicFrame objects. The connector exposes the full range of DataFrame operations and current Spark performance optimizations and is available in all AWS Commercial Regions where Glue runs.

read more →

Fri, October 3, 2025

Dataproc ML library: Connect Spark to Gemini and Vertex

🔗 Google has released an open-source Python library, Dataproc ML, to streamline running ML and generative-AI inference from Apache Spark on Dataproc. The library uses a SparkML-style builder pattern so users can configure a model handler (for example, GenAiModelHandler) and call .transform() to apply Gemini or other Vertex AI models directly to DataFrames. It also supports loading PyTorch and TensorFlow model artifacts from GCS for large-scale batch inference and includes performance optimizations such as vectorized data transfer, connection reuse, and automatic retry/backoff.

read more →

Fri, August 29, 2025

Amazon EMR Adds Spark FGAC and Glue Data Catalog Views

🔒 Amazon EMR on EC2 now supports Apache Spark native fine-grained access control (FGAC) through AWS Lake Formation and adds support for AWS Glue Data Catalog views. These capabilities let administrators define and enforce granular Lake Formation policies once and apply them consistently to Spark jobs and interactive sessions, reducing administrative overhead and security risk. Access checks support named resource grants, data filters, and tag-based controls and are logged in AWS CloudTrail for auditing.

read more →