Tag Banner

All news with #apache spark tag

Fri, October 3, 2025

Dataproc ML library: Connect Spark to Gemini and Vertex

🔗 Google has released an open-source Python library, Dataproc ML, to streamline running ML and generative-AI inference from Apache Spark on Dataproc. The library uses a SparkML-style builder pattern so users can configure a model handler (for example, GenAiModelHandler) and call .transform() to apply Gemini or other Vertex AI models directly to DataFrames. It also supports loading PyTorch and TensorFlow model artifacts from GCS for large-scale batch inference and includes performance optimizations such as vectorized data transfer, connection reuse, and automatic retry/backoff.

read more →

Fri, August 29, 2025

Amazon EMR Adds Spark FGAC and Glue Data Catalog Views

🔒 Amazon EMR on EC2 now supports Apache Spark native fine-grained access control (FGAC) through AWS Lake Formation and adds support for AWS Glue Data Catalog views. These capabilities let administrators define and enforce granular Lake Formation policies once and apply them consistently to Spark jobs and interactive sessions, reducing administrative overhead and security risk. Access checks support named resource grants, data filters, and tag-based controls and are logged in AWS CloudTrail for auditing.

read more →