All news with #apache spark tag
Fri, October 3, 2025
Dataproc ML library: Connect Spark to Gemini and Vertex
🔗 Google has released an open-source Python library, Dataproc ML, to streamline running ML and generative-AI inference from Apache Spark on Dataproc. The library uses a SparkML-style builder pattern so users can configure a model handler (for example, GenAiModelHandler) and call .transform() to apply Gemini or other Vertex AI models directly to DataFrames. It also supports loading PyTorch and TensorFlow model artifacts from GCS for large-scale batch inference and includes performance optimizations such as vectorized data transfer, connection reuse, and automatic retry/backoff.
Fri, August 29, 2025
Amazon EMR Adds Spark FGAC and Glue Data Catalog Views
🔒 Amazon EMR on EC2 now supports Apache Spark native fine-grained access control (FGAC) through AWS Lake Formation and adds support for AWS Glue Data Catalog views. These capabilities let administrators define and enforce granular Lake Formation policies once and apply them consistently to Spark jobs and interactive sessions, reducing administrative overhead and security risk. Access checks support named resource grants, data filters, and tag-based controls and are logged in AWS CloudTrail for auditing.