All news with #dataproc tag
Mon, October 20, 2025
Dataproc 2.3 on Google Compute Engine: Lightweight Security
🔐 Dataproc 2.3 on Google Compute Engine provides a streamlined image that includes only the essential core components for Spark and Hadoop, reducing the attack surface and simplifying compliance. The image is FedRAMP High compliant and leverages both automated CVE remediation and manual engineering intervention for complex fixes. Optional tools like Flink, Hudi, Ranger, and Zeppelin are available on-demand during cluster creation, or can be pre-baked into custom images to speed provisioning while preserving the security benefits of the lightweight base.
Fri, October 3, 2025
Dataproc ML library: Connect Spark to Gemini and Vertex
🔗 Google has released an open-source Python library, Dataproc ML, to streamline running ML and generative-AI inference from Apache Spark on Dataproc. The library uses a SparkML-style builder pattern so users can configure a model handler (for example, GenAiModelHandler) and call .transform() to apply Gemini or other Vertex AI models directly to DataFrames. It also supports loading PyTorch and TensorFlow model artifacts from GCS for large-scale batch inference and includes performance optimizations such as vectorized data transfer, connection reuse, and automatic retry/backoff.
Tue, September 9, 2025
Dataproc Multi-Tenant Clusters for Notebook Workloads
🚀 Google Cloud announced Dataproc multi-tenant clusters to let many data scientists share a single cluster for interactive notebook workloads while preserving per-user authorization. The feature maps individual Google identities to service accounts, externalizes mappings to a YAML file, and supports updates on running clusters. Jupyter kernels launch via the Jupyter Kernel Gateway across worker nodes, with optional Vertex AI Workbench integration and the BigQuery JupyterLab Extension. Administrators retain IAM-based least-privilege control and cluster hardening isolates credentials and OS users.
Fri, September 5, 2025
Gemini Cloud Assist for Dataproc: Troubleshoot Apache Spark
🛠️ Gemini Cloud Assist Investigations is now in public preview to help troubleshoot Dataproc and Serverless for Apache Spark workloads by automatically analyzing driver and executor logs, Spark UI metrics, configurations, and cross-product telemetry. Accessible from the Google Cloud console and via API, it produces prioritized summaries and clear remediation steps. The tool is tailored to data engineers, data scientists, SREs, and managers to reduce investigation time and accelerate fixes.