Platform Operations: Administer Databricks workspaces, including user provisioning, cluster governance, workspace configuration, job orchestration, and usage policy enforcement.
AI/ML Enablement: Support MLflow, MLOps pipelines, and Mosaic AI for experimentation, deployment, and observability.
Resilience & Availability: Design and implement disaster recovery and high availability strategies, including multi-region backups and failover planning.
Infrastructure Automation: Automate provisioning and lifecycle management using Terraform and Python.
Access & Security Governance: Manage access control via Unity Catalog, SCIM-based identity management, and workspace isolation with audit readiness.
Performance & Cost Optimization: Monitor platform usage, enforce cluster policies, and optimize job and resource performance.
Standardization: Establish reusable patterns, runbooks, cluster templates, and ML lifecycle standards.
User Support & Enablement: Provide onboarding and operational support to data engineering, data science, and analytics teams.
Feature Rollouts: Lead adoption of new features (e.g., Mosaic AI, Unity Catalog, Delta Live Tables) with documentation and change control.
Training & Evangelism: Deliver training and promote responsible platform usage with a focus on automation and reliability.
Required Qualifications
10+ years in cloud infrastructure, platform operations, or data platform administration roles.
Proven experience managing Databricks or similar cloud data platforms at scale.
Experience administering or architecting data lake/lakehouse environments.
Strong cross-functional communication and collaboration skills.
Focus on platform stability, automation, and enablement of data/ML workflows.