Description

Key Responsibilities

  • Platform Operations: Administer Databricks workspaces, including user provisioning, cluster governance, workspace configuration, job orchestration, and usage policy enforcement.
  • AI/ML Enablement: Support MLflow, MLOps pipelines, and Mosaic AI for experimentation, deployment, and observability.
  • Resilience & Availability: Design and implement disaster recovery and high availability strategies, including multi-region backups and failover planning.
  • Infrastructure Automation: Automate provisioning and lifecycle management using Terraform and Python.
  • Access & Security Governance: Manage access control via Unity Catalog, SCIM-based identity management, and workspace isolation with audit readiness.
  • Performance & Cost Optimization: Monitor platform usage, enforce cluster policies, and optimize job and resource performance.
  • Standardization: Establish reusable patterns, runbooks, cluster templates, and ML lifecycle standards.
  • User Support & Enablement: Provide onboarding and operational support to data engineering, data science, and analytics teams.
  • Feature Rollouts: Lead adoption of new features (e.g., Mosaic AI, Unity Catalog, Delta Live Tables) with documentation and change control.
  • Training & Evangelism: Deliver training and promote responsible platform usage with a focus on automation and reliability.


 

Required Qualifications

  • 10+ years in cloud infrastructure, platform operations, or data platform administration roles.
  • Proven experience managing Databricks or similar cloud data platforms at scale.
  • Experience administering or architecting data lake/lakehouse environments.
  • Strong cross-functional communication and collaboration skills.
  • Focus on platform stability, automation, and enablement of data/ML workflows.


 

Technical Expertise

  • Databricks Platform: Unity Catalog, MLflow, Mosaic AI, Delta Live Tables, workspace management.
  • Disaster Recovery: Multi-region backups, failover testing, high availability architecture.
  • Infrastructure as Code: Terraform and Python for provisioning and lifecycle automation.
  • Security & Compliance: Role-based access, SCIM provisioning, audit logging, governance enforcement.
  • Cost & Performance: Cluster policy tuning, tagging, monitoring, and analytics.
  • CI/CD: Automated deployment pipelines using GitHub Actions or similar tools.
  • Cloud Integration: AWS services (S3, IAM, networking) and Databricks integrations.
  • Observability: Monitoring, alerting, logging, and platform metrics dashboards

Education

Any Gradute