Job Description:
Responsibilities:
- Design, implement, and maintain CI/CD pipelines for machine learning applications using AWS CodePipeline, CodeCommit, and CodeBuild.
- Automate the deployment of ML models into production using Amazon SageMaker, Databricks, and MLflow for model versioning, tracking, and lifecycle management.
- Develop, test, and deploy AWS Lambda functions for triggering model workflows, automating pre/post-processing, and integrating with other AWS services.
- Maintain and monitor Databricks model serving endpoints, ensuring scalable and low-latency inference workloads.
- Use Airflow (MWAA) or Databricks Workflows to orchestrate complex, multi-stage ML pipelines, including data ingestion, model training, evaluation, and deployment.
- Collaborate with Data Scientists and ML Engineers to productionize models and convert notebooks into reproducible and version-controlled ML pipelines.
- Integrate and automate model monitoring (drift detection, performance logging) and alerting mechanisms using tools like CloudWatch, Prometheus, or Datadog.
- Optimize compute workloads by managing infrastructure-as-code (IaC) via CloudFormation or Terraform for reproducible, secure deployments across environments.
- Ensure secure and compliant deployment pipelines using IAM roles, VPC, and secrets management with AWS Secrets Manager or SSM Parameter Store.
- Champion DevOps best practices across the ML lifecycle, including canary deployments, rollback strategies, and audit logging for model changes.
Minimum Requirements:
- hands-on experience in MLOps deploying ML applications in production at scale.
- Proficient in AWS services: SageMaker, Lambda, CodePipeline, CodeCommit, ECR, ECS/Fargate, and CloudWatch.
- Strong experience with Databricks workflows and Databricks Model Serving, including MLflow for model tracking, packaging, and deployment.
- Proficient in Python and shell scripting with the ability to containerize applications using Docker.
- Deep understanding of CI/CD principles for ML, including testing ML pipelines, data validation, and model quality gates.
- Hands-on experience orchestrating ML workflows using Airflow (open-source or MWAA) or Databricks Workflows.
- Familiarity with model monitoring and logging stacks (e.g., Prometheus, ELK, Datadog, or OpenTelemetry).
- Experience deploying models as REST endpoints, batch jobs, and asynchronous workflows.
- Version control expertise with Git/GitHub and experience in automated deployment reviews and rollback strategies.
Nice to Have:
- Experience with Feature Store (e.g., AWS SageMaker Feature Store, Feast).
- Familiarity with Kubeflow, SageMaker Pipelines, or Vertex AI (if multi-cloud).
- Exposure to LLM-based models, vector databases, or retrieval-augmented generation (RAG) pipelines.
- Knowledge of Terraform or AWS CDK for infrastructure automation.
- Experience with A/B testing or shadow deployments for ML models