ML/OPS Engineer

Job Description:

Responsibilities:

Design, implement, and maintain CI/CD pipelines for machine learning applications using AWS CodePipeline, CodeCommit, and CodeBuild.
Automate the deployment of ML models into production using Amazon SageMaker, Databricks, and MLflow for model versioning, tracking, and lifecycle management.
Develop, test, and deploy AWS Lambda functions for triggering model workflows, automating pre/post-processing, and integrating with other AWS services.
Maintain and monitor Databricks model serving endpoints, ensuring scalable and low-latency inference workloads.
Use Airflow (MWAA) or Databricks Workflows to orchestrate complex, multi-stage ML pipelines, including data ingestion, model training, evaluation, and deployment.
Collaborate with Data Scientists and ML Engineers to productionize models and convert notebooks into reproducible and version-controlled ML pipelines.
Integrate and automate model monitoring (drift detection, performance logging) and alerting mechanisms using tools like CloudWatch, Prometheus, or Datadog.
Optimize compute workloads by managing infrastructure-as-code (IaC) via CloudFormation or Terraform for reproducible, secure deployments across environments.
Ensure secure and compliant deployment pipelines using IAM roles, VPC, and secrets management with AWS Secrets Manager or SSM Parameter Store.
Champion DevOps best practices across the ML lifecycle, including canary deployments, rollback strategies, and audit logging for model changes.

Minimum Requirements:

hands-on experience in MLOps deploying ML applications in production at scale.
Proficient in AWS services: SageMaker, Lambda, CodePipeline, CodeCommit, ECR, ECS/Fargate, and CloudWatch.
Strong experience with Databricks workflows and Databricks Model Serving, including MLflow for model tracking, packaging, and deployment.
Proficient in Python and shell scripting with the ability to containerize applications using Docker.
Deep understanding of CI/CD principles for ML, including testing ML pipelines, data validation, and model quality gates.
Hands-on experience orchestrating ML workflows using Airflow (open-source or MWAA) or Databricks Workflows.
Familiarity with model monitoring and logging stacks (e.g., Prometheus, ELK, Datadog, or OpenTelemetry).
Experience deploying models as REST endpoints, batch jobs, and asynchronous workflows.
Version control expertise with Git/GitHub and experience in automated deployment reviews and rollback strategies.

Nice to Have:

Experience with Feature Store (e.g., AWS SageMaker Feature Store, Feast).
Familiarity with Kubeflow, SageMaker Pipelines, or Vertex AI (if multi-cloud).
Exposure to LLM-based models, vector databases, or retrieval-augmented generation (RAG) pipelines.
Knowledge of Terraform or AWS CDK for infrastructure automation.
Experience with A/B testing or shadow deployments for ML models

Any Graduate

Back To Jobs