Description

Reliability and Availability:

  • Ensure high availability and reliability of production systems.
  • Implement and maintain robust monitoring and alerting systems.
  • Participate in on-call rotations to respond to incidents and outages.
  • Conduct post-incident reviews and implement preventative measures.

Automation and Infrastructure as Code (IaC):

  • Automate infrastructure provisioning, configuration, and deployment using IaC tools (e.g., Terraform, Ansible).
  • Develop and maintain CI/CD pipelines to streamline software releases.
  • Optimize and automate data pipelines and workflows.

Apache Druid Management:

  • Manage and optimize Apache Druid clusters for high performance and scalability.
  • Troubleshoot Druid performance issues and implement solutions.
  • Design and implement Druid data ingestion and query optimization strategies.

Apache Airflow Orchestration:

  • Design, develop, and maintain Airflow DAGs for data orchestration and workflow automation.
  • Monitor Airflow performance and troubleshoot issues.
  • Optimize Airflow workflows for efficiency and reliability.

Monitoring and Logging:

  • Implement and maintain comprehensive monitoring and logging solutions (e.g., Prometheus, Grafana, ELK stack).
  • Analyze metrics and logs to identify performance bottlenecks and potential issues.
  • Create and maintain dashboards for visualizing system health and performance.

Collaboration and Communication:

  • Collaborate with development, data, and operations teams to ensure smooth operations.
  • Communicate effectively with stakeholders regarding system status and incidents.
  • Document processes and procedures

Education

Any Gradute