Site Reliability Engineer

Tap Growth ai

Tap Growth AI is an AI-powered platform that helps recruiters find and hire the right talent faster.

About the Company

A leading technology organization focused on delivering highly reliable and scalable digital solutions seeks to expand its engineering team. With a commitment to innovation and operational excellence, the company ensures high-performance systems and optimal experiences for users across critical platforms. The workplace is located in Scottsdale, United States, offering in-office collaboration and hands-on engagement with cutting-edge infrastructure.

About the Role

The Site Reliability Engineer (SRE) will ensure the availability, scalability, and efficiency of mission-critical systems. This role bridges development and operations, focusing on automation, monitoring, incident response, and infrastructure optimization to maintain high system reliability and performance.

Responsibilities

  • Design, implement, and maintain monitoring, alerting, and observability solutions.
  • Automate infrastructure provisioning and deployment pipelines to streamline operations.
  • Troubleshoot and resolve complex production incidents efficiently.
  • Analyze system performance, conduct capacity planning, and identify optimizations.
  • Apply security best practices and maintain disaster recovery procedures.
  • Collaborate with development teams to design and improve system architecture.

Required Skills

  • 5+ years of experience in Site Reliability Engineering, DevOps, or systems engineering.
  • Strong proficiency in cloud platforms such as AWS, GCP, or Azure.
  • Expertise in scripting languages including Python, Bash, or Go.
  • Hands-on experience with containerization and orchestration tools (Docker, Kubernetes, etc.).
  • Familiarity with monitoring and observability tools like Prometheus, Grafana, or ELK stack.
  • Solid understanding of CI/CD pipelines and infrastructure-as-code practices.

Preferred Qualifications

  • Advanced troubleshooting and analytical problem-solving skills.
  • Experience with high-availability systems and large-scale production environments.
  • Knowledge of security practices and disaster recovery strategies.
  • Familiarity with performance tuning and capacity planning in cloud-based environments.
  • Experience collaborating with cross-functional teams in dynamic and fast-paced settings.

To learn more about this role, please check the official website listed below:

Copyright © 2025 SRE-Jobs.com. All Rights Reserved.