Tap Growth AI is an AI-powered platform that helps recruiters find and hire the right talent faster.
About the Company
A leading technology organization focused on delivering highly reliable and scalable digital solutions seeks to expand its engineering team. With a commitment to innovation and operational excellence, the company ensures high-performance systems and optimal experiences for users across critical platforms. The workplace is located in Scottsdale, United States, offering in-office collaboration and hands-on engagement with cutting-edge infrastructure.
About the Role
The Site Reliability Engineer (SRE) will ensure the availability, scalability, and efficiency of mission-critical systems. This role bridges development and operations, focusing on automation, monitoring, incident response, and infrastructure optimization to maintain high system reliability and performance.
Responsibilities
- Design, implement, and maintain monitoring, alerting, and observability solutions.
- Automate infrastructure provisioning and deployment pipelines to streamline operations.
- Troubleshoot and resolve complex production incidents efficiently.
- Analyze system performance, conduct capacity planning, and identify optimizations.
- Apply security best practices and maintain disaster recovery procedures.
- Collaborate with development teams to design and improve system architecture.
Required Skills
- 5+ years of experience in Site Reliability Engineering, DevOps, or systems engineering.
- Strong proficiency in cloud platforms such as AWS, GCP, or Azure.
- Expertise in scripting languages including Python, Bash, or Go.
- Hands-on experience with containerization and orchestration tools (Docker, Kubernetes, etc.).
- Familiarity with monitoring and observability tools like Prometheus, Grafana, or ELK stack.
- Solid understanding of CI/CD pipelines and infrastructure-as-code practices.
Preferred Qualifications
- Advanced troubleshooting and analytical problem-solving skills.
- Experience with high-availability systems and large-scale production environments.
- Knowledge of security practices and disaster recovery strategies.
- Familiarity with performance tuning and capacity planning in cloud-based environments.
- Experience collaborating with cross-functional teams in dynamic and fast-paced settings.