
Title: Site Reliability Engineer II
Location: Alpharetta, GA (3 days a week onsite)
Duration: 6 months
Job Description:
We are seeking a skilled Site Reliability Engineer to join our team and help build, maintain, and scale our cloud-native infrastructure. You will work closely with development and operations teams to ensure our systems are reliable, scalable, and efficient. The ideal candidate is passionate about automation, observability, and infrastructure-as-code, and thrives in a collaborative, fast-paced environment.
Key Responsibilities
* Design, implement, and manage cloud infrastructure on Azure using Terraform and Terragrunt.
* Maintain and optimize Kubernetes clusters on Azure Kubernetes Service (AKS).
* Build and manage CI/CD pipelines using GitHub Actions/Workflows and ArgoCD for GitOps deployments.
* Enhance system reliability by implementing monitoring, alerting, and observability solutions with Grafana.
* Automate operational tasks to reduce toil and improve team efficiency.
* Participate in on-call rotations, incident response, and post-mortem analysis.
* Collaborate with development teams to improve application performance, scalability, and resilience.
* Implement and advocate for SRE best practices, including SLIs, SLOs, and error budgets.
* Continuously improve system performance, cost efficiency, and security.
Required Skills & Qualifications
* 3+ years of experience in an SRE, DevOps, or cloud infrastructure role.
* Strong experience with Azure cloud services and infrastructure.
* Hands-on experience with java and Terraform and Terragrunt for infrastructure-as-code.
* Proficiency with Kubernetes (preferably AKS and container orchestration.
* Experience with CI/CD tools, especially GitHub Workflows/Actions and ArgoCD.
* Solid understanding of observability tools like Grafana (Prometheus, Loki, Tempo experience is a plus).
Education Requirements Bachelor’s degree required, (Masters preferred)