
Overview
As a Site Reliability Engineer (SRE) Level II, you will play a key role in maintaining the availability, scalability, and performance of critical infrastructure and services. You will build and automate solutions that enhance system reliability and support continuous delivery. In this role you will manage complex operational tasks and incidents, mentor junior SREs, and collaborate with development teams to ensure systems are designed for reliability from the ground up.
Responsibilities
- Respond to complex incidents and ensure service uptime.
- Lead troubleshooting for high‑impact production issues, performing root‑cause analysis and preventive measures.
- Participate in on‑call rotations, acting as an escalation point for Level 1 SREs during major incidents.
- Build and maintain automation scripts and infrastructure using Terraform, Ansible, or CloudFormation.
- Implement automation to eliminate manual tasks and improve system reliability, scalability, and performance.
- Analyze system performance and recommend optimizations for scalability and reliability.
- Support capacity planning by monitoring metrics, traffic patterns, and usage trends to predict future resource needs.
- Collaborate with software engineering teams to influence design of new services, ensuring they are scalable, reliable, and resilient.
- Contribute to architectural decisions, aligning with best practices in fault tolerance, redundancy, and recovery.
- Build and maintain robust monitoring, alerting, and observability solutions; optimize existing tools and build dashboards for better visibility.
- Ensure systems and infrastructure are secure and compliant; assist with vulnerability management, patching, and security implementation.
- Lead continuous improvement of operational processes, tools, and workflows.
- Implement and enforce best practices in deployment, monitoring, and incident management to reduce downtime.
Qualifications
- Minimum 5 years of experience in site reliability engineering, DevOps, systems administration, or related roles.
- Strong experience with Linux/Unix administration and proficiency in scripting languages such as Python, Bash, or Go.
- Deep understanding of cloud platforms (AWS, GCP, Azure) and related services (EC2, S3, Lambda, Kubernetes, etc.).
- Experience with containerization and orchestration technologies like Docker and Kubernetes.
- Proficiency with monitoring and observability tools such as Dynatrace, Prometheus, Grafana, Datadog, or ELK Stack.
- Strong understanding of networking fundamentals (DNS, HTTP, TCP/IP), load balancing, and CDNs.
- Experience with CI/CD tools (Jenkins, GitLab CI, CircleCI) and infrastructure automation (Terraform, Ansible, Puppet).
- Familiarity with distributed systems and microservices architecture.
- Excellent problem‑solving and troubleshooting skills, especially in diagnosing production issues in high‑scale environments.
Preferred
- Background in MLOps, data engineering, and/or cloud‑native AI deployment.
- Strong communication and documentation abilities.
- Knowledge of security best practices for AI and cloud infrastructure.
- Contributions to open‑source AI/SRE projects or relevant technical communities.
- Proven track record of managing complex infrastructure, troubleshooting production issues, and optimizing system performance.
Exempt Status
Yes (not eligible for overtime pay)
Workplace Type
Office. Certain positions outside our branch network may be eligible for a flexible work arrangement, combining in‑office and work‑from‑home. Remote roles may also have the opportunity to come together in our offices for moments that matter. Specific work arrangements will be provided by the hiring team.
Huntington is an Equal Opportunity Employer.
Tobacco‑Free Hiring Practice: Visit Huntington’s Career Web Site for more details.
#J-18808-Ljbffr