Senior Site Reliability Engineer

Cordial

Cordial automates billions of emails, SMS, and mobile app messages using all of your data.

About the Company

Cordial is a leading software company specializing in data-driven, personalized communication solutions. With clients such as PacSun, Revolve, Abercrombie & Fitch, and Forbes, Cordial helps brands enhance customer relationships and drive revenue growth through improved messaging. Founded on principles of transparency, collaboration, and trust, Cordial fosters a culture of growth, continuous improvement, and innovation. Join a passionate team committed to shaping the future of digital communication.

About the Role

Cordial is seeking a skilled Site Reliability Engineer (SRE) to enhance the stability, performance, and scalability of the Cordial platform. This is an exciting opportunity to work with cutting-edge technologies like AWS, Kubernetes, Consul, and Vault in a collaborative, agile environment. The role is ideal for an individual with strong experience in cloud infrastructure and an eagerness to help monitor and optimize critical systems while ensuring a seamless experience for end-users.

Responsibilities

  • Administer, monitor, and troubleshoot cloud-based application and network components using Web, App, Server, Storage, and Security technologies.
  • Design, deploy, and monitor Kubernetes clusters, helm charts, and service mesh configurations.
  • Collaborate with Product and DevOps teams to troubleshoot production data corruption or performance issues.
  • Provide production support, participate in on-call rotations, and assist in troubleshooting complex system issues.
  • Contribute to platform infrastructure design and implementation.
  • Develop monitoring and alerting solutions for system performance and stability.
  • Assist with the creation and monitoring of Service Level Objectives (SLOs) and Service Level Agreements (SLAs).
  • Implement best practices in security and performance across all production systems.

Required Skills

  • 5+ years of experience in UNIX/Linux Systems and Network Administration (DNS, IPsec, VPN, Load Balancing).
  • Expertise in AWS (EC2, EKS) and Kubernetes/EKS clusters.
  • Hands-on experience with Helm charts and service meshes (app-mesh, Istio, Linkerd).
  • Experience with monitoring, logging, and alerting tools like Prometheus, Grafana, and ELK.
  • Proficiency in infrastructure as code (IaC) tools like Terraform, CloudFormation.
  • Strong knowledge of networking fundamentals and cloud security best practices.
  • Solid understanding of observability principles and distributed tracing tools.
  • Previous experience in a Site Reliability Engineering or DevOps role.
  • Familiarity with CI/CD tools like Jenkins, GitLab CI, or ArgoCD.

Preferred Qualifications

  • Development experience in PHP.
  • Experience with Docker, containers, and Kubernetes.
  • Knowledge of HashiCorp products like Consul and Vault.
  • Strong problem-solving skills with a systematic approach to debugging.
  • Fluency in English (both verbal and written).

Head to the official website below for the full vacancy description and requirements:

Copyright © 2025 SRE-Jobs.com. All Rights Reserved.