Cordial automates billions of emails, SMS, and mobile app messages using all of your data.
About the Company
Cordial is a leading software company specializing in data-driven, personalized communication solutions. With clients such as PacSun, Revolve, Abercrombie & Fitch, and Forbes, Cordial helps brands enhance customer relationships and drive revenue growth through improved messaging. Founded on principles of transparency, collaboration, and trust, Cordial fosters a culture of growth, continuous improvement, and innovation. Join a passionate team committed to shaping the future of digital communication.
About the Role
Cordial is seeking a skilled Site Reliability Engineer (SRE) to enhance the stability, performance, and scalability of the Cordial platform. This is an exciting opportunity to work with cutting-edge technologies like AWS, Kubernetes, Consul, and Vault in a collaborative, agile environment. The role is ideal for an individual with strong experience in cloud infrastructure and an eagerness to help monitor and optimize critical systems while ensuring a seamless experience for end-users.
Responsibilities
- Administer, monitor, and troubleshoot cloud-based application and network components using Web, App, Server, Storage, and Security technologies.
- Design, deploy, and monitor Kubernetes clusters, helm charts, and service mesh configurations.
- Collaborate with Product and DevOps teams to troubleshoot production data corruption or performance issues.
- Provide production support, participate in on-call rotations, and assist in troubleshooting complex system issues.
- Contribute to platform infrastructure design and implementation.
- Develop monitoring and alerting solutions for system performance and stability.
- Assist with the creation and monitoring of Service Level Objectives (SLOs) and Service Level Agreements (SLAs).
- Implement best practices in security and performance across all production systems.
Required Skills
- 5+ years of experience in UNIX/Linux Systems and Network Administration (DNS, IPsec, VPN, Load Balancing).
- Expertise in AWS (EC2, EKS) and Kubernetes/EKS clusters.
- Hands-on experience with Helm charts and service meshes (app-mesh, Istio, Linkerd).
- Experience with monitoring, logging, and alerting tools like Prometheus, Grafana, and ELK.
- Proficiency in infrastructure as code (IaC) tools like Terraform, CloudFormation.
- Strong knowledge of networking fundamentals and cloud security best practices.
- Solid understanding of observability principles and distributed tracing tools.
- Previous experience in a Site Reliability Engineering or DevOps role.
- Familiarity with CI/CD tools like Jenkins, GitLab CI, or ArgoCD.
Preferred Qualifications
- Development experience in PHP.
- Experience with Docker, containers, and Kubernetes.
- Knowledge of HashiCorp products like Consul and Vault.
- Strong problem-solving skills with a systematic approach to debugging.
- Fluency in English (both verbal and written).