Full Time
USA (Remote)
Posted 4 days ago

Cordial

Cordial automates billions of emails, SMS, and mobile app messages using all of your data.

[email protected] LinkedIn cordialinc Website

About the Company

Cordial is a leading software company specializing in data-driven, personalized communication solutions. With clients such as PacSun, Revolve, Abercrombie & Fitch, and Forbes, Cordial helps brands enhance customer relationships and drive revenue growth through improved messaging. Founded on principles of transparency, collaboration, and trust, Cordial fosters a culture of growth, continuous improvement, and innovation. Join a passionate team committed to shaping the future of digital communication.

About the Role

Cordial is seeking a skilled Site Reliability Engineer (SRE) to enhance the stability, performance, and scalability of the Cordial platform. This is an exciting opportunity to work with cutting-edge technologies like AWS, Kubernetes, Consul, and Vault in a collaborative, agile environment. The role is ideal for an individual with strong experience in cloud infrastructure and an eagerness to help monitor and optimize critical systems while ensuring a seamless experience for end-users.

Responsibilities

Administer, monitor, and troubleshoot cloud-based application and network components using Web, App, Server, Storage, and Security technologies.
Design, deploy, and monitor Kubernetes clusters, helm charts, and service mesh configurations.
Collaborate with Product and DevOps teams to troubleshoot production data corruption or performance issues.
Provide production support, participate in on-call rotations, and assist in troubleshooting complex system issues.
Contribute to platform infrastructure design and implementation.
Develop monitoring and alerting solutions for system performance and stability.
Assist with the creation and monitoring of Service Level Objectives (SLOs) and Service Level Agreements (SLAs).
Implement best practices in security and performance across all production systems.

Required Skills

5+ years of experience in UNIX/Linux Systems and Network Administration (DNS, IPsec, VPN, Load Balancing).
Expertise in AWS (EC2, EKS) and Kubernetes/EKS clusters.
Hands-on experience with Helm charts and service meshes (app-mesh, Istio, Linkerd).
Experience with monitoring, logging, and alerting tools like Prometheus, Grafana, and ELK.
Proficiency in infrastructure as code (IaC) tools like Terraform, CloudFormation.
Strong knowledge of networking fundamentals and cloud security best practices.
Solid understanding of observability principles and distributed tracing tools.
Previous experience in a Site Reliability Engineering or DevOps role.
Familiarity with CI/CD tools like Jenkins, GitLab CI, or ArgoCD.

Preferred Qualifications

Development experience in PHP.
Experience with Docker, containers, and Kubernetes.
Knowledge of HashiCorp products like Consul and Vault.
Strong problem-solving skills with a systematic approach to debugging.
Fluency in English (both verbal and written).

Head to the official website below for the full vacancy description and requirements:

Apply Now