Site Reliability Engineer

LogRocket

Stop guessing about your digital experience - Session Replay | Product Analytics | Error Tracking

About the Company

LogRocket is a leading provider of AI-powered solutions for product managers and developers, improving the web app user experience. Founded in 2016, the company solves the challenge of understanding user interactions by offering detailed pixel-perfect replays of user sessions, with insights into logs, errors, and network activity. With a clientele that includes top-tier organizations such as ClassPass, Capital One, and Cisco, LogRocket is on a mission to enhance software experiences across the globe. Backed by investors like Matrix Partners and Battery Ventures, the company has raised $55M in funding and is expanding rapidly.

About the Role

The Site Reliability Engineer (SRE) at LogRocket will play a crucial role in enhancing the stability, performance, and security of the platform’s infrastructure. This position offers the opportunity to improve operational systems, reduce noise in pager alerts, and ensure efficient, secure, and scalable operations. The role is designed for individuals passionate about cloud technologies, container orchestration, and operational security.

Responsibilities

  • Improve the quality of pager alerts while minimizing noise and false positives.
  • Monitor the impact of engineering initiatives across the organization, focusing on stability, cost, and performance.
  • Keep infrastructure up-to-date by applying security patches and leveraging new features.
  • Enhance operational security while ensuring engineering teams maintain autonomy.
  • Participate in improving cloud infrastructure and optimize performance, scalability, and availability.
  • Design solutions for high availability systems, ensuring minimal downtime during updates and maintenance.
  • Automate cloud infrastructure scaling and improve cost efficiency during traffic spikes.
  • Collaborate with the product and engineering teams to resolve system performance issues.
  • Support incident management and provide proactive solutions to prevent service disruptions.
  • Help build tools for better onboarding and deployment processes for on-premise customers.

Required Skills

  • 5+ years of experience as a Site Reliability Engineer or in a related role.
  • Experience with cloud technologies, including common providers and related tools.
  • Expertise in modern container orchestration (Kubernetes on GKE preferred).
  • Strong knowledge of cloud systems performance, architecture, and cost optimization.
  • Deep understanding of incident response, security best practices, and risk mitigation.
  • Proficient in scripting and automation tools for streamlining DevOps processes.
  • Ability to read and understand product code; coding experience is a plus.
  • Strong collaboration skills and ability to work cross-functionally with engineering teams.
  • Experience in handling applications and databases with demanding scalability or availability requirements.

Preferred Qualifications

  • Familiarity with Nginx load-balancers, latency management, and database scaling.
  • Experience with tools like Prometheus and Grafana for system monitoring.
  • Hands-on experience with database performance optimization.
  • Knowledge of automation frameworks like Terraform, Chef, or Ansible.
  • Prior experience in improving performance and reliability in a SaaS environment.
  • Familiarity with Cloud security certifications and industry standards (e.g., SOC2, PCI).

Benefits & Perks

  • Comprehensive health, dental, and vision coverage.
  • Open vacation policy with flexible time off.
  • 3 months of fully-paid parental leave.
  • 401(k) and commuter benefits.
  • Generous stock options.
  • Regular team outings, activities, and employee gifts.
  • Catered lunches for in-office staff and a fully stocked kitchen.
  • Flexible working hours and location.

Please refer to the official website below for a comprehensive job description:

Copyright © 2025 SRE-Jobs.com. All Rights Reserved.