The AI & HPC Cloud powered by AMD Instinct™ Series GPUs. 🌊
About the Company
TensorWave is at the forefront of AI computing, delivering a versatile cloud platform that fuels the next generation of intelligent computing. The company focuses on creating a robust foundation for AI innovation, enabling cutting-edge developments and pushing the boundaries of what’s possible in artificial intelligence.
About the Role
The Senior Site Reliability Engineer (SRE) will design, build, and maintain highly scalable, secure, and resilient infrastructure. This role blends systems programming and DevOps, providing opportunities to work on low-level infrastructure, automation, and cloud orchestration while supporting high-performance AI workloads. Ideal candidates thrive in environments that demand both coding expertise and infrastructure management skills.
Responsibilities
- Design, deploy, and maintain infrastructure systems on Linux and NixOS.
- Manage resources and infrastructure using Terraform for automation and scalability.
- Architect and operate Kubernetes clusters with a focus on security, performance, and automation.
- Develop internal tooling and high-performance utilities in Go, Rust, C, Zig, or Javascript.
- Build and maintain CI/CD pipelines to support code and infrastructure deployments.
- Monitor system performance, troubleshoot issues, and enhance platform reliability using observability tools.
- Collaborate with engineering teams to support development workflows and deployment strategies.
Required Skills
- 5+ years in DevOps, Site Reliability, or Infrastructure Engineering roles.
- Advanced experience with Linux systems and configuration management, preferably NixOS.
- Hands-on experience with Terraform, Kubernetes, and containerized environments.
- Proficiency in one or more low-level programming languages: Rust, C, Zig, Javascript, Go.
- Strong knowledge of systems programming, OS internals, and performance optimization.
- Familiarity with CI/CD best practices and monitoring/alerting tools.
Preferred Qualifications
- Demonstrated ability to design scalable, secure, and resilient infrastructure for large-scale applications.
- Experience developing automation and internal tools for operational efficiency.
- Strong problem-solving skills and collaborative mindset in cross-functional teams.
- Enthusiasm for tackling complex technical challenges in AI and cloud computing.
Benefits
- Stock options and equity opportunities.
- 100% paid medical, dental, and vision insurance.
- Life and voluntary supplemental insurance, short-term disability coverage.
- Flexible spending account and 401(k) plan.
- Paid holidays, flexible PTO, and parental leave.
- Mental health support through Spring Health.