
Overview
Title: Lead Site Reliability Engineer (SRE) / Principal Site Reliability Engineer (SRE)
Location: Irving, TX & Charlotte, NC – Hybrid Role
Duration: 18+ Months (s) Contract to hire, or possibility to extension
We are seeking a Senior Site Reliability Engineer (SRE) with a strong background in software engineering and a passion for solving complex problems at scale. This role blends software engineering with operational expertise to deliver stable, scalable, and resilient services, while reducing toil and shifting operations left.
Runs support for Shared Services Operations Technology. Split amongst Payment Evaluations, Regulatory Operations, Financial Crimes, and Business and Real Estate Evaluation. Supports systems that do KYC and AML supporting financial crimes. Have about 85 apps they support, about 75 of those have no SLOs and SLI s, so they’d like those defined. Also getting into automation with RPA and chatbots. Hoping to find someone who could apply to any one of the domains. High volume of tickets in the org, but this person would be expected to be working more proactively on projects. Right now, that person may be “firefighting” 60% of the time and doing prevention the other 40%, but would like to improve to 80% prevention.
OCP is highly preferable for cloud experience since it’s being implemented across the organization.
Backfilling an FTE with someone they’d like to try out. May be some weekends that require system support, overtime could be an occasional possibility. May work weekends once a month or two months on a rotation, depending on if they’re assigned to that rotation as an SRE.
Key Responsibilities
- Design and implement automated tooling to eliminate manual toil and optimize operations.
- Build and enhance monitoring, alerting and overall observability.
- Champion the SRE practice within COO Technology by modeling best practices, mentoring peers, and collaborating with embedded platform SRE teams.
- Enhance system availability in a multi-cloud environment by evolving resiliency patterns.
- Introduce and scale AIOps, including self-healing and autonomic systems using AI/ML, RPA, and unified communications.
- Automate key SRE metrics and IT service operations processes, including customer impact analysis, availability tracking, SLO/SLI adherence, error budgeting, and incident response.
- Support critical applications and customer journeys, lead Agile-based remediation efforts, conduct blameless postmortems, and drive root cause analysis to eliminate recurring issues.
- Implement and guide through Non-Functional Requirements (NFRs) during modernization and uplift initiatives.
- Help define, govern and enforce Permit to Operate.
Top Skills
- 8+ years minimum SRE experience
- Database knowledge
- Observability tools
Nice to Have
- Autosys
- A good SRE will likely be interested in AI
Infrastructure & Cloud
- Expertise in Linux and container platforms (Kubernetes)
- Experience with cloud platforms: PCF, AWS, GCP, or Azure
CI/CD & Automation
Observability & AIOps
Operations & Data
- Data platforms: Oracle, DB2, SQL, MongoDB, Hadoop, Cloudera, Spark, Teradata
EEO
Mindlance is an Equal Opportunity Employer and does not discriminate in employment on the basis of – Minority/Gender/Disability/Religion/LGBTQI/Age/Veterans.
#J-18808-Ljbffr