JOB TITLE: Site Reliability Engineer
LOCATION: Pleasanton, CA
DURATION: 4-6 week contract to hire
RATE RANGE: $90+ per hour
POSITION SUMMARY:
The Senior Site Reliability Engineer (SRE) plays a vital role in ensuring the reliability, scalability, and performance of our enterprise software platform. This is a senior-level position that requires deep technical expertise, strong problem-solving skills, and the ability to collaborate effectively in a fast-paced, demanding environment. Our customers, the largest enterprises in the world, expect 24/7 platform availability and top-tier performance.
The ideal candidate has strong expertise in AWS cloud technologies , a deep understanding of serverless architectures (AWS Lambda), and a passion for building resilient systems to enhance the customer experience.
RESPONSIBILITIES:
Platform Reliability:
- Design, implement, and manage highly available and scalable systems to meet customer expectations for 24/7 uptime.
- Monitor, troubleshoot, and resolve platform incidents using tools such as Sentry, New Relic, and custom monitoring frameworks.
- Lead post-incident reviews to ensure root cause analysis and preventative measures are in place.
Automation and Optimization:
- Develop and maintain automation for infrastructure management, monitoring, and incident response.
- Optimize platform performance and scalability, proactively identifying and addressing bottlenecks.
- Contribute to the development of CI/CD pipelines to improve deployment reliability and speed.
Collaboration:
- Partner with L2 engineers to resolve complex customer issues, providing guidance and technical expertise as needed.
- Work closely with product engineering to ensure platform improvements align with customer needs.
- Actively contribute to the documentation and sharing of best practices to improve team performance and customer outcomes.
Leadership:
- Mentor junior engineers and provide technical leadership in reliability engineering.
- Drive cross-functional initiatives to improve platform stability and customer satisfaction.
QUALIFICATIONS:
- Bachelor's degree in Computer Science or related discipline.
- 8+ years in a Site Reliability Engineering or DevOps role, with experience supporting enterprise-grade software platforms.
- 3+ years of experience in cloud services, in particular AWS.
- Experience building observability systems on New Relic, Cloudwatch or similar.
- Experience implementing rate-limiting, API gateways, and load balancing for highly available systems.
- Exposure to security best practices and compliance frameworks (e.g., SOC2, ISO27001).
- Proficient in infrastructure as code (IaC) using tools such as Terraform or CloudFormation.
- Hands-on experience with scripting and programming languages like Python, Go, or Bash.
- Strong troubleshooting and debugging skills.
- Excellent communication and collaboration skills.
- Experience with incident management and post-mortem practices.
- Soft Skills:
- Exceptional problem-solving and critical thinking abilities.
- Strong verbal and written communication skills, with the ability to navigate ambiguity and provide clarity.
- Ability to work collaboratively in cross-functional teams under pressure.
#J-18808-Ljbffr