Job Title: Site Reliability Engineer
Duration: Direct hire
Location: Hybrid Role - must be able to commit to 3 days/week in our Bloomington office
What you’ll be doing:
- Collaborate with development and operations teams to design, implement, and maintain observability frameworks that provide deep insights into system performance, particularly for data and ML pipelines.
- Lead the establishment of Service Level Objectives (SLOs) and Service Level Indicators (SLIs), ensuring they align with business goals and drive continuous performance improvements.
- Partner with stakeholders to understand system performance requirements and translate them into actionable performance engineering strategies.
- Proactively identify performance bottlenecks and collaborate with teams to implement solutions that enhance system scalability and reliability.
- Design and execute performance regression test suites, focusing on data-intensive and ML workloads, to ensure continuous performance optimization.
- Own the reliability and performance metrics of our systems, driving a culture of performance excellence and proactive issue resolution.
- Collaborate with subject matter experts to gain a deep understanding of domain-specific performance challenges, particularly in data and ML pipelines.
- Utilize tools like Datadog, Jira, and GitHub to monitor system performance, manage projects, and track issues, with a strong emphasis on performance-related metrics.
- Define and monitor success metrics, ensuring our systems consistently meet or exceed performance and reliability targets.
- Actively contribute to the continuous improvement of performance engineering practices across the team, fostering a culture of excellence in observability and system performance.
- Perform other duties as assigned.
What you’ll bring to us:
- Bachelor’s degree in computer science, Engineering, or a related field.
- Five years of experience in a site-reliability-focused role responsible for establishing reliability standards in a cloud-native environment
- Strong expertise in establishing SLOs/SLIs and building observability frameworks for complex systems.
- Proficiency with cloud services, particularly AWS, and experience in designing scalable and reliable architectures.
- Hands-on experience with performance monitoring and observability tools like Datadog.
- Proficiency in version control systems like Git/GitHub and infrastructure as code tools like Terraform.