The Aspen Group (TAG) is one of the largest and most trusted retail healthcare business support organizations in the U.S. and has supported over 20,000 healthcare professionals and team members with close to 1,500 health and wellness offices across 48 states in four distinct categories: dental care, urgent care, medical aesthetics, and animal health. Working in partnership with independent practice owners and clinicians, the team is united by a single purpose: to prove that healthcare can be better and smarter for everyone. TAG provides a comprehensive suite of centralized business support services that power the impact of five consumer-facing businesses: Aspen Dental, ClearChoice Dental Implant Centers, WellNow Urgent Care, Chapter Aesthetic Studio, and AZPetVet. Each brand has access to a deep community of experts, tools and resources to grow their practices, and an unwavering commitment to delivering high-quality consumer healthcare experiences at scale.?
A
s a reflection of our current needs and planned growth we are very pleased to offer a new opportunity to join our dedicated team as a Senior Site Reliability Engineer.
T
he Senior Site Reliability Engineer (SRE) & Monitoring Specialist will be responsible for ensuring the reliability, performance, and scalability of our systems. This role involves implementing and managing monitoring solutions, responding to incidents, and optimizing system performance to meet business objectives.
R
esponsibilities: S
ite Reliability Engineering: D
- esign, build, and maintain scalable and reliable systems to support our applications and services. D
- evelop and manage Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to ensure systems meet reliability targets. D
- rive improvements in system reliability, availability, and performance through proactive measures and automation. M
onitoring & Observability: I
- mplement and manage comprehensive monitoring and alerting solutions to ensure full visibility into system health and performance. D
- evelop and maintain dashboards and reporting tools that provide actionable insights for troubleshooting and performance optimization. E
- valuate and integrate new monitoring tools and technologies as needed to enhance observability.
Incident Management: L
- ead and participate in incident response efforts, including troubleshooting, root cause analysis, and resolution. D
- evelop and maintain incident management processes to improve response times and minimize service disruptions. C
- onduct post-incident reviews to identify areas for improvement and implement preventive measures. P
erformance Optimization: A
- nalyze performance metrics and logs to identify and address bottlenecks and inefficiencies in the system. C
- ollaborate with development teams to optimize code and infrastructure for better performance and reliability. P
- erform capacity planning to ensure systems can handle current and future loads.
A
utomation & Process Improvement: D
- evelop and implement automation solutions to streamline operations and reduce manual intervention. I
- dentify and drive process improvements to enhance operational efficiency and effectiveness. M
- aintain documentation related to monitoring, incident management, and SRE best practices. C
ollaboration & Communication: W
- ork closely with engineering, operations, and product teams to align on reliability and monitoring goals. C
- ommunicate effectively with stakeholders, providing regular updates on system health, incidents, and performance improvements. F
- oster a culture of collaboration and knowledge sharing within the team and across the organization.
R
equirements: B
- achelor's degree in Computer Science or a related field. A
- t least 5 years of experience in Site Reliability Engineering or a similar role. S
- trong proficiency in at least one programming language such as Python, Java, or Go. E
- xperience with containerization technologies such as Docker and Kubernetes. S
- trong understanding of networking, distributed systems, and cloud infrastructure. F
- amiliarity with monitoring and logging tools such as Prometheus, Grafana, ELK Stack, and Splunk. E
- xcellent problem-solving skills and the ability to work independently and in a team environment. E
- xperience with incident management and root cause analysis. I
f you are a Senior SRE Engineer with a passion for ensuring the reliability and performance of production systems, we encourage you to apply for this exciting opportunity.