Location : Montreal (Hybrid 3 days)
Duration: 12+ Months
Job Profile
Systems Reliability Engineering (SRE) is a discipline focused on improving system service availability, observability, scalability, performance, and resilience across *** by applying sound software engineering principles and adopting the latest technology and tooling.
Responsibilities:
- Are interested in distributed systems and working with highly scalable and reliable services.
- Like to work in a fast-moving environment and you aren't afraid to change things to make them better.
- Enjoy new technological challenges and solving hard problems.
- Believe a team working well together is smarter than the single smartest person on that team.
- Have grit, drive and a deep sense of ownership.
- Working closely with engineering/development teams to design, build, and maintain systems.
- Troubleshooting issues across the entire technology stack: hardware, software, application, and network.
- Identifying and driving opportunities to improve automation for our platforms; scope and create automation for deployment, management, and visibility of our services.
- Proactively identifying and addressing systems reliability risks.
- Working alongside existing global and regional team members on a follow-the-sun basis.
- Represent the RPE organization in design reviews and operational readiness exercises for new and existing services.
Qualifications - Skill Set
- Demonstrated ability to troubleshoot problems and debug to identify root cause.
- Hands on experience on enterprise tools such as AppDynamics, Grafana, Splunk, Dynatrace.
- Experience with Ansible, GitHub or any automation/configuration/release management tools.
- Automation-related experience is particularly valued using scripting languages such as python, bash, perl. One higher level language is desired.
- Awareness of, and ability to reason about modern software and systems architectures, including load-balancing, databases, queueing, caching, distributed systems failure modes, micro services, Cloud, etc.
- Practical experience running large scale systems is an advantage.
- Should be able to contribute to system design and architecture with strong database knowledge.
Experience: Intermediate with 2 to 5 years
Top 3 Must have :
1. Strong experience with Python and / or Shell scripting
2. Strong experience with data base (DB2 knowledges is a plus)
3. Strong communication skills. The consultant will work with business users in day to day basis.
Top 2 Nice to have :
1. Good knowledges of Grafana, Prometheus
2. Good experience with debugging