Site Reliability Engineer
My client is seeking a
Site Reliability Engineer to join their Operations team. Your goal will be to ensure our platforms and products operate at peak performance. You'll focus on optimizing reliability and minimizing manual work for our 24/7 internet-based solutions.
This role is highly collaborative, working closely with Engineering teams to address existing issues and build reliability into new solutions.
What You'll Do
As a Site Reliability Engineer, you will:
- Monitor and optimize system performance.
- Troubleshoot hardware, software, and network issues.
- Collaborate with the Security team to protect cloud solutions.
- Work with Developers and Product Owners to plan upgrades and improvements.
- Support projects by identifying risks and mitigation strategies.
- Provide on-call support on a rota basis.
- Automate alerts and responses with scripts and tools.
- Help Engineering teams maintain high-performance production environments.
What We're Looking For
We're looking for candidates with experience in:
- Ensuring system reliability and performance.
- Working in Operations or Site Reliability roles.
- Collaborating across teams and taking ownership of tasks.
- Monitoring tools like Datadog, Dynatrace, or New Relic.
- Configuration management tools (e.g., Ansible or Chef).
- Scripting languages like PowerShell, Bash, Python, or Ruby.
- Supporting web-based applications, including firewalls, load balancers, and availability checks.
- Linux and Microsoft Server operating systems.
Bonus skills include:
- Knowledge of Microsoft Azure (especially PaaS tools like Web Apps or Functions).
- Familiarity with tools like Terraform, Jenkins, or Proxmox.
- Understanding of DNS, load balancer setup, and cloud-based networks.
- Experience with Agile methodologies and microservice architectures.
- Knowledge of cloud security best practices.