Join our team as a Site Reliability Engineer, where you'll work with a diverse range of technologies, from IoT communication protocols to cloud-scalable connectivity and big data processing. You will be instrumental in ensuring the constant uptime, seamless scalability, and robust foundation for our critical systems, enabling the growth of new applications and services. This role goes beyond traditional operations, collaborating closely with developers and architects to enhance stability, security, and scalability from the design phase onwards.
Responsibilities:
- Collaborate with developers and architects to improve system design and implementation for enhanced stability, security, and scalability.
- Implement and enhance AI infrastructure and application monitoring and observability capabilities to maximize reliability.
- Partner with application engineering teams to improve service operability, reliability, on-call efficiency, incident management, and post-mortem analysis.
- Drive production readiness and improve key areas such as capacity planning, configuration management, and observability.
- Design and refine architectures for new and existing systems based on reliability and high availability principles, incorporating comprehensive logging and observability.
- Develop and apply expertise in client infrastructure and best practices to enhance platforms for world-class distributed system performance.
- Develop tooling and automation to streamline infrastructure and application operations.
- Gather and analyze metrics from operating systems and applications to optimize performance and facilitate fault finding.
- Lead deep-dive troubleshooting of production issues and actively participate in diagnostic calls.
Qualifications:
- Bachelor's degree in Computer Science, Engineering, or a related technical field.
- 5+ years of experience supporting internet-facing production services and distributed systems.
- Strong expertise in AWS managed services, including Kafka, ElastiCache (Redis), PostgreSQL, and AMQP brokers.
- Hands-on experience with Pulumi, Terraform, and Terragrunt for Infrastructure as Code (IaC).
- Advanced proficiency in Kubernetes, with hands-on experience managing large-scale, production-grade clusters, optimizing workloads, and implementing best practices for scalability and high availability.
- Experience with ArgoCD and GitLab for CI/CD pipelines.
- Expertise in Linux systems, particularly Red Hat and Debian distributions.
- Experience implementing Prometheus for monitoring and observability.
- Strong scripting skills (Bash, PowerShell) and command-line interface proficiency.
- Proven ability to troubleshoot complex technical problems in distributed systems, networking, and security, especially on AWS.
- Self-starter with a focus on continuous improvement and operational optimization.
- Strong programming skills with at least one interpreted, dynamically typed language (e.g., Python, Node.js) and one compiled, statically typed language (e.g., C#, Java).
- Relevant SRE training and certifications are a plus.
- Excellent verbal and written communication skills in English.
This is an excellent opportunity to contribute to a cutting-edge AI product and work with a talented team. If you are passionate about reliability, scalability, and automation, we encourage you to apply.