Senior Cloud Operations Engineer - PyTorch

job
  • ZipRecruiter
Job Summary
Location
San Francisco ,CA 94199
Job Type
Contract
Visa
Any Valid Visa
Salary
PayRate
Qualification
BCA
Experience
2Years - 10Years
Posted
28 Feb 2025
Share
Job Description
Job DescriptionCompany Description

The Linux Foundation is a driving force in fostering open-source collaboration and supporting communities across a range of projects, including PyTorch. We're dedicated to enhancing and expanding our infrastructure to meet the growing demands of PyTorch and related AI projects. We are seeking a Senior Cloud Operations Engineer who will focus on the infrastructure operations of the PyTorch project, automating processes, optimizing cloud tools, and ensuring a robust and scalable cloud environment.Job Description

The Senior Cloud Operations Engineer will play a crucial role in managing and optimizing our multi-cloud infrastructure and DevOps practices. This position is essential for maintaining and scaling our cloud operations across multiple cloud provider platforms and accelerator technologies. The ideal candidate will combine deep expertise in cloud technologies, hardware accelerators, and DevOps methodologies to ensure our infrastructure remains robust, efficient, and future-proof.Responsibilities:Cloud Infrastructure ManagementDesign and manage multi-cloud environments across AWS, GCP, and AzureOptimize instance selection and utilization across various compute types including AMD and Intel CPU-based instancesConfigure and manage GPU-accelerated instances (AMD and NVIDIA) and specialized accelerators (TPUs, NPUs)Implement and maintain infrastructure-as-code using Terraform and other IaC toolsOptimize cloud resource utilization and implement FinOps practices for cost managementDesign and implement high-availability solutions across multiple cloud providersCI/CD and DevOpsDesign, implement, and maintain CI/CD pipelines using GitHub ActionsConfigure and manage both GitHub-hosted and self-hosted runnersImplement and maintain non-blocking and out-of-tree CI jobsDesign and implement matrix testing strategies across different hardware configurationsDevelop and maintain automated testing frameworks for various testing types (unit, integration, performance)Implement best practices for version control management and branching strategiesExperience with agile methodologies and scrum practicesPerformance Optimization and TestingDevelop and implement performance testing frameworks for various hardware acceleratorsOptimize workload distribution across different types of compute instancesImplement automated performance regression testingDesign and maintain benchmarking systems for various hardware configurationsInfrastructure Security and MonitoringImplement security best practices across multi-cloud environmentsDevelop comprehensive monitoring solutions using cloud toolsParticipate in on-call rotations supporting operations and incident responseEstablish and maintain escalation procedures and resolution processesManage access control and security policies across cloud platformsQualifications

Required:Bachelor's degree in Computer Science, Engineering, or related field7+ years of experience in cloud operations with extensive multi-cloud expertise (AWS, GCP, Azure)Demonstrated experience with GPU computing (AMD and NVIDIA) and specialized accelerators (TPUs, NPUs)Strong knowledge of CPU architectures and instance type optimization (AMD, Intel)Advanced experience with GitHub Actions, including custom runner configuration and managementExpertise in implementing non-blocking and out-of-tree CI jobsStrong background in version control systems and branching strategiesExperience with agile methodologies and scrum practicesProficiency in infrastructure-as-code tools, particularly TerraformStrong scripting abilities (Python, Bash, PowerShell, Typescript)Experience with containerization and orchestration (Docker, Kubernetes)Demonstrated experience in implementing automated testing frameworksPreferred:Experience optimizing workloads across different hardware acceleratorsBackground in performance testing and optimizationContributions to open-source projectsExperience mentoring other engineersBackground in machine learning infrastructureExperience with Datadog is a plusBenefits:Competitive salaryComprehensive health, dental, and vision insuranceFlexible PTO policyRemote work environmentProfessional development opportunities401(k) matchingHome office stipend

Additional Information

Open to US-based employees only. Preference for West Coast candidates.Salary $125,000 - $165,000 USDAbout Us:We maintain a predominantly remote workforce and are committed to hiring top-notch talent. We are passionate about providing a flexible and supportive work culture. Our team values collaboration, innovation, and continuous learning. We embrace and believe in creating an inclusive environment where all team members can thrive.The Linux Foundation is an Equal Opportunity Employer.

#J-18808-Ljbffr
Other Smiliar Jobs
 
  • San Francisco, CA
  • 2 Days ago
  • San Francisco, CA
  • 2 Days ago
  • San Francisco, CA
  • 2 Days ago
  • San Francisco, CA
  • 2 Days ago
  • Menlo Park, CA
  • 2 Days ago
  • San Francisco, CA
  • 2 Days ago
  • Redwood City, CA
  • 2 Days ago
  • San Diego, CA
  • 2 Days ago
  • Menlo Park, CA
  • 2 Days ago
  • San Mateo, CA
  • 2 Days ago