Staff Machine Learning Engineer | Large Scale AI Infrastructure

job
  • Glocomms
Job Summary
Location
Palo Alto ,CA 94306
Job Type
Contract
Visa
Any Valid Visa
Salary
PayRate
Qualification
BCA
Experience
2Years - 10Years
Posted
16 Jan 2025
Share
Job Description

This position will sit within a company that is pioneering a new era of Biomedicine!

Role Overview:

  • GPU Cluster Management: Architect, deploy, and sustain high-performance GPU clusters, ensuring they are stable, reliable, and scalable. Oversee and manage cluster resources to maximize efficiency and utilization.
  • Distributed/Parallel Training: Apply distributed computing techniques to facilitate parallel training of extensive deep learning models across multiple GPUs and nodes. Optimize data distribution and synchronization for faster convergence and reduced training times.
  • Performance Optimization: Enhance GPU clusters and deep learning frameworks to achieve peak performance for specific workloads. Identify and resolve performance bottlenecks through profiling and system analysis.
  • Deep Learning Framework Integration: Work closely with data scientists and machine learning engineers to incorporate distributed training capabilities into the company's model development and deployment frameworks.
  • Scalability and Resource Management: Ensure GPU clusters can scale effectively to meet growing computational demands. Develop strategies for resource management to prioritize and allocate computing resources based on project needs.
  • Troubleshooting and Support: Diagnose and resolve issues related to GPU clusters, distributed training, and performance anomalies. Provide technical support to users and efficiently resolve technical challenges.
  • Documentation: Develop and maintain documentation on GPU cluster configuration, distributed training workflows, and best practices to facilitate knowledge sharing and smooth onboarding of new team members.

Qualifications:

  • Master's or Ph.D. in computer science or a related field, with a focus on High-Performance Computing, Distributed Systems, or Deep Learning.
  • Over 2 years of proven experience in managing GPU clusters, including installation, configuration, and optimization.
  • Strong expertise in distributed deep learning and parallel training techniques.
  • Proficiency in popular deep learning frameworks such as PyTorch, Megatron-LM, and DeepSpeed.
  • Programming skills in Python and experience with GPU-accelerated libraries (e.g., CUDA, cuDNN).
  • Knowledge of performance profiling and optimization tools for HPC and deep learning.
  • Familiarity with resource management and scheduling systems (e.g., SLURM, Kubernetes).
  • Solid background in distributed systems, cloud computing (AWS, GCP), and containerization (Docker, Kubernetes).
  • Currently or previously holding a Staff or equivalent title | Currently sitting within a Senior leveled title for 3+ years


The company will provide a relocation package for candidates open to relocate!

Other Smiliar Jobs
 
  • Stamford, CT
  • 14 Days ago
  • Atlanta, GA
  • 2 Days ago
  • San Diego, CA
  • 18 Hours ago
  • Tampa, FL
  • 10 Days ago
  • Arlington, VA
  • 3 Days ago
  • , VA
  • 3 Days ago
  • Washington, DC
  • 16 Hours ago
  • New York, NY
  • 14 Days ago
  • Irving, TX
  • 14 Days ago