Job Intersection | Contract Job on Staff Machine Learning Engineer

Job Summary

Location

Palo Alto ,CA 94306

Job Type

Contract

Visa

Any Valid Visa

Salary

PayRate

Qualification

BCA

Experience

2Years - 10Years

Posted

02 Jan 2025

Job Description

This position will sit within a company that is pioneering a new era of Biomedicine!

Role Overview:

GPU Cluster Management: Architect, deploy, and sustain high-performance GPU clusters, ensuring they are stable, reliable, and scalable. Oversee and manage cluster resources to maximize efficiency and utilization.
Distributed/Parallel Training: Apply distributed computing techniques to facilitate parallel training of extensive deep learning models across multiple GPUs and nodes. Optimize data distribution and synchronization for faster convergence and reduced training times.
Performance Optimization: Enhance GPU clusters and deep learning frameworks to achieve peak performance for specific workloads. Identify and resolve performance bottlenecks through profiling and system analysis.
Deep Learning Framework Integration: Work closely with data scientists and machine learning engineers to incorporate distributed training capabilities into the company's model development and deployment frameworks.
Scalability and Resource Management: Ensure GPU clusters can scale effectively to meet growing computational demands. Develop strategies for resource management to prioritize and allocate computing resources based on project needs.
Troubleshooting and Support: Diagnose and resolve issues related to GPU clusters, distributed training, and performance anomalies. Provide technical support to users and efficiently resolve technical challenges.
Documentation: Develop and maintain documentation on GPU cluster configuration, distributed training workflows, and best practices to facilitate knowledge sharing and smooth onboarding of new team members.

Qualifications:

Master's or Ph.D. in computer science or a related field, with a focus on High-Performance Computing, Distributed Systems, or Deep Learning.
Over 2 years of proven experience in managing GPU clusters, including installation, configuration, and optimization.
Strong expertise in distributed deep learning and parallel training techniques.
Proficiency in popular deep learning frameworks such as PyTorch, Megatron-LM, and DeepSpeed.
Programming skills in Python and experience with GPU-accelerated libraries (e.g., CUDA, cuDNN).
Knowledge of performance profiling and optimization tools for HPC and deep learning.
Familiarity with resource management and scheduling systems (e.g., SLURM, Kubernetes).
Solid background in distributed systems, cloud computing (AWS, GCP), and containerization (Docker, Kubernetes).
Currently or previously holding a Staff or equivalent title | Currently sitting within a Senior leveled title for 3+ years

The company will provide a relocation package for candidates open to relocate!

Other Smiliar Jobs


Data Scientist/Machine Learning Engineer Houston, TX 4 Days ago
Lead IT Security Engineer Stamford, CT 5 Days ago
Cloud Security Engineer - IAM Dallas, TX 5 Days ago
SOAR Engineer Dallas, TX 5 Days ago
Senior Engineering Manager Atlanta, GA 5 Days ago
Cloud Security Engineer , VA 4 Days ago
Senior Offensive Security Engineer Arlington, VA 4 Days ago
Principal Cyber Security Analyst Tampa, FL 1 Days ago
Security Operations Detection Engineer Dallas, TX 1 Days ago
Founding Data Engineer Brooklyn, NY 5 Days ago
Software Engineer New York, NY 5 Days ago
Cloud Infrastructure Engineer Dallas, TX 5 Days ago
Kubernetes Engineer Irving, TX 5 Days ago
Security Operations Detection Engineer Dallas, TX 5 Days ago

Staff Machine Learning Engineer | Large Scale AI Infrastructure

Job Summary

Location

Palo Alto ,CA 94306

Job Type

Contract

Visa

Any Valid Visa

Salary

PayRate

Qualification

BCA

Experience

2Years - 10Years

Posted

02 Jan 2025

Share

Job Description

Other Smiliar Jobs

Data Scientist/Machine Learning Engineer

Lead IT Security Engineer

Cloud Security Engineer - IAM

SOAR Engineer

Senior Engineering Manager

Cloud Security Engineer

Senior Offensive Security Engineer

Principal Cyber Security Analyst

Security Operations Detection Engineer

Founding Data Engineer

Software Engineer

Cloud Infrastructure Engineer

Kubernetes Engineer

Security Operations Detection Engineer

Quick Links

Find Jobs

Search by Jobs Type

Jobs by Visa