Senior AI Infrastructure Engineer

job
  • Signify Technology
Job Summary
Location
San Francisco ,CA 94199
Job Type
Contract
Visa
Any Valid Visa
Salary
PayRate
Qualification
BCA
Experience
2Years - 10Years
Posted
23 Jan 2025
Share
Job Description

Job Title: Senior AI Infrastructure Engineer

Location: Remote but must be located in the Bay Area

Salary Range: $200,000-$250,000 + Equity



About the Company


They are a fast-growing startup in the 3D generation space, focused on creating tools for 3D artists and game developers. With over 1 million users, their platform is at the forefront of revolutionizing the creation of 3D content using advanced AI and machine learning. Their products enable game developers to quickly generate high-quality 3D models. As they continue to expand, they are looking for an experienced Senior AI Infrastructure Engineer to help scale their AI and machine learning infrastructure.




About the Role


In this role, the engineer will be responsible for training and managing GPU clusters, scaling data processing workflows, and optimizing the performance of AI models on cloud infrastructure. They will work hands-on with large-scale datasets and GPUs to build and scale the infrastructure required to support cutting-edge AI applications such as Text-to-3D and Image-to-3D generation. The ideal candidate will have experience managing their own GPU clusters (8+ GPUs), scaling workloads, and working with large image datasets in a cloud environment.




Responsibilities


  • GPU Cluster Management: Lead the training and inferencing processes for image-based AI models on GPU clusters. Manage and scale 8+ GPUs, ensuring efficient operation and optimal performance across the cluster. This includes setup, monitoring, and troubleshooting of GPU resources.
  • Data Processing & Scaling: Work directly with large-scale data processing workflows. Ensure data is processed, cleaned, and ready for training. Scale data pipelines to support high throughput in cloud environments such as AWS or Azure.
  • Model Tuning & Training: Work with teams to fine-tune AI models on large image datasets. Train models from scratch or fine-tune pre-trained models for specific use cases, ensuring high performance and scalability. Fine-tuning multi-GPU setups will be a critical part of the role.
  • Cloud Infrastructure: Utilize cloud platforms like AWS or Azure to manage and scale GPU clusters. Optimize cloud resources for large-scale training jobs and ensure infrastructure supports the growing demands of their AI models.
  • Collaboration & Innovation: Collaborate closely with AI and ML teams to deploy new algorithms, experiment with distributed training, and enhance infrastructure. Play a key role in scaling their GenAI products and ensuring systems can handle millions of AI operations per month.


Required Skills


  • Experience with GPU Clusters: Proven hands-on experience managing and training models on GPU clusters of 8+ GPUs, ideally managing the infrastructure independently (not via a company). Comfortable with both training and inferencing tasks on large-scale systems.
  • Large-Scale Data Experience: Experience processing large image datasets for machine learning tasks, including data preprocessing, scaling data workflows, and ensuring smooth pipelines for large training jobs.
  • Model Training & Tuning: Experience in training and fine-tuning deep learning models (primarily image-based models) using frameworks like PyTorch, TensorFlow, or similar. Proficiency in tuning models on GPUs to maximize performance.
  • Cloud Platforms & Tools: Experience working with cloud platforms like AWS or Azure to scale GPU clusters for deep learning workloads. Knowledge of cloud-based orchestration tools (e.g., Ray) is a plus.
  • Programming Skills: Proficiency in Python for developing and optimizing training pipelines. Experience with distributed computing and parallel processing tools is highly valued. Familiarity with JAX, PyTorch, or similar libraries for model training is beneficial.

Other Smiliar Jobs
 
  • Sonoma, CA
  • 20 Hours ago
  • San Jose, CA
  • 20 Hours ago
  • Santa Rosa, CA
  • 20 Hours ago
  • Alameda, CA
  • 20 Hours ago
  • San Francisco, CA
  • 20 Hours ago
  • San Bernardino, CA
  • 20 Hours ago