We're seeking problem-solving and challenge-seeking research interns for a 6-month program focused on Kubernetes and Large Language Model (LLM) training on a Supercomputer.
Roles and Responsibilities:
1. Learn and work under the guidance of our team lead
2. Gradually build proficiency in Kubernetes and AI/ML training
3. Maintain network diagnostics and implement corrective actions
4. Assist in ongoing research and development projects
Key Learning Areas:
1. Networking concepts (basic to advanced)
2. Docker fundamentals and advanced concepts
3. Kubernetes principles, workload deployment, and service creation
4. Autoscaling and self-healing in Kubernetes
5. AI/ML training techniques
6. Single node fine-tuning on Kubernetes using Axoltol
7. Distributed training using PyTorch Distributed and Colossal AI
Perks and Benefits:
1. Immigration sponsorship provided
2. OPT sponsorship for recent graduates (CPT considered for eligible students)
3. Accommodation and meals provided
4. Flight covered from anywhere in the US to Utah
5. Monthly pay of $250 USD (subject to EAD approval)
6. Opportunity to take the CCNA exam (certification costs covered by Data Care)
Only those candidates can apply who:
1. are available for full time (in-office) internship
2. can start the internship between 12th Jul'24 and 16th Aug'24
3. are available for duration of 4 months
4. are from or open to relocate to Price
5. have relevant skills and interests
1. Strong problem-solving skills
2. Excellent English comprehension
3. Preference is given to MS graduates in Computer Science or Data Science
4. Graduates from other STEM fields are also encouraged to apply
5. Must be willing to work on-site (no remote positions available)
Data Care LLC is a cutting-edge Kubernetes and HPC Cloud provider operating from its own data centre, equipped with thousands of NVIDIA and AMD GPUs. We manage our clusters through Kubernetes and provide various applications on top of the clusters including:
1.KNative for autoscaling of inference
2.RAPIDS for data science
3. Colossal AI for distributed training
We provide GPU computing services to various AI/ML startups and academic
researchers. Our clients primarily use our clusters for AI inference and fine-tuning of
models.
We're currently developing an innovative HPC design that integrates Slurm with
Kubernetes for distributed neural network training.