混元大模型Infra稳定性研发工程师(深圳/北京/上海/杭州)Apply |
|
Job Source |
腾讯集团 |
Location |
China, Shenzhen |
Salary |
Negotiable |
Designation |
Internet/AI |
Job Type |
Full Time |
Language |
|
Job Posted Date |
01-09-2025 |
Job Description |
|
1.负责混元infra相关链路稳定性治理、规范建设;
2.联动框架、算力、网络各模块完善关键metric采集; 3.系统性构建故障节点、慢节点检测平台化能力; 4.联合混元一站式建设统一的任务自动续训能力; 5.响应并解决日常混元大模型任务的故障问题。 |
|
Job Requirements |
|
1.熟悉Megatron/PyTorch等框架的基本的训练流程;
2.掌握GPU/NPU等工作原理、常见操作命令; 3.熟悉RDMA网络相关硬件特性、熟悉all2all、allGather等集合通信原理; 4.了解docker容器、存储挂载等基础知识; 5.有大规模任务系统故障排查、分析解决经验者优先; 6.良好沟通、团队协作能力。。加分项: |
Welcome to Linkedtour! Please complete your profile first and then enjoy your trip in Linkedtour!
Please complete now your information at our partner site and click to apply. Good luck !