Connecting World's top Talents with Premier Jobs and Networking.
Register
Connecting World's top Talents with Premier Jobs and Networking.

混元大模型Infra稳定性研发工程师(深圳/北京/上海/杭州)

Apply instagram Share link

Job Source

腾讯集团

Location

China, Shenzhen

Salary

Negotiable

Designation

Internet/AI

Job Type

Full Time

Language

Job Posted Date

01-09-2025

Job Description

1.负责混元infra相关链路稳定性治理、规范建设;
2.联动框架、算力、网络各模块完善关键metric采集;
3.系统性构建故障节点、慢节点检测平台化能力;
4.联合混元一站式建设统一的任务自动续训能力;
5.响应并解决日常混元大模型任务的故障问题。

Job Requirements

1.熟悉Megatron/PyTorch等框架的基本的训练流程;
2.掌握GPU/NPU等工作原理、常见操作命令;
3.熟悉RDMA网络相关硬件特性、熟悉all2all、allGather等集合通信原理;
4.了解docker容器、存储挂载等基础知识;
5.有大规模任务系统故障排查、分析解决经验者优先;
6.良好沟通、团队协作能力。。加分项:



腾讯集团




Just one more quick step more to complete your application!

 

Welcome to Linkedtour! Please complete your profile first and then enjoy your trip in Linkedtour!

 

Just one more quick step more to complete your application!

 

Please complete now your information at our partner site and click to apply. Good luck !