The project: Our goal is to benchmark AI/ML cluster fabric with realistic workloads which typically require investments into compute systems with GPUs and RDMA NICs that are costly and time-consuming to build and operate. To achieve this goal, we’re creating a product that is able emulate realistic collective communication that is able to test AI/ML fabrics both for conformance and performance.
The Team:
Was founded in order to fill a gap on the AI/ML data center workloads testing market.
Currently has 15 members and it’s part of a bigger team spread accross multiple geos.
We live in the virtual environment, working in Agile, meeting daily to plan and track our objectives.
We are working directly with our partners as well as the end customer in a dynamic fashion.
We are always designing and developing new features, as well as resolving customer’s enquiries.
Responsibilities
Participate in the analysis, design, development and maintenance of AL/ML related products.
Maintain and enhance current products and participate in the design and development of applications for both internal and external use.
Interact with project management, leads, testers and other developers in order to understand the features, planning the schedule, designing and implementing the solutions, optimizing, performing development testing and bug-fixing in order to deliver high quality releases on time.
Requirements
Required Qualifications:
Strong knowledge of Linux user space. Kernel space development knowledge is a big plus
Strong knowledge of C, C++ & Python programming languages
Network programming expertise
Good algorithms/data structures knowledge
Ability to quickly learn and grasp new technologies
Desire and ability to work in a highly collaborative, team-oriented environment
Prove to be performance driven and having a proactive attitude
Bonus skills (not mandatory but appreciated)
Good knowledge of virtualization technologies (Qemu/KVM)
Good knowledge with linux containers technology and Docker/Kubernetes