Training Performance Engineer - Acceleration
About kaiko.ai
Kaiko is building a next-generation agentic clinical AI assistant that helps clinicians reason across patient data, guidelines, and diagnostics.
Healthcare decisions are rarely made by a single person or from a single data source. kaiko's assistant maintains longitudinal patient context across encounters, clinicians, and institutions, enabling collaboration, second opinions, and complex diagnostic workflows. The system is designed to operate safely in real clinical environments, with human oversight, auditability, and regulatory alignment at its core.
Our assistant core supports broadly applicable clinical tasks such as patient data navigation, guideline interaction, multimodal interaction (chat and voice), and care coordination. On top of this foundation, we are developing specialized diagnostic agents in areas such as oncology, radiology, and pathology.
We build in close collaboration with leading hospitals and research centers, including the Netherlands Cancer Institute (NKI). kaiko is a well-funded company with a growing international team, operating from Zurich and Amsterdam.
About the role
Kaiko trains its own foundation models for clinical work. The program runs on open-weight MoE bases in the hundreds-of-billions to trillion-parameter range.
You own throughput on our Blackwell training cluster — instrument runs, identify utilization gaps, and ship optimizations that push MFU, wall-clock, and uptime. You work alongside research as new architectures and phases land on the cluster.
The hard problems are low-precision training, modern attention variants on open-weight MoE bases at the kernel level, and MoE parallelism tuned to the cluster fabric.
You will be based in either The Netherlands or Switzerland, with the expectation of spending at least 50% of your time at the office.
Some areas of responsibility
Instrument and analyze runs — MFU, throughput, uptime — and close gaps against predicted wall-clocks.
Benchmark NCCL collectives over InfiniBand and NVLink — including rail/topology behaviour and congestion at scale, and keep a current picture of what the fabric delivers.
Drive low-precision training in our stack and validate the speed-up.
Tune MoE parallelism (TP / PP / CP / EP / DP) per phase and characterise expert-parallel comm cost on the cluster fabric.
Land custom attention-variant kernels (e.g. hybrid, latent-attention) into the training stack.
About you
Deep GPU systems experience, with kernel-level CUDA / Triton work and comfort with CUTLASS, Flash Attention, Pytorch and Nsight profiling.
Production experience with NCCL on InfiniBand or equivalent high-bandwidth interconnects.
Parallelism literacy: TP / PP / CP / EP / DP under memory, comm, and MFU constraints.
Tracks the relevant systems literature and brings it into the stack.
Nice to have:
Low-precision training (FP8, expert-only quant, dynamic loss scaling).
Sparse / hybrid / MLA attention at the kernel level.
Has shipped large-scale MoE training in production — pre-training, SFT, or RL.
Stack experience with Megatron, NeMo, or comparable.
We are excited to gather a broad range of perspectives in our team, as we believe it will help us build better products to support a broader set of people. If you're excited about us but don't fit every single qualification, we still encourage you to apply: we've had incredible team members join us who didn't check every box!
Why kaiko
At kaiko, we believe the best ideas come from collaboration, ownership and ambition. We've built a team of international experts where your work has a direct impact. Here's what we value:
Ownership: You'll have the autonomy to set your own goals, make critical decisions, and see the direct impact of your work.
Collaboration: You'll have to approach disagreement with curiosity, build on common ground, and create solutions together.
Ambition: You'll be surrounded by people who set high standards for themselves and others, who see obstacles as opportunities, and who are relentless in their work to create better outcomes for patients.
In addition, we offer:
An attractive and competitive salary, a good pension plan, and 25 vacation days per year.
Great offsites and team events to strengthen the team and celebrate successes together.
A EUR 1000 learning and development budget to help you grow.
Autonomy to do your work the way that works best for you, whether you have a kid or prefer early mornings.
An annual commuting subsidy.
Our interview process
Our interview process is designed to assess mutual fit across skills, motivation, and values. It typically includes the following steps:
Screening call: A short conversation to align on your motivation, professional goals, and initial fit for the role.
Technical take-home assessment: A deep dive into your problem-solving approach through a technical challenge.
Technical assessment debrief: You'll meet one of our team members and will focus your discussion on your technical take-home assessment approach.This would also be a good step to explore collaboration dynamics, team fit, and day-to-day context.
Final onsite interview: A chance to visit the office, meet more of our team members and have a chat focused on long-term alignment and shared expectations for impact.
- Locations
- Amsterdam, Zürich (Puls 5)