Senior/Staff RL Engineer - ML R&D
About kaiko.ai
Kaiko is building a next-generation agentic clinical AI assistant that helps clinicians reason across patient data, guidelines, and diagnostics.
Healthcare decisions are rarely made by a single person or from a single data source. kaiko's assistant maintains longitudinal patient context across encounters, clinicians, and institutions, enabling collaboration, second opinions, and complex diagnostic workflows. The system is designed to operate safely in real clinical environments, with human oversight, auditability, and regulatory alignment at its core.
Our assistant core supports broadly applicable clinical tasks such as patient data navigation, guideline interaction, multimodal interaction (chat and voice), and care coordination. On top of this foundation, we are developing specialized diagnostic agents in areas such as oncology, radiology, and pathology.
We build in close collaboration with leading hospitals and research centers, including the Netherlands Cancer Institute (NKI). kaiko is a well-funded company with a growing international team, operating from Zurich and Amsterdam.
About the role
Kaiko trains its own foundation models for clinical work on a dedicated GPU cluster. RL is the engine driving alignment, reasoning, and agentic capability across our stack.
You own the RL training infrastructure end-to-end: the distributed training stack, the reward pipelines, and the experiment infrastructure that lets researchers iterate fast. The hard problems are real, reward hacking and objective-level instability, entropy collapse as policies converge prematurely, sparse and delayed rewards that make credit assignment across long reasoning traces extremely difficult, and exploration failures on hard problems where the model rarely samples a correct trace and learning stalls entirely. You diagnose these at root cause, fix them, and contribute back upstream where you can. You also explore new algorithms - from policy gradient variants and offline RL to agentic RL with tool use - and bring what matters into production.
You will be based in either The Netherlands or Switzerland, with the expectation of spending at least 50% of your time at the office.
Some areas of responsibility
Own the RL training stack end-to-end and keep it scaling cleanly across large MoE models and long contexts.
Build and maintain reward pipelines: verifiable reward signals, LLM-based reward models, and reward shaping strategies for complex clinical reasoning tasks.
Debug training instabilities at root cause — reward hacking, entropy collapse, credit assignment failures, gradient issues — and ship fixes, not workarounds.
Explore new RL algorithms and reward designs; run controlled experiments and translate promising results into the main training stack.
Scale runs across more nodes, longer contexts, and more complex parallelism as models and tasks grow.
Contribute upstream to open-source frameworks when you find bugs or missing features.
About you
Deep hands-on experience with RL training systems: you have shipped and scaled RL or post-training runs, not just run tutorials.
Fluent in at least one distributed training framework at a level where you can read the source and debug silent failures.
Strong understanding of core RL challenges: reward hacking, credit assignment, exploration, entropy collapse, sample efficiency — and practical ways to address them.
Comfortable at the intersection of research and engineering: you read papers, implement ideas, and know when something is worth productionising.
Excellent software engineering: clean Python, typed code, reproducible experiments, good test coverage.
Independent operator: you don't need prescribed task lists; you take a system from "running" to "stable, fast, and understood."
Nice to have:
Experience with verifiable reward signals or LLM-as-judge reward pipelines.
Familiarity with inference serving systems as part of an RL rollout loop.
Experience with MoE training and the additional complexity it introduces.
Contributions to open-source training frameworks.
Exposure to agentic or tool-use RL — web search, code execution, multi-step reasoning.
Healthcare or regulated-deployment context.
Why kaiko
At kaiko, we believe the best ideas come from collaboration, ownership, and ambition. We've built a team of international experts where your work has a direct impact. Here's what we value:
Ownership: You'll have the autonomy to set your own goals, make critical decisions, and see the direct impact of your work.
Collaboration: You'll approach disagreement with curiosity, build on common ground, and create solutions together.
Ambition: You'll be surrounded by people who set high standards for themselves and others, who see obstacles as opportunities, and who are relentless in their work to create better outcomes for patients.
In addition, we offer:
An attractive and competitive salary, a good pension plan, and 25 vacation days per year.
Great offsites and team events to strengthen the team and celebrate successes together.
A EUR 1000 learning and development budget to help you grow.
Autonomy to do your work the way that works best for you, whether you have a kid or prefer early mornings.
An annual commuting subsidy.
Our interview process
Our interview process is designed to assess mutual fit across skills, motivation, and values. It typically includes the following steps:
Screening call: A short conversation to align on your motivation, professional goals, and initial fit for the role.
Technical interview with offline assignment: A deep dive into your problem-solving approach through a technical challenge, case study, or role-specific scenario.
Onsite meeting (optional): You'll meet team members across functions to explore collaboration dynamics, team fit, and day-to-day context.
Final executive conversation: A discussion with a member of the executive team focused on long-term alignment and shared expectations for impact.
- Locations
- Amsterdam, Zürich (Puls 5)
- Remote status
- Hybrid