Senior ML Engineer (Evaluation)
About kaiko.ai
kaiko.ai is building a next-generation agentic clinical AI assistant that helps clinicians reason across patient data, guidelines, and diagnostics.
Healthcare decisions are rarely made by a single person or from a single data source. kaiko’s assistant maintains longitudinal patient context across encounters, clinicians, and institutions, enabling collaboration, second opinions, and complex diagnostic workflows. The system is designed to operate safely in real clinical environments, with human oversight, auditability, and regulatory alignment at its core.
Our assistant core supports broadly applicable clinical tasks such as patient data navigation, guideline interaction, multimodal interaction (chat and voice), and care coordination. On top of this foundation, we are developing specialized diagnostic agents in areas such as oncology, radiology, and pathology.
We build in close collaboration with leading hospitals and research centers, including the Netherlands Cancer Institute (NKI). kaiko is a well-funded company with a growing international team, operating from Zurich and Amsterdam.
About the role
kaiko’s Multimodal Large Language Model (MLLM) is trained on domain-specific, high-complexity medical data. Reaching clinical-grade performance demands a comprehensive evaluation stack that is fast, reliable, and deeply integrated with our model development loop.
As a Senior ML Engineer in Evaluation, you will own the engineering stack to run evaluations at scale, from efficient inference across a growing set of frontier models to large-scale automated pipeline execution across a broad range of clinical benchmarks, with a strong eye for observability and production-grade system organisation. You will work closely with other ML researchers and Product to translate research and clinical requirements into reliable and well-engineered eval signals.
As a Senior ML Engineer in Evaluation, you will
• Design, operate, and mature the automated pipelines and workflows that run large-scale evaluation jobs, and extend automation across the eval stack wherever possible.
• Maintain and mature the inference and eval services that form the backbone of our evaluation stack, ensuring correctness, reproducibility, and throughput as the model and benchmark zoo grows.
• Ensure the functional integrity of the eval stack through rigorous testing and validation: automated model/benchmark integration testing, expected output validation across configurations, supporting ML researchers in understanding evaluation outputs.
Own Eval/MLOps end-to-end: service deployments, model and artifact versioning, eval data organization, and post-deployment observability.
Develop towards a technical lead: set engineering direction, make architectural decisions, and support other engineers in execution.
You will be based in Zurich or Amsterdam, with the expectation of spending ∼50% of your time in the office.
About you
Essential:
• Excellent Python skills and strong software engineering fundamentals: testing, modular design, CI/CD, code review, and monorepo tooling.
• Experience designing and operating workflow orchestration and automated pipelines, with a strong grasp of the full deployment lifecycle: containerisation, config management, observability, and incident response.
• Proven experience building and operating ML infrastructure at scale, ideally for large language or multimodal models.
• Solid understanding of distributed compute systems and GPU workloads, including cluster scheduling and resource management.
• Ability to read and reason about model internals at a low level: tokenisation, numerical precision, tensor shapes, and inference-time behaviour.
• Prior experience in the medical domain is not required, but a strong motivation to push the frontier of clinical AI through excellent engineering is.
Nice to Have:
• Experience acting as a technical lead: setting direction on an engineering sub-system, making architectural trade-offs, and guiding other engineers.
• Hands-on experience with our stack: Dagster for orchestration, Ray for distributed compute, vLLM or similar for model serving.
• Prior experience with eval harness orchestration is a plus (e.g. lm-eval-harness, HF Evaluate, or similar frameworks).
• Safety and reliability engineering mindset: experience with red-teaming, load testing, or quality practices for production AI systems.
We are excited to gather a broad range of perspectives in our team, as we believe it will help us build better products to support a broader set of people. If you’re excited about us but don’t fit every single qualification, we still encourage you to apply: we’ve had incredible team members join us who didn’t check every box.
Why kaiko
At kaiko, we believe the best ideas come from collaboration, ownership and ambition. We’ve built a team of international experts where your work has direct impact. Here’s what we value:
Ownership: You’ll have the autonomy to set your own goals, make critical decisions, and see the direct impact of your work.
Collaboration: You’ll have to approach disagreement with curiosity, build on common ground and create solutions together.
Ambition: You’ll be surrounded by people who set high standards for themselves and others, who see obstacles as opportunities, and who are relentless in their work to create better outcomes for patients.
In addition, we offer
An attractive and competitive salary, a good pension plan and 25 vacation days per year.
Great offsites and team events to strengthen the team and celebrate successes together.
A EUR 1000 learning and development budget to help you grow.
Autonomy to do your work the way that works best for you, whether you have a kid or prefer early mornings.
An annual commuting subsidy.
Our interview process
Our interview process is designed to assess mutual fit across skills, motivation, and values. It typically includes the following steps:
Screening call: A short conversation to align on your motivation, career goals, and initial fit for the role.
Technical interview: A deep dive into your problem-solving approach through a technical challenge, case study, or role-specific scenario.
Onsite meeting (optional): You’ll meet team members across functions to explore collaboration dynamics, team fit, and day-to-day context.
Final executive conversation: A discussion with a member of the executive team focused on long-term alignment, cultural fit, and shared expectations for impact.
- Department
- ML R&D
- Locations
- Amsterdam, Zürich (Puls 5)
- Remote status
- Hybrid