Senior Platform Engineer – Observability & SRE

About kaiko

Delivering high quality cancer care is complex; specialists form a view of each patient's condition by reasoning across different data - CT scans, genomics context, treatment history and clinical notes.

Current AIs are powerful within domains but fall short when it comes to reasoning across data or domain areas. kaiko.w, our AI assistant for oncology, aims to equip every clinician with a full understanding of their patients, helping them to reason across data as they assess each case.

We’re building this in close collaboration with the Netherlands Cancer Institute (NKI) and a growing network of hospitals and research centers. We’ve raised significant long-term funding and have nearly doubled our team over the past year. We’re now 80+ people representing 25 nationalities, based across our offices in Zurich and Amsterdam

About the role

You will be joining our Core Infrastructure Team as a key contributor in scaling the observability and Site Reliability Engineering (SRE) systems essential for maintaining exceptional reliability and developer experience. This role sits at the heart of our technical stack, ensuring we deliver a unified observability platform capable of providing clear and actionable insights into the health, performance, and cost-efficiency of our rapidly expanding services.

Working at the intersection of healthcare AI and observability solutions for large-scale systems, you will manage and enhance the observability stack that provides developers a reliable and intuitive "single pane of glass" for service monitoring. Your efforts directly impact service reliability and the developer experience by building robust incident response workflows and observability tools used across the entire organization.

This is a platform-first role, where you'll spend approximately 70% of your time designing, developing, and optimizing our observability infrastructure, and 30% proactively ensuring system reliability and performance. You will collaborate extensively with the Observability and SRE teams, developers, and internal stakeholders to onboard services to the observability platform and continuously refine our SRE practices.

You’ll be located either in Amsterdam or Zurich, with the expecation of spending >50% of your time at the office.

Some areas of responsibility

You will provision, deploy, and optimize observability solutions and tools that cater to ML workloads (Prometheus, OpenTelemetry, Grafana, Loki, VictoriaMetrics, etc.) in Kubernetes-based container orchestration environments.

You will design, build, and maintain Infrastructure-as-Code (Terraform) and configuration-as-code (Ansible, Helm, Kustomize) solutions for reliable, reproducible, and scalable deployments.

You will develop and maintain unified dashboards, SLO-based alerting strategies, and automated workflow monitoring to ensure reliable and actionable system observability, scoped to both container runtime environments and cloud systems.

You will develop and standardise AI observability practices, leveraging your prior experience to ensure effective monitoring, logging, tracing, and alerting specifically tailored for machine learning services and workflows.

Implement robust, data-driven incident response processes aligned with best-in-class SRE methodologies (including runbooks, playbooks, and post-incident reviews).

Collaborate closely with developers and other stakeholders, providing mentorship and fostering a culture of continuous improvement and proactive reliability.

If you are a (Senior) Platform or Infrastructure Engineer who specializes in observability and reliability, has hands‑on experience designing, scaling, and operating large‑scale observability platforms, is excited about the challenges of healthcare AI, and wants to contribute towards advancing cancer research, we believe you will excel in this role!

About you

Minimum requirements:

4+ years of hands-on experience in Observability and Site Reliability Engineering (SRE), with proficiency in at least one of the following domains and solid exposure to at least one other:

Container Workload Observability: Managing container-native monitoring and logging stacks (e.g. Prometheus, Grafana, Loki, OpenTelemetry, VictoriaMetrics etc.) specifically for Kubernetes-based environments to support reliable, observable, and efficient workloads.

Cloud Observability: Deploying and managing cloud-native monitoring, tracing, and logging solutions (e.g. Datadog, AWS CloudWatch, Azure Monitor, GCP Operations Suite, etc.) to provide actionable insights into cloud infrastructure health and performance.

Application Performance Monitoring: Implementing and operating APM tools (e.g. Sentry, New Relic, Dynatrace etc.) to ensure real-time visibility and optimization of application performance, identifying bottlenecks and reducing latency.

Strong experience in developing and standardizing SRE workflows, including defining and tracking SLIs/SLOs, managing error budgets, establishing on-call processes, and running incident response (playbooks, runbooks, and post-incident reviews) aligned with industry best practices.

Infrastructure-as-Code experience (Terraform or similar) and observability dashboards maintenance.

Nice to have:

Experience in machine learning observability (MLFlow, KubeFlow, WandB etc.)

Experience with coding and scripting (Python, Go, or Bash)

Experience with on-call rotations

We are excited to gather a broad range of perspectives in our team, as we believe it will help us build better products to support a broader set of people. If you’re excited about us but don’t fit every single qualification, we still encourage you to apply: we’ve had incredible team members join us who didn’t check every box!

Why kaiko

At kaiko, we believe the best ideas come from collaboration, ownership and ambition. We’ve built a team of international experts where your work has direct impact. Here’s what we value:

Ownership: You’ll have the autonomy to set your own goals, make critical decisions, and see the direct impact of your work.

Collaboration: You’ll have to approach disagreement with curiosity, build on common ground and create solutions together.

Ambition: You’ll be surrounded by people who set high standards for themselves and others, who see obstacles as opportunities, and who are relentless in their work to create better outcomes for patients.

 
In addition, we offer:

An attractive and competitive salary, a good pension plan and 25 vacation days per year.

Great offsites and team events to strengthen the team and celebrate successes together.

A EUR 1000 learning and development budget to help you grow.

Autonomy to do your work the way that works best for you, whether you have a kid or prefer early mornings.

An annual commuting subsidy.

Our interview process

Our interview process is designed to assess mutual fit across skills, motivation, and values. It typically includes the following steps:

Screening call: A short conversation to align on your motivation, career goals, and initial fit for the role.

Technical interview: A deep dive into your problem-solving approach through a technical challenge, case study, or role-specific scenario.

Onsite meeting (optional): You’ll meet team members across functions to explore collaboration dynamics, team fit, and day-to-day context.

Final executive conversation: A discussion with a member of the executive team focused on long-term alignment, cultural fit, and shared expectations for impact.

Senior Platform Engineer – Observability & SRE

Open Roles

Senior Platform Engineer – Observability & SRE

Already working at Kaiko?