Medior Platform Engineer – Observability & SRE
About kaiko
Delivering high quality cancer care is complex; specialists form a view of each patient's condition by reasoning across different data - CT scans, genomics context, treatment history and clinical notes.
Current AIs are powerful within domains but fall short when it comes to reasoning across data or domain areas. kaiko.w, our AI assistant for oncology, aims to equip every clinician with a full understanding of their patients, helping them to reason across data as they assess each case.
We’re building this in close collaboration with the Netherlands Cancer Institute (NKI) and a growing network of hospitals and research centers. We’ve raised significant long-term funding and have nearly doubled our team over the past year. We’re now 80+ people representing 25 nationalities, based across our offices in Zurich and Amsterdam
About the role
You will be joining our Core Infrastructure Team as a key contributor in scaling the observability and Site Reliability Engineering (SRE) systems essential for maintaining exceptional reliability and developer experience. This role sits at the heart of our technical stack, ensuring we deliver a unified observability platform capable of providing clear and actionable insights into the health, performance, and cost-efficiency of our rapidly expanding services.
Working at the intersection of healthcare AI and observability solutions for large-scale systems, you will manage and enhance the observability stack that provides developers a reliable and intuitive "single pane of glass" for service monitoring. Your efforts directly impact service reliability and the developer experience by building robust incident response workflows and observability tools used across the entire organization.
This is an operations-first role, where you'll spend approximately 70% of your time proactively ensuring system reliability, build efficient workflows, automate operations, and manage stakeholder requests and 30% of your time designing, developing, and optimizing our observability infrastructure. You will collaborate extensively with the Observability and SRE teams, developers, and internal stakeholders to onboard services to the observability platform and continuously refine our SRE practices.
You’ll be located either in Amsterdam or Zurich, with the expectation of spending >50% of your time at the office.
Some areas of responsibility
- Keep telemetry flowing; ensure metrics, logs, traces (and, where useful, profiles) are collected, routed, stored, and queryable with predictable performance and retention.
- Maintain healthy signals: curate SLO-mapped alerts, reduce noise, tune thresholds, and validate alert reliability through synthetic checks.
- Own day-to-day reliability: patching, upgrades, backup/restore, capacity/retention planning, cost and storage hygiene, access control, and tenancy where applicable.
- Manage change safely: plan and execute maintenance windows, rollouts/rollbacks, and configuration changes using infrastructure- and configuration-as-code.
- Support dependents: ensure downstream systems (e.g., deployment pipelines, incident tooling, service catalog, ML workloads) receive the right telemetry and metadata.
- Run incident management: participate in on-call, triage issues, coordinate responders, and drive quick, high-signal communications; capture learnings in post-incident reviews.
- Automate toil: turn repeatable ops work into scripts, jobs, or pipelines; invest in self-service flows for product teams (dashboards, alert packs, instrumentation templates).
We’re tool-friendly, not tool-fixated. If you’ve operated similar systems, your experience will transfer.
If you are a Medior Platform or Infrastructure Engineer who specializes in observability and reliability, has hands‑on experience in maintaining systems in production using large‑scale observability platforms, is excited about the challenges of healthcare AI, and wants to contribute towards advancing cancer research, we believe you will excel in this role!
About you
Minimum requirements:
- 2–4 years in Observability, SRE, or Production Operations (or equivalent hands-on experience) with responsibility for running services in production.
- Strength in at least one of:
- Containerized workload observability for distributed systems.
- Cloud/platform observability for infrastructure and services.
- Application-level insight (request tracing, latency, error rates).
- Experience operating on-call: actionable alerts, incident triage, stakeholder comms, and writing/updating runbooks.
- Comfortable with infrastructure/configuration as code and treating dashboards/alerts as code.
- Clear, calm communicator who collaborates well with developers, data/ML teams, and security/compliance partners.
Nice to have:
- Familiarity with ML/AI operations signals (model performance, data/feature drift, GPU/accelerator utilization).
- Experience in regulated or safety-critical environments and an understanding of privacy-by-design practices.
We are excited to gather a broad range of perspectives in our team, as we believe it will help us build better products to support a broader set of people. If you’re excited about us but don’t fit every single qualification, we still encourage you to apply: we’ve had incredible team members join us who didn’t check every box!
Why kaiko
At kaiko, we believe the best ideas come from collaboration, ownership and ambition. We’ve built a team of international experts where your work has direct impact. Here’s what we value:
- Ownership: You’ll have the autonomy to set your own goals, make critical decisions, and see the direct impact of your work.
- Collaboration: You’ll have to approach disagreement with curiosity, build on common ground and create solutions together.
- Ambition: You’ll be surrounded by people who set high standards for themselves and others, who see obstacles as opportunities, and who are relentless in their work to create better outcomes for patients.
In addition, we offer:
- An attractive and competitive salary, a good pension plan and 25 vacation days per year.
- Great offsites and team events to strengthen the team and celebrate successes together.
- A EUR 1000 learning and development budget to help you grow.
- Autonomy to do your work the way that works best for you, whether you have a kid or prefer early mornings.
- An annual commuting subsidy.
Our interview process
Our interview process is designed to assess mutual fit across skills, motivation, and values. It typically includes the following steps:
- Screening call: A short conversation to align on your motivation, career goals, and initial fit for the role.
- Technical interview: A deep dive into your problem-solving approach through a technical challenge, case study, or role-specific scenario.
- Onsite meeting (optional): You’ll meet team members across functions to explore collaboration dynamics, team fit, and day-to-day context.
- Final executive conversation: A discussion with a member of the executive team focused on long-term alignment, cultural fit, and shared expectations for impact.
- Department
- Platform Engineering
- Locations
- Amsterdam, Zürich (Puls 5)
- Remote status
- Hybrid
Already working at Kaiko?
Let’s recruit together and find your next colleague.