Site Reliability Engineer
About kaiko.ai
Kaiko is building a next-generation agentic clinical AI assistant that helps clinicians reason across patient data, guidelines, and diagnostics.
Healthcare decisions are rarely made by a single person or from a single data source. kaiko's assistant maintains longitudinal patient context across encounters, clinicians, and institutions, enabling collaboration, second opinions, and complex diagnostic workflows. The system is designed to operate safely in real clinical environments, with human oversight, auditability, and regulatory alignment at its core.
Our assistant core supports broadly applicable clinical tasks such as patient data navigation, guideline interaction, multimodal interaction (chat and voice), and care coordination. On top of this foundation, we are developing specialized diagnostic agents in areas such as oncology, radiology, and pathology.
We build in close collaboration with leading hospitals and research centers, including the Netherlands Cancer Institute (NKI). kaiko is a well-funded company with a growing international team, operating from Zurich and Amsterdam.
About the role
This is a frontline reliability role, and we want to be upfront about that: you’ll spend the majority of your time keeping the platform healthy — responding to alerts, triaging and resolving incidents, and carrying on-call as a core, respected part of the work. The rest of your time goes into making that work better over time: sharpening alerts, removing toil, and strengthening the SRE programme so the same problems don’t keep paging us.
If you’re energized by understanding why something broke and designing the page out of existence, you’ll do well here. If on-call feels like a tax you’d rather avoid, this isn’t the right fit.
This is net-new headcount on our Observability & SRE team, so you’ll have room to shape how reliability works at kaiko rather than inherit a rigid process.
Some areas of responsibility
Own the reactive frontline. Carry on-call, triage alerts quickly and methodically, and drive incidents to resolution with clear communication and clean handoffs.
Make our alerts trustworthy. Treat noisy or low-value alerts as defects to fix, not background noise. Help move us toward structured, queryable signals and precise, actionable alerting.
Turn incidents into durable fixes. Run and contribute to blameless postmortems, then follow through across teams, so action items land in our services.
Strengthen the SRE programme. Write automation and tooling that removes toil, improves runbooks and dashboards, and leaves the on-call rotation better documented than you found it.
About you
These are the things we consider essential for you to thrive in this role:
Deep Kubernetes experience. You can reason and debug real failure modes under pressure — scheduling, networking, resource pressure, control-plane vs. workload issues — not just operate a cluster through a dashboard.
Strong Linux fundamentals. You’re comfortable debugging at the OS level: processes, networking, filesystems, resource limits.
Solid programming ability. You reach for code to eliminate toil and write maintainable automation and tooling. Demonstrable experience with building products shows good problem-solving abilities and a comfortable mentality to maintain complex software.
Observability and incident-response fluency. You live comfortably in metrics, logs, and traces, can write the query that isolates a problem, and stay composed running an incident.
Internalized SRE principles. SLO/SLI and error-budget thinking, a bias toward prevention, and an instinct for reducing toil — so your improvements make the system genuinely more reliable.
We’re thinking of someone with roughly 3-6 years of production SRE, platform, or infrastructure experience, but we care far more about demonstrated skill than years served.
Soft skills
Solution-oriented and pragmatic; not afraid to build solutions from the ground up whenever the problem calls for it.
Collaborative communicator who coaches teams and writes clear, actionable guidance.
Bias to automate and remove toil.
Nice to have
Experience with structured logging and modern alerting practices
Infrastructure-as-code and CI/CD fluency (Terraform, Helm, GitOps, and similar)
Familiarity with incident-management tooling (Rootly, PagerDuty, incident.io) and the postmortem discipline around it
Exposure to regulated or high-stakes domains (health, fintech, critical infrastructure)
Why kaiko
At kaiko, we believe the best ideas come from collaboration, ownership and ambition. We’ve built a team of international experts where your work has direct impact. Here’s what we value:
Ownership: You’ll have the autonomy to set your own goals, make critical decisions, and see the direct impact of your work.
Collaboration: You’ll approach disagreement with curiosity, build on common ground and create solutions together.
Ambition: You’ll be surrounded by people who set high standards, see obstacles as opportunities, and work relentlessly to create better outcomes for patients.
In addition, we offer
An attractive and competitive salary, a good pension plan and 25 vacation days per year.
Great offsites and team events to strengthen the team and celebrate successes together.
A EUR 1000 learning and development budget to help you grow.
Autonomy to do your work the way that works best for you, whether you have a kid or prefer early mornings.
An annual commuting subsidy.
Apply
If this sounds like your kind of work, we’d love to hear from you — even if you don’t tick every single box. If you’re strong on the fundamentals and energised by reliability work, please reach out.
kaiko is committed to building a diverse team and to an inclusive, equitable hiring process. We welcome applicants of all backgrounds.
- Locations
- Amsterdam, Zürich (Puls 5)
- Remote status
- Hybrid