We are looking for an experienced and ambitious Senior Site Reliability Engineer. In this role, you will use your skills and knowledge to handle our workflows from the technical side and help us set technical direction and standards for handling our production deployments. We are looking for a self-driven and proactive specialist who has worked on medium-sized and bigger production systems from the reliability and maintainability perspective and can use this experience to influence development.

At Neptune, we have quite an ambitious goal of becoming an MLOps standard for data scientists worldwide. Our platform is a lightweight experiment tracker for ML teams that struggle with debugging and reproducing experiments, sharing results, and messy model handover. Don’t worry, you don’t need to have ML skills. Our customers handle that part.

We design Neptune to be extensible, flexible, and lightweight to ensure it fits any workflow and keeps up with this fast-paced field. Some of the things we do are pretty run-of-the-mill engineering work (REST, SQL, NoSQL), but we often stumble upon a bigger challenge! After all, how many companies have implemented a custom scalable time-series storage moving data between various underlying storages?

 

Our tech stack:

  • Java (+ Spring), Scala, Kotlin, Python
  • Kubernetes, Terraform, Helm
  • Google Cloud Platform, Microsoft Azure
  • MySQL, Elasticsearch, Kafka

 

In this role, you will:

  • Manage Neptune deployments for our customers globally. It includes new deployments, upgrades, resource optimisation, and troubleshooting across various platforms (GCP, Azure, AWS, on-prem) using K8s, Helm, and Terraform.
  • Participate in on-call rotation for incident management, ensuring high availability and timely resolution of customers’ issues.
  • Oversee services (ClickHouse, Elasticsearch, Kafka, MySQL, Redis) to ensure performance and availability.
  • Maintain CI/CD pipelines, enhance automation and implement security measures.
  • Monitor and troubleshoot network, resource utilzation and system issues to ensure high availability.

 

We are looking for:

  • 5+ years in site reliability engineering or related roles;
  • Showed flexibility and resilience in the dynamic landscape of software development, maintaining positivity and determination even after setbacks;
  • Ability to plan work and basic knowledge of project management tools;
  • Strong experience with Linux systems and network administration;
  • Expertise in managing distributed computing and near real-time data streaming systems;
  • Proficient in managing Kubernetes across multiple environments (GCP, Azure, AWS, and self-hosted) using Terraform and Helm;
  • Solid scripting abilities in BASH and Python;
  • Fluent in English, with excellent communication skills for interacting with global customers.

Nice to have:

  • Certifications in Linux, Kubernetes, GCP, Azure, or AWS.
  • Experience in high-traffic, petabyte-scale data environments.
  • Experience in managing ClickHouse, ElasticSearch, Redis and Kafka deployments.

 

We offer:

  • Flexibility: 100% remote work with an office in Warsaw available and flexible working hours;
  • Share in our success: Participate in the Employee Stock Option Plan and be part of our growth journey;
  • Time off: 20 paid service-free days per year;
  • Ownership and impact: Space to take action, bring your ideas to life, and make a real impact;
  • Perks: Co-financing of private medical care and a Multisport card, regular team-building events, and free lunch when you’re at the office.

 

Any questions?

Check out our ultimate guide for candidates to the neptune.ai Engineering team.

Don’t hesitate to contact our Talent Acquisition team, and check out our About us page to get to know the story and faces behind Neptune.

 

By applying, you consent for neptune.ai to process your personal data to assess your suitability for the role you have applied for in accordance with the General Data Protection Regulation (GDPR). Your personal data will remain confidential and shared only with authorized personnel involved in the recruitment process. You have the right to access, rectify, or delete your personal data at anytime.
With your optional consent, we can retain your data for up to 12 months after the application to consider you for future suitable roles if you’re not a match for the current position.

Apply for this Job

* Required
resume chosen  
(File types: pdf, doc, docx, txt, rtf)


Our system has flagged this application as potentially being associated with bot traffic. Please turn off any VPNs, clear your browser cache and cookies, or try submitting your application in a different browser. If this issue persists, please reach out to our support team via our help center.
Please complete the reCAPTCHA above.