Optimove is a global marketing tech company, recognized as a Leader by Forrester and a Challenger by Gartner. We work with some of the world's most exciting brands, such as Sephora, Staples, and Entain, who love our thought-provoking combination of art and science. With a strong product, a proven business, and the DNA of a vibrant, fast-growing startup, we're on the cusp of our next growth spurt. It's the perfect time to join our team of ~450 thinkers and doers across NYC, LDN, TLV, and other locations, where 2 of every 3 managers were promoted from within. Growing your career with Optimove is basically guaranteed. 

Are you passionate about ensuring system reliability, scalability, and performance? Do you thrive in a dynamic environment where automation and operational excellence are key?
Optimove is looking for a Site Reliability Engineer (SRE) to join our team and play a crucial role in designing, implementing, and maintaining our cloud-based infrastructure. In this role, you will collaborate across teams to drive automation, improve system resilience, and optimize performance while fostering a culture of reliability.

Responsibilities:

  • System Reliability – Ensure high availability and performance of services through effective monitoring, incident management, and root cause analysis.
  • Automation & Tooling – Develop and maintain automation for infrastructure provisioning, configuration management, and application deployment.
  • Performance Optimization – Analyze and enhance system performance, including load balancing, caching, and database tuning. Conduct regular capacity planning.
  • Incident Response & Troubleshooting – Lead incident response efforts, participate in on-call rotations, and troubleshoot complex infrastructure issues.
  • Security & Compliance – Collaborate with security teams to implement best practices and ensure compliance with relevant standards (ISO 27001, SOC 2, etc.).
  • Collaboration & Mentorship – Work closely with developers, DevOps, Support, and product teams to enhance application reliability and implement SRE best practices.

Requirements:

  • 5+ years in site reliability engineering, DevOps, or related roles.
  • Proven experience managing large-scale, cloud-based infrastructure in GCP, AWS, or Azure.
  • Expertise in container orchestration (Kubernetes, Docker) and microservices architecture.
  • Strong proficiency in scripting and programming languages (Python, Go, Bash, etc.).
  • Experience with CI/CD pipelines, infrastructure as code (Terraform, CloudFormation), and configuration management (Ansible, Puppet, Chef).
  • Hands-on experience with monitoring and observability tools (Datadog, Prometheus, Grafana, ELK Stack).
  • Deep understanding of networking concepts, DNS, load balancing, and distributed systems.
  • Strong problem-solving skills, excellent communication, and a proactive mindset.

Advantages:

  • Certifications – AWS Certified Solutions Architect, GCP Professional Cloud Architect, or Kubernetes certifications (CKA, CKAD).

Why Join Us?

In this role, you will have the opportunity to work on cutting-edge technology, solve challenging problems, and make a tangible impact on the reliability and scalability of our systems. Join a team that values collaboration, innovation, and continuous learning, and be part of an exciting journey as we scale our platform to new heights!

 

Apply for this Job

* Required
resume chosen  
(File types: pdf, doc, docx, txt, rtf)
cover_letter chosen  
(File types: pdf, doc, docx, txt, rtf)


Enter the verification code sent to to confirm you are not a robot, then submit your application.

This application was flagged as potential bot traffic. To resubmit your application, turn off any VPNs, clear the browser's cache and cookies, or try another browser. If you still can't submit it, contact our support team through the help center.