Optimove is a global marketing tech company, recognized as a Leader by Forrester and a Challenger by Gartner. We work with some of the world's most exciting brands, such as Sephora, Staples, and Entain, who love our thought-provoking combination of art and science. With a strong product, a proven business, and the DNA of a vibrant, fast-growing startup, we're on the cusp of our next growth spurt. It's the perfect time to join our team of ~450 thinkers and doers across NYC, LDN, TLV, and other locations, where 2 of every 3 managers were promoted from within. Growing your career with Optimove is basically guaranteed.
Are you passionate about ensuring system reliability, scalability, and performance? Do you thrive in a dynamic environment where automation and operational excellence are key?
Optimove is looking for a Site Reliability Engineer (SRE) to join our team and play a crucial role in designing, implementing, and maintaining our cloud-based infrastructure. In this role, you will collaborate across teams to drive automation, improve system resilience, and optimize performance while fostering a culture of reliability.
Responsibilities:
- System Reliability – Ensure high availability and performance of services through effective monitoring, incident management, and root cause analysis.
- Automation & Tooling – Develop and maintain automation for infrastructure provisioning, configuration management, and application deployment.
- Performance Optimization – Analyze and enhance system performance, including load balancing, caching, and database tuning. Conduct regular capacity planning.
- Incident Response & Troubleshooting – Lead incident response efforts, participate in on-call rotations, and troubleshoot complex infrastructure issues.
- Security & Compliance – Collaborate with security teams to implement best practices and ensure compliance with relevant standards (ISO 27001, SOC 2, etc.).
- Collaboration & Mentorship – Work closely with developers, DevOps, Support, and product teams to enhance application reliability and implement SRE best practices.
Requirements:
- 5+ years in site reliability engineering, DevOps, or related roles.
- Proven experience managing large-scale, cloud-based infrastructure in GCP, AWS, or Azure.
- Expertise in container orchestration (Kubernetes, Docker) and microservices architecture.
- Strong proficiency in scripting and programming languages (Python, Go, Bash, etc.).
- Experience with CI/CD pipelines, infrastructure as code (Terraform, CloudFormation), and configuration management (Ansible, Puppet, Chef).
- Hands-on experience with monitoring and observability tools (Datadog, Prometheus, Grafana, ELK Stack).
- Deep understanding of networking concepts, DNS, load balancing, and distributed systems.
- Strong problem-solving skills, excellent communication, and a proactive mindset.
Advantages:
- Certifications – AWS Certified Solutions Architect, GCP Professional Cloud Architect, or Kubernetes certifications (CKA, CKAD).
Why Join Us?
In this role, you will have the opportunity to work on cutting-edge technology, solve challenging problems, and make a tangible impact on the reliability and scalability of our systems. Join a team that values collaboration, innovation, and continuous learning, and be part of an exciting journey as we scale our platform to new heights!