software development
The Cloud Persistence team, part of the Cloud Foundation group, delivers the mission-critical storage layer for AEM Cloud Service. We make sure content is always consistent, durable, and lightning-fast to access—so our customers can create and deliver personalized, high-quality experiences without a hitch.
We operate at serious scale, managing thousands of databases and storage accounts holding terabytes of content. We like solving big problems, we like doing it together, and we’re growing fast.
What You’ll Do
· Own the reliability, performance, and operational readiness of the storage components.
· Build and improve monitoring, alerting, dashboards, and on-call playbooks (Prometheus, Grafana, Splunk).
· Define, track, and improve SLIs/SLOs .
· Drive incident analysis, root-cause fixes, and push long-term reliability improvements.
· Partner with engineers to design systems that are observable, scalable, and easy to operate at thousands-of-clusters scale.
· Automate repetitive operational work.
· Bachelor’s or Master’s degree in Computer Science or equivalent experience
· 6+ years of SRE/production engineering experience (or strong backend + clear SRE depth).
· Strong experience with observability stacks: Prometheus, Grafana, alerting pipelines, log analysis (Splunk, ELK, etc).
· Solid understanding of SLIs/SLOs, error budgets, system scaling, and incident management.
· Hands-on experience with Kubernetes and at least one major cloud (Azure preferred).
· Curiosity, willingness to learn, and a collaborative mindset—because we win as a team.
· Comfortable running and debugging distributed systems (Java/JVM familiarity is a plus).