SRE team is responsible for providing automated processes linked with building and deploying software in our own DC and Azure, developing scripts/software needed by all activities done by the SRE team. Other areas that SR Engineer will take care of are operations linked with monitoring of SLA-critical production platforms, resolving issues and manual intervention. All off these actions will be done with close cooperation with software development teams.
We are looking for a SR Engineer that will be working with the company’s data traceability solutions for Connectivity & IoT domain - the solutions that provides our customers with an end-to-end eSIM lifecycle management for consumer devices or for industrial and IoT use cases, that complies with GSMA Specifications, and even goes beyond with a set of value added services.
- Run the production environment by monitoring availability and taking a holistic view of system health in both Azure and Private DCs.
- Recover platforms during production incidents to meet targeted SLO; perform detailed root cause analysis to prevent regressions.
- Troubleshoot, evaluate and resolve operational challenges and support escalation
- Maintain platforms after go live by measuring and monitoring their availability, performance and overall system health.
- Scale systems through automation, improving change velocity and reliability
- Leverage technical skills to partner with team members and be comfortable diving into a problem as needed
- Measure and optimize system performance, with an eye toward pushing our capabilities forward, getting ahead of customer needs, and innovating to continually improve
- Participate in system design consulting, platform management, and capacity planning
- Work with feature teams on day-to-day design and development activities (e.g. review architectural changes and their impact on platform OA&M, challenge security decisions, provide feedback and propose improvements related to operational aspects of the applications)
- Develop auxiliary tools automating or simplifying platform Ops
- Take responsibility for platform availability, performance and overall system health; manage platform’s error budget
- Recover platforms during production incidents to meet targeted SLOs; perform detailed root cause analysis to prevent regressions
- Provide technical expertise on company products and support processes to internal and external customers, including defining SLI/SLO acceptable by all involved parties
- Validate readiness and maturity of new rollouts through development, execution and verification of automated smoke test suites
- Experience in one or more of the following: Java, Python, Go, Perl, or shell scripting.
- Experience with Azure Cloud services
- Experience Unix/Linux operating systems internals and administration.
- Expertise in analyzing and troubleshooting large-scale distributed systems.
Nice to have:
- Kubernetes, Docker, Helm, Ansible, Kong, OpenStack, Puppet, and other cloud-based deployment tools and services.
- Ability to debug and optimize a variety of code, languages, and automation tools.
- Knowledge and experience designing and developing applications that take into account scalability, reliability, extensibility, etc.
- Test automation experience with either unit/integration or functional API testing harnessed in a continuous delivery tool.
- Experience in production environments supporting mission-critical applications.