PW - Sr. SRE B. - Job3730
Multiple Countries
Full Time
Manager/Supervisor
PW - Sr. SRE B. - Job3730
Summary
We are looking for a seasoned Site Reliability Engineer (SRE) to join our team and support our strategy of driving products and technology to accelerate business growth. As an SRE, you will work alongside a team of problem solvers, helping to solve complex business issues from strategy to execution.
Responsibilities
- Defining standard reliability and resilience for infrastructure and application components.
- Proactively optimizing redundancies, monitoring practices, and alerting patterns.
- Developing resilient and highly available distributed systems.
- Building infrastructure as code tools for cloud environments.
- Monitoring systems and services, providing incident response to triage and resolve system or client issues.
- Managing the application ecosystem, improving platform infrastructure and applications with high reliability,resiliency, performance, and quality.
- Creating documentation, knowledge articles, and runbooks.
- Designing and implementing SRE patterns that adhere to our client's security guidelines and policies.
Requirements
- Bachelor's degree in Computer Science or related field (or equivalent work experience).
- At least 4 years of relevant working experience as a Site Reliability Engineer or similar role.
- Advanced Kubernetes expertise - Strong skills in Kubernetes at scale using AKS, EKS, or GKE. Experience with Kubectl and Helm. Familiarity with tools like Lens or Rancher.
- Observability: experience in setting up tools like Datadog & Splunk for actionable insights on microservice environments including synthetics, application performance monitoring, logging, and alerting (PagerDuty/OpsGenie integrations).
- Good CI/CD expertise. Experience using Azure DevOps & GitHub Actions for continuous integration and continuous deployment processes.
- SCM proficiency - Working with tools like GitHub for source code management, along with experience in branching strategies like GitFlow or trunk-based development.
- Strong troubleshooting skills - Ability to dive deep into code-level analysis to provide development teams with a head start on resolving application issues. Effective contribution to root cause analysis exercises.
- Good communication skills - Active listening, verbal and non-verbal communication, clarity, concision, confidence, open-mindedness, and respect.
- Good documentation skills - Ability to effectively document automation and technical efforts for ease of adaptability of solutions.
- Collaboration skills - Ability to work effectively with Scrum/Dev teams using a push/pull philosophy, managing expectations and contributing to the stability and improvement of the platform.
Nice to Have
- Infrastructure as Code tools (Terraform, Pulumi). Preferably developed modules in the past rather than just using them.
- Security practices including encryption at rest/in transit with tools like Azure Key vault, Hashicorp Vault, Google KMS.
- Containerization experience deploying Java (Spring Boot) microservices in Docker environments.
- Automation – Must be able to identify toil and opportunities to reduce that within the team.
- Authentication/Authorization – Familiarity with Authn/Authz schemes like OpenID, OAuth 2.0, SAML.
- Scripting and Programming – Experience with Python, Powershell, Java or Node.
- Familiarity with event-driven/event sourcing patterns using platforms like Kafka, EventHub, RabbitMQ and patterns like CQRS.
Apply for this position
Required*