PW - Sr. SRE B. - Job3730

Multiple Countries
Full Time
Manager/Supervisor

PW - Sr. SRE B. - Job3730

Summary

We are looking for a seasoned Site Reliability Engineer (SRE) to join our team and support our strategy of driving products and technology to accelerate business growth. As an SRE, you will work alongside a team of problem solvers, helping to solve complex business issues from strategy to execution.

Responsibilities

  • Defining standard reliability and resilience for infrastructure and application components.
  • Proactively optimizing redundancies, monitoring practices, and alerting patterns.
  • Developing resilient and highly available distributed systems.
  • Building infrastructure as code tools for cloud environments.
  • Monitoring systems and services, providing incident response to triage and resolve system or client issues.
  • Managing the application ecosystem, improving platform infrastructure and applications with high reliability,resiliency, performance, and quality.
  • Creating documentation, knowledge articles, and runbooks.
  • Designing and implementing SRE patterns that adhere to our client's security guidelines and policies.

Requirements

  • Bachelor's degree in Computer Science or related field (or equivalent work experience).
  • At least 4 years of relevant working experience as a Site Reliability Engineer or similar role.
  • Advanced Kubernetes expertise - Strong skills in Kubernetes at scale using AKS, EKS, or GKE. Experience with Kubectl and Helm. Familiarity with tools like Lens or Rancher.
  • Observability: experience in setting up tools like Datadog & Splunk for actionable insights on microservice environments including synthetics, application performance monitoring, logging, and alerting (PagerDuty/OpsGenie integrations).
  • Good CI/CD expertise. Experience using Azure DevOps & GitHub Actions for continuous integration and continuous deployment processes.
  • SCM proficiency - Working with tools like GitHub for source code management, along with experience in branching strategies like GitFlow or trunk-based development.
  • Strong troubleshooting skills - Ability to dive deep into code-level analysis to provide development teams with a head start on resolving application issues. Effective contribution to root cause analysis exercises.
  • Good communication skills - Active listening, verbal and non-verbal communication, clarity, concision, confidence, open-mindedness, and respect.
  • Good documentation skills - Ability to effectively document automation and technical efforts for ease of adaptability of solutions.
  • Collaboration skills - Ability to work effectively with Scrum/Dev teams using a push/pull philosophy, managing expectations and contributing to the stability and improvement of the platform.

Nice to Have

  • Infrastructure as Code tools (Terraform, Pulumi). Preferably developed modules in the past rather than just using them.
  • Security practices including encryption at rest/in transit with tools like Azure Key vault, Hashicorp Vault, Google KMS.
  • Containerization experience deploying Java (Spring Boot) microservices in Docker environments.
  • Automation – Must be able to identify toil and opportunities to reduce that within the team.
  • Authentication/Authorization – Familiarity with Authn/Authz schemes like OpenID, OAuth 2.0, SAML.
  • Scripting and Programming – Experience with Python, Powershell, Java or Node.
  • Familiarity with event-driven/event sourcing patterns using platforms like Kafka, EventHub, RabbitMQ and patterns like CQRS.
Share

Apply for this position

Required*
We've received your resume. Click here to update it.
Attach resume as .pdf, .doc, .docx, .odt, .txt, or .rtf (limit 5MB) or Paste resume

Paste your resume here or Attach resume file

Human Check*