HE - SRE - New API pods - Job2382

Multiple Countries
Full Time
Manager/Supervisor

HE - SRE - New API pods - Job2382

Summary

We are seeking a skilled Site Reliability Engineer (SRE) with 3–4 years of Azure Kubernetes experience and 7+ years in event-driven microservices to design, deploy, and operate scalable, secure AKS clusters. You’ll use GitHub Copilot to automate AKS provisioning, generate deployment templates, and optimize CI/CD pipelines, ensuring our healthcare platform (HQY) delivers high performance and reliability. This role is ideal for an SRE passionate about Kubernetes, Azure services, and security standards (e.g., ISO 27000, NIST 800), who thrives in fast-paced, remote environments and is excited to leverage GenAI for operational efficiency.

Responsibilities

As a Senior Site Reliability Engineer for the New API pods project, you will:

  • Lead the design, deployment, and ongoing operations of Azure Kubernetes Service (AKS) clusters that support event-driven microservices critical to our healthcare platform.
  • Utilize GitHub Copilot to automate AKS provisioning, generate deployment templates, and optimize CI/CD pipelines, enhancing operational efficiency and reducing manual overhead.
  • Develop and maintain infrastructure as code using Terraform and PowerShell scripts to automate deployments and manage cloud resources consistently and securely.
  • Drive continuous integration and continuous delivery (CI/CD) pipelines using Azure DevOps, ensuring rapid, reliable, and repeatable software releases.
  • Implement and enforce security and compliance standards aligned with ISO 27000 and NIST 800 frameworks, safeguarding sensitive healthcare data and infrastructure.
  • Enhance system observability and monitoring by integrating Azure Event Hubs, Azure Application Insights, and Prometheus, enabling proactive incident detection and resolution.
  • Manage authentication and authorization mechanisms using OAuth2, Pod Security Policies, TLS, Managed Identities, and Service Principals to secure microservices and APIs.
  • Optimize cluster performance and resource utilization by configuring autoscalers and tuning Kubernetes components.
  • Support microservices deployment and traffic management using service mesh technologies such as Istio and Envoy, ensuring secure and reliable inter-service communication.
  • Collaborate closely with software engineering, security, and operations teams to align infrastructure capabilities with application requirements and business goals.
  • Leverage GenAI tools and automation to continuously improve operational workflows and reduce toil.
  • Participate in agile ceremonies and contribute to a culture of continuous improvement, knowledge sharing, and innovation.

Requirements

Must-Have Skills

  • Proven AKS Expertise: 3–4 years of hands-on experience designing, deploying, and operating Azure Kubernetes Service (AKS), with 7+ years in scalable, secure, event-driven microservices.

    • Strong hands-on knowledge of ISTIO, Kusto, Helm, and Envoy for service mesh and observability.
    • Proficiency in Azure services (e.g., App Service, Service Bus, Event Hubs, ACR) and serverless computing.
    • Experience with database technologies like Azure SQL Server, MongoDB, and PostgreSQL.
    • Expertise in REST APIs and Swagger/OpenAPI for API specification and integration.
    • Strong understanding of Agile, Scrum, or Kanban software development life cycles.
  • Automation & Tooling:

    • 2+ years building CI/CD pipelines with Jenkins, Terraform, and Ansible Playbooks, with expertise in YAML scripting for AKS manifests and automation.
    • Experience with Git or SVN for code versioning and deployment automation.
    • Familiarity with ARM templates, PowerShell, or alternatives (CloudFormation, Ansible, Chef, Puppet) for infrastructure automation.
  • GenAI Proficiency: Hands-on experience with GitHub Copilot or similar GenAI tools to accelerate scripting (e.g., YAML, Terraform, PowerShell), debug AKS configurations, and generate observability queries or test data.

  • Security & Compliance: Deep knowledge of NIST, FedRAMP, CSA, and ISO 27000 standards, with experience implementing Pod Security Policies (PSP), node-to-node encryption, and HTTPS Ingress with TLS certificate management.

  • Observability: Expertise in App Insights, Prometheus, and Kusto for real-time monitoring and diagnostics.

  • Authentication/Authorization: Strong fundamentals in managing managed identities, service principals, and certificates for secure AKS access.

Nice-to-Have Skills

  • Experience with Azure Active Directory or other IAM platforms for Kubernetes authentication.
  • Familiarity with healthcare compliance (e.g., HIPAA) for secure data handling.
  • Knowledge of Microsoft AutoGen for agentic automation workflows.
  • Prior work with Azure Container Service (ACS) or Linux/Windows Kubernetes clusters.
Share

Apply for this position

Required*
We've received your resume. Click here to update it.
Attach resume as .pdf, .doc, .docx, .odt, .txt, or .rtf (limit 5MB) or Paste resume

Paste your resume here or Attach resume file

Human Check*