See all the jobs at InfraCloud Technologies here:
, , | Full-time | Fully remote
Position Summary
We are seeking a highly skilled Site Reliability Engineer (SRE) to design, build, automate, and operate scalable, secure, and highly available cloud-native platforms. The ideal candidate will have strong expertise in Kubernetes ecosystem technologies, Google Cloud Platform (GCP), Infrastructure as Code (Terraform), GitOps, Observability, Service Mesh, and Secrets Management.
The SRE will work closely with Development, Platform Engineering, Security, and DevOps teams to ensure reliability, performance, scalability, and operational excellence across production environments.
Key Responsibilities
Kubernetes Platform Engineering
- Design, deploy, and manage large-scale Kubernetes clusters in production environments.
- Administer and optimize Kubernetes networking using:
- Cilium
- Istio Service Mesh
- Kubernetes Ingress Controllers
- Build highly available and resilient container platforms.
- Implement cluster lifecycle management, upgrades, scaling, and capacity planning.
- Troubleshoot complex Kubernetes infrastructure and application issues.
Cloud Infrastructure (GCP)
- Design and operate cloud-native infrastructure on Google Cloud Platform.
- Manage services such as:
- GKE (Google Kubernetes Engine)
- VPC Networking
- IAM
- Cloud Load Balancers
- Cloud Storage
- Monitoring and Logging services
- Ensure security, scalability, and cost optimization of cloud environments.
- Implement multi-environment and multi-region deployment strategies.
Infrastructure as Code (Terraform)
- Develop and maintain reusable Terraform modules.
- Automate provisioning and management of cloud infrastructure.
- Implement infrastructure standards and governance.
- Maintain version-controlled infrastructure repositories.
- Ensure repeatable, auditable, and scalable infrastructure deployments.
Kubernetes Package Management (Helm)
- Create and maintain Helm charts for platform and application deployments.
- Standardize deployment practices across teams.
- Manage Helm repositories and release strategies.
- Support blue-green, canary, and rolling deployment methodologies.
GitOps & Continuous Delivery
- Build and maintain GitOps workflows using ArgoCD.
- Automate application deployment pipelines.
- Implement environment promotion strategies.
- Maintain deployment compliance and auditability.
- Drive CI/CD best practices across engineering teams.
Secrets & Service Discovery Management
- Manage secrets, certificates, and application credentials using Vault.
- Implement secure secret injection patterns for Kubernetes workloads.
- Configure and maintain Consul for service discovery and service networking.
- Establish access control and security policies for sensitive workloads.
Monitoring, Observability & Reliability Engineering
- Build comprehensive observability solutions using:
- Prometheus
- Prometheus Operator
- Grafana
- Loki
- Tempo
- Alloy
- Mimir
- Pyroscope
- Define and implement:
- Service Level Indicators (SLIs)
- Service Level Objectives (SLOs)
- Error Budgets
- Create dashboards, alerts, and operational runbooks.
- Conduct root cause analysis (RCA) and postmortems.
- Improve system reliability, performance, and operational visibility.
Incident Response & Operations
- Participate in on-call rotations.
- Lead incident management during production outages.
- Troubleshoot infrastructure, networking, application, and platform issues.
- Develop automation to reduce operational toil.
- Create disaster recovery and business continuity procedures.
Automation & Platform Engineering
- Develop automation scripts and operational tooling.
- Improve platform self-service capabilities.
- Drive reliability engineering best practices.
- Eliminate manual operational processes through automation.
Required Technical Skills
Container & Kubernetes Ecosystem
- Kubernetes (Production-grade administration)
- Cilium
- Istio Service Mesh
- Kubernetes Ingress Controllers
- Container Networking
- Cluster Security and RBAC
Cloud Platforms
- Google Cloud Platform (GCP)
- GKE
- Cloud Networking
- IAM and Security Controls
Infrastructure as Code
- Terraform
- Infrastructure Automation
- Configuration Management Concepts
Deployment & GitOps
- ArgoCD
- GitOps Methodologies
- GitLab
- CI/CD Pipelines
Secrets & Service Networking
- HashiCorp Vault
- Consul
Monitoring & Observability
- Prometheus
- Prometheus Operator
- Grafana
- Loki
- Tempo
- Alloy
- Mimir
- Pyroscope
Operating Systems & Networking
- Linux Administration
- TCP/IP
- DNS
- Load Balancing
- SSL/TLS
- Network Troubleshooting
Preferred Qualifications
- Experience managing large-scale Kubernetes platforms.
- Experience supporting mission-critical production systems.
- Strong understanding of distributed systems concepts.
- Knowledge of cloud security best practices.
- Experience implementing SRE principles such as:
- SLI/SLO/Error Budgets
- Capacity Planning
- Incident Management
- Reliability Engineering
- Experience with multi-cluster Kubernetes environments.
- Relevant certifications such as:
- Certified Kubernetes Administrator (CKA)
- Certified Kubernetes Security Specialist (CKS)
- Google Cloud Professional Certifications
- HashiCorp Terraform Associate
Experience
- 5–10+ years of overall infrastructure/platform engineering experience.
- 3–5+ years of hands-on Kubernetes production experience.
- Strong experience in cloud-native platforms, observability, automation, and GitOps-driven operations.
Fetching your Linkedin profile ...