Site Reliability Engineer

Bengaluru, Karnataka, India | Full-time | Partially remote

Apply

Experience - 4 - 8 Years
Location - Bangalore (Hybrid) 


We are seeking a highly skilled Site Reliability Engineer (SRE) to design, build, and operate scalable, reliable, and secure cloud-native platforms. The ideal candidate will have strong experience with Kubernetes ecosystems, cloud infrastructure, automation, observability, and GitOps practices.

Key Responsibilities

  • Manage and optimize Kubernetes-based platforms, including Cilium, Istio, Ingress Controllers, and related ecosystem components.
  • Design, deploy, and maintain infrastructure on Google Cloud Platform (GCP).
  • Automate infrastructure provisioning and lifecycle management using Terraform.
  • Implement and manage GitOps workflows using ArgoCD and GitLab.
  • Deploy and maintain Helm charts for Kubernetes applications.
  • Manage secrets, service discovery, and distributed systems using Vault and Consul.
  • Build and maintain monitoring, logging, and observability platforms using Prometheus Operator and the Grafana Stack (Grafana, Mimir, Loki, Alloy, Tempo, and Pyroscope).
  • Collaborate with development teams to improve platform reliability, performance, scalability, and operational excellence.
  • Develop CI/CD pipelines and automation to support modern cloud-native deployments.

Required Skills

  • Strong hands-on experience with Kubernetes (K8s) and cloud-native technologies.
  • Experience with GCP, Terraform, Helm, and ArgoCD.
  • Knowledge of Service Mesh technologies, particularly Istio and Cilium.
  • Experience with Vault, Consul, and infrastructure security best practices.
  • Strong expertise in observability tools including Prometheus and the Grafana ecosystem.
  • Proficiency with GitOps, GitLab, CI/CD pipelines, and automation.
  • Good understanding of Linux systems, networking, and troubleshooting in distributed environments.

Preferred Qualifications

  • Experience operating large-scale production environments.
  • Knowledge of SRE principles, incident management, capacity planning, and reliability engineering.
  • Relevant cloud-native certifications (CKA, GCP, Terraform, etc.) are a plus.