See all the jobs at InfraCloud Technologies here:
, , | Full-time | Fully remote
QA Engineer - Job Description
Key Responsibilities
-
Test product-specific use cases and validate end-to-end alerting workflows across monitoring systems.
-
Simulate incidents and test scenarios that trigger alerts in tools like Datadog, Prometheus, or similar monitoring platforms.
-
Verify that alerts raised in monitoring tools are correctly consumed and acted upon by downstream systems or automated workflows.
-
Understand alert rules so test cases are easier to design, execute, debug, and maintain (alert configuration will be handled by Developers/SREs, but QA must understand them).
-
Collaborate closely with engineering teams (Developers, SREs/DevOps) to improve detection, investigation, and automated incident response.
-
Analyze alert behaviour, validate incident pipelines, and ensure seamless integration across all monitoring and automation tools.
-
Identify gaps in monitoring, logging, and alert workflows and provide clear, actionable QA feedback.
-
Document test scenarios, alert behaviour, and monitoring workflows in a clear and reproducible manner.
Mandatory Skills
-
Monitoring Tools Expertise: Hands-on experience with at least one major monitoring system (Datadog or Prometheus), including working with alerts, dashboards, and troubleshooting.
-
Alert Simulation & Validation: Ability to trigger, simulate, and validate alert events end-to-end.
-
Incident Workflow Understanding: Strong understanding of how alerts propagate through monitoring systems and how automated systems respond to them.
-
Automation Mindset: Ability to use or write simple scripts (Python, Shell, etc.) to simulate workloads or events that trigger alerts.
-
Communication & Problem Solving: Ability to collaborate effectively with Developers and SRE/DevOps teams to ensure monitoring accuracy.
Good to Have
-
Experience with automated incident investigation or remediation tools.
-
Familiarity with CI/CD pipelines and integrating monitoring validation into pipelines.
-
Understanding of observability fundamentals—metrics, logs, traces.
-
Exposure to infrastructure or SRE environments.
-
Basic knowledge of Kubernetes, Docker, or cloud platforms (AWS/GCP/Azure).
Fetching your Linkedin profile ...