Senior Site Reliability Engineer, Observability
R5590
Location
Mumbai
Career Track
Technology
Senior Site Reliability Engineer, Observability
This role is eligible for our hybrid work model: Two days in-office.
Senior Site Reliability Engineer, Observability.
Our Technology team is the backbone of our company: constantly creating, testing, learning and iterating to better meet the needs of our customers. If you thrive in a fast-paced, ideas-led environment, you’re in the right place.
Why this job’s a big deal:
As Priceline continues to scale globally, reliable production visibility is critical to delivering seamless customer experiences. We are investing in strengthening our observability foundations to improve detection, diagnosis, and overall system reliability.
This role plays a key part in maturing our observability capabilities—standardizing instrumentation, improving telemetry quality, and enabling faster root cause analysis that directly impacts MTTR and MTTD.
In this role you will get to:
-
Support and evolve end-to-end observability solutions for collecting, shipping, storing, and querying OpenTelemetry signals (metrics, logs, and traces) across infrastructure, containers, and Kubernetes environments, while influencing architectural decisions for scalability and long-term sustainability.
-
Administer and operate core observability platforms (Splunk, New Relic, Click House, Grafana, Lightrun), including onboarding, access management, configuration, upgrades, and ensuring platform reliability, performance, and SLAs.
-
Drive the adoption and standardization of instrumentation practices across services, establishing consistent logging, metrics, and distributed tracing standards, schemas, and conventions.
-
Partner with product, platform, and engineering teams to enhance production visibility, support SLO-driven reliability practices, and act as a subject matter expert for observability.
-
Optimize telemetry pipelines for performance, data quality, scalability, and cost efficiency, including implementing strategies such as sampling, filtering, and data lifecycle management.
-
Define and support observability governance standards, driving consistency and adoption through documentation, tooling, and enablement.
-
Lead complex incident investigations and postmortems, identifying observability gaps and driving improvements to reduce MTTR and MTTD while improving alert quality and signal-to-noise ratio.
-
Contribute to advancing the observability platform toward intelligent and AI-enabled capabilities, exploring MCP-based and other solutions to improve signal quality, incident triage, and operational efficiency.
Who You Are:
-
Bachelor’s degree in Computer Science or equivalent practical experience.
-
7+ years of experience in Observability, SRE, DevOps, or platform engineering roles supporting production systems.
-
Strong understanding of APM and SRE fundamentals, including MELT (Metrics, Events, Logs, Traces), latency analysis, error rate monitoring, service dependency mapping, SLIs/SLOs, alert tuning, and root cause analysis, with demonstrated application in large-scale distributed systems.
-
Hands-on experience administering at least one modern observability/APM platform (e.g., Splunk, New Relic, Grafana), with practical exposure to metrics, logs, distributed tracing, and platform configuration. Experience supporting full-stack observability coverage across infrastructure, application, and browser monitoring layers, including operating platforms at scale.
-
Experience building dashboards and actionable alerts, including configuring alert workflows and integrations with incident management tools such as PagerDuty. Experience implementing or supporting OpenTelemetry-based instrumentation and improving telemetry quality across services, with a focus on reducing alert fatigue and improving signal-to-noise ratio.
-
Familiarity with Kubernetes and cloud-native environments – an understanding of how applications are deployed, monitored, and scaled, including troubleshooting complex production issues in distributed environments.
-
Experience managing telemetry pipelines and agents (e.g., collectors, forwarders, sidecars), including onboarding services and troubleshooting ingestion issues, and optimizing pipelines for scale and efficiency.
-
Working knowledge of scripting or automation (e.g., Shell, Python) and CI/CD concepts. Experience or familiarity with infrastructure-as-code tools such as Terraform for managing platform configurations and integrations is a plus.
-
Comfortable collaborating with engineering teams to improve monitoring standards, instrumentation quality, and overall production visibility, with proven ability to influence.
-
Ability to analyze trade-offs between observability depth, performance, and cost, and make recommendations aligned with business and engineering priorities.
-
Experience leading or contributing to incident investigations and postmortems, identifying observability gaps and driving continuous improvement.
-
Relevant certifications such as New Relic APM Professional, Reliability Engineer – Professional, Splunk Admin, or GCP Associate Cloud Engineer are a plus.
-
Demonstrated history of living the values important to Priceline: Customer, Innovation, Team, Accountability and Trust.
-
The Right Results, the Right Way is not just a motto at Priceline; it’s a way of life. It’s therefore essential that you also meet our high standard of ethics, honesty, transparency and compliance.
#LI-hybrid