Site Reliability Engineering Specialist Job Details

Site Reliability Engineering Specialist

Req ID: 57936

Posting Start Date: 11/05/2026

Job Function: Software Engineering

Division: BT International

Job Location: IND-Bengaluru-Pritech

Advertised Salary: Competitive

At BT International, our purpose is to keep the world connected. As part of BT, we build on almost 180 years of innovation and expertise to deliver secure connectivity and digital services to some of the world’s leading multinational businesses and organisations. Our customers trust us to safeguard their data, drive their digital transformation and keep their businesses running. With colleagues on the ground across the world and supporting customers wherever they need to operate, BT International offers a truly global experience. Whether it’s about providing cloud connectivity, helping organisations collaborate, or enabling innovation in cybersecurity and digital services, you’ll be part of a team that shapes how businesses succeed in a world that is being transformed by AI. If you have the drive and ambition to make an impact on a global stage, BT International is where it happens.

About the role

As a Site Reliability Engineer (SRE) within the Network Operations team, BTI International, you will be Capable for ensuring the reliability, resilience and performance of our Global Platforms including Global Fabric. You will collaborate closely with Engineering, Product and Service teams to embed SRE principles such as automation, observability and proactive incident reduction into day to day operations. By improving how we monitor, maintain and evolve our services, you will help reduce risk, improve service quality and increase operational efficiency. Through this role, you will help BTI International’s strategy by enabling stable, secure and scalable platforms that help business growth, accelerate delivery of new capabilities, and protect customer experience.

What you’ll be doing

• Own the operational reliability, performance and resilience of the Global Fabric NaaS platform.
• Help and troubleshoot microservices, APIs and integrations across the NaaS ecosystem.
• Diagnose and resolve production issues across Kubernetes-hosted applications, Linux systems, networking, Kafka, APIs and service integrations.
• Help safe, automated change into production using CI/CD, GitOps, and automated testing.
• Improve observability, monitoring and traceability across the platform using Dynatrace, Prometheus, Grafana, Elasticsearch and Kafka.
• Help BT’s move towards end-to-end tracing and service traceability, helping implement and improve synthetic monitoring, tracing and service flow visibility.
• Participate in major incident resolution, root cause analysis and post-incident improvement activities.
• Manage incidents, problems and changes through ServiceNow and track defects and improvements in Jira.
• Drive automation through Ansible, Python, Bash or similar tooling to reduce manual effort and operational risk.
• Mentor and help L2 engineers by improving troubleshooting practices, runbooks and operational readiness.
• Build strong knowledge of the end-to-end customer journey and ensure operational decisions are aligned to customer impact.

Essential Skills / Experience

• Strong Linux and system administration experience, including server and compute management.
• Experience deploying, supporting and troubleshooting containerised applications in Kubernetes.
• Experience using monitoring tools such as Dynatrace, Prometheus, Grafana, Elasticsearch and Kafka.
• Experience supporting large-scale, high-availability services in an ISP, telecom, NaaS or network-centric environment.
• Experience with CI/CD, GitOps and safe production deployments.
• Experience with scripting and automation using Python, Bash, Ansible or similar.
• Growth Mindset: Self-driven attitude towards learning new skills and aiding the development of others

Desirable Skills / Experience

• In-depth knowledge of network protocols, including BGP, IS-IS and MPLS.
• Understanding of synthetic monitoring, telemetry and end-to-end service visibility.
• Experience of resilience, disaster recovery, chaos engineering or high availability testing.
• Ability to manage incidents through ServiceNow, track defects and continuous improvements in Jira.

Provider	Description	Enabled
Vimeo	Vimeo is a video hosting, sharing and services platform focused on the delivery of video. Opting out of Vimeo cookies will disable your ability to watch or interact with Vimeo videos. Cookie Policy Privacy Policy Terms and Conditions
YouTube	YouTube is a video-sharing service where users can create their own profile, upload videos, watch, like and comment on videos. Opting out of YouTube cookies will disable your ability to watch or interact with YouTube videos. Cookie Policy Privacy Policy Terms and Conditions

Provider	Description	Enabled
Google Tag Manager	Google Tag Manager is a tag management system for conversion tracking, site analytics, remarketing and more. Privacy Policy Terms and Conditions
LinkedIn	LinkedIn is an employment-oriented social networking service. We use the Apply with LinkedIn feature to allow you to apply for jobs using your LinkedIn profile. Opting out of LinkedIn cookies will disable your ability to use Apply with LinkedIn. Cookie Policy Cookie Table Privacy Policy Terms and Conditions