At BT International, our purpose is to keep the world connected. As part of BT, we build on almost 180 years of innovation and expertise to deliver secure connectivity and digital services to some of the world’s leading multinational businesses and organisations. Our customers trust us to safeguard their data, drive their digital transformation and keep their businesses running. With colleagues on the ground across the world and supporting customers wherever they need to operate, BT International offers a truly global experience. Whether it’s about providing cloud connectivity, helping organisations collaborate, or enabling innovation in cybersecurity and digital services, you’ll be part of a team that shapes how businesses succeed in a world that is being transformed by AI. If you have the drive and ambition to make an impact on a global stage, BT International is where it happens.
About the role
As a Site Reliability Engineer (SRE) within the Network Operations team, BTI International, you will be Capable for ensuring the reliability, resilience and performance of our Global Platforms including Global Fabric. You will collaborate closely with Engineering, Product and Service teams to embed SRE principles such as automation, observability and proactive incident reduction into day to day operations. By improving how we monitor, maintain and evolve our services, you will help reduce risk, improve service quality and increase operational efficiency. Through this role, you will help BTI International’s strategy by enabling stable, secure and scalable platforms that help business growth, accelerate delivery of new capabilities, and protect customer experience.
What you’ll be doing
• Own the operational reliability, performance and resilience of the Global Fabric NaaS platform.
• Help and troubleshoot microservices, APIs and integrations across the NaaS ecosystem.
• Diagnose and resolve production issues across Kubernetes-hosted applications, Linux systems, networking, Kafka, APIs and service integrations.
• Help safe, automated change into production using CI/CD, GitOps, and automated testing.
• Improve observability, monitoring and traceability across the platform using Dynatrace, Prometheus, Grafana, Elasticsearch and Kafka.
• Help BT’s move towards end-to-end tracing and service traceability, helping implement and improve synthetic monitoring, tracing and service flow visibility.
• Participate in major incident resolution, root cause analysis and post-incident improvement activities.
• Manage incidents, problems and changes through ServiceNow and track defects and improvements in Jira.
• Drive automation through Ansible, Python, Bash or similar tooling to reduce manual effort and operational risk.
• Mentor and help L2 engineers by improving troubleshooting practices, runbooks and operational readiness.
• Build strong knowledge of the end-to-end customer journey and ensure operational decisions are aligned to customer impact.
Essential Skills / Experience
• Strong Linux and system administration experience, including server and compute management.
• Experience deploying, supporting and troubleshooting containerised applications in Kubernetes.
• Experience using monitoring tools such as Dynatrace, Prometheus, Grafana, Elasticsearch and Kafka.
• Experience supporting large-scale, high-availability services in an ISP, telecom, NaaS or network-centric environment.
• Experience with CI/CD, GitOps and safe production deployments.
• Experience with scripting and automation using Python, Bash, Ansible or similar.
• Growth Mindset: Self-driven attitude towards learning new skills and aiding the development of others
Desirable Skills / Experience
• In-depth knowledge of network protocols, including BGP, IS-IS and MPLS.
• Understanding of synthetic monitoring, telemetry and end-to-end service visibility.
• Experience of resilience, disaster recovery, chaos engineering or high availability testing.
• Ability to manage incidents through ServiceNow, track defects and continuous improvements in Jira.