About the position
Lead DevOps Engineer (GCP – Google Cloud)
Purpose of Role:
As the Lead SRE (GCP – Google Cloud), you will drive reliability and scalability across production environments by leading a high-performing SRE team and implementing robust monitoring, automation, and DevOps practices on Google Cloud Platform. You will ensure system uptime, efficiency, and performance while mentoring others and embedding a culture of engineering excellence.
Key Responsibilities & Outputs
- Lead and mentor a team of SRE engineers, promoting knowledge sharing and growth.
- Act as the technical authority on SRE practices for GCP, ensuring system reliability and uptime across environments.
- Oversee team workload distribution and manage stakeholder expectations.
- Champion and implement DevOps and SRE best practices with emphasis on automation and scalability.
- Drive monitoring and observability initiatives, leveraging tools like Grafana, Prometheus, and Stackdriver.
- Design, maintain and optimise CI/CD pipelines using GCP-native tools and industry standards.
- Troubleshoot complex production incidents, ensuring root cause analysis and long-term fixes.
- Collaborate with cross-functional teams to ensure consistent platform performance.
- Apply Infrastructure as Code (IaC) principles using tools such as Terraform or Deployment Manager.
- Stay abreast of emerging technologies to continually evolve our clients’ tooling and architecture continually.
- Foster a proactive and blameless incident management culture.
Education
Education required to perform this role
- Degree or Diploma in Information Technology, Computer Science, or equivalent experience.
- Google Cloud certifications (e.g., Professional Cloud DevOps Engineer, Professional Cloud Architect) are highly advantageous.
Competencies
Experience required
- Minimum 3 years in a management/leadership capacity within SRE/DevOps teams.
- Strong experience working on GCP infrastructure and services.
- Experience with Kubernetes, Docker, and container orchestration at scale.
- Familiarity with incident management, post-mortem processes, and production monitoring tools.
- Hands-on experience with IaC tools such as Terraform, Ansible, or Deployment Manager.
- Experience working with CI/CD pipelines and automation tools.
- UNIX/Linux administration expertise.
- Familiarity with security, compliance, and cost optimisation on GCP.
Technical skills/knowledge
- Strong scripting and automation skills (Python, Bash, Shell).
- Familiarity with configuration management (Chef, Puppet, or Ansible).
- Use of observability platforms (Grafana, Prometheus, Stackdriver, etc.).
- Deep understanding of system performance and reliability engineering
Abilities / Behaviors
- Strong leadership and team development capabilities.
- High level of professionalism and ownership.
- Excellent communication and stakeholder management skills.
- Ability to manage priorities in high-pressure environments.
- Passion for continuous improvement and driving engineering excellence.
- Innovative thinker willing to challenge the status quo.
Desired Skills:
- GCP – Google Cloud
- Google Cloud Platform
- CI/CD pipelines