About the position
An exciting opportunity exists for a Senior Compute Systems Engineer to provide hands-on technical leadership in the design, implementation, and long-term operation of secure, reliable, and high-performance compute and storage infrastructure. This role will guide infrastructure development, shape operational practices, and mentor team members, supporting the transition of large-scale telescope systems from construction to steady-state operations.
Key Responsibilities
- Lead the compute and storage systems team, contributing to strategic technical planning and infrastructure design
- Contribute to the global design and implementation of scalable, fault-tolerant infrastructure systems
- Deploy, configure, and maintain distributed storage and database systems
- Analyse system failures, performance issues, and misconfigurations across hardware, software, and network layers
- Drive long-term infrastructure planning to support reliable, sustainable operations
- Mentor engineers and collaborate across teams to align with Site Reliability Engineering (SRE) principles
Key Requirements
Qualification(s) required:
BTech, BEng/MTech, MEng or PhD in Computer Science, Software Engineering, Information Systems, Electronic Engineering, or equivalent
Experience (qualification-dependent):
- 13 years' relevant experience, coupled with a BTech
- 9 years' relevant experience, coupled with a BEng/MTech, OR
- 7 years' relevant experience, coupled with a MEng, OR
- 5 years' relevant experience, coupled with a PhD
Proven hands on Experience required:
- At least 3 years in a technical leadership or architectural role overseeing distributed systems
- Infrastructure design and automation, observability, CI/CD, container orchestration (e.g., Kubernetes), and cloud-native technologies
- Leading teams or initiatives across data platforms, storage, networking, and systems engineering
Knowledge & Competencies required:
- Advanced Linux systems engineering, including troubleshooting, kernel tuning, and optimisation
- Expertise in containerisation (Docker, Podman), orchestration (Kubernetes, Helm), and microservices/cloud-native patterns
- Proficiency in infrastructure-as-code, CI/CD, and configuration management tools (e.g., GitLab CI, Terraform, Ansible, ArgoCD)
- Strong understanding of distributed storage systems (Ceph, S3, NFS, clustered filesystems)
- Operational fluency with relational and NoSQL databases (PostgreSQL, MySQL, MongoDB)
- Knowledge of observability stacks (Prometheus, Grafana, ELK/EFK) and networking fundamentals
- Beneficial exposure to HPC systems (e.g., SLURM, GPU/FPGA environments)
- Demonstrated skills in technical delivery, Agile/DevOps practices, documentation, and cross-team collaboration
Skills & Attributes required:
- Technical leadership with the ability to influence design decisions and mentor team members
- Strong problem-solving and diagnostic ability, with a root-cause-first approach
- Excellent planning, backlog scoping, and Agile delivery capabilities
- Clear communication, knowledge sharing, and stakeholder engagement
- Commitment to continuous learning and staying current with emerging technologies
Desired Skills:
- Compute Systems
- Distributed Systems
- Kubernetes
- DevOps
- Linux Engineering
- Infrastructure Automation
- Technical Leadership
- Ceph
- Prometheus
- Grafana
- SAN
- Infrastructure Design
- Cloud Native
Desired Work Experience:
Desired Qualification Level: