About the position
An opportunity exists for a Platform Engineer to contribute to the development, integration, and operation of shared platform services supporting large-scale scientific computing and complex software systems. Working within the Site Reliability Engineering (SRE) team, this role will focus on automation, observability, and operational readiness as the platform transitions from construction into steady-state operations.
Key Responsibilities
- Develop and enhance platform services to support engineering and operational teams
- Integrate platform services with application and infrastructure systems
- Contribute to automation, monitoring, and service reliability improvements
- Support operational readiness and continuous improvement efforts
- Collaborate with senior engineers and cross-functional teams to deliver resilient and scalable solutions
Minimum Requirements
Qualification(s) required:
- National Diploma, BTech, BEng/MTech, MEng, or PhD in Computer Science, Software Engineering, Information Systems, Electronic Engineering, or equivalent (qualification level aligned with experience requirements below)
Experience required (qualification-dependent):
Proven relevant experience in the field:
- 7 years' relevant experience, coupled with a National Diploma, OR
- 6 years' relevant experience, coupled with a BTech, OR
- 4 years' relevant experience, coupled with a BEng/MTech, OR
- 3 years' relevant experience, coupled with a MEng, OR
- 1 year relevant experience, coupled with a PhD
Additional Experience required:
- Minimum 2 years' hands-on experience in infrastructure automation, distributed systems, observability, CI/CD, and container orchestration (e.g., Kubernetes)
- Experience working in teams across data platforms, storage, networking, and systems engineering
- Exposure to DevOps and SRE practices including monitoring, alerting, incident response, and resilience engineering
- Practical experience with infrastructure-as-code, deployment pipelines, and observability stacks
Knowledge & Competencies required:
- Solid understanding of distributed systems, service meshes, and microservices architectures
- Proficiency in containerisation and orchestration (Docker, Kubernetes, Helm)
- Strong Linux administration, troubleshooting, and scripting skills
- Familiarity with networking, security, and storage systems (object, block, distributed)
- Working knowledge of CI/CD tools (e.g., GitLab CI, Jenkins, ArgoCD)
- Exposure to cloud platforms (AWS, GCP, Azure, or OpenStack)
- Advantageous: knowledge of control systems, data acquisition, or scientific computing platforms
Skills & Attributes required:
- Problem-solving and root cause analysis
- Strong communication and collaboration skills across technical and non-technical stakeholders
- Agile delivery experience, including backlog grooming and sprint planning
- Ability to document technical solutions and share knowledge across teams
- Passion for continuous learning and engineering excellence
Desired Skills:
- Platform Engineering
- Kubernetes
- DevOps
- Linux Systems
- Infrastructure Automation
- Observability
- Cloud Computing
- CI/CD
- SRE
- Cloud Native
- Docker
- Helm
- Agile
- GPU
- FPGA
- AWS
- GCP
- Azure
- OpenStack
Desired Work Experience:
Desired Qualification Level: