About the position
Job Description
Design, build and maintain scalable data pipelines and ETL workflows to ingest and transform data for analytics and reporting.
Implement and optimize data storage solutions including data lakes and data warehouses on cloud platforms.
Develop PySpark and Python applications for large-scale data processing and transformations.
Ensure data quality, consistency and integrity through testing, validation and the use of data quality tools.
Collaborate with stakeholders to translate business requirements into technical specifications and data models.
Propose and review system and solution designs and evaluate technical alternatives.
Maintain and operate cloud infrastructure and CI/CD pipelines for data platform components.
Create and maintain technical documentation, runbooks and artefacts for developed solutions.
Support production troubleshooting, monitoring and incident management for data services.
Work closely with BI teams to prepare and optimize data for reporting tools such as Business Objects or Tableau.
Coach and support fellow engineers, and help improve team capability through knowledge sharing and training.
Participate in Agile ceremonies and contribute to continuous improvement of delivery processes.
Minimum Requirements:
SKILLS REQUIREMENTS:
Qualifications/Experience:
Minimum 3-5 years’ experience as a data engineer with demonstrated hands-on experience in Python, PySpark and cloud data services (AWS and/or OCI).
Relevant IT/Computer Science/Engineering degree or equivalent proven experience; advanced degrees advantageous.
Certifications such as AWS Certified Cloud Practitioner, Oracle Cloud certifications or other relevant cloud/data engineering certifications preferred.
Essential Skills Requirements:
Strong experience with Python (Python 3.x) and PySpark for developing data processing jobs.
At least 3 years’ experience with AWS services commonly used by data engineers, such as Athena, Glue, Lambda, S3 and ECS.
Hands-on experience with NoSQL databases such as DynamoDB and relational databases (Oracle/PostgreSQL) including strong Oracle SQL skills.
Experience with Oracle Cloud Infrastructure (OCI) services and tooling for databases, storage, and data processing.
Expertise in data formats and schema design, including Parquet, AVRO, JSON, XML and CSV, and technical data modelling (“not drag and drop”).
ETL and data pipeline development experience, including building pipelines with AWS Glue or similar platforms.
Experience with containerization and orchestration technologies such as Docker (Kubernetes/OpenShift advantageous).
Proficiency with scripting for automation (Bash, PowerShell) and familiarity with Linux/Unix environments.
Experience with data quality tooling and validation (e.g., Great Expectations) and performing thorough data testing and validation.
Familiarity with cloud infrastructure as code and DevOps tools such as Terraform, CloudFormation, CI/CD pipelines, Git and Jenkins.
Advantageous Skills Requirements:
Knowledge of Kafka or other streaming technologies and AWS Kinesis for real-time data ingestion.
Experience with AWS Redshift, EMR and other analytics/warehouse technologies.
Familiarity with our client Cloud Data Hub (CDH) or similar organizational cloud data blueprints.
Java / JEE experience and understanding of Java application servers.
Experience with monitoring and observability tools such as CloudWatch and Grafana.
AWS solution architecture experience and certifications (e.g., AWS Certified Cloud Practitioner) are advantageous.
Familiarity with REST APIs and building integrations with external systems.
Experience with schema design for BI and data warehousing, and preparing specifications for development.
Experience with MongoDB or other NoSQL stores.
Familiarity with Agile/Scrum delivery models and working within cross-functional teams.
Desired Skills:
- Python (Python 3.x) and PySpark
- AWS services
- NoSQL databases