Senior HPC Infrastructure Engineer (Compute System)

Firmus Technologies · Australia Or Singapore, AU · 3 days ago

Role Summary
Firmus is seeking a highly skilled and driven Kubernetes HPC Engineer to join our Software Defined Infrastructure team. In this role, you will build high-performance, fault-tolerant, and reliable infrastructure to support bare-metal provisioning, performance benchmarking, and platform validation.
You will be instrumental in ensuring the stability, performance, and continuous improvement of our complex and mission-critical bare-metal HPC GPU clusters.
Key Responsibilities
Own the end-to-end lifecycle of AI compute systems, including GPU compute, NVSwitch, and platform firmware (BIOS, GPU, NIC, and storage devices).
Define, maintain, and enforce supported firmware and driver compatibility matrices across hardware generations, operating systems, kernels, and AI software stacks.
Lead firmware qualification and regression testing to ensure updates do not introduce performance degradation, instability, or compatibility issues.
Investigate and remediate performance regressions caused by firmware, driver, or system-level changes, working closely with networking, storage, and HPC engineers.
Collaborate to integrate firmware and performance checks into SDI tooling, enabling automated validation during provisioning, upgrades, and cluster bring-ups.
Produce clear technical documentation, including firmware standards, validation reports, and benchmarking results, to support operational consistency and informed decision-making.
Collaborate with L2 SRE engineers, site operations, and networking teams to ensure platform reliability, reproducibility, and performance.
Support hardware bring-up activities, including BIOS tuning, GPU topology verification, NUMA alignment, and PCIe/NVLink checks.
Contribute to continuous improvement in cluster validation, CI/CD automation, and provisioning and testing frameworks.
Contribute to the development of custom Kubernetes operators and intelligent orchestration frameworks that optimise AI clusters for large-scale GPU cluster commissioning.
Skills & Experience
Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field.
Experience with bare-metal cluster provisioning using tools such as Metal3, OpenStack Ironic, MaaS, xCAT, or similar.
Hands-on expertise with platform firmware and low-level system components, including BIOS, BMC, GPU firmware, NIC firmware, and storage devices.
Proven experience managing firmware and driver compatibility across operating systems, Linux kernels, and AI software stacks, with a disciplined approach to version control and validation.
Solid understanding of GPU architecture and interconnects, including PCIe, NVLink, and GPU-to-GPU communication patterns.
Demonstrated experience in performance benchmarking and validation using industry-standard and custom tools to measure GPU, compute, storage, and interconnect performance.
Strong Linux systems knowledge, including kernel behaviour, driver management, performance tuning, and troubleshooting at the OS and hardware boundary.
Experience diagnosing and resolving performance regressions related to firmware, drivers, or system-level changes in production or pre-production environments.
Strong automation mindset using tools such as Ansible, Helm, Terraform/OpenTofu, or equivalent.
Understanding of firmware, BIOS, BMC/IPMI/Redfish, and low-level system tuning.
Proficiency in one or more programming languages such as Go, Bash, Rust, or Python.
Excellent documentation skills with a high level of attention to detail.
Experience participating in an on-call rotation supporting production services.
Proactive self-starter with a drive for continuous technical improvement.
Key Competencies
Ability to understand AI compute platforms as end-to-end systems spanning hardware, firmware, operating systems, drivers, and workloads.
Ability to anticipate cross-layer impacts of changes and design solutions that optimise overall system performance and reliability.
Proactively identifies risks related to firmware upgrades and ensures compatibility through structured validation and rollback strategies.
Experience operating AI infrastructure at medium to large scale, with a focus on reliability, repeatability, and performance consistency.
Strong sense of ownership and accountability for system performance and reliability.
Comfortable operating in ambiguous, fast-evolving environments while driving continuous improvement.
Success Metrics
Reliable, automated firmware validation and upgrade systems and processes.
Performance validation and optimisation.
Improved operational efficiency.
High-quality documentation and effective knowledge transfer.
Location & Reporting
Sydney, NSW or Hobart/Launceston, TAS
Reporting to Senior Manager, Software Defined Infrastructure
Employment Basis
Full-time
Diversity
At Firmus, we are committed to building a diverse and inclusive workplace. We encourage applications from candidates of all backgrounds who are passionate about creating a more sustainable future through innovative engineering solutions.
Join us in our mission to revolutionize the AI industry through sustainable practices and cutting-edge engineering. Apply now to be part of shaping the future of sustainable AI infrastructure.

Headquarters

Australia Or Singapore

Work Location

on-site

Job Category

Engineering

Application Deadline

Not specified

Job Type

full-time

Experience Level

senior-level

Application Method

Apply via Website

Salary

Not specified

Quick Search Firmus Technologies Company in Australia Or Singapore

Related Jobs

No related jobs found