Trading, Research & ML

Machine Learning Performance Engineer

RemoteFull-timeSeniorPosted: April 22, 2026

About the Role

At Uncharted Network, our models retrain intraday and our inference must keep pace with the market. As a Machine Learning Performance Engineer, you will own the performance envelope of our entire ML stack — from batch training throughput to single-digit millisecond inference latency on the critical execution path. This role demands a whole-systems mindset: you will profile GPU warps, tune memory hierarchies, redesign storage access patterns, and optimise inter-node networking. If you close the gap between theoretical FLOP throughput and actual goodput, you will find this environment uniquely satisfying.

Responsibilities

Profile and optimise training runs end to end: GPU utilisation, memory bandwidth, collective communication, and storage I/O
Develop custom CUDA kernels and Triton programs for performance-critical model components
Reduce inference latency on the live trading path through kernel fusion, quantisation, and computation graph optimisation
Investigate and tune the full hardware stack: NVLink, InfiniBand, PCIe topology, NUMA layout, and host-GPU transfer patterns
Work with ML Researchers and ML Engineers to co-design models with hardware performance constraints in mind
Benchmark and document performance gains with rigorous, reproducible methodology

Requirements

Deep practical knowledge of GPU architecture: warps, cooperative groups, memory hierarchy, and Tensor Core utilisation
Hands-on experience with CUDA, PTX/SASS, and profiling tools (NSight Systems, NSight Compute, CUDA GDB)
Strong familiarity with ML frameworks at the C++/CUDA level (PyTorch internals, JAX XLA)
Understanding of distributed training networking: NCCL, InfiniBand/RoCE, GPUDirect, and collective communication algorithms
Solid general programming skills in Python and C++
Ability to interrogate performance from first principles and communicate findings rigorously

Nice to Have

Experience with Triton or CUTLASS for custom kernel authoring
Knowledge of inference optimisation techniques: INT8/FP8 quantisation, speculative decoding, or batched attention
Background in low-latency systems engineering: networking, storage, and OS-level scheduling

What We Offer

Competitive UNT token allocation + fiat salary
Fully remote with async-first culture
Dedicated GPU cluster access — profile and optimise on real production workloads at scale
Top-tier hardware setup stipend
Annual performance-engineering conference and technical learning budget

Uncharted