POSTING ACTIVE · REQ-2B290 · FY26.Q2

Deep Learning Performance Architect, CUTLASS DSL

NVIDIA
[ COMPANY ]
[ LOCATION ]
[ POSTED ]
[ REQ ID ]
[ COMPENSATION RANGE · ANNUAL · BASE ]
Not Disclosed
TECHNICAL STACK · 2 TAGS
§ 01OVERVIEW

Are you passionate about programming languages, compiler technology, and GPU performance? Do you want to help shape the future of high-performance kernel development for AI? We are looking for outstanding engineers to buildCUTLASSDSL, a Python-native language for GPU kernel development, along with the MLIR dialects and lowering passes behind it. In this role, you willalsohelp accelerate kernel compilation while delivering performance comparable to CUTLASS C++, enabling efficient hardware-software co-design for NVIDIA's next generation of AI platforms.

Whatyou'llbe doing:

  • Design, develop, andoptimizeCUTLASSDSL, a Python-native language for high-performance GPU kernel development

  • Build and advance the MLIR dialects, lowering passes, and code generation flows that power theCUTLASSDSL stack

  • Drive innovations that improve kernel compilation speed whilemaintainingperformance on par with CUTLASS C++

  • Collaborate closely with architecture, research, software product teams, and the open-source community to bringcutting-edgeoptimizations into real products

What we need to see:

  • MS, PhD, or equivalent experience in Computer Science, Software Engineering, or a related field

  • 2+ years ofrelevant work experience

  • Excellent programming skills in Python and strongproficiencyin C++

  • Hands-on experience with DSLs, compilers, or code generation systems

  • Strong command of the MLIR/LLVM stack, including IR design and pass optimization

  • Strong communicationskills and the ability to thrive in a highly collaborative environment

Ways to stand out from the crowd:

  • Deep understanding of the CUDA GPU programming model, GPU microarchitecture, and performance analysis and optimization techniques

  • Familiarity with key high-performance computing abstractions such as Layout, Tile, MMA, and TMA in theCuTeecosystem

[ APPLICATION ROUTE ]WORKDAY · External ATS
APPLY VIA WORKDAY