Welcome!

My name is Mike Wilkins, and I am an expert in networking for high-performance distributed AI and scientific workloads. I combine my technical background with strong interpersonal communication and leadership skills to drive projects across team and organizational boundaries. At Cornelis Networks, I lead our GPU communication effort and mentor engineers on high-impact work across the stack; outside of my day job, I continue to advise graduate students, publish peer-reviewed research, and contribute to the open-source HPC ecosystem. I am open to collaboration opportunities, so please feel free to reach out with ideas or questions!

Experiences

Senior Software Engineer

Sep 2025 – Present
Cornelis Networks
  • Project lead for GPU communication: architected and implemented a DMA-BUF-based NCCL/RCCL data path on the CN5000 fabric, doubling collective network bandwidth
  • Drive full-stack optimizations across the userspace and kernel layers of the CN5000 transport software, delivering order-of-magnitude performance gains across collective operations and state-of-the-art performance for HPC and AI applications
  • Created an AI-native engineering workflow with auto-deployable working directories, managed dependencies, and workflow-specific AI profiles, now adopted across the software organization to accelerate agentic development

Maria Goeppert Mayer Fellow

Oct 2024 – Sep 2025
Argonne National Laboratory
  • Directed an independent research program on autotuning and collective communication, supported by a $1M award from Argonne
  • Translated my MPI autotuning research into production, achieving speedups up to 35x for collective operations on Argonne’s exascale system, Aurora
  • Contributed upstream performance improvements to MPICH, the reference open-source MPI implementation, spanning collective algorithms and low-level networking paths

Software Engineer

Jan 2024 – Sep 2024
Cornelis Networks
  • Spearheaded performance optimizations for the OPX libfabric provider, delivering 5× bandwidth for GPU communication alongside latency and scalability improvements
  • Led the architecture and development of the reference libfabric provider for the Ultra Ethernet Consortium, achieving a key milestone in the standard’s development

AI Research Intern

May 2023 – Sep 2023
Meta
  • Prototyped an application-aware NCCL autotuner for large-scale AI workloads on NCCLX, Meta’s internal (now public) fork of NCCL
  • Built an AI workload emulation tool that reproduces production training communication patterns with generic compute kernels, improving observability and iteration speed without requiring full models or private user data

Research Aide / Visiting Student

Jun 2020 – Dec 2023
Argonne National Laboratory
  • Founded the MPI collective algorithm/machine learning research thread as a contributor to the MPICH project
  • Earned continuous external funding from ANL for the remainder of my Ph.D.

Sample Research Projects

Here is a high-level description of some of my active and former research projects.

ML Autotuning for MPI

Ongoing
  • Invented many optimizations to make ML-based MPI autotuning feasible on large-scale systems
  • Developed the world’s first exascale-capable MPI collective algorithm autotuner and achieved up to 20% speedups for production applications
  • Exploring new “holistic” tuning methodologies to encompass performance-critical parameters across the software stack, targeting large scale AI workloads

Algorithms for Collective Communication

Ongoing
  • Created new generalized MPI collective algorithms that expose a tunable radix and outperform the previous best algorithms by up to 4.5x
  • Exploring new generalized algorithms for GPU-specific collective communication (e.g., NCCL) and new abstractions (e.g., circulant graphs)

High-Level Parallel Languages for HPC

2019-2023
  • Developed a new hardware/software co-design for the Standard ML language targeted at HPC systems and applications, including AI
  • Created a new version of the NAS benchmark suite using MPL (a parallel compiler for Standard ML) to enable direct comparison between HLPLs and lower-level languages for HPC

Cache Coherence for High-Level Parallel Languages

2019-2022
  • Identified a low-level memory property called WARD in high-level parallel programs
  • Implemented a custom cache coherence protocol in the Sniper architectural simulator and found an average speedup of 1.46x across the PBBS benchmark suite.

Publications

  • Practical Machine Learning Autotuning For Large-Scale Collective Communication
  • Michael Wilkins, Yanfei Guo, Rajeev Thakur, Peter Dinda, Nikos Hardavellas
    IEEE Transactions on Parallel and Distributed Systems, 2026
  • Generalized Collective Algorithms for the Exascale Era
  • Michael Wilkins, Hanming Wang, Peizhi Liu, Bangyen Pham, Yanfei Guo, Rajeev Thakur, Nikos Hardavellas, Peter Dinda
    CLUSTER'23
  • Evaluating Functional Memory-Managed Parallel Languages for HPC using the NAS Parallel Benchmarks
  • Michael Wilkins, Garrett Weil, Luke Arnold, Nikos Hardavellas, Peter Dinda
    HIPS'23 Workshop
  • WARDen: Specializing Cache Coherence for High-Level Parallel Languages
  • Michael Wilkins, Sam Westrick, Vijay Kandiah, Alex Bernat, Brian Suchy, Enrico Armenio Deiana, Simone Campanoni, Umut Acar, Peter Dinda, Nikos Hardavellas
    CGO'23
  • Program State Element Characterization
  • Enrico Deiana, Brian Suchy, Michael Wilkins, Brian Homerding, Tommy McMichen, Katarzyna Dunajewski, Nikos Hardavellas, Peter Dinda, Simone Campanoni
    CGO'23
  • ACCLAiM: Advancing the Practicality of MPI Collective Communication Autotuning Using Machine Learning
  • Michael Wilkins, Yanfei Guo, Rajeev Thakur, Peter Dinda, Nikos Hardavellas
    CLUSTER'22
  • A FACT-Based Approach: Making Machine Learning Collective Autotuning Feasible on Exascale Systems
  • Michael Wilkins, Yanfei Guo, Rajeev Thakur, Nikos Hardavellas, Peter Dinda, Min Si
    ExaMPI'21 Workshop

    Skills

    HPC & AI Networking

    NCCL, RCCL, MPI, Libfabric, DMA-BUF, CUDA, PyTorch, Linux kernel networking

    Languages

    C, C++, Python, Bash, Standard/Parallel ML, SQL, Java

    Architecture Simulators

    Sniper, gem5, ZSim