Welcome!
My name is Mike Wilkins, and I am an expert in networking for high-performance distributed AI and scientific workloads. I combine my technical background with strong interpersonal communication and leadership skills to drive projects across team and organizational boundaries. At Cornelis Networks, I lead our GPU communication effort and mentor engineers on high-impact work across the stack; outside of my day job, I continue to advise graduate students, publish peer-reviewed research, and contribute to the open-source HPC ecosystem. I am open to collaboration opportunities, so please feel free to reach out with ideas or questions!
Experiences
- Project lead for GPU communication: architected and implemented a DMA-BUF-based NCCL/RCCL data path on the CN5000 fabric, doubling collective network bandwidth
- Drive full-stack optimizations across the userspace and kernel layers of the CN5000 transport software, delivering order-of-magnitude performance gains across collective operations and state-of-the-art performance for HPC and AI applications
- Created an AI-native engineering workflow with auto-deployable working directories, managed dependencies, and workflow-specific AI profiles, now adopted across the software organization to accelerate agentic development
- Directed an independent research program on autotuning and collective communication, supported by a $1M award from Argonne
- Translated my MPI autotuning research into production, achieving speedups up to 35x for collective operations on Argonne’s exascale system, Aurora
- Contributed upstream performance improvements to MPICH, the reference open-source MPI implementation, spanning collective algorithms and low-level networking paths
- Spearheaded performance optimizations for the OPX libfabric provider, delivering 5× bandwidth for GPU communication alongside latency and scalability improvements
- Led the architecture and development of the reference libfabric provider for the Ultra Ethernet Consortium, achieving a key milestone in the standard’s development
- Prototyped an application-aware NCCL autotuner for large-scale AI workloads on NCCLX, Meta’s internal (now public) fork of NCCL
- Built an AI workload emulation tool that reproduces production training communication patterns with generic compute kernels, improving observability and iteration speed without requiring full models or private user data
- Founded the MPI collective algorithm/machine learning research thread as a contributor to the MPICH project
- Earned continuous external funding from ANL for the remainder of my Ph.D.
Sample Research Projects
Here is a high-level description of some of my active and former research projects.
- Invented many optimizations to make ML-based MPI autotuning feasible on large-scale systems
- Developed the world’s first exascale-capable MPI collective algorithm autotuner and achieved up to 20% speedups for production applications
- Exploring new “holistic” tuning methodologies to encompass performance-critical parameters across the software stack, targeting large scale AI workloads
- Created new generalized MPI collective algorithms that expose a tunable radix and outperform the previous best algorithms by up to 4.5x
- Exploring new generalized algorithms for GPU-specific collective communication (e.g., NCCL) and new abstractions (e.g., circulant graphs)
- Developed a new hardware/software co-design for the Standard ML language targeted at HPC systems and applications, including AI
- Created a new version of the NAS benchmark suite using MPL (a parallel compiler for Standard ML) to enable direct comparison between HLPLs and lower-level languages for HPC
- Identified a low-level memory property called WARD in high-level parallel programs
- Implemented a custom cache coherence protocol in the Sniper architectural simulator and found an average speedup of 1.46x across the PBBS benchmark suite.
Publications
IEEE Transactions on Parallel and Distributed Systems, 2026
Skills
HPC & AI Networking
NCCL, RCCL, MPI, Libfabric, DMA-BUF, CUDA, PyTorch, Linux kernel networking
Languages
C, C++, Python, Bash, Standard/Parallel ML, SQL, Java
Architecture Simulators
Sniper, gem5, ZSim