Welcome!
My name is Mike Wilkins, and I research optimizations for AI workloads on high-performance computing systems. As a Maria Goeppert Mayer Fellow at Argonne National Laboratory, I’m currently leading the development of a new holistic online autotuner. I previously completed my Ph.D. in Computer Engineering at Northwestern University and have industry experience at Cornelis Networks and Meta. I am open to collaboration opportunities, please feel free to reach out with ideas or questions!
Experiences
- Directed an independent research program on autotuning and collective communication, supported by a 3-year, $1M award from Argonne
- Translated my MPI autotuning research into production, achieving speedups up to 35x for collective operations on Argonne’s exascale system, Aurora
- Contributed major enhancements to MPICH, the leading open-source MPI implementation, with a focus on optimizing collective communication for high-performance computing environments
- Spearheaded major performance optimizations for the OPX libfabric provider, achieving 5× bandwidth improvements for GPU communications and other critical improvements
- Led the architecture and development of the reference libfabric provider for the Ultra Ethernet Consortium, achieving a key milestone in the standard’s development
- Created OPX developer tools, including a profiler and autotuner, boosting team velocity
- Designed and implemented an application-aware communication (NCCL) autotuner for large-scale AI workloads
- Developed an AI application emulation tool that mimics production models by overlapping communication and genericized compute kernels
- Founded the MPI collective algorithm/machine learning project, initially under the supervision of Dr. Min Si and Dr. Pavan Balaji, later Dr. Yanfei Guo and Dr. Rajeev Thakur
- Earned perpetual external funding from ANL for the remainder of my Ph.D
Sample Research Projects
Here is a high-level description of some of my active and former research projects.
- Invented many optimizations to make ML-based MPI autotuning feasible on large-scale systems
- Developed the world’s first exascale-capable MPI collective algorithm autotuner and achieved up to 20% speedups for production applications
- Exploring new “holistic” tuning methodologies to encompass performance-critical parameters across the software stack, targeting large scale AI workloads
- Created new generalized MPI collective algorithms that expose a tunable radix and outperform the previous best algorithms by up to 4.5x
- Exploring new generalized algorithms for GPU-specific collective communication (e.g., NCCL) and new abstractions (e.g., circulant graphs)
- Developed a new hardware/software co-design for the Standard ML language targeted at HPC systems and applications, including AI
- Created a new version of the NAS benchmark suite using MPL (a parallel compiler for Standard ML) to enable direct comparison between HLPLs and lower-level languages for HPC
- Identified a low-level memory property called WARD in high-level parallel programs
- Implemented a custom cache coherence protocol in the Sniper architectural simulator and found an average speedup of 1.46x across the PBBS benchmark suite.
Publications
Skills
Software/Scripting Languages
C, C++, Python, Standard/Parallel ML, C#, LabVIEW, Java, SQL, Bash
Parallel Programming/Communication
MPI, Libfabric, NCCL, CUDA, PyTorch, Parallel ML
Simulators/Tools
Sniper, gem5, ZSim, Xilinx Vivado, Xilinx ISE, Quartus II
Hardware Description Languages
Chisel, VHDL, Verilog, SPICE