- GPU kernel optimization and low-level performance engineering
- Model quantization and precision-efficient inference
An automated neural network optimization framework that treats model compression as a sequential decision-making problem. Uses Monte Carlo Tree Search to navigate the combinatorial space of compression configurations, discovering optimal quantization and pruning strategies without requiring gradient-based fine-tuning.
Empirical analysis of composing quantization with early exit strategies for efficient inference. Investigates how reduced numerical precision interacts with adaptive computation depth—whether aggressive quantization degrades the confidence estimates that early exit relies on, and how to jointly optimize both techniques. Explores the Pareto frontier of latency, memory, and accuracy trade-offs.
Custom kernel implementations achieving 8.8x RMSNorm speedup and 79-83% memory bandwidth utilization on distributed 8-GPU A100 systems. Experience porting optimizations across CUDA, ROCm/HIP, and Metal backends, with focus on memory coalescing, warp-level primitives, and minimizing kernel launch overhead.
M.S. Information Systems — Northeastern University (December 2025)
Thesis: ModelOpt: Research Framework for Zero-Shot Computer Vision Model Optimization with Tree Search and Federated Knowledge Sharing
Advisor: Professor Handan Liu
GPU Programming: CUDA, ROCm/HIP, Metal, Triton
ML Frameworks: PyTorch, DeepSpeed, FSDP, Hugging Face Transformers
Quantization Tools: bitsandbytes, GPTQ, AWQ
Languages: Python, C++, CUDA C
Infrastructure: Distributed training, SLURM, multi-node clusters
]



