As heterogeneous parallel architectures grow increasingly complex, achieving high performance and effectively teaching parallel programming have become more challenging. Benchmark suites are powerful tools for illustrating and evaluating optimization techniques in practical, performance-critical scenarios. But commonly used parallel benchmark suites (e.g., SPEC OMP and Rodinia) are primarily designed for performance assessment purposes only. They are not intended for performance optimization training or educational instruction in parallel programming. Furthermore, their complexity in configuration and deployment often limits their accessibility, reducing their practical utility for researchers and students in educational settings. To address these limitations, this dissertation presents two novel benchmark suites, NeoRodinia and CUDAMicroBench, that support not only performance evaluation, but also the exploration of optimization strategies. These suites are further augmented with educational features, such as integration with large language models (LLMs) for optimization guidance and interactive, browser-accessible execution environments.
NeoRodinia features a structured three-level parallelization model (P1, P2, P3) across CPU worksharing, GPU offloading, SIMD, and tasking. It provides standardized execution workflows, automated performance evaluation scripts and visualization tools. Additionally, NeoRodinia integrates AI-assisted analysis, allowing LLMs to offer optimization recommendations and debugging insights. CUDAMicroBench is a modular microbenchmark suite targeting key GPU optimization challenges such as memory hierarchy usage, warp divergence, and concurrent kernel execution, serving as a practical reference for GPU performance tuning.
In addition to benchmark-based contributions, this dissertation advances parallel programming education by introducing the Interactive OpenMP Programming book. By employing deliberate prompt engineering strategies, it effectively leverages large language models (ChatGPT-4, Gemini Pro 1.5, and Claude 3) to enhance the quality, relevance, and pedagogical value of the generated content. Delivered via a Jupyter-based environment, it enables real-time experimentation with OpenMP constructs, promoting hands-on learning and deeper understanding.
Collectively, these contributions form a unified educational infrastructure for modern parallel computing. By combining benchmarking, structured optimization guidance, and LLM-driven interactive learning, this work bridges performance engineering and pedagogy, providing a scalable and adaptable solution for educators and learners in today's heterogeneous HPC landscape.