[COMSE6998-013] High-Performance Machine Learning

Course Description

During the past decades, the field of High-Performance Computing (HPC) has been about building supercomputers to solve some of the biggest challenges in science. HPC is where cutting-edge technology (GPUs, low latency interconnects, etc.) is applied to solve scientific and data-driven problems.

One of the key ingredients to the current success of AI is the ability to perform computations on vast amounts of training data. Today, applying HPC techniques to AI algorithms is a fundamental driver for the progress of Artificial Intelligence.

In this course, you will learn HPC techniques typically applied to supercomputing software and how they are applied to obtain the maximum performance from AI algorithms.

You will also learn about techniques for building efficient AI systems. This is especially becoming more critical in the era of large foundation models such as GPT and LLAMA that require massive amounts of computational power and energy.

This course will introduce efficient AI computing techniques for both training and inference. Topics include model compression, pruning, quantization, knowledge distillation, neural architecture search, data/model parallelism, and distributed training

The course is based on PyTorch and CUDA programming.

Objectives

Use HPC techniques to find and solve performance bottlenecks.
Do performance measurements and profiling of ML software
Evaluate the performance of different ML software stacks and hardware systems
Develop high-performance distributed AI algorithms for efficient training
Use fast math libraries, CUDA and C++ to accelerate High-Performance ML algorithms
Model compression techniques such as quantization, pruning, and knowledge distillation.
Essential HPC techniques to handle large foundation models such as Large Language Models (LLMs)
Efficient LLM inference and finetuning systems and algorithms: vLLM, FlashAttention, speculative decoding, LoRA/QLoRA, prompt tuning

For details see the Syllabus.

Prerequisites

General knowledge of computer architecture and operating systems
C/C++: intermediate programming skills
Python: intermediate programming skills.
Good understanding of Neural Network algorithms:

The course is focused on model performance rather than algorithms, and a high-level review of the algorithms will be part of it. However, it is strongly recommended that you come to the course with a good understanding of the following algorithms: logistic regression, feed-forward (basic) neural networks, convolutional neural networks, recurrent neural networks, and transformer architectures.

Course Information

Instructors	Dr. Kaoutar El Maghraoui Adjunct Professor of Computer Science and Principal Research Scientist, IBM T.J. Watson Research Center, NY
TAs	Arnold Caleb Asiimwe, William Das and Wookje Han

Office Hours for Project Proposals and Discussions
Wednesday	04:00 – 05:00pm Prof. Kaoutar El Maghraoui

TA Office Hours
Tuesday	10:30 – 11:30am Wookje Han
Friday	09:00 – 10:00am William Das
Saturday	05:00 – 06:00pm Arnold Caleb Asiimwe

Google Calender Office Hours

Click the button for online office hours on the Google calendar

Course materials

The course does not follow a specific textbook; however, some books can be used as learning support. Pointers to literature/web links will be provided in class.

Introduction to High-Performance Computing for Scientists and Engineers

Authors: Georg Hager, Gerhard Wellein Editor: CRC Press

ISBN: 9781439811924

Introduction to High-Performance Scientific Computing (ONLINE)

Authors: Victor Eijkhout with Edmond Chow, Robert van de Geijn

Computer Architecture 5th Edition - A Quantitative Approach

Authors: John Hennessy, David Patterson Editor: Morgan Kaufmann

ISBN: 9780123838728

Efficient Processing of Deep Neural Networks

Authors: Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, Joel Emer Morgan & Claypool Publishers

ISBN-13: 978-1681738352

Homework

There will be five to six homework assignments, mostly involving programming and experiments involving GPUs. Assignments will be based on C/C++, Python, and PyTorch.

Grading	Homework (50%) + Final Project (30%) + Quizzes (15%) + Attendance & Participation (5%)
Late Homework Policy	Quizzes and project submissions must be submitted on time. Zero credits will begiven for late submissions.

Throughout the semester, each student has an allocation of 6 'late days.' These can be used only for homework submissions, allowing flexibility without penalty. However, once your total allowance of 6 late days is depleted, the following late submission penalties will apply:

Original Due Time: Assignments must be submitted on time for full credit.
Counting Late Days: Late days are calculated daily, with a new late day commencing at 11:59 pm ET.
Penalty Post-Late Days Allowance: After exceeding the 6 late days allowance, 20% of the total marks will be deducted per additional late day up to 5 days. Beyond that, the assignment will be given zero credits.

Course project, Project proposals are due by the midterm.

Final presentations of all projects towards the end of the course.

Syllabus

Week 1: Introduction to HPC and AI
Introduction to HPC and ML Sept. 5^th	Course introduction and organization HPC and ML technology; ML/DL success drivers; HPC for ML; hardware overview: CPUs, accelerators, high speed networks; software overview: algorithms, math libraries, frameworks
Week 2: AI performance optimization
AI performance optimization Sept. 12^th	Factors affecting ML performance; software performance optimization for ML; Performance optimization methodology: measurement, analysis, optimization; Measurement: metrics, benchmarking workloads, time/resources, throughput, time to accuracy (TTA), profiling, tracing; Analysis: Amdahl’s law, critical path, bottleneck, data movement locality principle, Roofline model; Optimization in relation to Roofline model	Assignment 1 out Quiz 1 out
Week 3: Gradient Descent Optimization Algorithms and PyTorch
Gradient Descent Optimization Algorithms and Pytorch Sept. 19^th	PyTorch Optimizer: momentum, Nesterov momentum, Adagrad, Adadelta, Adam; PyTorch Multiprocessing: concurrency vs parallelism, forking, spawning, shared memory; PyTorch data loading: Dataloader class, data prefetching, disk I/O performance; PyTorch CUDA.	Quiz 1 due
Week 4: PyTorch Performance
PyTorch Performance Sept. 26^th	Python performance: interpreter inner workings, CPython, memory management, dynamic typing; PyTorch performance: computation graph evaluation approach, JIT compilation, profiling, benchmarking; Declarative vs imperative approach for computation graph; JIT compilation optimization; PyTorch profiling: cprofile/profile, profiling a PyTorch neural network, visualization; PyTorch benchmarking using timeit module.	Homework 1 due Homework 2 out Quiz 2 out
Week 5: CUDA Basics
CUDA Basics Oct. 03^th	Heterogeneous architectures motivations; NVIDIA GPUs and CUDA: compute capability CUDA compilation and runtime: CUDA runtime, CUDA driver, AoT and JIT compilation; CUDA Programming Model: grid, block, thread	Quiz 2 due
Week 6: CUDA Advanced Topics
CUDA Advanced Topic Oct. 10^th	Unified Virtual Memory (UVM); CUDA block and warp scheduling; CUDA streams. Matrix multiplication: CUDA memory access: global memory, shared memory, caches; simple, tiled; NVIDIA deep learning SDK; cuDNN: APIs and descriptors.	Homework 2 due Homework 3 out(CUDA)
Week 7: Efficient Training
Distributed Deep Learning Algorithms and PyTorch Oct. 17^th	Model, data, hybrid parallelism; Synchronous and asynchronous DDL; Stragglers and stale gradients; Centralized and decentralized DDL; PyTorch DDL: modules for single and multi-node distributed training, available collectives; All-Reduce algorithm; NCCL; Efficient transformers.	Quiz 3
Week 8: Eficient Inference
Sparsity, Model Pruning/Compression Oct. 24^th	Activation sparsity, weight sparsity, Compression, Sparse Dataflow Low-rank approximation, Knowledge distillation Distilled architectures in convolutional and recurrent networks	Homework 3 due (CUDA) Homework 4 March 26th (DDL)
Week 9: Efficient Inference
Reduced Precision and Quantization Oct. 31^th	Determining bit-width; Mixed and varying precision; Quantization: post-training quantization, static vs dynamic quantization, quantization aware training, graph mode quantization; hardware aware quantization.	Quiz 4
Week 10: Efficient Inference
Knowledge Distillation Nov. 07^th	Knowledge distillation, Distilled architectures in convolutional and recurrent networks Knowledge distillation in vision transformers	Homework 4 due Homework 3 out (quantization)
Week 11: Efficient Transformers and LLMs
Efficient Transformers and LLMs Nov. 14^th	Transformer basics, Encoder/Decoder architecture, KV Cache optimizations	Quiz 5
Week 12: Efficient LLM Deployment Systems
Efficient inference algorithms and systems for LLMs Nov. 21^th	vLLM, FlashAttention, Speculative decoding. LoRA/QLoRA, Adapter and Prompt tuning, Attention Sparsity, and Mixture of Experts	Homework 5 due
Week 13
Thanksgiving Nov. 28^rd holiday
Week 14: Neural Architecture Search
Designing Efficient DNNs with Neural Architecture Search Dec. 05^th	Improving efficiency in manual network design; Neural architecture search (NAS), hardware-aware NAS; Near memory and In-memory processing; Analog AI.	Quiz 6
Week 15 and Week 16 Final Project Presentations
Final Project Presentations Project Presentation		Project Presentation Due
Final Project Presentation Project Presentation		Final Project Presentation

Course Description

Objectives

Prerequisites

Course Information

Course materials

Homework

Syllabus

Week 1: Introduction to HPC and AI

Introduction to HPC and ML

Week 2: AI performance optimization

AI performance optimization

Week 3: Gradient Descent Optimization Algorithms and PyTorch

Gradient Descent Optimization Algorithms and Pytorch

Week 4: PyTorch Performance

PyTorch Performance

Week 5: CUDA Basics

CUDA Basics

Week 6: CUDA Advanced Topics

CUDA Advanced Topic

Week 7: Efficient Training

Distributed Deep Learning Algorithms and PyTorch

Week 8: Eficient Inference

Sparsity, Model Pruning/Compression

Week 9: Efficient Inference

Reduced Precision and Quantization

Week 10: Efficient Inference

Knowledge Distillation

Week 11: Efficient Transformers and LLMs

Efficient Transformers and LLMs

Week 12: Efficient LLM Deployment Systems

Efficient inference algorithms and systems for LLMs

Week 13

Thanksgiving

Week 14: Neural Architecture Search

Designing Efficient DNNs with Neural Architecture Search

Week 15 and Week 16 Final Project Presentations

Final Project Presentations

Final Project Presentation