Lecture: Nonconvex Optimization for Deep Learning
University of Tübingen, Wintersemester 2024/2025
Dr. Antonio Orvieto, ELLIS Principal Investigator at MPI.

TLDR;
Take this module if
- You know what deep learning is, and that deep models are trained with variants of SGD;
- You are curious to understand what influences training speed and generalization in vision and language models.
- You like maths and want to know more about deep learning theory.
Time/Date (to be confirmed)
AI Center Bulding, TTR2 on Tuesday, 10-12 (lecture) + 12-14 (tutorial). Tutorials are starting the third week.
Basic Information
Note: This lecture does not overlap with "Convex and Nonconvex Optimization." While students are encouraged to take "Convex and Nonconvex Optimization" to solidify their understanding of SGD and basic optimization concepts (duality, interior point methods, constraints), we will only discuss optimization in the context of training deep neural networks and often drift into discussions regarding model design and initialization.
Successful training of deep learning models requires non-trivial optimization techniques. This course gives a formal introduction to the field of nonconvex optimization by discussing training of large deep models. We will start with a recap of essential optimization concepts and then proceed to convergence analysis of SGD in the general nonconvex smooth setting. Here, we will explain why a standard nonconvex optimization analysis cannot fully explain the training of neural networks. After discussing the properties of stationary points (e.g., saddle points and local minima), we will study the geometry of neural network landscapes; in particular, we will discuss the existence of "bad" local minima.
Next, to gain some insight into the training dynamics of SGD in deep networks, we will explore specific and insightful nonconvex toy problems, such as deep chains and matrix factorization/decomposition/sensing. These are to be considered warm-ups (primitives) for deep learning problems. We will then examine training of standard deep neural networks and discuss the impact of initialization and (over)parametrization on optimization speed and generalization. We will also touch on the benefits of normalization and skip connections.
Finally, we will analyze adaptive methods like Adam and discuss their theoretical guarantees and performance on language models. If time permits, we will touch on advanced topics such as label noise, sharpness-aware minimization, neural tangent kernel (NTK), and maximal update parametrization (muP).
Here are a few crucial papers discussed in the lecture (math will be greatly simplified): https://arxiv.org/abs/1605.07110, https://arxiv.org/pdf/1802.06509, https://arxiv.org/abs/1812.0795, https://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf,
https://arxiv.org/abs/1502.01852, https://arxiv.org/pdf/2402.16788v1.
Requirements
The course requires some deep learning familiarity and basic knowledge of gradient-based optimization. Students who have already attended "Convex and Nonconvex Optimization" or any machine learning lecture that discusses gradient descent will have no problem following the lecture. In general, the semester requires good mathematical skills, roughly at the level of the lecture "Mathematics for Machine Learning." In particular, multivariate calculus and linear algebra are needed.
Textbook
Lecture notes will be provided. A good reference that takes a similar approach is "Optimization for deep learning: theory and algorithms" by Ruoyu Sun, available at https://arxiv.org/pdf/1912.08957.
Link to all content (recordings as well, exercise sheets etc) : https://drive.google.com/drive/folders/1ZasVKDEM02UiLAOviv9eZy8wROB9l0ck?usp=sharing
Exercises
Exercise sheets will be provided every week and discussed during the next session. Exercises are not graded, and no project has been planned.
Exam
Written exam, 2h, closed book.
Schedule (work in progress)
Date
Lecture
Tutorial
Session 0 |
Oct 15 |
Introduction and motivation |
no tutorial |
Session 1 |
Oct 22 |
Linear Regression
GD on quadratic models |
no tutorial |
Session 2 |
Oct 29 |
Neural Network Basics
Local Minima and Saddle Points
Noisy Gradients |
Linear Algebra Recap |
Session 3 |
Nov 5 |
Initialization of Deep MLPs |
Batch size scaling laws |
Session 4 |
Nov 12 |
Nonconvex Gradient Descent
Descent Lemma and Rates |
Noisy Linear Models
Deep Chains |
Session 5 |
Nov 19 |
Nonconvex SGD
Upper and Lower Bounds |
Gradient Flow
PL Condition |
Session 6 |
Nov 26 |
Overparametrization
Implicit Bias of SGD |
Robbins–Monro Conditions |
Session 7 |
Dec 3 |
Neural Tangent Kernel
Lazy Training |
Matrix Sensing |
NeurIPS - Canceled |
Dec 10 |
|
|
Session 8 |
Dec 17 |
Maximal Update Parametrization |
Stochastic Matrices
Exam-like questions |
Break - Canceled |
Dec 24 |
|
|
Break - Canceled |
Dec 31 |
|
|
Session 9 |
Jan 7 |
Optimization Challenges in CNNs
Batch Normalization and Skip Connections |
Interpolated SGD Rates |
Session 10 |
Jan 14 |
Optimization Challenges in Attention
Layer Normalization and Rank Collapse |
Residual Connection Mechanics |
Session 11 |
Jan 21 |
Adaptive Methods Theory 1 |
Layer Normalization Mechanics |
Session 12 |
Jan 28 |
Adaptive Methods Theory 2 |
Polyak Stepsize |
Final Session |
Feb 4 |
Exam tips |
Adam non-convergence |