Generalization in Deep Learning

2017, 논문링크 : https://arxiv.org/abs/1710.05468

저자 : Kenji Kawaguchi, Leslie Pack Kaelbling, Yoshua Bengio
초록 (Abstract)

This paper explains why deep learning can generalize well, despite large capacity and possible algorithmic instability, nonrobustness, and sharp minima, effectively addressing an open problem in the literature. Based on our theoretical insight, this paper also proposes a family of new regularization methods. Its simplest member was empirically shown to improve base models and achieve state-of-the-art performance on MNIST and CIFAR-10 benchmarks. Moreover, this paper presents both data-dependent and data-independent generalization guarantees with improved convergence rates. Our results suggest several new open areas of research.

요약 (Summary)
- 딥러닝의 일반화(Generalization) 성능에 대해 Zhang et al. 등은
필요한 사전지식 (Preliminaries)
- Zhang et al. 논문 (ICLR, 2017)
- Rademacher complexity
- VC dimension
기호 (Notation)
- Input space : $x\in\mathcal{X}$
- Target space : $y\in\mathcal{Y}$
- Size of a training set : $m$
- Learning Algorithm : $\mathcal{A}$
- Training dataset :
  - $S_{m}$ 의 각 원소들은 i.i.d 샘플링 되었다고 가정
- Model learned by $\mathcal{A}$ with $S_{m}$ : $f_{\mathcal{A}(S_{m})} : x \mapsto f_{\mathcal{A}(S_{m})}(x)$
- Loss function : $\ell$
- Expected risk : $R[f] = \mathbb{E}_{x,y \sim P_{(x,y)} }[\ell(f(x),y)]$
- Empirical risk : $\hat{R}_{m}[f] = \frac{1}{m}\sum_{i=1}^{m}[\ell(f(x_{i}),y_{i})]$ where $(x_{i}, y_{i})\sim P_{(x,y)}$
- Hypothesis set (or model class) : $\mathcal{F}$
- Family of loss function : $\mathcal{L}_{\mathcal{F}} = \{ \ell(f(\cdot),\cdot) : f\in\mathcal{F} \}$
- Trainable parameters : $w\in\mathbb{R}^{n}$
- Pre-activation vector : $h^{(l)}(x)$
- Number of hidden layers : $H$
- Largest singular value of matrix $M$ : $\lambda_{\max }(M)$
- Spectral norm of matrix $M$ : $\Vert M \Vert$

Overview of Generalization theory

머신러닝 이론에서 학습의 목표는 expected risk $R[f_{\mathcal{A}(S_{m})}]$ 를 최소화하는 것이다. 그러나 expected risk 를 직접 계산하는 건 불가능하기 때문에 대신 empirical risk 인 $\hat{R}_{m}[f_{\mathcal{A}(S_{m})}]$ 을 최소화한다. 이를 empirical risk minimization (이하 ERM) 이라 부른다.

그런데 ERM 으로 원래 목표인 $R[f_{\mathcal{A}(S_{m})}]$ 최소화가 반드시 가능할까? 다시 말해 아래의 명제는 참인가?

$\text{minimize }\hat{R}_{m}[f_{\mathcal{A}(S_{m})}]\quad\overset{?}{\Longrightarrow}\quad \text{minimize }R[f_{\mathcal{A}(S_{m})}]$

사실 위 명제가 성립하지 않는 경우가 바로 오버피팅(overfitting) 이다. 그러므로 학습이두 risk 의 차이를 generalization gap 이라 하는데, 오버피팅을 논할 때 사용되는 generalization error 는 식 (1) 의 추정량(estimator)이 된다 :

$\text{generalization gap}:= R[f_{\mathcal{A}(S_{m})}] - \hat{R}_{m}[f_{\mathcal{A}(S_{m})}]\quad\cdots\quad(1)$

다시 말해 Generalization 이론의 목적은 식 (1) 의 generalization gap 을 분석함으로써 이를 규명하는게 목표다.

문제는 식 (1) 의 분석이 쉬운 일은 아니다. 왜냐하면 empirical risk 에서 학습된 모델과 $\hat{R}_{m}$ 둘 다 학습데이터 $S_{m}$ 에 의존하는 통계량이기 때문이다 :

$\hat{R}_{m}[f_{\mathcal{A}(S_{m})}] = \frac{1}{m}\sum_{i=1}^{m}\ell(f_{\mathcal{A}(S_{m})}(x_{i}),y_{i}),\quad (x_{i},y_{i})\sim P(x,y)$

이 부분을 돌파하기 위해 Generalization 이론에서는 아래와 같은 여러가지 분석 방법론을 개발해왔다

Model complexity approach
Stability approach
Robustness approach

Model complexity approach

Model complexity 를 사용해서 분석하는 방법론은 function space 로 정의되는 Model class $\mathcal{F}$ 를 상정함으로써 학습데이터 $S_{m}$ 과 $f_{\mathcal{A}(S_{m})}$ 을 분리시키는 것이다. 아래 부등식의 우변은 $S_{m}$ 과 무관하기 때문에 좌변과 달리 분석이 용이해진다 :

$R[f_{\mathcal{A}(S_{m})}]-\hat{R}_{m}(f_{\mathcal{A}(S_{m})}) \leq \sup_{f\in\mathcal{F}}\left(R[f]-\hat{R}_{m}[f]\right)$

물론 $\mathcal{F}$ 를 아무 제약 없는 임의의 함수공간으로 상정하는 것은 위 부등식의 우변을 무한대로 만들어버리기 때문에 쓸모가 없다. 따라서 $\mathcal{F}$ 를 model complexity 등으로 제한하는 게 필요한데 Rademacher complexity 나 VC dimension 을 도입한다. 가령 아래와 같은 결과는 위 부등식보다 좀 더 나은 결과이다

Theorem 1 (Mohri et al., 2012) For any $\delta > 0$

$\mathbb{P}\left(\sup_{f\in\mathcal{F}}\left(R[f]-\hat{R}_{m}[f]\right)\leq 2\mathfrak{R}_{m}(\mathcal{L}_{\mathcal{F}}) + \sqrt{\frac{|\ln \delta|}{2m}} \right) \geq 1-\delta$

where $\mathfrak{R}_{m}(\mathcal{L}_{\mathcal{F}})$ is the Rademacher complexity of $\mathcal{L}_{\mathcal{F}}$ .

Theorem 1 의 결과를 정리하면 $\mathfrak{R}_{m}(\mathcal{L}_{\mathcal{F}})$ 가 almost surely 상한(upper bound)이 된다는 의미이다

$\mathbb{P}\left(\sup_{f\in\mathcal{F}}\left(R[f]-\hat{R}_{m}[f]\right) \leq 2\mathfrak{R}_{m}(\mathcal{L}_{\mathcal{F}})\right) = 1$

상한이 $\mathfrak{R}_{m}(\mathcal{L}_{\mathcal{F}})$ 이란건, 우리가 적절한 model class $\mathcal{F}$ 를 선택할 수 있다면, 식 (1) generalization gap 을 항상 줄일 수 있다는 의미이다. 다시 말해 오버피팅이 일어날 일은 없다는 것이다. 문제는 딥러닝의 경우 $\mathfrak{R}_{m}(\mathcal{L}_{\mathcal{F}})$ 가 레이어의 수에 따라 지수적으로(exponentially) 증가한다는 점이다 (Sun et al., 2016; Neyshabur et al., 2015; Xie et al., 2015). 패러미터의 수에 대해선 선형적으로 증가한다는 결과도 있지만 (Shalev-Shwartz and Ben-David, 2014) 둘 다 의미가 약한 것은 마찬가지이다.

Stability approach

Rethinking Generalization in Machine Learning

2017년 Zhang et al. 은

figure1.1