#10. Optimization Methods

연구실 2019. 10. 10. 00:37

- 좋은 optimization을 사용하면 learning 속도도 빨라지고 cost function에 들어갈 결과 값도 좋은 값을 얻을 수 있다.

- ∂J/∂a=da

* Gradient Descent

- 가장 간단한 optimization 방법 중 하나이다.

- 모든 step에 모든 example들을 사용하게 되면 Batch Gradient Descent라고 한다.

- W[l] = W[l] - α*dW[l], b[l]=b[l] - α*db[l]

- Stochastic Gradient Descent: 한번에 하나의 training set만 사용해 gradient를 계산

- (Batch) Gradient Descent

X = data_input
Y = labels
parameters = initialize_parameters(layers_dims)
for i in range(0, num_iterations):
    # Forward propagation
    a, caches = forward_propagation(X, parameters)
    # Compute cost.
    cost += compute_cost(a, Y)
    # Backward propagation.
    grads = backward_propagation(a, caches, parameters)
    # Update parameters.
    parameters = update_parameters(parameters, grads)

- Stochastic Gradient Descent

X = data_input
Y = labels
parameters = initialize_parameters(layers_dims)
for i in range(0, num_iterations):
    # Forward propagation
    a, caches = forward_propagation(X, parameters)
    # Compute cost.
    cost += compute_cost(a, Y)
    # Backward propagation.
    grads = backward_propagation(a, caches, parameters)
    # Update parameters.
    parameters = update_parameters(parameters, grads)

- training set이 크면 SGD가 더 빠를수도 있지만, 부드럽게 수렴하지 않고 많이 진동하면서 수렴한다.

- Mini-batch는 몇몇개의 training set를 사용하는 방법이다.

* Mini-Batch Gradient descent

- training set(X, Y)에서 mini-batch를 만드는 방법

(1) Shuffle: examples will be split randomly into different mini-batches

(2) Partition: mini-batch들을 partition한다.

- 보통 2의 제곱수들을 mini-batch 사이즈로 고른다.

* Momentum

- mini-batch GD를 쓰면 example들의 일부만을 보고 계산을 하는 것이기 때문에 수렴 과정에서 많이 진동하게 된다. 이때 momentum을 쓰면 이러한 진동을 줄일 수 있다.

- 이전의 gradient를 고려해 update를 smooth시킨다. 이전의 gradient의 방향을 저장해놓고, 그것을 고려해 그 다음을 정한다.

- 빨간색 화살표가 mini-batch GD와 momentum을 함께 적용시킨 것이다. 파란색 점들은 gradient의 방향을 보여주고 있다.

- Momentum Update rule(L: number of layers, β: momentum, α: learning rate)

- β가 0이면 momentum이 없는 그냥 GD가 된다.

- 어떻게 β 값을 선택해야 하는가?

(1) β가 클수록 업데이트는 부드러워지는데 그 이유는 past gradient가 더 많이 고려되기 때문이다. 하지만 β가 너무 크게 되면 업데이트를 너무 부드럽게 만들 수도 있다.

(2) 보통 β는 0.8에서 0.999사이이다. 기본적으로 β=0.9를 사용한다.

(3) 자신의 모델에 맞추어 cost function J가 감소하는 방향으로 β 값을 조절해 나간다.

* Adam

- RMSProp과 Momentum을 함께 사용한 기법이다.

- 작동 방법:

(1) 이전 gradient의 exponentially weighted average를 계산해 저장해놓는다.

(2) 이전 gradient의 제곱의 exponentially weighted average를 계산해 저장해놓는다.

(3) (1)과 (2)를 함께 고려하여 parameter들을 업데이트시킨다.

- The update rule(t: num of steps taken of Adam, L: number of layers, β1 and β2: hyperparameters that control the two exponentially weighted averages, α: learning rate, ε: very small number)

- low memory requirement

- little tuning of hyperparameters로도 잘 작동한다.

* Model with different optimization algorithms

- Mini-batch GD

- Mini-batch GD with momentum

- Mini-batch with Adam

- Momentum usually helps, but 지금과 같은 작은 learning rate와 심플한 데이터셋으로는 임팩트가 크지 않다.

'연구실' 카테고리의 다른 글

#12. Convolutional Neural Networks: Step by Step (0)	2019.10.13
#11. Tensorflow Tutorial (0)	2019.10.10
#9. Gradient Checking (0)	2019.10.07
#7. Initialization (0)	2019.10.07
#6. Building your Deep Neural Network: Step by Step (0)	2019.10.04

ABOUT ME

ꉂꉂ(ᵔᗜᵔ) ꉂꉂ(ᵔᗜᵔ)

* Gradient Descent

* Mini-Batch Gradient descent

* Momentum

* Adam

* Model with different optimization algorithms

'연구실' 카테고리의 다른 글

티스토리툴바

ABOUT ME

* Gradient Descent

* Mini-Batch Gradient descent

* Momentum

* Adam

* Model with different optimization algorithms

'연구실' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바