Back propagation example

Содержание

Слайд 2

20 Error .90 .17 1 .76 1.0 0.0 1 4.5 -5.2

20

Error

.90

.17

1

.76

1.0

0.0

1

4.5

-5.2

-2.0

-4.6

-1.5

3.7

2.9

3.7

2.9

Computed output: y = .76
Correct output: t = 1.0

How do we adjust the weights?
Слайд 3

21 Key Concepts Gradient descent error is a function of the

21

Key Concepts

Gradient descent
error is a function of the weights
we want to

reduce the error
gradient descent: move towards the error minimum
compute gradient → get direction to the error minimum
adjust weights towards direction of lower error
Back-propagation
first adjust last set of weights
propagate error back to each previous layer
adjust their weights
Слайд 4

22 Gradient Descent λ error(λ) gradient = 1 current λ optimal λ

22

Gradient Descent

λ

error(λ)

gradient = 1

current λ

optimal λ

Слайд 5

23 Gradient Descent Gradient for w1 Gradient for w2 Optimum Current Point Combined Gradient

23

Gradient Descent

Gradient for w1

Gradient for w2

Optimum

Current Point

Combined Gradient

Слайд 6

Coborârea gradientului (gradient descent) în spațiul ponderilor Din cartea Machine Learning, de Tom Mitchel. http://profsite.um.ac.ir/~monsefi/machine-learning/pdf/Machine-Learning-Tom-Mitchell.pdf

Coborârea gradientului (gradient descent) în spațiul ponderilor

Din cartea Machine Learning, de

Tom Mitchel.
http://profsite.um.ac.ir/~monsefi/machine-learning/pdf/Machine-Learning-Tom-Mitchell.pdf
Слайд 7

24 Derivative of Sigmoid Sigmoid 1 sigmoid(x) = 1 + e−x

24

Derivative of Sigmoid

Sigmoid

1

sigmoid(x) =
1 + e−x

Reminder: quotient rule

Derivative

dx

d sigmoid(x) d 1
=
dx 1 +

e−x

=

0 × (1 − e−x) − (−e−x)

(1 + e−x)2

=

1

e−x

1 + e−x 1 + e−x


=

1
1 + e−x

1 −

1
1 + e−x

= sigmoid(x)(1 − sigmoid(x))

Слайд 8

25 Final Layer Update Linear combination of weights Activation function y

25

Final Layer Update

Linear combination of weights
Activation function y = sigmoid(s)

2

Error

(L2 norm) E = (t − y)

1 2

Derivative of error with regard to one weight wk

Слайд 9

26 Final Layer Update (1) 2 Error (L2 norm) E =

26

Final Layer Update (1)

2

Error (L2 norm) E = (t − y)

1 2

Derivative of

error with regard to one weight wk

=

dE dE dy ds

dwk dy ds dwk

Error E is defined with respect to y

2

Linear combination of weights
Activation function y = sigmoid(s)

Слайд 10

27 Final Layer Update (2) Linear combination of weights Activation function

27

Final Layer Update (2)

Linear combination of weights
Activation function y =

sigmoid(s)

2

Error (L2 norm) E = (t − y)

1 2

Derivative of error with regard to one weight wk

=

dE dE dy ds

dwk dy ds dwk

y with respect to x is sigmoid(s)

dy d sigmoid(s)

ds

= = sigmoid(s)(1 − sigmoid(s)) = y(1 − y)
ds

Слайд 11

28 Final Layer Update (3) Linear combination of weights s =

28

Final Layer Update (3)

Linear combination of weights s = Σk wkhk
Activation

function y = sigmoid(s)

2

Error (L2 norm) E = (t − y)

1 2

Derivative of error with regard to one weight wk

=

dE dE dy ds

dwk dy ds dwk

x is weighted linear combination of hidden node values hk

Слайд 12

29 Putting it All Together Derivative of error with regard to

29

Putting it All Together

Derivative of error with regard to one weight

wk

=

dE dE dy ds

dwk dy ds dwk

= −(t − y) y(1 − y) hk
error
derivative of sigmoid: y'
Weight adjustment will be scaled by a fixed learning rate µ

Слайд 13

30 Multiple Output Nodes Our example only had one output node

30

Multiple Output Nodes

Our example only had one output node
Typically neural networks

have multiple output nodes
Error is computed over all j output nodes

Weights k → j are adjusted according to the node they point to

Слайд 14

31 Hidden Layer Update In a hidden layer, we do not

31

Hidden Layer Update

In a hidden layer, we do not have a

target output value
But we can compute how much each node contributed to downstream error
Definition of error term of each node
Back-propagate the error term
(why this way? there is math to back it up...)

Universal update formula
∆wj←k = µ δj hk

Слайд 15

32 Our Example .17 4.5 -5.2 -2.0 -4.6 -1.5 2.9 3.7

32

Our Example

.17

4.5

-5.2

-2.0

-4.6

-1.5

2.9

3.7

2.9

A
1.0
B
0.0

C
1

3.7 D
.90

E

F
1

G
.76

Computed output: y = .76
Correct output: t =

1.0
Final layer weight updates (learning rate µ = 10)
– δG = (t − y) y' = (1 − .76) 0.181 = .0434
– ∆wGD = µ δG hD = 10 × .0434 × .90 = .391
– ∆wGE = µ δG hE = 10 × .0434 × .17 = .074
– ∆wGF = µ δG hF = 10 × .0434 × 1 = .434
Слайд 16

33 Our Example .17 -4.6 -1.5 2.9 3.7 2.9 A 1.0

33

Our Example

.17

-4.6

-1.5

2.9

3.7

2.9

A
1.0
B
0.0

C
1

3.7 D
.90

E

F
1

G
.76

4.8914—.5

-5.126 -—5.—2

-1.566 —-2.—0

Computed output: y = .76
Correct output: t

= 1.0
Final layer weight updates (learning rate µ = 10)
– δG = (t − y) y' = (1 − .76) 0.181 = .0434
– ∆wGD = µ δG hD = 10 × .0434 × .90 = .391
– ∆wGE = µ δG hE = 10 × .0434 × .17 = .074
– ∆wGF = µ δG hF = 10 × .0434 × 1 = .434
Слайд 17

34 Hidden Layer Updates .17 -4.6 -1.5 2.9 3.7 2.9 A

34

Hidden Layer Updates

.17

-4.6

-1.5

2.9

3.7

2.9

A
1.0
B
0.0

C
1

3.7 D
.90

E

F
1

G
.76

4.8914—.5

-5.126 -—5.—2

-1.566 —-2.—0

Hidden node D

Hidden node E

Слайд 18

35 some additional aspects

35

some additional aspects

Слайд 19

36 Initialization of Weights Weights are initialized randomly e.g., uniformly from

36

Initialization of Weights

Weights are initialized randomly
e.g., uniformly from interval [−0.01, 0.01]
Glorot

and Bengio (2010) suggest
– for shallow neural networks

n is the size of the previous layer
– for deep neural networks

nj is the size of the previous layer, nj size of next layer

Слайд 20

37 Neural Networks for Classification Predict class: one output node per

37

Neural Networks for Classification

Predict class: one output node per class
Training data

output: ”One-hot vector”, e.g., ˙
Prediction
predicted class is output node yi with highest value
obtain posterior probability distribution by soft-max
Слайд 21

38 Problems with Gradient Descent Training error(λ) λ Too high learning rate

38

Problems with Gradient Descent Training

error(λ)

λ
Too high learning rate

Слайд 22

39 Problems with Gradient Descent Training error(λ) λ Bad initialization Philipp

39

Problems with Gradient Descent Training

error(λ)

λ
Bad initialization

Philipp Koehn

Machine Translation: Introduction to Neural

Networks

27 September 2018

Слайд 23

40 Problems with Gradient Descent Training λ error(λ) local optimum global optimum Local optimum

40

Problems with Gradient Descent Training

λ

error(λ)
local optimum

global optimum

Local optimum

Слайд 24

41 Speedup: Momentum Term Updates may move a weight slowly in

41

Speedup: Momentum Term

Updates may move a weight slowly in one direction
To

speed this up, we can keep a memory of prior updates
∆wj←k(n − 1)
... and add these to any new updates (with decay factor ρ)
∆wj←k(n) = µ δj hk + ρ∆wj←k(n − 1)

Philipp Koehn

Machine Translation: Introduction to Neural Networks

27 September 2018

Слайд 25

42 Adagrad Typically reduce the learning rate µ over time at

42

Adagrad

Typically reduce the learning rate µ over time
at the beginning, things

have to change a lot
later, just fine-tuning
Adapting learning rate per parameter
Adagrad update

based on error E with respect to the weight w at time t = gt = dE

dw

t

∆w =

µ

.

Σ

t τ =1

g2

τ

gt

Слайд 26

43 Dropout A general problem of machine learning: overfitting to training

43

Dropout

A general problem of machine learning: overfitting to training data (very

good on train, bad on unseen test)
Solution: regularization, e.g., keeping weights from having extreme values
Dropout: randomly remove some hidden units during training
mask: set of hidden units dropped
randomly generate, say, 10–20 masks
alternate between the masks during training
Why does that work?
→ bagging, ensemble, ...
Слайд 27

44 Mini Batches Each training example yields a set of weight

44

Mini Batches

Each training example yields a set of weight updates ∆wi.
Batch

up several training examples
sum up their updates
apply sum to model
Mostly done or speed reasons
Слайд 28

45 computational aspects

45

computational aspects

Слайд 29

46 Vector and Matrix Multiplications Forward computation: Activation function: Error term:

46

Vector and Matrix Multiplications

Forward computation:
Activation function:
Error term:
Propagation of error

term:
Weight updates:
Слайд 30

47 GPU Neural network layers may have, say, 200 nodes Computations

47

GPU

Neural network layers may have, say, 200 nodes
Computations such as require

200 × 200 = 40, 000 multiplications
Graphics Processing Units (GPU) are designed for such computations
image rendering requires such vector and matrix operations
massively mulit-core but lean processing units
example: NVIDIA Tesla K20c GPU provides 2496 thread processors
Extensions to C to support programming of GPUs, such as CUDA
Слайд 31

48 Toolkits Theano Tensorflow (Google) PyTorch (Facebook) MXNet (Amazon) DyNet С (easy api)

48

Toolkits

Theano
Tensorflow (Google)
PyTorch (Facebook)
MXNet (Amazon)
DyNet
С (easy api)

Слайд 32

lia@math.md

lia@math.md