Back propagation example

Июль 22, 2022

Главная
Информатика
Back propagation example

Содержание

2. 20 Error .90 .17 1 .76 1.0 0.0 1 4.5 -5.2 -2.0 -4.6 -1.5 3.7 2.9
3. 21 Key Concepts Gradient descent error is a function of the weights we want to reduce
4. 22 Gradient Descent λ error(λ) gradient = 1 current λ optimal λ
5. 23 Gradient Descent Gradient for w1 Gradient for w2 Optimum Current Point Combined Gradient
6. Coborârea gradientului (gradient descent) în spațiul ponderilor Din cartea Machine Learning, de Tom Mitchel. http://profsite.um.ac.ir/~monsefi/machine-learning/pdf/Machine-Learning-Tom-Mitchell.pdf
7. 24 Derivative of Sigmoid Sigmoid 1 sigmoid(x) = 1 + e−x Reminder: quotient rule Derivative dx
8. 25 Final Layer Update Linear combination of weights Activation function y = sigmoid(s) 2 Error (L2
9. 26 Final Layer Update (1) 2 Error (L2 norm) E = (t − y) 1 2
10. 27 Final Layer Update (2) Linear combination of weights Activation function y = sigmoid(s) 2 Error
11. 28 Final Layer Update (3) Linear combination of weights s = Σk wkhk Activation function y
12. 29 Putting it All Together Derivative of error with regard to one weight wk = dE
13. 30 Multiple Output Nodes Our example only had one output node Typically neural networks have multiple
14. 31 Hidden Layer Update In a hidden layer, we do not have a target output value
15. 32 Our Example .17 4.5 -5.2 -2.0 -4.6 -1.5 2.9 3.7 2.9 A 1.0 B 0.0
16. 33 Our Example .17 -4.6 -1.5 2.9 3.7 2.9 A 1.0 B 0.0 C 1 3.7
17. 34 Hidden Layer Updates .17 -4.6 -1.5 2.9 3.7 2.9 A 1.0 B 0.0 C 1
18. 35 some additional aspects
19. 36 Initialization of Weights Weights are initialized randomly e.g., uniformly from interval [−0.01, 0.01] Glorot and
20. 37 Neural Networks for Classification Predict class: one output node per class Training data output: ”One-hot
21. 38 Problems with Gradient Descent Training error(λ) λ Too high learning rate
22. 39 Problems with Gradient Descent Training error(λ) λ Bad initialization Philipp Koehn Machine Translation: Introduction to
23. 40 Problems with Gradient Descent Training λ error(λ) local optimum global optimum Local optimum
24. 41 Speedup: Momentum Term Updates may move a weight slowly in one direction To speed this
25. 42 Adagrad Typically reduce the learning rate µ over time at the beginning, things have to
26. 43 Dropout A general problem of machine learning: overfitting to training data (very good on train,
27. 44 Mini Batches Each training example yields a set of weight updates ∆wi. Batch up several
28. 45 computational aspects
29. 46 Vector and Matrix Multiplications Forward computation: Activation function: Error term: Propagation of error term: Weight
30. 47 GPU Neural network layers may have, say, 200 nodes Computations such as require 200 ×
31. 48 Toolkits Theano Tensorflow (Google) PyTorch (Facebook) MXNet (Amazon) DyNet С (easy api)
32. lia@math.md
34. Скачать презентацию

Слайд 2

20
Error
.90
.17
1
.76
1.0
0.0
1
4.5
-5.2
-2.0
-4.6
-1.5
3.7
2.9
3.7
2.9
Computed output: y = .76
Correct output: t = 1.0
⇒

How do we adjust the weights?

Слайд 3

21
Key Concepts
Gradient descent
error is a function of the weights
we want to

reduce the error
gradient descent: move towards the error minimum
compute gradient → get direction to the error minimum
adjust weights towards direction of lower error
Back-propagation
first adjust last set of weights
propagate error back to each previous layer
adjust their weights

Слайд 4

22
Gradient Descent
λ
error(λ)
gradient = 1
current λ
optimal λ

Слайд 5

23
Gradient Descent
Gradient for w1
Gradient for w2
Optimum
Current Point
Combined Gradient

Слайд 6

Coborârea gradientului (gradient descent) în spațiul ponderilor
Din cartea Machine Learning, de

Tom Mitchel.
http://profsite.um.ac.ir/~monsefi/machine-learning/pdf/Machine-Learning-Tom-Mitchell.pdf

Слайд 7

24
Derivative of Sigmoid
Sigmoid
1
sigmoid(x) =
1 + e−x
Reminder: quotient rule
Derivative
dx
d sigmoid(x) d 1
=
dx 1 +

e−x

0 × (1 − e−x) − (−e−x)

(1 + e−x)2

e−x

1 + e−x 1 + e−x

1
1 + e−x

1 −

1
1 + e−x

= sigmoid(x)(1 − sigmoid(x))

Слайд 8

25
Final Layer Update
Linear combination of weights
Activation function y = sigmoid(s)
2
Error

(L2 norm) E = (t − y)

1 2

Derivative of error with regard to one weight wk

Слайд 9

26
Final Layer Update (1)
2
Error (L2 norm) E = (t − y)
1 2
Derivative of

error with regard to one weight wk

dE dE dy ds

dwk dy ds dwk

Error E is defined with respect to y

Linear combination of weights
Activation function y = sigmoid(s)

Слайд 10

27
Final Layer Update (2)
Linear combination of weights
Activation function y =

sigmoid(s)

Error (L2 norm) E = (t − y)

1 2

Derivative of error with regard to one weight wk

dE dE dy ds

dwk dy ds dwk

y with respect to x is sigmoid(s)

dy d sigmoid(s)

= = sigmoid(s)(1 − sigmoid(s)) = y(1 − y)
ds

Слайд 11

28
Final Layer Update (3)
Linear combination of weights s = Σk wkhk
Activation

function y = sigmoid(s)

Error (L2 norm) E = (t − y)

1 2

Derivative of error with regard to one weight wk

dE dE dy ds

dwk dy ds dwk

x is weighted linear combination of hidden node values hk

Слайд 12

29
Putting it All Together
Derivative of error with regard to one weight

dE dE dy ds

dwk dy ds dwk

= −(t − y) y(1 − y) hk
error
derivative of sigmoid: y'
Weight adjustment will be scaled by a fixed learning rate µ

Слайд 13

30
Multiple Output Nodes
Our example only had one output node
Typically neural networks

have multiple output nodes
Error is computed over all j output nodes

Weights k → j are adjusted according to the node they point to

Слайд 14

31
Hidden Layer Update
In a hidden layer, we do not have a

target output value
But we can compute how much each node contributed to downstream error
Definition of error term of each node
Back-propagate the error term
(why this way? there is math to back it up...)

Universal update formula
∆wj←k = µ δj hk

Слайд 15

32
Our Example
.17
4.5
-5.2
-2.0
-4.6
-1.5
2.9
3.7
2.9
A
1.0
B
0.0
C
1
3.7 D
.90
E
F
1
G
.76
Computed output: y = .76
Correct output: t =

1.0
Final layer weight updates (learning rate µ = 10)
– δG = (t − y) y' = (1 − .76) 0.181 = .0434
– ∆wGD = µ δG hD = 10 × .0434 × .90 = .391
– ∆wGE = µ δG hE = 10 × .0434 × .17 = .074
– ∆wGF = µ δG hF = 10 × .0434 × 1 = .434

Слайд 16

33
Our Example
.17
-4.6
-1.5
2.9
3.7
2.9
A
1.0
B
0.0
C
1
3.7 D
.90
E
F
1
G
.76
4.8914—.5
-5.126 -—5.—2
-1.566 —-2.—0
Computed output: y = .76
Correct output: t

= 1.0
Final layer weight updates (learning rate µ = 10)
– δG = (t − y) y' = (1 − .76) 0.181 = .0434
– ∆wGD = µ δG hD = 10 × .0434 × .90 = .391
– ∆wGE = µ δG hE = 10 × .0434 × .17 = .074
– ∆wGF = µ δG hF = 10 × .0434 × 1 = .434

Слайд 17

34
Hidden Layer Updates
.17
-4.6
-1.5
2.9
3.7
2.9
A
1.0
B
0.0
C
1
3.7 D
.90
E
F
1
G
.76
4.8914—.5
-5.126 -—5.—2
-1.566 —-2.—0
Hidden node D
Hidden node E

Слайд 18

35
some additional aspects

Слайд 19

36
Initialization of Weights
Weights are initialized randomly
e.g., uniformly from interval [−0.01, 0.01]
Glorot

and Bengio (2010) suggest
– for shallow neural networks

n is the size of the previous layer
– for deep neural networks

nj is the size of the previous layer, nj size of next layer

Слайд 20

37
Neural Networks for Classification
Predict class: one output node per class
Training data

output: ”One-hot vector”, e.g., ˙
Prediction
predicted class is output node yi with highest value
obtain posterior probability distribution by soft-max

Слайд 21

38
Problems with Gradient Descent Training
error(λ)
λ
Too high learning rate

Слайд 22

39
Problems with Gradient Descent Training
error(λ)
λ
Bad initialization
Philipp Koehn
Machine Translation: Introduction to Neural

Networks

27 September 2018

Слайд 23

40
Problems with Gradient Descent Training
λ
error(λ)
local optimum
global optimum
Local optimum

Слайд 24

41
Speedup: Momentum Term
Updates may move a weight slowly in one direction
To

speed this up, we can keep a memory of prior updates
∆wj←k(n − 1)
... and add these to any new updates (with decay factor ρ)
∆wj←k(n) = µ δj hk + ρ∆wj←k(n − 1)

Philipp Koehn

Machine Translation: Introduction to Neural Networks

27 September 2018

Слайд 25

42
Adagrad
Typically reduce the learning rate µ over time
at the beginning, things

have to change a lot
later, just fine-tuning
Adapting learning rate per parameter
Adagrad update

based on error E with respect to the weight w at time t = gt = dE

∆w =

t τ =1

Слайд 26

43
Dropout
A general problem of machine learning: overfitting to training data (very

good on train, bad on unseen test)
Solution: regularization, e.g., keeping weights from having extreme values
Dropout: randomly remove some hidden units during training
mask: set of hidden units dropped
randomly generate, say, 10–20 masks
alternate between the masks during training
Why does that work?
→ bagging, ensemble, ...

Слайд 27

44
Mini Batches
Each training example yields a set of weight updates ∆wi.
Batch

up several training examples
sum up their updates
apply sum to model
Mostly done or speed reasons

Слайд 28

45
computational aspects

Слайд 29

46
Vector and Matrix Multiplications
Forward computation:
Activation function:
Error term:
Propagation of error

term:
Weight updates:

Слайд 30

47
GPU
Neural network layers may have, say, 200 nodes
Computations such as require

200 × 200 = 40, 000 multiplications
Graphics Processing Units (GPU) are designed for such computations
image rendering requires such vector and matrix operations
massively mulit-core but lean processing units
example: NVIDIA Tesla K20c GPU provides 2496 thread processors
Extensions to C to support programming of GPUs, such as CUDA

Слайд 31

48
Toolkits
Theano
Tensorflow (Google)
PyTorch (Facebook)
MXNet (Amazon)
DyNet
С (easy api)

Слайд 32

Back propagation example

Содержание

20Error.90.171.761.00.014.5 -5.2 -2.0-4.6-1.5 3.7 2.93.72.9Computed output: y = .76Correct output: t = 1.0⇒

21Key ConceptsGradient descenterror is a function of the weightswe want to

22Gradient Descentλerror(λ)gradient = 1current λoptimal λ

23Gradient DescentGradient for w1Gradient for w2OptimumCurrent PointCombined Gradient

Coborârea gradientului (gradient descent) în spațiul ponderilorDin cartea Machine Learning, de

24Derivative of SigmoidSigmoid1sigmoid(x) =1 + e−xReminder: quotient ruleDerivativedxd sigmoid(x) d 1=dx 1 +

25Final Layer UpdateLinear combination of weights Activation function y = sigmoid(s)2Error

26Final Layer Update (1)2Error (L2 norm) E = (t − y)1 2Derivative of

27Final Layer Update (2)Linear combination of weights Activation function y =

28Final Layer Update (3)Linear combination of weights s = Σk wkhkActivation

29Putting it All TogetherDerivative of error with regard to one weight

30Multiple Output NodesOur example only had one output nodeTypically neural networks

31Hidden Layer UpdateIn a hidden layer, we do not have a

32Our Example.174.5 -5.2 -2.0-4.6-1.52.93.72.9A1.0B0.0C1 3.7 D.90EF1G.76Computed output: y = .76Correct output: t =

33Our Example.17-4.6-1.52.93.72.9A1.0B0.0C1 3.7 D.90EF1G.764.8914—.5-5.126 -—5.—2 -1.566 —-2.—0Computed output: y = .76Correct output: t

34Hidden Layer Updates.17-4.6-1.52.93.72.9A1.0B0.0C1 3.7 D.90EF1G.764.8914—.5-5.126 -—5.—2 -1.566 —-2.—0Hidden node DHidden node E

35some additional aspects

36Initialization of WeightsWeights are initialized randomlye.g., uniformly from interval [−0.01, 0.01]Glorot

37Neural Networks for ClassificationPredict class: one output node per classTraining data

38Problems with Gradient Descent Trainingerror(λ)λToo high learning rate

39Problems with Gradient Descent Trainingerror(λ)λBad initializationPhilipp KoehnMachine Translation: Introduction to Neural

40Problems with Gradient Descent Trainingλerror(λ)local optimumglobal optimumLocal optimum

41Speedup: Momentum TermUpdates may move a weight slowly in one directionTo

42AdagradTypically reduce the learning rate µ over timeat the beginning, things

43DropoutA general problem of machine learning: overfitting to training data (very

44Mini BatchesEach training example yields a set of weight updates ∆wi.Batch

45computational aspects

46Vector and Matrix MultiplicationsForward computation: Activation function: Error term:Propagation of error

47GPUNeural network layers may have, say, 200 nodesComputations such as require

48ToolkitsTheanoTensorflow (Google)PyTorch (Facebook)MXNet (Amazon)DyNetС (easy api)

lia@math.md

Похожие презентации

20
Error
.90
.17
1
.76
1.0
0.0
1
4.5
-5.2
-2.0
-4.6
-1.5
3.7
2.9
3.7
2.9
Computed output: y = .76
Correct output: t = 1.0
⇒

21
Key Concepts
Gradient descent
error is a function of the weights
we want to

22
Gradient Descent
λ
error(λ)
gradient = 1
current λ
optimal λ

23
Gradient Descent
Gradient for w1
Gradient for w2
Optimum
Current Point
Combined Gradient

Coborârea gradientului (gradient descent) în spațiul ponderilor
Din cartea Machine Learning, de

24
Derivative of Sigmoid
Sigmoid
1
sigmoid(x) =
1 + e−x
Reminder: quotient rule
Derivative
dx
d sigmoid(x) d 1
=
dx 1 +

25
Final Layer Update
Linear combination of weights
Activation function y = sigmoid(s)
2
Error

26
Final Layer Update (1)
2
Error (L2 norm) E = (t − y)
1 2
Derivative of

27
Final Layer Update (2)
Linear combination of weights
Activation function y =

28
Final Layer Update (3)
Linear combination of weights s = Σk wkhk
Activation

29
Putting it All Together
Derivative of error with regard to one weight

30
Multiple Output Nodes
Our example only had one output node
Typically neural networks

31
Hidden Layer Update
In a hidden layer, we do not have a

32
Our Example
.17
4.5
-5.2
-2.0
-4.6
-1.5
2.9
3.7
2.9
A
1.0
B
0.0
C
1
3.7 D
.90
E
F
1
G
.76
Computed output: y = .76
Correct output: t =

33
Our Example
.17
-4.6
-1.5
2.9
3.7
2.9
A
1.0
B
0.0
C
1
3.7 D
.90
E
F
1
G
.76
4.8914—.5
-5.126 -—5.—2
-1.566 —-2.—0
Computed output: y = .76
Correct output: t

34
Hidden Layer Updates
.17
-4.6
-1.5
2.9
3.7
2.9
A
1.0
B
0.0
C
1
3.7 D
.90
E
F
1
G
.76
4.8914—.5
-5.126 -—5.—2
-1.566 —-2.—0
Hidden node D
Hidden node E

35
some additional aspects

36
Initialization of Weights
Weights are initialized randomly
e.g., uniformly from interval [−0.01, 0.01]
Glorot

37
Neural Networks for Classification
Predict class: one output node per class
Training data

38
Problems with Gradient Descent Training
error(λ)
λ
Too high learning rate

39
Problems with Gradient Descent Training
error(λ)
λ
Bad initialization
Philipp Koehn
Machine Translation: Introduction to Neural

40
Problems with Gradient Descent Training
λ
error(λ)
local optimum
global optimum
Local optimum

41
Speedup: Momentum Term
Updates may move a weight slowly in one direction
To

42
Adagrad
Typically reduce the learning rate µ over time
at the beginning, things

43
Dropout
A general problem of machine learning: overfitting to training data (very

44
Mini Batches
Each training example yields a set of weight updates ∆wi.
Batch

45
computational aspects

46
Vector and Matrix Multiplications
Forward computation:
Activation function:
Error term:
Propagation of error

47
GPU
Neural network layers may have, say, 200 nodes
Computations such as require

48
Toolkits
Theano
Tensorflow (Google)
PyTorch (Facebook)
MXNet (Amazon)
DyNet
С (easy api)