AdalinePerceptronArticle.Codeproject.html

﻿<h2>Content</h2>

<ul>
<li><a href="#introduction">Introduction</a>
<ul>
<li><a href="#disclaimer">Disclaimer</a></li>
<li><a href="#setting-some-bounds">Setting some bounds</a></li>
</ul>
</li>
<li><a href="#the-basic-math-formula-for-the-adaline-perceptron">The basic math formula for the ADALINE Perceptron</a></li>
<li><a href="#why-do-we-need-another-perceptron">Why do we need another perceptron?</a></li>
<li><a href="#sum-of-squared-errors-convex-optimization-what-are-you-talking-about">Sum of Squared Errors, Convex Optimization: what are you talking about?</a>
<ul>
<li><a href="#sum-of-squared-errors">Sum of Squared Errors</a></li>
<li><a href="#the-calculation-of-squared-somethings">The calculation of squared somethings</a></li>
<li><a href="#convex-optimization">Convex Optimization</a></li>
</ul>
</li>
<li><a href="#finding-the-minimum-of-a-function">Finding the minimum of a function</a></li>
<li><a href="#differentiatable-continuous-any-other-new-words-youd-like-to-drop">Differentiatable, Continuous? Any other new words you’d like to drop?</a>
<ul>
<li><a href="#the-derivative-of-a-single-variable-function">The Derivative of a single variable function</a></li>
<li><a href="#understanding-the-limit-of-a-function">Understanding the Limit of a function</a></li>
<li><a href="#understanding-the-derivative">Understanding the derivative</a></li>
<li><a href="#the-derivative-of-a-multiple-variable-function-the-directional-derivative">The derivative of a multiple variable function: the Directional Derivative</a></li>
<li><a href="#the-derivative-of-a-multiple-variable-function-the-gradient-vector">The derivative of a multiple variable function: the Gradient vector</a></li>
<li><a href="#the-derivative-of-the-derivative-of-the-...-second-derivative-third-derivative-etc...">The derivative of the derivative of the …: Second Derivative, Third Derivative, etc…</a></li>
</ul>
</li>
<li><a href="#yes-here-is-another-word-matrices">Yes, here is another word: ‘Matrices’</a>
<ul>
<li><a href="#definition-of-a-matrix">Definition of a matrix</a></li>
<li><a href="#operations-with-matrices">Operations with matrices</a></li>
<li><a href="#so-why-do-we-need-them-for-the-directional-derivative">So, why do we need them for the directional derivative?</a></li>
</ul>
</li>
<li><a href="#and-some-more-words-critical-points-stationary-points-and-fermats-theorem">And some more words: Critical Points, Stationary Points and Fermats Theorem</a></li>
<li><a href="#optimizing-minimizing-the-mean-squared-errors">Optimizing (minimizing) the Mean Squared Errors</a>
<ul>
<li><a href="#getting-the-derivative-of-our-cost-function">Getting the derivative of our cost function</a></li>
<li><a href="#prooving-convexity">Prooving convexity</a></li>
<li><a href="#solving-the-problem">Solving the problem</a></li>
</ul>
</li>
<li><a href="#gradient-descent">Gradient Descent</a>
<ul>
<li><a href="#gradient-descend-in-words-and-pictures">Gradient descend in words (and pictures)</a></li>
<li><a href="#gradient-descend-in-formulas">Gradient descend in formulas</a></li>
<li><a href="#but-it-can-go-wrong">But it can go wrong</a></li>
</ul>
</li>
<li><a href="#the-adaline-learning-rule">The ADALINE learning rule</a></li>
<li><a href="#wrap-up">Wrap up</a>
<ul>
<li><a href="#basic-formula-of-the-adaline-perceptron">Basic formula of the ADALINE Perceptron</a></li>
<li><a href="#behaviour-of-the-adaline-perceptron">Behaviour of the ADALINE Perceptron</a></li>
<li><a href="#formalising-some-things-a-few-definitions">Formalising some things: a few definitions</a></li>
</ul>
</li>
<li><a href="#what-is-wrong-with-the-adaline-perceptron">What is wrong with the ADALINE perceptron?</a></li>
<li><a href="#references">References:</a>
<ul>
<li><a href="#sum-of-squared-errors-1">Sum of Squared Errors</a></li>
<li><a href="#maxima-minima-and-the-slope">Maxima, Minima and the Slope</a></li>
<li><a href="#limit-continuity-and-differentiability">Limit, Continuity and Differentiability</a></li>
<li><a href="#matrices">Matrices</a></li>
<li><a href="#critical-points-stationary-points-and-fermats-theorem">Critical Points, Stationary Points and Fermat’s Theorem</a></li>
<li><a href="#gradient-descent-1">Gradient Descent</a></li>
<li><a href="#adaline">ADALINE</a></li>
</ul>
</li>
</ul>

	
<h2 id="introduction">Introduction</h2>
<p>A lot of articles introduce the perceptron showing the basic mathematical formulas that define it, but without offering much of an explanation on what exactly makes it work.</p>
<p>And surely it is possible to use the perceptron without really understanding the basic math involved with it, but is it not also fun to see how all this math you learned in school can help you understand the perceptron, and in extension, neural networks?</p>
<p>I also got inspired for this series of articles by a series of articles on <a href="https://www.svm-tutorial.com/svm-tutorial/math-svm-tutorial/">Support Vector Machines</a>, explaining the basic mathematical concepts involved, and slowly building up to the more complex mathematics. So that is my intention with this article and the accompaning code: show you the math envolved in the preceptron. And, if time permits, I will write articles all the way up to convolutional neural networks.</p>
<p>Of course, when explaining the math, the question is: when do you stop explaining? There is some math involved that is rather basic, like for example <em>what is a vector?</em>, <em>what is a cosine?</em>, etc… I will assume some basic knowledge of mathematics like you have some idea of what <em>a vector</em> is, you know the basics of geometry, etc… My assumptions will be arbitraty, so if you think i’m going too fast in some explanations just leave a comment.</p>
<p>So, let us get started.</p>
<h3 id="disclaimer">Disclaimer</h3>
<p>This article is about the math involved in the perceptron and NOT about the code used and written to illustrate these mathematical concepts. Although it is not my intention to write such an article, never say never…</p>
<h3 id="setting-some-bounds">Setting some bounds</h3>
<p>A perceptron basically takes some input values, called “features” and represented by the values <span class="math">\(x_1, x_2, ... x_n\)</span> in the following formula , multiplies them by some factors called “weights”, represented by <span class="math">\(w_1, w_2, ... w_n\)</span>, takes the sum of these multiplications and depending on the value of this sum outputs another value <span class="math">\(o\)</span>:</p>
<p><div class="math">$o = f(w_1x_1 + w_2x_2 + ... + w_ix_i + ... + w_nx_n)$</div></p>
<p>There are a few types of perceptrons, differing in the way the sum results in the output, thus the function <span class="math">\(f()\)</span> in the above formula.</p>
<p>In this article we will build on the Adaline Perceptron. Adaline stands for Adaptive Linear Neuron. During this article I will simply be using the name “Perceptron” when referring to the Adaline Perceptron. I asume you have read the article about the <a href="https://sergedesmedt.github.io/MathOfNeuralNetworks/RosenblattPerceptronArticle.html">Rosenblatt Perceptron</a> and thus are already familiar with the basic vector math necessary to understand the basic formulas of a general perceptron.</p>
<p>We will investigate the math envolved and discuss its limitations, thereby setting the ground for the future articles.</p>
<h2 id="the-basic-math-formula-for-the-adaline-perceptron">The basic math formula for the ADALINE Perceptron</h2>
<p>There’s not a big difference between the math formula for the Rosenblatt perceptron and for the ADALINE perceptron:</p>
<p><div class="math">$f(x) = \begin{cases} 1 &amp; \text{if } w_1x_1 + w_2x_2 + ... + w_ix_i + ... + w_nx_n &gt; b\\ -1 &amp; \text{otherwise} \end{cases} $</div></p>
<p>The main difference is that the <span class="math">\(otherwise\)</span> case returns <span class="math">\(-1\)</span> instead of <span class="math">\(0\)</span> which is important here.</p>
<p>But we still calculate the dot-product and compare it with a value, so we still only have two possible outcomes and we still can only classify linearily seperable data. So, basically we can do nothing more then what we could do with the Rosenblatt perceptron.</p>
<p>So then why do we need a new perceptron?</p>
<h2 id="why-do-we-need-another-perceptron">Why do we need another perceptron?</h2>
<p>In the previous article about the Rosenblatt perceptron we where eventually able to divide “things” into two different classes through a linear combination of its features. What can we want more?</p>
<p>Well, a better estimate of the linear boundary would be nice. With the Rosenblatt perceptron we do not have any control over the boundary which the learning will eventually give us: as demonstrated in the previous article it could be any line which seperates our two classes:</p>
<p><img src="https://sergedesmedt.github.io/MathOfNeuralNetworks/Resources/LinearSeperability_SeperableCandidateSolutions.PNG" alt="Candidate lines"></p>
<p>Remember the learning procedure of the Rosenblatt perceptron. We defined the error <span class="math">\(e\)</span> as being the difference between the desired output <span class="math">\(d\)</span> and the effective output <span class="math">\(o\)</span> for an input vector <span class="math">\(\mathbf{x}\)</span> and a weight vector <span class="math">\(\mathbf{w}\)</span>:</p>
<p><div class="math">$\begin{aligned} e &amp;= d-o\\ &amp;= d-H(\mathbf{w} \cdot \mathbf{x}) \end{aligned}$</div></p>
<p><em>(If the above formulas look like chinese to you, you should have a look at <a href="https://sergedesmedt.github.io/MathOfNeuralNetworks/RosenblattPerceptronArticle.html">the article about the Rosenblatt perceptron</a>)</em></p>
<p>The effective output <span class="math">\(o\)</span> is determined by the Heaviside step function, represented by <span class="math">\(H()\)</span> in the formula above. We could define a cost function using this error and then try to minimize this cost function. A <em>cost function</em> is basically a function which calculates the total error of our model with respect to the desired values, so by minimising the result of the cost function we minimize the error.</p>
<p>The cost function for the ADALINE perceptron is based on the <em>Sum of Squared Errors / Mean Sum of Squared Errors</em> which in our case (linear regression) is part of a category of optimization algorithms called <em>Convex Optimization</em>.</p>
<h2 id="sum-of-squared-errors-convex-optimization-what-are-you-talking-about">Sum of Squared Errors, Convex Optimization: what are you talking about?</h2>
<p>If you search the internet on information about the ADALINE perceptron you’re likely to find various definitions for the learning rule. The original learning rule was the <a href="https://en.wikipedia.org/wiki/Least_mean_squares_filter">Least Mean Squares filter</a>, also known as the Widrow-Hoff learning rule named after it’s inventors. However, in our digital computers age you will also find references to the Sum of Squared Errors method. In following section you’ll see they basically boil down to the same underlying idea.</p>
<h3 id="sum-of-squared-errors">Sum of Squared Errors</h3>
<p>We try to optimize (that is, minimize) the sum of the squared errors. The errors are defined in a similar fashion as for the Rosenbatt perceptron, but here we use the outcome of the dot-product immediately:</p>
<p><div class="math">$\begin{aligned} e &amp;= d-o\\ &amp;= d-\mathbf{w} \cdot \mathbf{x} \end{aligned}$</div></p>
<p>Then we define the cost function as the sum of the squares of those errors:</p>
<p><div class="math">$E(w) = \frac{1}{2} \sum_{j=1}^{M} e_j^2$</div></p>
<p>in which <span class="math">\(M\)</span> is the number of samples. Thus, we take the sum over all our samples.</p>
<p>Never mind the division by <span class="math">\(2\)</span>, this is to get a convenient function later which we can solve. This will become clear after we discussed derivation.</p>
<p>But why <em>squared</em>?</p>
<p>If you search the internet for an explanation, then most probably you’ve read something along the lines of:</p>
<blockquote>
<p>Because we want all positive values</p>
</blockquote>
<p>This is true: we do indeed want positive values. Imagine we just take the sum of the errors. We would have positive and negative values which would cancel each other out. So the summation would become smaller (possibly even zero) although our solution would not be optimal. If we first make positive values of our errors and take the summation of those, then the only way to get zero as a result is if the results of our perceptron are equal to those of our samples. That is: if we have no errors.</p>
<p><img src="https://sergedesmedt.github.io/MathOfNeuralNetworks/Resources/SSE_vs_SE.PNG" alt="Sum of errors can result in zero if positive and negative errors cancel each other out"></p>
<p>Try it yourself:<br>
<a href="https://sergedesmedt.github.io/MathOfNeuralNetworks/SumOfSuaredErrors.html#learn_sse_vs_se">Sum of Squared Errors versus Sum of Errors interactive</a></p>
<p>Ok, but then why not simply take the absolute value of our error? Asking the internet mostly returns:</p>
<blockquote>
<p>Because large errors take a bigger share in the result.<br>
Because taking the absolute value does not allow to take the derivative</p>
</blockquote>
<p>To understand the last reason we will need to understand <em>convex optimization</em></p>
<p>Above is not a mathematical rigourous explanation of why to use the Sum Of Squared Errors. There is a rigourous explanation involving things like <em>likelyhood</em>, <em>maximum likelihood</em>, etc… This however is a pretty big subject and we already have a lot to explain, which is why I will not explain this any further in this article. But I do intend to provide a seperate article detailing what this is about.</p>
<p>As mentioned above you’ll also find definitions for the learning procedure which use the Mean Sum of Squared Error:</p>
<p><div class="math">$E(w) = \frac{1}{2M} \sum_{j=1}^{M} e_j^2$</div></p>
<p>The original definition for the ADALINE perceptron used this definition. For optimizing it does not make any difference if we use the Sum of Squared or Mean Sum of Squared Errors (as we’ll soon see), however, if we want to compare the results depending on the samples taken it can make a difference if the number of samples for which we want to compare the results differ. See the next paragraph:</p>
<h3 id="the-calculation-of-squared-somethings">The <em>calculation</em> of squared <em>something</em>s</h3>
<p>When you start searching the internet for information on the sum of squared errors, you’ll find all sorts of things done with squared somethings. It’s easy to get lost: I sure did. So let us briefly examine some combinations and see why we end up using the sum of squared errors:</p>
<ul>
<li><a href="https://en.wikipedia.org/wiki/Least_squares"><strong>Least Squares Error</strong></a> This is a method for determining our model. We try to minimize (Least) the Sum of Squares of the Errors.</li>
<li><a href="https://en.wikipedia.org/wiki/Residual_sum_of_squares"><strong>Residual Sum of Squares</strong></a> This is basically the same as the Sum of Squared Errors. Instead of using the name Errors, we use the name Residuals</li>
<li><a href="https://en.wikipedia.org/wiki/Ordinary_least_squares"><strong>Ordinary Least Squares</strong></a> If our model is a linear model (which for the perceptron it is), then this is the same is the Least Squares Error. So, this also is a method for determining our model, but <em>specialized</em> for linear models.</li>
<li><a href="https://en.wikipedia.org/wiki/Mean_squared_error"><strong>Mean Squared Error</strong></a> The mean squared errors calculates the mean of the squared errors. The mean is calculated by taking the sum of all the errors of our samples and dividing it by the number of samples, thus, it basically is the Least Squares Error divided by the number of samples. It is a good measure for evaluating the quality after we have calculated our weights. By taking the mean of the squared errors we take the number of samples we want to check into account. Just imagine not dividing by the number of samples in the above formula: it would be clear that the higher the number of samples, the higher the sum of the squares would become and thus would hide the desired metric.</li>
</ul>
<p>Try it yourself:<br>
<a href="https://sergedesmedt.github.io/MathOfNeuralNetworks/SumOfSuaredErrors.html#learn_sse_vs_sme">Sum of Squared Errors versus Sum of Mean Errors interactive</a></p>
<h3 id="convex-optimization">Convex Optimization</h3>
<p>In convex optimization we try to <em>minimize</em> a <em>convex function</em>.</p>
<h4 id="what-is-a-convex-function">What is a Convex Function?</h4>
<p>From wikipedia:</p>
<blockquote>
<p>… is called <strong>convex</strong>  if the line segment between any two points on the graph of the function lies above or on the graph. Equivalently, a function is convex if (…) the set of points on or above the graph of the function is a convex set.</p>
</blockquote>
<p>(If you are not familiar with the concepts of line segments and convex sets, you should check out the first article in this series where I discuss the two.)</p>
<p>Using math this definition becomes:</p>
<p><div class="math">$f( \lambda A + (1- \lambda)B) \leq \lambda f(A) + (1- \lambda)f(B)$</div></p>
<p>in which <span class="math">\(A\)</span> and <span class="math">\(B\)</span> are two points in <span class="math">\(\mathbb{R}^n\)</span> and <span class="math">\(\lambda\)</span> is in the interval <span class="math">\((0, 1)\)</span></p>
<p>In words, this formula states that for a convex function, every value of the function on the “hyper-line” between the input points <span class="math">\(A\)</span> and <span class="math">\(B\)</span> must be smaller than the corresponding point on the “hyper-line” connecting the two results of the funciton for the points <span class="math">\(A\)</span> and <span class="math">\(B\)</span>. If you are unfamiliar with the concept of line segments (from which this formula derives) you should read <a href="https://sergedesmedt.github.io/MathOfNeuralNetworks/RosenblattPerceptronArticle.html#linear-separability-and-half-spaces">the section on line segments in the first article</a> of this series</p>
<p>Following illustrates this formula:</p>
<p><img src="https://sergedesmedt.github.io/MathOfNeuralNetworks/Resources/ConvexFunction_Convex.PNG" alt="A convex function"></p>
<p>Try it yourself:<br>
<a href="https://sergedesmedt.github.io/MathOfNeuralNetworks/ConvexFunction.html#learn_convex">Sum of Squared Errors versus Sum of Mean Errors interactive</a></p>
<p>Following is an example of when the formula does NOT hold:</p>
<p><img src="https://sergedesmedt.github.io/MathOfNeuralNetworks/Resources/ConvexFunction_NotConvex.PNG" alt="A convex function"></p>
<p>Try it yourself:<br>
<a href="https://sergedesmedt.github.io/MathOfNeuralNetworks/ConvexFunction.html#learn_notconvex">Sum of Squared Errors versus Sum of Mean Errors interactive</a></p>
<p>You may ask yourself what happens if the line segment is completely underneath the graph of the function. Then we have a concave function. It will be clear that this is just the convex function negated, thus if <span class="math">\(f(x)\)</span> is a convex function, then <span class="math">\(-f(x)\)</span> is a concave function.</p>
<p>Our cost function, the sum of squared errors, is a convex function.</p>
<h4 id="what-is-optimization">What is optimization?</h4>
<p>What is interesting about a convex function is that it has a single minimum value. So, if our cost function of the perceptron is a convex function then we can find a single minimum value for this function and thus we have a single optimal solution to our problem.</p>
<p>This is the main difference of the ADALINE perceptron with respect to the Rosenblatt perceptron: we can find a <strong>guaranteed</strong> optimal solution which will <strong>always be the same</strong> for a given set of samples. We will also find a solution when the learning data is not strictly seperable, allthough this will not necesarilly be a correct solution: some points may end up in the wrong half-space. But the error will be minimal.</p>
<p>So, the act of <em>optimization</em> in this case is to find for what arguments the cost function has a minimum value.</p>
<h2 id="finding-the-minimum-of-a-function">Finding the minimum of a function</h2>
<p>Let us first mathematically define what a minimum is:</p>
<p><div class="math">$a \in \mathbb{R}\text{ is a minimum for a function }f(x)\\ \text{ if for all points }x\text{ in a region around }a\text{ we have }f(x) &gt;= f(a)$</div></p>
<p>We can extend this definition to functions of multiple variables:</p>
<p><div class="math">$A = (a_1, a_2, ..., a_n) \in \mathbb{R^n}\text{ is a minimum for a function }f(x_1, x_2, ..., x_n)\\ \text{ if for all vectors }X\text{ in a region around }A\text{ we have }f(X) &gt;= f(A)$</div></p>
<p>In words: we have a minimum for a function if all the neighbouring values of the function are larger then this minimum value.</p>
<p>Finding this minimum is typically done by calculating the derivative of the function. As we will show later, the derivative of a function can be seen as the slope of that function. Or: the derivative of a function at a certain point is equal to the slope of the tangent to the function at that point.</p>
<p>This can be easily shown grafically for function of a single variable:</p>
<p><img src="https://sergedesmedt.github.io/MathOfNeuralNetworks/Resources/Derivation_Slope.PNG" alt="The slope of a function"></p>
<p>When the slope (thus the derivative) of a function is zero, thus when the tangent is horizontal, we typically have a minimum, maximum value or a <a href="https://en.wikipedia.org/wiki/Inflection_point">horizontal inflection point</a>:</p>
<p><img src="https://sergedesmedt.github.io/MathOfNeuralNetworks/Resources/Derivation_SlopeZero.PNG" alt="horizontal inflection point"></p>
<p>Because our cost function is convex, we know we will have a minimum.</p>
<p>Although the above intuïtively feels ok, a more formal explanation can be found in <a href="https://en.wikipedia.org/wiki/Fermat%27s_theorem_(stationary_points)">Fermat’s_theorem_on stationary_points</a>, but we’ll get back to this later.</p>
<p>So, to find the minimum we have to find where the slope is <span class="math">\(0\)</span> , and because the slope is the derivative (we’ll see why in the next section) we have to be able to find the derivative of the function.</p>
<p>But herein is the problem with regard to the Rosenblatt perceptron: for differentiation (which is the act af calculating the derivative) a function has to be continuous and the Heaviside step function is not continuous.</p>
<p>The ADALINE perceptron however defines the error as being the difference between the desired outcome <span class="math">\(d\)</span> and the value of the dot-product, thus the value before applying the activation function:</p>
<p><div class="math">$e = d-\mathbf{w} \cdot \mathbf{x}$</div></p>
<p>And this is continuous and thus differentiatable.</p>
<p>But let us step back a bit and first explain what is the meaning of “differentiable” and “continuous”.</p>
<h2 id="differentiatable-continuous-any-other-new-words-youd-like-to-drop">Differentiatable, Continuous? Any other new words you’d like to drop?</h2>
<h3 id="the-derivative-of-a-single-variable-function">The Derivative of a single variable function</h3>
<p>Let’s say we have a function with a single variable:</p>
<p><div class="math">$y = f(x)$</div></p>
<p>The operation of taking the derivative is called differentiation. The mathematical definition of the derivative is:</p>
<p><div class="math">$f'(x) = \lim_{h\to0}\frac{f(x+h) - f(x)}{h}$</div></p>
<p>Notice the accent (<span class="math">\('\)</span>) on the <span class="math">\(f\)</span>: this means <em>the derivative</em><br>
Another notation is sometimes used:</p>
<p><div class="math">$\frac{df}{dx}$</div></p>
<p>in which we can interpret the <span class="math">\(d\)</span> as meaning <em>a small change</em>. So the derivative is a small change in the outcome of a function divided by a small change in the argument of that function. Hence, the derivative of a function at a certain point is <em>the slope</em> of that function at that point: it gives a measure for how much the output of the function changes for a certain change of the input at a certain point.<br>
Too make it explicite in the above definition we could write:</p>
<p><div class="math">$\begin{aligned} f'(x) &amp;= \lim_{h\to0}\frac{f(x+h) - f(x)}{(x+h) - x}\\ &amp;=\lim_{h\to0}\frac{f(x+h) - f(x)}{h} \end{aligned}$</div></p>
<p>What is this <span class="math">\(\lim_{h\to0}\)</span> thing?</p>
<h3 id="understanding-the-limit-of-a-function">Understanding the Limit of a function</h3>
<p>In general, when we write</p>
<p><div class="math">$\lim_{x\to{c}}f(x) = L$</div></p>
<p>we mean the value the function <span class="math">\(f(x)\)</span> takes as <span class="math">\(x\)</span> approaches <span class="math">\(c\)</span></p>
<p>From Wikipedia:</p>
<blockquote>
<p><span class="math">\(f(x)\)</span> can be made to be as close to <span class="math">\(L\)</span> as desired by making <span class="math">\(x\)</span> sufficiently close to <span class="math">\(c\)</span>. In that case, the above equation can be read as "the limit of <span class="math">\(f\)</span> of <span class="math">\(x\)</span>, as <span class="math">\(x\)</span> approaches <span class="math">\(c\)</span>, is <span class="math">\(L\)</span></p>
</blockquote>
<p>It does <em>NOT</em> mean that <span class="math">\(x\)</span> becomes equal to <span class="math">\(c\)</span> !!! So, it is approaching within a very small range without ever reaching it. This “approaching” is (for our simple function <span class="math">\(y = f(x)\)</span>) from both sides of the value <span class="math">\(c\)</span>.</p>
<p>A mathematical more rigourous definition is known as the (ε, δ)-definition of limit.</p>
<p>Let <span class="math">\(I\)</span> be an open interval containing <span class="math">\(c\)</span> (an interval basically being two numbers and all the numbers in-between them. “Open” meaning not containing the bounding two numbers and “closed” containing the two bounding numbers), and let <span class="math">\(f\)</span> be a function defined on <span class="math">\(I\)</span> (thus <span class="math">\(f(x)\)</span> having an outcome for each number between the interval its bounding numbers), except possibly at <span class="math">\(c\)</span></p>
<p>Let <span class="math">\(\delta\)</span> and <span class="math">\(\epsilon\)</span> be two positive numbers:</p>
<p><div class="math">$\begin{aligned} \delta &amp;\in \mathbb{R}_{&gt;0}\\ \epsilon &amp;\in \mathbb{R}_{&gt;0} \end{aligned}$</div></p>
<p>Then we can write:</p>
<p><div class="math">$\lim_{x\to{c}}f(x) = L$</div></p>
<p>Means that for all <span class="math">\(x \ne c\)</span></p>
<p><div class="math">$\lvert x−c \lvert &lt; \delta \implies \lvert f(x)−L\lvert &lt; \epsilon$</div></p>
<p>So what we are saying here is that for every <span class="math">\(\epsilon\)</span> there is a <span class="math">\(\delta\)</span> such that the above formula holds. The <span class="math">\(\epsilon\)</span> part in this mathematical definition takes care of the <em><span class="math">\(f(x)\)</span> can be made to be as close to <span class="math">\(L\)</span> as desired</em> part in the wikipedia definition, and <span class="math">\(\delta\)</span> takes care of the <em>by making <span class="math">\(x\)</span> sufficiently close to <span class="math">\(c\)</span></em> part.</p>
<p>There are however a few things to notice here:</p>
<p>First, we take the absolute value of the difference. This means that when we say <span class="math">\(x\to{c}\)</span> we mean we approach <span class="math">\(c\)</span> from every direction. On a linear scale this means from both the right side and the left side.</p>
<p>Second: we say <em>smaller than</em> and <strong>NOT</strong> <em>smaller than or equal to</em> which corresponds with the statement in the definition of the function above:</p>
<blockquote>
<p>and let <span class="math">\(f\)</span> be a function defined on <span class="math">\(I\)</span>, <strong>except possibly at <span class="math">\(c\)</span></strong></p>
</blockquote>
<p>There is no need for the function to be defined <strong>at the value</strong> the limit is approaching to, because we never really get to this value! We just approach it or get as close to it as possible from every direction. Mind however this is only at the value itself! It has to be defined <em>around</em> the value.</p>
<p>Try it yourself:<br>
<a href="https://sergedesmedt.github.io/MathOfNeuralNetworks/Limit.html">Limit of a function interactive</a></p>
<h3 id="understanding-the-derivative">Understanding the derivative</h3>
<p>Getting back to our derivative: simply put, it is the change in the result of a function divided by the change in the argument of that function, for a change in argument going to (<em>approaching</em>) <span class="math">\(0\)</span> but not equal to <span class="math">\(0\)</span>.</p>
<p>Remember how we said for calcuating the limit the function does not have to be defined at the value to which we approach? For calculating the derivative it does have to be. Not only that, it also has to be <em>Continuous</em>.</p>
<p>For a function to be differentiable , the derivative must exist at all points of its domain. And for a function to be differentiable at a point, it must be <em>continuous</em> at that point.</p>
<p>Continuity of a function is defined by the mathematical expression</p>
<p><div class="math">$\lim_{x\to{c}}f(x) = f(c)$</div></p>
<p>Or in words: a function is continuous at a point <span class="math">\(c\)</span> if the limit of the function approaching that point is equal to the value of the function at that point. Notice how this is in contrast with the definition of the limit itself in which we said the function does not even have to be defined at <span class="math">\(c\)</span>.</p>
<p>Intuitively, you can see this as filling a gap: around <span class="math">\(c\)</span> we have a certain value and we say that at <span class="math">\(c\)</span> we want the same value.</p>
<p>So why is this a necessary condition for differentiability?</p>
<p>This has to do with being able to divide by <span class="math">\(0\)</span>.</p>
<p>Let us get back to the mathematical definition of differentiation:</p>
<p><div class="math">$f'(x) = \lim_{h\to0}\frac{f(x+h) - f(x)}{h}$</div></p>
<p>If we take the differentiation at a certain point, we can also write this as:</p>
<p><div class="math">$f'(x) = \lim_{x\to{a}}\frac{f(x) - f(a)}{x-a}$</div></p>
<p><em>Following is NOT a rigourous mathematical explanation, but rather a “gut feeling” explanation</em></p>
<p>If we take the value of the denominator for <span class="math">\(x\to{a}\)</span>, then this value is going to (<em>approaching</em>) <span class="math">\(0\)</span>. This means the result of the quotient is getting larger and larger and for the value <span class="math">\(0\)</span> it is infinite. The only way to avoid this quotient reaching inifinity is by making sure the nominator itself is also approaching <span class="math">\(0\)</span>. And thus, for <span class="math">\(x\to{a}\)</span>, the value of <span class="math">\(f(x)\)</span> must go to <span class="math">\(f(a)\)</span> which is exactly the definition of continuity given above:</p>
<p><div class="math">$\lim_{x\to{a}}f(x) = f(a)$</div></p>
<p>A more mathematically rigourous proof can be found in the references section.</p>
<p>Try it yourself:<br>
<a href="https://sergedesmedt.github.io/MathOfNeuralNetworks/Continuity.html">Continuity</a></p>
<p>Try it yourself:<br>
<a href="https://sergedesmedt.github.io/MathOfNeuralNetworks/FunctionDerivation.html">Derivation</a></p>
<h3 id="the-derivative-of-a-multiple-variable-function-the-directional-derivative">The derivative of a multiple variable function: the Directional Derivative</h3>
<p>Of course, our function can be dependent on more than one argument and a single outcome <span class="math">\(y\)</span>:</p>
<p><div class="math">$y = f(x_1,x_2,...,x_n)$</div></p>
<p>In this case the slope becomes  <em>gradient</em> and is calculated in a certain direction. We speak of the <em>directional derivative</em>:</p>
<p>We define the directional derivative in the direction <span class="math">\(u\)</span>, with <span class="math">\(\hat{u}\)</span> the unit vector in the direction of <span class="math">\(u\)</span> or <span class="math">\(\hat{u} = \frac{u}{\lvert\lvert{u}\lvert\lvert}\)</span>, as:</p>
<p><div class="math">$\begin{aligned} {f'} _u &amp;= \lim_{h\to0}\frac{f(x+h\hat{u}) - f(x)}{h}\\ &amp;=\lim_{h\to0}\frac{f(x_1+h\hat{u_1}, x_2+h\hat{u_2}, ..., x_n+h\hat{u_n}) - f(x_1, u_2, ..., x_n)}{h} \end{aligned}$</div></p>
<p>In this formula the symbol <span class="math">\({f'}_u\)</span> stands for <em>directional derivative</em> and <span class="math">\(\hat{u_1}, \hat{u_2}, ..., \hat{u_n}\)</span> are the components of the unit vector <span class="math">\(\hat{u}\)</span> along the <span class="math">\(x_1, x_2, ...,x_n\)</span> axes.</p>
<p>So it is the slope of the function in the direction of <span class="math">\(u\)</span> (If you are unfamiliar with vectors now would be a good time to read the first article in this series)</p>
<p>But how can we easily calculate this and what is more interesting (for our optimisaton problem), in what direction is its largest value? For this we need so called <em>partial derrivatives</em>.</p>
<p>A partial derivative of a multi variable function with respect to one of its variables is the derivative of that function to that variable with all other variables held constant. You can think of it is if we reduce the multi variable function to a single variable function and then calculate the derivative of this</p>
<p>Say we have a function:</p>
<p><div class="math">$f(x_1,x_2,...,x_i, ...,x_n)$</div></p>
<p>Then we can define partial derivatives for each variable:</p>
<p><div class="math">$\frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, ..., \frac{\partial f}{\partial x_i} ..., \frac{\partial f}{\partial x_n}$</div></p>
<p>in which the partial derivative in <span class="math">\(x_i\)</span> at a certain point <span class="math">\((x_1,x_2,...,x_i, ...,x_n)\)</span> is:</p>
<p><div class="math">$\frac{\partial f}{\partial x_i}(x_1,x_2,...,x_i, ...,x_n) = \lim_{h\to0}\frac{f(x_1,x_2,...,x_i+h, ...,x_n) - f(x_1,x_2,...,x_i, ...,x_n)}{h}$</div></p>
<p>It is the rate of change / slope of the function in the direction of <span class="math">\(x_i\)</span></p>
<p>Try it yourself:<br>
<a href="https://sergedesmedt.github.io/MathOfNeuralNetworks/Gradient.html">the Partial Derivative</a></p>
<h3 id="the-derivative-of-a-multiple-variable-function-the-gradient-vector">The derivative of a multiple variable function: the Gradient vector</h3>
<p>The gradient at a point is a vector with the direction of the greatest increase of a multi-variable function and its magnitude is de rate of increase of that function. Or, we can say that the gradient is the directional derivative in the direction of the biggest change. It is composed of the partial derivatives of the function.</p>
<p><div class="math">$\nabla{f} = [\frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, ..., \frac{\partial f}{\partial x_i} ..., \frac{\partial f}{\partial x_n}]$</div></p>
<p>So, why is the gradient in direction of the biggest change? For this we need to investigate the relation between the gradient and the directional derivative. We can proof the following relation between the gradient and the directional derivative (see the reference section for some proofs available on the internet):</p>
<p><div class="math">$\begin{aligned} {f'} _u &amp;= \nabla{f} \cdot \hat{u}\\ &amp;=[\frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, ..., \frac{\partial f}{\partial x_i} ..., \frac{\partial f}{\partial x_n}] \cdot [\hat{u_1}, \hat{u_1}, ..., \hat{u_i}, ..., \hat{u_n}]\\ &amp;= \frac{\partial f}{\partial x_1}\hat{u_1} + \frac{\partial f}{\partial x_1}\hat{u_1} + ... + \frac{\partial f}{\partial x_i}\hat{u_i} + ... + \frac{\partial f}{\partial x_n}\hat{u_n}\\ &amp;= \sum_{i=1}^{n} \frac{\partial f}{\partial x_i}\hat{u_i} \end{aligned}$</div></p>
<p>So, the directional derivative along <span class="math">\(u\)</span> is the dot product of the gradient with the unit vector in the direction of <span class="math">\(u\)</span>. Mind that this is the dot-product and thus the result is a number and not a vector!</p>
<p>As we know from the first article, the dot product of two vectors can be expressed through the size of the vectors and the angle between the vectors:</p>
<p><div class="math">$\begin{aligned} \mathbf{c} &amp;= \mathbf{a} \cdot \mathbf{b}\\ &amp;= {\lvert\lvert{\mathbf{a}}\lvert\lvert}\text{ }{\lvert\lvert{\mathbf{b}}\lvert\lvert}\text{ }cos(\alpha)\\ \end{aligned}$</div></p>
<p>Applying this to our definition of the directional derivative above gives:</p>
<p><div class="math">$\begin{aligned} {f'}_u &amp;= \nabla{f} \cdot \hat{u}\\ &amp;= {\lvert\lvert{\nabla{f}}\lvert\lvert}\text{ }{\lvert\lvert{\hat{u}}\lvert\lvert}\text{ }cos(\alpha)\\ \end{aligned}$</div></p>
<p>And because the maximum value for the cosine of an angle is 1 when the angle is 0, we know that the directional derivative will be maximized when the vector <span class="math">\(\hat{u}\)</span> is in same direction as the gradient vector <span class="math">\(\nabla{f}\)</span>. Thus, the directional derivative is maximum in the direction of the gradient.</p>
<p>Also, because the size of the vector <span class="math">\(\hat{u}\)</span> is 1 (we defined it as being the unit vector) you can see that the size of the gradient vector <span class="math">\(\nabla{f}\)</span> is the slope of the function at that point.</p>
<h3 id="the-derivative-of-the-derivative-of-the-...-second-derivative-third-derivative-etc...">The derivative of the derivative of the …: Second Derivative, Third Derivative, etc…</h3>
<p>Of course, in the end the result of differentiation is also just a  function which we can also (possibly) differentiate.</p>
<p>For functions of a single variable this results in the so called <em>second derivative</em>, <em>third derivative</em>, etc… For these we use the notation:</p>
<p><div class="math">$f''(x), f'''(x), ...$</div></p>
<p>or</p>
<p><div class="math">$\frac{d^2f}{dx^2}, \frac{d^3f}{dx^3}$</div></p>
<p>We can do the same with functions of multiple variables.</p>
<p>However, because we have multiple variables we also again have partial derivatives. Which results in partial derivatives of partial derivatives.  Remember that for a function of severable variables we had a partial derivative for each variable. But those partial derivatives are also (possibly) functions of severable variables, so for each partial derivative we have multiple second partial derivatives.</p>
<p>Say we have a function:</p>
<p><div class="math">$f(x_1,x_2,...,x_i, ...,x_n)$</div></p>
<p>Then we can define partial derivatives for each variable:</p>
<p><div class="math">$\frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, ..., \frac{\partial f}{\partial x_i} ..., \frac{\partial f}{\partial x_n}$</div></p>
<p>However, each of these partial derivatives is again possibly a function of those variables:</p>
<p><div class="math">$\begin{aligned} \frac{\partial f}{\partial x_1} &amp;= g_1(x_1,x_2,...,x_i, ...,x_n), \\ \frac{\partial f}{\partial x_2} &amp;= g_2(x_1,x_2,...,x_i, ...,x_n), \\ ..., \\ \frac{\partial f}{\partial x_i} &amp;= g_i(x_1,x_2,...,x_i, ...,x_n), \\ ..., \\ \frac{\partial f}{\partial x_n} &amp;= g_n(x_1,x_2,...,x_i, ...,x_n) \end{aligned}$</div></p>
<p>And thus for each function <span class="math">\(g()\)</span> we again have the partial derivatives:</p>
<p><div class="math">$\frac{\partial g_i}{\partial x_1}, \frac{\partial g_i}{\partial x_2}, ..., \frac{\partial g_i}{\partial x_i} ..., \frac{\partial g_i}{\partial x_n}$</div></p>
<p>And because <span class="math">\(g_i(x_1,x_2,...,x_i, ...,x_n) = \frac{\partial f}{\partial x_i}\)</span></p>
<p><div class="math">$\frac{\partial^2 f}{{\partial x_i \partial x_1}}, \frac{\partial^2 f}{{\partial x_i \partial x_2}}, ..., \frac{\partial^2 f}{{\partial x_i \partial x_i}} ..., \frac{\partial^2 f}{{\partial x_i \partial x_n}}$</div></p>
<p>Of course, these second partial derivatives are also functions of multiple variables, so the same line of thinking applies to them.</p>
<p>Just as we had a (first order) directional derivative for functions with multiple variables, we also have a second order directional derivative and by extension higher order directional derivatives.</p>
<p>Let us have a look at the second order directional derivative. It is the directional derivative in the direction <span class="math">\(u\)</span> of the directional derivative in the direction <span class="math">\(u\)</span> of the function <span class="math">\(f\)</span></p>
<p>Remember the definition of the directional derivative (I’ve changed the symbol for the function from <span class="math">\(f\)</span> to <span class="math">\(g\)</span>, you’ll understand why in a second):</p>
<p><div class="math">$\begin{aligned} {g '} _u &amp;= \lim_{h\to0}\frac{g(x+h\hat{u}) - g(x)}{h}\\ &amp;=\lim_{h\to0}\frac{g(x_1+h\hat{u_1}, x_2+h\hat{u_2}, ..., x_n+h\hat{u_n}) - g(x_1, u_2, ..., x_n)}{h}\\ &amp;= \frac{\partial g}{\partial x_1}\hat{u_1} + \frac{\partial g}{\partial x_2}\hat{u_2} + ... + \frac{\partial g}{\partial x_i}\hat{u_i} + ... + \frac{\partial g}{\partial x_n}\hat{u_n}\\ \end{aligned}$</div></p>
<p>For the second directional derivative, the function <span class="math">\(g\)</span> is the first directional derivative, thus:</p>
<p><div class="math">$g = {f'}_u$</div></p>
<p>So the derivative of <span class="math">\(g\)</span> is the second order derivative of <span class="math">\(f\)</span>, thus:</p>
<p><div class="math">$\begin{aligned} \frac{\partial g}{\partial x_i} &amp;= \frac{\partial {f'}_u}{\partial x_i}\\ &amp;= \frac{\partial {f'}_u}{\partial x_i}\\ &amp;= \frac{\partial {(\frac{\partial f}{\partial x_1}\hat{u_1} + \frac{\partial f}{\partial x_2}\hat{u_1} + ... + \frac{\partial f}{\partial x_i}\hat{u_i} + ... + \frac{\partial f}{\partial x_n}\hat{u_n}})}{\partial x_i}\\ &amp;= \frac{\partial^2 f}{\partial x_1\partial x_i}\hat{u_1} + \frac{\partial^2 f}{\partial x_2\partial x_i}\hat{u_2} + ... + \frac{\partial^2 f}{\partial x_i\partial x_i}\hat{u_i} + ... + \frac{\partial^2 f}{\partial x_n\partial x_i}\hat{u_n}\\ &amp;= \frac{\partial^2 f}{\partial x_1\partial x_i}\hat{u_1} + ... + \frac{\partial^2 f}{\partial x_n\partial x_i}\hat{u_n} \end{aligned}$</div></p>
<p>And if we plug this in the formula above:</p>
<p><div class="math">$\begin{aligned} {g '} _u &amp;= \frac{\partial g}{\partial x_1}\hat{u_1} ... + \frac{\partial g}{\partial x_n}\hat{u_n}\\ &amp;= ( \frac{\partial^2 f}{\partial x_1\partial x_1}\hat{u_1} + ... + \frac{\partial^2 f}{\partial x_n\partial x_1}\hat{u_n})\hat{u_1} + ... + (\frac{\partial^2 f}{\partial x_1\partial x_n}\hat{u_1} + ... + \frac{\partial^2 f}{\partial x_n\partial x_n}\hat{u_n})\hat{u_n}\\ &amp;= \frac{\partial^2 f}{\partial x_1\partial x_1}\hat{u_1}\hat{u_1} + ... + \frac{\partial^2 f}{\partial x_n\partial x_1}\hat{u_n}\hat{u_1} + ... + \frac{\partial^2 f}{\partial x_1\partial x_n}\hat{u_1}\hat{u_n} + ... + \frac{\partial^2 f}{\partial x_n\partial x_n}\hat{u_n}\hat{u_n}\\ \end{aligned}$</div></p>
<p>As you can see this formula is starting to get huge. I already left off some indices but still, it is easy to get lost in all those indices. And we are only at the second directional derivative. Imagine what you would get with higher order  (third, fourth, etc…) directional derivatives.</p>
<p>Luckily, there is a more condens way of writing this formula but for that we need matrices.</p>
<h2 id="yes-here-is-another-word-matrices">Yes, here is another word: ‘Matrices’</h2>
<h3 id="definition-of-a-matrix">Definition of a matrix</h3>
<p>From <a href="https://en.wikipedia.org/wiki/Matrix_(mathematics)">Wikipedia</a>:</p>
<blockquote>
<p>In mathematics, a matrix (plural: matrices) is a rectangular array of numbers, symbols, or expressions, arranged in rows and columns.</p>
</blockquote>
<p>In the following example we have a matrix with 3 rows and 4 columns. We say we have a <span class="math">\(3 \times 4\)</span> matrix (read as ‘3 by 4’). The number of rows and the number of columns are the so called <em>dimensions</em> of the matrix. Notice that we first mention the number of rows and then the number of columns.</p>
<p><div class="math">$A = \begin{bmatrix} a_{11} &amp; a_{12} &amp; a_{13} &amp; a_{14}\\ a_{21} &amp; a_{22} &amp; a_{23} &amp; a_{24}\\ a_{31} &amp; a_{32} &amp; a_{33} &amp; a_{34}\\ \end{bmatrix} $</div></p>
<h3 id="operations-with-matrices">Operations with matrices</h3>
<h4 id="sum-and-difference-of-two-matrices">Sum and difference of two Matrices</h4>
<p>The sum and difference of two matrices is defined by the sum or difference of each element at the same position in the matrix. Thus, say we have two matrices <span class="math">\(A\)</span> and <span class="math">\(B\)</span> and we want to calculate the sum <span class="math">\(C\)</span>, then:</p>
<p><div class="math">$c_{ij} = a_{ij} + b_{ij}$</div></p>
<p>In which <span class="math">\(i\)</span> and <span class="math">\(j\)</span> are subscripts for the rows and columns.</p>
<p>As an example</p>
<p><div class="math">$A = \begin{bmatrix} a_{11} &amp; a_{12} &amp; a_{13} &amp; a_{14}\\ a_{21} &amp; a_{22} &amp; a_{23} &amp; a_{24}\\ a_{31} &amp; a_{32} &amp; a_{33} &amp; a_{34}\\ \end{bmatrix} $</div></p>
<p><div class="math">$B = \begin{bmatrix} b_{11} &amp; b_{12} &amp; b_{13} &amp; b_{14}\\ b_{21} &amp; b_{22} &amp; b_{23} &amp; b_{24}\\ b_{31} &amp; b_{32} &amp; b_{33} &amp; b_{34}\\ \end{bmatrix} $</div></p>
<p>Then the sum is</p>
<p><div class="math">$\begin{aligned} C &amp;= \begin{bmatrix} a_{11} + b_{11} &amp; a_{12} + b_{12} &amp; a_{13} + b_{13} &amp; a_{14} + b_{14}\\ a_{21} + b_{21} &amp; a_{22} + b_{22} &amp; a_{23} + b_{23} &amp; a_{24} + b_{24}\\ a_{31} + b_{31} &amp; a_{32} + b_{32} &amp; a_{33} + b_{33} &amp; a_{34} + b_{34}\\ \end{bmatrix}\\ &amp;= \begin{bmatrix} c_{11} &amp; c_{12} &amp; c_{13} &amp; c_{14}\\ c_{21} &amp; c_{22} &amp; c_{23} &amp; c_{24}\\ c_{31} &amp; c_{32} &amp; c_{33} &amp; c_{34}\\ \end{bmatrix}\\ \end{aligned}$</div></p>
<p>A similar definition exists for subtraction.</p>
<p>This also means we can NOT add or subtract matrices with different dimensions because for some elements in one matrix we would not have conterparts in the other matrix.</p>
<h4 id="scalar-multiplication">Scalar multiplication</h4>
<p>The scalar multiplication of a matrix with a scalar is the result of multiplying each entry in the matrix with the scalar. Thus, if we have a matrix <span class="math">\(A\)</span> and a scalar <span class="math">\(\lambda\)</span>, then:</p>
<p><div class="math">$c_{ij} = \lambda a_{ij}$</div></p>
<p>In which <span class="math">\(i\)</span> and <span class="math">\(j\)</span> are subscripts for the rows and columns.</p>
<p>As an example</p>
<p><div class="math">$A = \begin{bmatrix} a_{11} &amp; a_{12} &amp; a_{13} &amp; a_{14}\\ a_{21} &amp; a_{22} &amp; a_{23} &amp; a_{24}\\ a_{31} &amp; a_{32} &amp; a_{33} &amp; a_{34}\\ \end{bmatrix} $</div></p>
<p>Then the scalar multiplication with <span class="math">\(\lambda\)</span> is</p>
<p><div class="math">$\begin{aligned} C &amp;= \begin{bmatrix} \lambda a_{11} &amp; \lambda a_{12} &amp; \lambda a_{13} &amp; \lambda a_{14}\\ \lambda a_{21} &amp; \lambda a_{22} &amp; \lambda a_{23} &amp; \lambda a_{24}\\ \lambda a_{31} &amp; \lambda a_{32} &amp; \lambda a_{33} &amp; \lambda a_{34}\\ \end{bmatrix}\\ &amp;= \begin{bmatrix} c_{11} &amp; c_{12} &amp; c_{13} &amp; c_{14}\\ c_{21} &amp; c_{22} &amp; c_{23} &amp; c_{24}\\ c_{31} &amp; c_{32} &amp; c_{33} &amp; c_{34}\\ \end{bmatrix}\\ \end{aligned}$</div></p>
<h4 id="matrix-multiplication">Matrix multiplication</h4>
<p>Multiplication of two matrices is defined by taking the dot product of the rows of the first matrix with the columns of the second matrix. Thus, if we have a matrix <span class="math">\(A\)</span> with dimensions <span class="math">\(m \times n\)</span> and matrix <span class="math">\(B\)</span> with dimensions <span class="math">\(n \times p\)</span>, then:</p>
<p><div class="math">$c_{ij} = \sum_{k=1}^{n} a_{ik} b_{kj}$</div></p>
<p>There are some important points to take away from this definition:</p>
<ul>
<li>multiplication of two matrices is only possible if the number of columns in the first matrix <span class="math">\(A\)</span> equals the number of rows in the second matrix <span class="math">\(B\)</span> (Notice the letters used for the dimensions of the matrices when I introduced the formula).</li>
<li>If we have a matrix <span class="math">\(A\)</span> with dimensions <span class="math">\(m \times n\)</span> and matrix <span class="math">\(B\)</span> with dimensions <span class="math">\(n \times p\)</span>, then the dimensions of the resulting matrix of the multiplication are  <span class="math">\(m \times p\)</span></li>
<li>multiplication of matrices is <strong>NOT</strong> commutative. Thus <span class="math">\(AB \neq BA\)</span>. This can be easily seen from the definition of the multiplication below. As a matter of fact: its even not sure if you can do the multiplication because of the requirement regarding the columns and the rows.</li>
<li>multiplication of matrices is associative. Thus <span class="math">\(A(BC) = (AB)C\)</span>. This can be easily proven by just writing out the formula. In the references section I also provide a link to a very understandable proof.</li>
</ul>
<p>As an example we have a matrix <span class="math">\(A\)</span> with dimensions <span class="math">\(4 \times 3\)</span></p>
<p><div class="math">$A = \begin{bmatrix} a_{11} &amp; a_{12} &amp; a_{13}\\ a_{21} &amp; a_{22} &amp; a_{23}\\ a_{31} &amp; a_{32} &amp; a_{33}\\ a_{41} &amp; a_{42} &amp; a_{43}\\ \end{bmatrix} $</div></p>
<p>and we have a matrix <span class="math">\(A\)</span> with dimensions <span class="math">\(3 \times 2\)</span></p>
<p><div class="math">$B = \begin{bmatrix} b_{11} &amp; b_{12}\\ b_{21} &amp; b_{22}\\ b_{31} &amp; b_{32}\\ \end{bmatrix} $</div></p>
<p>Then the multiplication is</p>
<p><div class="math">$\begin{aligned} C &amp;= \begin{bmatrix} a_{11} b_{11} + a_{12} b_{21} + a_{13} b_{31} &amp; a_{11} b_{12} + a_{12} b_{22} + a_{13} b_{32} \\ a_{21} b_{11} + a_{22} b_{21} + a_{23} b_{31} &amp; a_{21} b_{12} + a_{22} b_{22} + a_{23} b_{32} \\ a_{31} b_{11} + a_{32} b_{21} + a_{33} b_{31} &amp; a_{31} b_{12} + a_{32} b_{22} + a_{33} b_{32} \\ a_{41} b_{11} + a_{42} b_{21} + a_{43} b_{31} &amp; a_{41} b_{12} + a_{42} b_{22} + a_{43} b_{32} \\ \end{bmatrix}\\ &amp;= \begin{bmatrix} c_{11} &amp; c_{12}\\ c_{21} &amp; c_{22}\\ c_{31} &amp; c_{32}\\ c_{41} &amp; c_{42}\\ \end{bmatrix}\\ \end{aligned}$</div></p>
<p>With dimensions <span class="math">\(4 \times 2\)</span></p>
<h4 id="matrix-transpose">Matrix transpose</h4>
<p>The transpose of a matrix is the matrix obtained by converting the rows of the source matrix into columns and the columns into rows. Thus, if we have a matrix <span class="math">\(A\)</span> with dimensions <span class="math">\(m \times n\)</span>, then:</p>
<p><div class="math">$c_{ij} = a_{ji}$</div></p>
<p>As an example we have a matrix <span class="math">\(A\)</span> with dimensions <span class="math">\(4 \times 2\)</span></p>
<p><div class="math">$A = \begin{bmatrix} a_{11} &amp; a_{12}\\ a_{21} &amp; a_{22}\\ a_{31} &amp; a_{32}\\ a_{41} &amp; a_{42}\\ \end{bmatrix} $</div></p>
<p>Then the transpose is</p>
<p><div class="math">$\begin{aligned} C &amp;= \begin{bmatrix} a_{11} &amp; a_{21} &amp; a_{31} &amp; a_{41} \\ a_{12} &amp; a_{22} &amp; a_{32} &amp; a_{42} \\ \end{bmatrix}\\ &amp;= \begin{bmatrix} c_{11} &amp; c_{12} &amp; c_{13} &amp; c_{14}\\ c_{21} &amp; c_{22} &amp; c_{23} &amp; c_{24}\\ \end{bmatrix}\\ \end{aligned}$</div></p>
<p>The transpose of a matrix <span class="math">\(A\)</span> is noted as <span class="math">\(A^\top\)</span></p>
<h4 id="a-side-note-on-the-transpose-and-the-notation-of-the-perceptron-integration-function">A side note on the transpose and the notation of the perceptron integration function</h4>
<p>If you’ve been searching the internet on information about the perceptron, chances are you’ve found formula’s for the perceptron making use of this transpose:</p>
<p><div class="math">$w^\top x$</div></p>
<p>So why don’t we?</p>
<p>We’ve been using vector notation for the perceptron which makes use of the dot-product defined on vectors. However, instead of considering <span class="math">\(w\)</span> and <span class="math">\(x\)</span> is being vectors, we could also consider them being matrices with <span class="math">\(n\)</span> rows and a single column, thus dimensions <span class="math">\(n \times 1\)</span>.</p>
<p><div class="math">$\begin{aligned} w &amp;= \begin{bmatrix} w_{11} \\ w_{12} \\ ... \\ w_n \\ \end{bmatrix}\\ x &amp;= \begin{bmatrix} x_{11} \\ x_{12} \\ ... \\ x_n \\ \end{bmatrix}\\ \end{aligned}$</div></p>
<p>In this case the dot-product is replaced by matrix multiplication and by definition this multiplication is not defined for two matrices with dimensions <span class="math">\(n \times 1\)</span>. We can hower multiply a matrix with dimensions <span class="math">\(1 \times n\)</span>, with a matrix with dimensions <span class="math">\(n \times 1\)</span>. To get at this <span class="math">\(1 \times n\)</span> matrix we take the transpose of our <span class="math">\(n \times 1\)</span> matrix. Hence the formula with the transpose:</p>
<p><div class="math">$\begin{aligned} w^\top x &amp;= \begin{bmatrix} w_{11} &amp; w_{12} &amp; ... &amp; w_n \end{bmatrix} \begin{bmatrix} x_{11} \\ x_{12} \\ ... \\ x_n \\ \end{bmatrix}\\ &amp;= \begin{bmatrix} w_11x_11 + w_22x_22 + ... + w_nx_n \end{bmatrix} \end{aligned}$</div></p>
<p>Notice however that mathematically the outcome of the dot product is a scalar while the outcome of matrix multiplicatin is another matrix. So, allthough the calculated value is the same, the mathematical construct is not!</p>
<h3 id="so-why-do-we-need-them-for-the-directional-derivative">So, why do we need them for the directional derivative?</h3>
<p>Now that we know about matrices and matrix multiplication, let us return to the second directional derivative of a function with multiple arguments:</p>
<p><div class="math">$\begin{aligned} {f''} _u &amp;= \frac{\partial^2 f}{\partial x_1\partial x_1}\hat{u_1}\hat{u_1} + ... + \frac{\partial^2 f}{\partial x_n\partial x_1}\hat{u_n}\hat{u_1} + ... + \frac{\partial^2 f}{\partial x_1\partial x_n}\hat{u_1}\hat{u_n} + ... + \frac{\partial^2 f}{\partial x_n\partial x_n}\hat{u_n}\hat{u_n}\\ \end{aligned}$</div></p>
<p>Let us expand this for a function of two variables. <strong>WARNING:</strong> I’ll be using <span class="math">\(x\)</span> and <span class="math">\(y\)</span> instead of subscripts to <span class="math">\(x\)</span>, like <span class="math">\(x_1\)</span> and <span class="math">\(x_2\)</span> because I feel this makes things more clear: it is easier to distinguish <span class="math">\(x\)</span> and <span class="math">\(y\)</span> than <span class="math">\(x_1\)</span> and <span class="math">\(x_2\)</span>. As a consequence I’ll also use <span class="math">\(u_x\)</span> and <span class="math">\(u_y\)</span> instead of <span class="math">\(u_1\)</span> and <span class="math">\(u_2\)</span></p>
<p>Thus, we have a function <span class="math">\(f(x, y)\)</span> . It’s second directional derivative equals to:</p>
<p><div class="math">${f''}_u = \frac{\partial^2 f}{{\partial x}^2}\hat{u_x}^2 + \frac{\partial^2 f}{\partial x\partial y}\hat{u_x}\hat{u_y} + \frac{\partial^2 f}{\partial y\partial x}\hat{u_y}\hat{u_x} + \frac{\partial^2 f}{{\partial y}^2}\hat{u_y}^2 $</div></p>
<p>If we expand the following matrix multiplication:</p>
<p><div class="math">$\begin{bmatrix} \hat{u_x} &amp; \hat{u_y} \\ \end{bmatrix} \begin{bmatrix} \frac{\partial^2 f}{{\partial x}^2} &amp; \frac{\partial^2 f}{\partial x\partial y} \\ \frac{\partial^2 f}{\partial y\partial x} &amp; \frac{\partial^2 f}{{\partial y}^2} \\ \end{bmatrix} \begin{bmatrix} \hat{u_x} \\ \hat{u_y} \end{bmatrix} $</div></p>
<p>Expansion of the first multiplication:</p>
<p><div class="math">$\begin{bmatrix} \hat{u_x} &amp; \hat{u_y} \\ \end{bmatrix} \begin{bmatrix} \frac{\partial^2 f}{{\partial x}^2} &amp; \frac{\partial^2 f}{\partial x\partial y} \\ \frac{\partial^2 f}{\partial y\partial x} &amp; \frac{\partial^2 f}{{\partial y}^2} \end{bmatrix} $</div></p>
<p>gives:</p>
<p><div class="math">$\begin{bmatrix} \hat{u_x} \frac{\partial^2 f}{{\partial x}^2} + \hat{u_y} \frac{\partial^2 f}{\partial y\partial x} &amp; \hat{u_x} \frac{\partial^2 f}{\partial x\partial y} + \hat{u_y} \frac{\partial^2 f}{{\partial y}^2} \\ \end{bmatrix} $</div></p>
<p>Then continue with the second matrix multiplication:</p>
<p><div class="math">$\begin{aligned} \begin{bmatrix} \hat{u_x} \frac{\partial^2 f}{{\partial x}^2} + \hat{u_y} \frac{\partial^2 f}{\partial y\partial x} &amp; \hat{u_x} \frac{\partial^2 f}{\partial x\partial y} + \hat{u_y} \frac{\partial^2 f}{{\partial y}^2} \\ \end{bmatrix} \begin{bmatrix} \hat{u_x} \\ \hat{u_y} \end{bmatrix} &amp;= \begin{bmatrix} (\hat{u_x} \frac{\partial^2 f}{{\partial x}^2} + \hat{u_y} \frac{\partial^2 f}{\partial y\partial x}) \hat{u_x} + (\hat{u_x} \frac{\partial^2 f}{\partial x\partial y} + \hat{u_y} \frac{\partial^2 f}{{\partial y}^2}) \hat{u_y} \end{bmatrix}\\ &amp;= \begin{bmatrix} \hat{u_x} \frac{\partial^2 f}{{\partial x}^2} \hat{u_x} + \hat{u_y} \frac{\partial^2 f}{\partial y\partial x} \hat{u_x} + \hat{u_x} \frac{\partial^2 f}{\partial x\partial y} \hat{u_y} + \hat{u_y} \frac{\partial^2 f}{{\partial y}^2} \hat{u_y} \\ \end{bmatrix}\\ &amp;= \begin{bmatrix} \frac{\partial^2 f}{{\partial x}^2} \hat{u_x}^2 + \frac{\partial^2 f}{\partial x\partial y} \hat{u_x} \hat{u_y} + \frac{\partial^2 f}{\partial y\partial x} \hat{u_y} \hat{u_x} + \frac{\partial^2 f}{{\partial y}^2} \hat{u_y}^2 \\ \end{bmatrix} \end{aligned}$</div></p>
<p>If we compare this last result with the expanded definition of the second directional derivative we see that they are the same. So, the matrix multiplication results in the second directional derivative:</p>
<p><div class="math">${f''}_u = \begin{bmatrix} \hat{u_x} &amp; \hat{u_y} \\ \end{bmatrix} \begin{bmatrix} \frac{\partial^2 f}{{\partial x}^2} &amp; \frac{\partial^2 f}{\partial x\partial y} \\ \frac{\partial^2 f}{\partial y\partial x} &amp; \frac{\partial^2 f}{{\partial y}^2} \\ \end{bmatrix} \begin{bmatrix} \hat{u_x} \\ \hat{u_y} \end{bmatrix} $</div></p>
<p>We can of course expand this again and generalize for multiple variables (using subscripts for the variables):</p>
<p><div class="math">${f''}_u = \begin{bmatrix} \hat{u_1} &amp; \hat{u_2} ... &amp; \hat{u_i} &amp; ... &amp; \hat{u_n} \\ \end{bmatrix} \begin{bmatrix} \frac{\partial^2 f}{{\partial x_1}^2} &amp; \frac{\partial^2 f}{\partial x_1\partial x_2} &amp; ... &amp; \frac{\partial^2 f}{\partial x_1\partial x_i} &amp; ... &amp; \frac{\partial^2 f}{\partial x_1\partial x_n} \\ \frac{\partial^2 f}{\partial x_2\partial x_1} &amp; \frac{\partial^2 f}{{\partial x_2}^2} &amp; ... &amp; \frac{\partial^2 f}{\partial x_2\partial x_i} &amp; ... &amp; \frac{\partial^2 f}{\partial x_2\partial x_n} \\ ... &amp; ... &amp; ... &amp; ... &amp; ... &amp; ... \\ \frac{\partial^2 f}{\partial x_n\partial x_1} &amp; \frac{\partial^2 f}{{\partial x_n}{\partial x_2}} &amp; ... &amp; \frac{\partial^2 f}{\partial x_n\partial x_i} &amp; ... &amp; \frac{\partial^2 f}{{\partial x_n}^2} \\ \end{bmatrix} \begin{bmatrix} \hat{u_1} \\ \hat{u_2} \\ ... \\ \hat{u_i} \\ ... \\ \hat{u_n} \\ \end{bmatrix} $</div></p>
<p>The matrix with the partial derivatives is called the <strong>Hessian matrix</strong> H:</p>
<p><div class="math">$H = \begin{bmatrix} \frac{\partial^2 f}{{\partial x_1}^2} &amp; \frac{\partial^2 f}{\partial x_1\partial x_2} &amp; ... &amp; \frac{\partial^2 f}{\partial x_1\partial x_i} &amp; ... &amp; \frac{\partial^2 f}{\partial x_1\partial x_n} \\ \frac{\partial^2 f}{\partial x_2\partial x_1} &amp; \frac{\partial^2 f}{{\partial x_2}^2} &amp; ... &amp; \frac{\partial^2 f}{\partial x_2\partial x_i} &amp; ... &amp; \frac{\partial^2 f}{\partial x_2\partial x_n} \\ ... &amp; ... &amp; ... &amp; ... &amp; ... &amp; ... \\ \frac{\partial^2 f}{\partial x_n\partial x_1} &amp; \frac{\partial^2 f}{{\partial x_n}{\partial x_2}} &amp; ... &amp; \frac{\partial^2 f}{\partial x_n\partial x_i} &amp; ... &amp; \frac{\partial^2 f}{{\partial x_n}^2} \\ \end{bmatrix} $</div></p>
<p>If we now define another matrix U, having the coëfficients of the directional unit vector, as:</p>
<p><div class="math">$U = \begin{bmatrix} \hat{u_1} \\ \hat{u_2} \\ ... \\ \hat{u_i} \\ ... \\ \hat{u_n} \\ \end{bmatrix} $</div></p>
<p>We can rewrite the second directional derivative to the terse matrix multiplication:</p>
<p><div class="math">${f''}_u = U^\top H U$</div></p>
<h2 id="and-some-more-words-critical-points-stationary-points-and-fermats-theorem">And some more words: Critical Points, Stationary Points and Fermats Theorem</h2>
<p>Let us not loose track of what where we are going: we are looking for a method to find the minimum of a function.</p>
<p>We can find <em>local extremums</em>, that is, a local minimum or maximum, using Fermat’s Theorem on Stationary Points. From Wikipedia:</p>
<blockquote>
<p>Fermat’s theorem is a method to find local maxima and minima of differentiable functions on open sets by showing that every local extremum of the function is a stationary point</p>
</blockquote>
<p>Great, so what are these <em>stationary points</em>? Again from Wikipedia:</p>
<blockquote>
<p>A stationary point of a differentiable function of one variable is a point on the graph of the function where the function’s derivative is zero. (…) For a differentiable function of several real variables, a stationary point is a point on the surface of the graph where all its partial derivatives are zero (equivalently, the gradient is zero).</p>
</blockquote>
<p>If the derivative of a function <span class="math">\(f\)</span> at a certain point is <span class="math">\(0\)</span> <strong>or is not defined</strong>, then those points are called <em>critical points</em>. So basically you could say that all stationary points are critical points, but not all critical points are stationary points, those where the derivative is not defined are NOT stationary points</p>
<p>In this, a <em>local extremum</em> is a point where the function has either a local maximum or a local minimum. <em>Local</em> as opposed to <em>Global</em> where global meaning there is no other value for the function which is smaller/larger then the value at the global minimum/maximum and local meaning that allthough at this point the function has a minimum/maximum value there are other points where the function possibly has a smaller/larger result.</p>
<p><img src="https://sergedesmedt.github.io/MathOfNeuralNetworks/Resources/GlocalVsLocalMinMax.png" alt="Global vs Local Minimum and Maximum"></p>
<p>Mind however that Fermat’s theorem does not apply in the other direction: if the derivative of a function is zero at a certain point, this thus not necessarily mean we have a minimum or maximum! We can also have a so called <a href="https://en.wikipedia.org/wiki/Inflection_point">horizontal inflection point</a> in the case of a function of a single variable or a <a href="https://en.wikipedia.org/wiki/Saddle_point">saddle point</a> for functions of multiple variables.</p>
<p><img src="https://sergedesmedt.github.io/MathOfNeuralNetworks/Resources/SaddlePoint.PNG" alt="Saddle point"></p>
<p>So, how do we know we do indeed have a minimum value? We can do this with the so called <a href="http://mathworld.wolfram.com/ExtremumTest.html">Extremum test</a> or <a href="https://en.wikipedia.org/wiki/Derivative_test">Second Derivative test</a>. For single variable functions it calculates the second derivative and depending on its value a critical point is classified as being a minimum, maximum or inflection point:</p>
<ul>
<li>If <span class="math">\(f'(x) = 0\)</span> and <span class="math">\(f''(x) &gt; 0\)</span>, then the function <span class="math">\(f()\)</span> has a minimum at <span class="math">\(x\)</span></li>
<li>If <span class="math">\(f'(x) = 0\)</span> and <span class="math">\(f''(x) &lt; 0\)</span>, then the function <span class="math">\(f()\)</span> has a maximum at <span class="math">\(x\)</span></li>
<li>If <span class="math">\(f'(x) = 0\)</span> and <span class="math">\(f''(x) = 0\)</span>, then we gain no extra information</li>
</ul>
<p>To see why this works let us get back to the meaning of derivative: the derivative of a function at a point is the slope of that function at that point. Or using another formulation: it is the rate of change of the function at that point.</p>
<p>This also means that if the derivative changes sign, the slope changes sign. It will be clear that a function can only change sign when its value goes though zero. If the slope changes sign, it means the function changes from going up to going down or vice versa. Which consequently means we have a maximum or minimum. So this intuîtively explains why we possibly have a minimum or maximum (or an inflection point) if the derivative is zero.</p>
<p>To differentiate between a maximum or a minimum we need to check if the first derivative starts positive and turns negative or vice versa. We can do this using the second derivative which gives us the rate of change or the slope of the first derivative.</p>
<p>If the first derivative starts positive and turns negative, it means it is descending, which means the rate of change or the slope is negative. And thus the second derivative is negative.</p>
<p>Hence our rule:</p>
<ul>
<li>If <span class="math">\(f'(x) = 0\)</span> and <span class="math">\(f''(x) &lt; 0\)</span>, then the function <span class="math">\(f()\)</span> has a maximum at <span class="math">\(x\)</span></li>
</ul>
<p>Likewise, it the first derivative starts negative and becomes positive, the slope is ascending, thus the second derivative is positive, hence our rule:</p>
<ul>
<li>If <span class="math">\(f'(x) = 0\)</span> and <span class="math">\(f''(x) &gt; 0\)</span>, then the function <span class="math">\(f()\)</span> has a minimum at <span class="math">\(x\)</span></li>
</ul>
<p>However, if the first derivative, when going through zero (and thus we have a maximum or minimum of our original function), has itself a slope of zero (which means it is an inflection point), then the second derivative will at this point also become zero. On the other hand, if the first derivative, when becoming zero, has a maximum or minimum itself (it merely touches zero, but does not change sign), then the second derivative becomes zero again. But because the first derivative does NOT change sign the differentiated function has no maximum or minimum. In short: we have a situation in which both the first and second derivative become zero but we still can not conclude if our original function has a maximum, minimum or an inflection point at that value.</p>
<p>Hence the rule:</p>
<ul>
<li>If <span class="math">\(f'(x) = 0\)</span> and <span class="math">\(f''(x) = 0\)</span>, then we gain no extra information</li>
</ul>
<p>To be able to make a conclusion we’ll need the third derivative in this case. In fact, there is a general test for this, the higher order derivative test.</p>
<p>Let’s define <span class="math">\(n&gt;1\)</span> and</p>
<ul>
<li>the <span class="math">\(n^{th}\)</span> derivative of a function <span class="math">\(f\)</span> at a point <span class="math">\(c\)</span> is not zero</li>
<li>all lower derivatives (thus <span class="math">\(\frac{d^{n-1}f}{dx^{n-1}}\)</span> until <span class="math">\(\frac{df}{dx}\)</span>) are zero<br>
Then, if</li>
<li>n is even and the <span class="math">\(n^{th}\)</span> derivative is negative; we have a local maximum at the point <span class="math">\(c\)</span></li>
<li>n is even and the <span class="math">\(n^{th}\)</span> derivative is positive; we have a local minimum at the point <span class="math">\(c\)</span></li>
<li>n is odd and the <span class="math">\(n^{th}\)</span> derivative is negative; we have an decreasing inflection point at the point <span class="math">\(c\)</span></li>
<li>n is odd and the <span class="math">\(n^{th}\)</span> derivative is positive; we have an increasing inflection point  at the point <span class="math">\(c\)</span></li>
</ul>
<p><em>Remark: if you look for information on the internet about this test you’ll find definitions where the even and odd in the above summary are switched. The reason is that some definitions (like the one on wikipedia) start with <span class="math">\(n=1\)</span> for the <strong>second</strong> derivative. Thus, <span class="math">\(n\)</span> is odd for the second derivative, which complies with the first two rules in our summary above.</em></p>
<p>Try it yourself:<br>
<a href="https://sergedesmedt.github.io/MathOfNeuralNetworks/SecondDerivativeTest.html">Second Derivative test</a></p>
<p>For multivariable functions however, things are more complex. In that case it is called the <a href="https://en.wikipedia.org/wiki/Second_partial_derivative_test">second partial derivative test</a></p>
<p>Remember our above discussion of the second derivative test for functions of a single variable: in it we deduced that if the first derivative is zero for a function, then we need to evaluate the second derivative to differentiate between a minimum and maxmum value. If the second derivative is negative we have a maximum, if positive we have a minimum.</p>
<p>We can make a similar assertion here.</p>
<p>We have shown that the maximum slope for a multivariable function is in the direction of the gradient. So, if the maximum slope is zero, then the directional derivative will be zero in all directions. Because the maximum slope is the gradient, for the maximum slope to be zero, the gradient must be zero and thus all partial derivatives must be zero.</p>
<p>How can we now differentiate between a maximum, minimum and saddle point? A similar line of thinking as for he single variable function can be made,</p>
<p>For the point where the gradient is zero to be a minimum or maximum, we must have that the directional derivative changes sign in all directions, or, the directional derivative must continue ascending or descending in al directions. Which means the second directional derivative must be either positive or negative.</p>
<p>Mathematically:</p>
<ul>
<li>If the gradient is zero then we have a minimum if <span class="math">\({f''}_u = U^\top H U \gt 0\)</span></li>
<li>If the gradient is zero then we have a maximum if <span class="math">\({f''}_u = U^\top H U \lt 0\)</span></li>
</ul>
<p>It just so happens that the inequalities above have a name:</p>
<ul>
<li>The definition of Positive Definite is <span class="math">\(A^\top H A \gt 0\)</span></li>
<li>The definition of Negative Definite is <span class="math">\(A^\top H A \lt 0\)</span></li>
</ul>
<p>In both inequalities the matrix <span class="math">\(A\)</span> can be any matrix with a single column. Our matrices <span class="math">\(U\)</span> however represent the unit vector, but we can see that this is the same as <span class="math">\(A\)</span> with a scale factor, or <span class="math">\(A\)</span> multiplied by a scalar:</p>
<p><div class="math">$A = \lambda U$</div></p>
<p>And thus:</p>
<p><div class="math">$\begin{aligned} A^\top H A &amp;= \lambda U^\top H \lambda U\\ &amp;= {\lambda}^2 U^\top H U\\ \end{aligned}$</div></p>
<p>And because <span class="math">\({\lambda}^2 \gt 0\)</span>, the inequalities written with matrix A are the same as written with matrix U.<br>
And we can conclude that:</p>
<ul>
<li>If the Hessian of a function is positive definite we have a minimum</li>
<li>If the Hessian of a function is negative definite we have a maximum</li>
</ul>
<h2 id="optimizing-minimizing-the-mean-squared-errors">Optimizing (minimizing) the Mean Squared Errors</h2>
<p>So, let us see where we are now:</p>
<ol>
<li>We have defined a cost function to be able to determine a best fit (according to this cost function).</li>
<li>We know our choosen cost function is convex (we’ll proof it in a minute) and thus has a single minimal value.</li>
<li>We can determine where this single minimal value is by differentiating our cost function and determine where the derivative is 0.</li>
</ol>
<h3 id="getting-the-derivative-of-our-cost-function">Getting the derivative of our cost function</h3>
<p>The derivative of a function is always taken to the variables of this function. If we take a look at out cost function:</p>
<p><div class="math">$E(w) = \frac{1}{2} \sum_{j=1}^{M} e_j^2$</div></p>
<p>In this:</p>
<ul>
<li>M is the number of samples</li>
<li>j is the index of the sample</li>
</ul>
<p>and with:</p>
<p><div class="math">$\begin{aligned} e &amp;= d-\mathbf{w} \cdot \mathbf{x}\\ \end{aligned}$</div></p>
<p>we have</p>
<p><div class="math">$E(w) = \frac{1}{2} \sum_{j=1}^{M} (d-\mathbf{w} \cdot \mathbf{x})_j^2$</div></p>
<p>in which</p>
<ul>
<li><span class="math">\(d\)</span> is the desired outcome</li>
<li><span class="math">\(w\)</span> is the weights of the perceptron</li>
<li><span class="math">\(x\)</span> is the input (the features) to the perceptron.</li>
</ul>
<p>You might be tempted to think that we need to differentiate to <span class="math">\(x\)</span> but that is wrong. We want to find, given some data pairs (desired outcome, observed features), thus <span class="math">\((d, x)\)</span>, for which values <span class="math">\(w\)</span> the total error is minimal. This means <span class="math">\(w\)</span> is our variable, our unknown, and <span class="math">\(d\)</span> and <span class="math">\(x\)</span> are values which are given and which we plug into the error function. Thus, we differentiate with respect to <span class="math">\(w\)</span>.</p>
<p>But <span class="math">\(w\)</span> is a vector which means the error function is a multiple variable function. So we will need to take the partial derivatives with respect to each <span class="math">\(w_i\)</span> in our vector <span class="math">\(w\)</span>.</p>
<p>There are a few rules which govern the calculation of the derivative of a function, like the power rule, multiplication by a constant, summary, difference, chaining, etc… We will not discuss all the rules, but only those we need in our cost function.</p>
<p>If you’re interested, you can have a look in the references section for som other rules.</p>
<p>If we take our cost function then we can see we will need the power rule for the quadratic term, the constant rule for the division by <span class="math">\(2\)</span>, the summary and difference rule for the <span class="math">\(\sum\)</span> terms and the calculation of the error itself (<span class="math">\(e = d-\mathbf{w} \cdot \mathbf{x}\)</span>) and finally the chain rule for putting them all together</p>
<p>Let us see what these rules tell and apply them to our cost function:</p>
<p><em>Keep attention to the indices because things are going to get real messy</em></p>
<p>The <strong>multiplication by constant rule</strong>:</p>
<p><div class="math">$f(x) = n g(x)$</div><br>
Then it can be proven that:</p>
<p><div class="math">$\begin{aligned} \frac{df(x)}{dx} &amp;= \frac{d(ng(x))}{dx} \\ &amp;= n\frac{dg(x)}{dx} \end{aligned}$</div></p>
<p>Applied to our cost function:</p>
<p><div class="math">$E(w) = \frac{1}{2} \sum_{j=1}^{M} {(d-\mathbf{w} \cdot \mathbf{x})}_j^2$</div></p>
<p>If we take the partial derivative with respect to the dimension <span class="math">\(k\)</span>:</p>
<p><div class="math">$\begin{aligned} \frac{\partial E}{\partial w_k} &amp;= \frac{\partial {\frac{1}{2} \sum_{j=1}^{M} {(d-\mathbf{w} \cdot \mathbf{x})}_j^2}}{\partial w_k} \\ &amp;= \frac{1}{2} \frac{\partial {\sum_{j=1}^{M} {(d-\mathbf{w} \cdot \mathbf{x})}_j^2}}{\partial w_k} \end{aligned}$</div></p>
<p>Next is <strong>the summary rule</strong>:</p>
<p><div class="math">$f(x) = g_1(x) + g_2(x)$</div><br>
Then it can be proven that:</p>
<p><div class="math">$\begin{aligned} \frac{df(x)}{dx} &amp;= \frac{d( g_1(x) + g_2(x))}{dx} \\ &amp;= \frac{dg_1(x)}{dx} + \frac{dg_2(x)}{dx} \end{aligned}$</div></p>
<p>Further application to our cost function for the <span class="math">\(\sum_{j=1}^{M}\)</span> term :</p>
<p><div class="math">$\begin{aligned} \frac{\partial E}{\partial w_k} &amp;= \frac{1}{2} \frac{\partial {\sum_{j=1}^{M} {(d-\mathbf{w} \cdot \mathbf{x})}_j^2}}{\partial w_k} \\ &amp;= \frac{1}{2} \sum_{j=1}^{M} \frac{\partial { ({(d-\mathbf{w} \cdot \mathbf{x})}_j^2})}{\partial w_k} \end{aligned}$</div></p>
<p>Next is <strong>the chain rule</strong> which handles <em>functions whose arguments are also functions</em><br>
Lets say we have two functions:</p>
<p><div class="math">$\begin{aligned} y &amp;= g(z)\\ z &amp;= h(x) \end{aligned}$</div></p>
<p>Then we can combine them into a single function as :</p>
<p><div class="math">$\begin{aligned} y &amp;= g(h(x))\\ &amp;= f(z) \end{aligned}$</div></p>
<p>Then it can be proven that:</p>
<p><div class="math">$\begin{aligned} \frac{df(x)}{dx} &amp;= \frac{d(g(h(x)))}{dx} \\ &amp;= \frac{dg(h(x))}{dx} \frac{dh(x)}{dx} \end{aligned}$</div></p>
<p>And <strong>the power rule</strong>:</p>
<p><div class="math">$f(x) = x^n$</div></p>
<p>Then it can be proven that:</p>
<p><div class="math">$\begin{aligned} \frac{df(x)}{dx} &amp;= \frac{d(x^n)}{dx} \\ &amp;= nx^{(n-1)} \end{aligned}$</div></p>
<p>If we combine the power rule applied to a function and the chain rule we get</p>
<p><div class="math">$f(x) = g(x)^n$</div></p>
<p>Then:</p>
<p><div class="math">$\begin{aligned} \frac{df(x)}{dx} &amp;= \frac{d(g(x)^n)}{dx} \\ &amp;= ng(x)^{(n-1)}\frac{dg(x)}{dx} \end{aligned}$</div></p>
<p>Further application to our cost function for the <span class="math">\(\frac{\partial { ({(d-\mathbf{w} \cdot \mathbf{x})}^2})}{\partial w_i}\)</span> term :</p>
<p><div class="math">$\begin{aligned} \frac{\partial E}{\partial w_k} &amp;= \frac{1}{2} \sum_{j=1}^{M} \frac{\partial { ({(d-\mathbf{w} \cdot \mathbf{x})}_j^2})}{\partial w_k} \\ &amp;= \frac{1}{2} \sum_{j=1}^{M} 2 (d-\mathbf{w} \cdot \mathbf{x})_j \frac{\partial { {(d-\mathbf{w} \cdot \mathbf{x})_j}}}{\partial w_k} \\ &amp;= \sum_{j=1}^{M} (d-\mathbf{w} \cdot \mathbf{x})_j \frac{\partial { {(d-\mathbf{w} \cdot \mathbf{x})_j}}}{\partial w_k} \\ \end{aligned}$</div></p>
<p>Now you also know why we divided by <span class="math">\(2\)</span>: it makes the coëfficient of the square disappear after derivation.</p>
<p>Before we continue, remember that the term <span class="math">\((d-\mathbf{w} \cdot \mathbf{x})_j\)</span> is the error of sample <span class="math">\(j\)</span>:</p>
<p><div class="math">$(d-\mathbf{w} \cdot \mathbf{x})_j = e_j$</div></p>
<p>And <span class="math">\(\mathbf{w} \cdot \mathbf{x}\)</span> is the dot product and thus also a sumation:</p>
<p><div class="math">$\mathbf{w} \cdot \mathbf{x} = \sum_{i=1}^{N} w_ix_i$</div></p>
<p>In this:</p>
<ul>
<li><span class="math">\(N\)</span> is the dimension of our feature vector.</li>
</ul>
<p>And thus, the partial derivative of the error function, which is a function of the weights, and with respect to the dimension <span class="math">\(k\)</span> is:</p>
<p><div class="math">$\begin{aligned} \frac{\partial E}{\partial w_k} &amp;= \sum_{j=1}^{M} e_j \frac{\partial { {(d-\sum_{i=1}^{N} w_ix_i)_j}}}{\partial w_k} \\ \end{aligned}$</div></p>
<p>In this:</p>
<ul>
<li><span class="math">\(k\)</span> is the dimension in which we take the partial derivative</li>
<li><span class="math">\(M\)</span> is the number of samples</li>
<li><span class="math">\(j\)</span> is the index of the sample</li>
<li><span class="math">\(i\)</span> is the index of the dimension of a sample</li>
</ul>
<p>In words:</p>
<ul>
<li>The partial derivative of the errorfunction with respect to the dimension <span class="math">\(k\)</span> is</li>
<li>the sum over all samples <span class="math">\(j\)</span> of
<ul>
<li>the error of the sample</li>
<li>times the partial derivative of the error with respect to dimension <span class="math">\(k\)</span> of sample <span class="math">\(j\)</span></li>
</ul>
</li>
</ul>
<p>It will be clear that the derivative of a constant is zero: a constant does not change.</p>
<p>In our error function <span class="math">\(d\)</span> is a constant, so its derivative is <span class="math">\(0\)</span>. But because we take the partial derivatives each <span class="math">\(w_k\)</span> is also a constant with respect to <span class="math">\(w_i\)</span> if <span class="math">\(i\neq k\)</span>, so applying this knowledge and making use of the summary rule and expanding the summary resulting from the dot product we get:</p>
<p><div class="math">$\begin{aligned} \frac{\partial E}{\partial w_k} &amp;= \sum_{j=1}^{M} e_j \frac{\partial { {(d-\sum_{i=1}^{N} w_ix_i)_j}}}{\partial w_k} \\ &amp;= - \sum_{j=1}^{M} e_j \frac{\partial {({\sum_{i=1}^{N} w_ix_i)_j}}}{\partial w_k} \\ &amp;= - \sum_{j=1}^{M} e_j (\frac{\partial {(w_1x_1 + ... + w_ix_i + ... + w_nx_n)_j}}{\partial w_k} )\\ &amp;= -( e_1 (\frac{\partial {(w_1x_1 + ... + w_ix_i + ... + w_nx_n)_1}}{\partial w_k}) + ... + e_j (\frac{\partial {(w_1x_1 + ... + w_ix_i + ... + w_nx_n)_j}}{\partial w_k}) + ... + e_m (\frac{\partial {(w_1x_1 + ... + w_ix_i + ... + w_nx_n)_m}}{\partial w_k}))\\ \end{aligned}$</div></p>
<p>Of the above terms, only the ones with <span class="math">\(i=k\)</span> remain: for the others the term <span class="math">\(wx\)</span> is considered constant with respect to derivation to <span class="math">\(\partial w\)</span> and thus are <span class="math">\(0\)</span></p>
<p><div class="math">$\begin{aligned} \frac{\partial E}{\partial w_k} &amp;= -( e_1 {x_k}_1 + .. + e_j {x_k}_j + ... + e_m {x_k}_m)\\ &amp;= - \sum_{j=1}^M {e_j {x_k}_j} \end{aligned}$</div></p>
<p>Thus, it is the sum over all samples of the error of each sample times its feature value of the k-th dimension.</p>
<h3 id="prooving-convexity">Prooving convexity</h3>
<p>Remember from our discussion on the second partial derivative test we concluded that a function is convex if the Hessian matix of the function is positive semidefinite. Also remember that the Hessian matrix was formed by arranging all the second partial derivatives of the function in a certain order in a matrix.</p>
<p>So, to proof the positive semidefiniteness of the Hessian, we’ll need the second partial derivatives.</p>
<p>Above, we calculated the first partial derivatives:</p>
<p><div class="math">$\begin{aligned} \frac{\partial E}{\partial w_k} &amp;= - \sum_{j=1}^M {e_j {x_k}_j} \\ &amp;= - \sum_{j=1}^M {(d-\mathbf{w} \cdot \mathbf{x})_j {x_k}_j} \end{aligned}$</div></p>
<p>From this we can calculate the second order partial derivatives:</p>
<p><div class="math">$\begin{aligned} \frac{\partial E}{{\partial w_k}{\partial w_l}} &amp;= \frac{\partial{(- \sum_{j=1}^M {{(d-\mathbf{w} \cdot \mathbf{x})_j x_k}_j}})}{\partial w_l} \\ &amp;= - \sum_{j=1}^M {{x_k}_j \frac{\partial{(d-\mathbf{w} \cdot \mathbf{x})_j }}{\partial w_l} } \end{aligned}$</div></p>
<p>Again, the term <span class="math">\(\mathbf{w} \cdot \mathbf{x}\)</span> is a dot product and thus also a sum:</p>
<p><div class="math">$\mathbf{w} \cdot \mathbf{x} = \sum_{i=1}^{N} w_ix_i$</div></p>
<p>Also, <span class="math">\(d\)</span> is a constant: it’s derivative is <span class="math">\(0\)</span>, and thus:</p>
<p><div class="math">$\begin{aligned} \frac{\partial E}{{\partial w_k}{\partial w_l}} &amp;= - \sum_{j=1}^M {{x_k}_j \frac{\partial{(-\sum_{i=1}^{N} w_ix_i)_j }}{\partial w_l} } \\ &amp;= \sum_{j=1}^M {{x_k}_j \frac{\partial{(w_1x_1 + ... + w_ix_i + ... + w_Nx_N)_j }}{\partial w_l} } \end{aligned}$</div></p>
<p>As before, of the above only the <span class="math">\(i = l\)</span> terms remain as the others are constant with respect to the partial derivative. Thus:</p>
<p><div class="math">$\begin{aligned} \frac{\partial E}{{\partial w_k}{\partial w_l}} &amp;= - \sum_{j=1}^M {{x_k}_j {x_l}_j } \\ &amp;= \sum_{j=1}^M {({x_k} {x_l})_j} \end{aligned}$</div></p>
<p>Finally, the Hessian matrix is:</p>
<p><div class="math">$\begin{aligned} H &amp;= \begin{bmatrix} \sum_{j=1}^M {({x_1} {x_1})_j} &amp; ... &amp; \sum_{j=1}^M {({x_1} {x_i})_j} &amp; ... &amp; \sum_{j=1}^M {({x_1} {x_N})_j} \\ ... &amp; ... &amp; ... &amp; ... &amp; ... \\ \sum_{j=1}^M {({x_i} {x_1})_j} &amp; ... &amp; \sum_{j=1}^M {({x_i} {x_i})_j} &amp; ... &amp; \sum_{j=1}^M {({x_i} {x_N})_j} \\ ... &amp; ... &amp; ... &amp; ... &amp; ... \\ \sum_{j=1}^M {({x_N} {x_1})_j} &amp; ... &amp; \sum_{j=1}^M {({x_N} {x_i})_j} &amp; ... &amp; \sum_{j=1}^M {({x_N} {x_N})_j} \\ \end{bmatrix} \\ &amp;= \sum_{j=1}^M( \begin{bmatrix} {x_1} {x_1} &amp; ... &amp; {x_1} {x_i} &amp; ... &amp; {x_1} {x_N} \\ ... &amp; ... &amp; ... &amp; ... &amp; ... \\ {x_i} {x_1} &amp; ... &amp; {x_i} {x_i} &amp; ... &amp; {x_i} {x_N} \\ ... &amp; ... &amp; ... &amp; ... &amp; ... \\ {x_N} {x_1} &amp; ... &amp; {x_N} {x_i} &amp; ... &amp; {x_N} {x_N} \\ \end{bmatrix}) \end{aligned} $</div></p>
<p>So, we have <span class="math">\(M\)</span> times the matrix:</p>
<p><div class="math">$\begin{bmatrix} {x_1} {x_1} &amp; ... &amp; {x_1} {x_i} &amp; ... &amp; {x_1} {x_N} \\ ... &amp; ... &amp; ... &amp; ... &amp; ... \\ {x_i} {x_1} &amp; ... &amp; {x_i} {x_i} &amp; ... &amp; {x_i} {x_N} \\ ... &amp; ... &amp; ... &amp; ... &amp; ... \\ {x_N} {x_1} &amp; ... &amp; {x_N} {x_i} &amp; ... &amp; {x_N} {x_N} \\ \end{bmatrix} $</div></p>
<p>We can split this matrix as being the multiplication of following column and row matrices:</p>
<p><div class="math">$\begin{bmatrix} {x_1} \\ ... \\ {x_i} \\ ... \\ {x_N} \\ \end{bmatrix} \begin{bmatrix} {x_1} &amp; ... &amp; {x_i} &amp; ... &amp; {x_N} \end{bmatrix} $</div></p>
<p>Now, recap the definition of positive semi definiteness:</p>
<p><div class="math">$A^\top H A \gt 0$</div></p>
<p>In this, <span class="math">\(A\)</span> is a column matrix and thus we can rewrite this (by substituting with the abve expression for the Hessian):</p>
<p><div class="math">$\begin{bmatrix} {a_1} &amp; ... &amp; {a_i} &amp; ... &amp; {a_N} \end{bmatrix} \begin{bmatrix} {x_1} \\ ... \\ {x_i} \\ ... \\ {x_N} \\ \end{bmatrix} \begin{bmatrix} {x_1} &amp; ... &amp; {x_i} &amp; ... &amp; {x_N} \end{bmatrix} \begin{bmatrix} {a_1} \\ ... \\ {a_i} \\ ... \\ {a_N} \\ \end{bmatrix} \gt 0 $</div></p>
<p>Because of the associativity of the matrix multiplication we can write:</p>
<p><div class="math">$\begin{bmatrix} {a_1}{x_1} + ... + {a_i}{x_i} + ... + {a_N}{x_N} \end{bmatrix} \begin{bmatrix} {x_1}{a_1} + ... + {x_i}{a_i} + ... + {x_N}{a_N} \end{bmatrix} $</div></p>
<p>Basically, this is the multiplication of two scalars. And because of the commutativity of scalar multiplication:</p>
<p><div class="math">$\begin{bmatrix} {a_1}{x_1} + ... + {a_i}{x_i} + ... + {a_N}{x_N} \end{bmatrix} \begin{bmatrix} {a_1}{x_1} + ... + {a_i}{x_i} + ... + {a_N}{x_N} \end{bmatrix} = \begin{bmatrix} {a_1}{x_1} + ... + {a_i}{x_i} + ... + {a_N}{x_N} \end{bmatrix}^2 $</div></p>
<p>So, we have the square of a scalar, which is always positive. And we can conclude that the result must be larger than or equal to <span class="math">\(0\)</span>, thus the Hessian is positive semi definite, thus our cost function is convex.</p>
<h3 id="solving-the-problem">Solving the problem</h3>
<p>So, now we know the errorfunction is convex, and thus has a single global minimum. If we can find where the derivative is zero, we have found the optimal solution for our perceptron.</p>
<p>There are two ways to search the minimum of the cost function.</p>
<p>A first solution is to solve for the derivative being zero. So we calculate the derivative, set it to zero and then solve this equation. This also gives an exact solution. We will not discuss this solution any further here. This solution can only be used on our very simple perceptron and does not generalize to multi perceptron neural networks. And even for a single perceptron, if we have a lot of properties which define our two target classes (thus <span class="math">\(N\)</span> is large), due to the complexity of the calculations for finding this result it is computationaly not a viable solution.</p>
<p>A second solution is called gradient descend: we calculate the gradient at some random point to find the direction in which the function is getting closer to its minimum: it is descending. Then we take a small step in that direction and start over again. We do this untill the change of the outcome of the function is below a pre-defined tresshold.</p>
<h2 id="gradient-descent">Gradient Descent</h2>
<h3 id="gradient-descend-in-words-and-pictures">Gradient descend in words (and pictures)</h3>
<p>The key idea behind gradient descend is that, starting from some random point on the curve of our cost function, we would like to move into the direction of the minimum untill we reach this minimum.</p>
<p>Graphically, we would like to do the following:</p>
<p><img src="https://sergedesmedt.github.io/MathOfNeuralNetworks/Resources/GradientDescentActual.PNG" alt="Gradient Descend (actual)"></p>
<p>We would like to “step” to the minimum by descending on the curve. We can do this by moving into the direction of the slope if it is descending or move in the opposite direction of the slope if it is ascending.</p>
<p>So, in gradient descent, the gradient is used to determine the direction into which we want to move.</p>
<h3 id="gradient-descend-in-formulas">Gradient descend in formulas</h3>
<p>Let’s say we have a function in a single variable <span class="math">\(f(x)\)</span><br>
If we are on the descending side of the function we want to move (take a step) in the direction of increasing x, thus:<br>
If we represent our step by the symbol <span class="math">\(\Delta x\)</span></p>
<p><div class="math">$x_t = x + \Delta x &gt; x$</div></p>
<p>If we are on the ascending side we want to move in the direction of decreasing x:</p>
<p><div class="math">$x_t = x - \Delta x &lt; x$</div></p>
<p>If we now choose <span class="math">\(\Delta x\)</span> to be proportional to the slope (remember the slope is equal to the derivative <span class="math">\(\frac{df(x)}{dx}\)</span> ), we have</p>
<p><div class="math">$\Delta x = \lambda \frac{df(x)}{dx} $</div></p>
<p>With the slope being negative on the descending side we get (remember that a negative number multiplied by a negative number is positive):</p>
<p><div class="math">$x_t = x - \lambda \frac{df(x)}{dx} &gt; x$</div></p>
<p>With the slope being positive on the ascending side we get:</p>
<p><div class="math">$x_t = x - \lambda \frac{df(x)}{dx} &lt; x$</div></p>
<p>In general we can take steps proportional to the negative value of the slope. Or, we take steps <em>against</em> the direction of the slope.</p>
<p>A similar line of thinking can be followed for functions of multiple variables, but then we have a directional derivative equal to the slope in a certain direction. As we have seen, the directional derivative has its largest value in the direction of the gradient, so this is the direction we will follow. Thus we get:</p>
<p><div class="math">$x_t = x - \lambda \nabla{f}$</div></p>
<p>in which <span class="math">\(x_t\)</span>, <span class="math">\(x\)</span> and <span class="math">\(\nabla{f}\)</span> are vectors</p>
<h3 id="but-it-can-go-wrong">But it can go wrong</h3>
<p>It can be proven that for the correct preconditions, the gradient descent algorithm always converges to a (local) minimum. However, if those conditions are not met, then convergence is not guaranteed.</p>
<p>One such condition has to do with the size of <span class="math">\(\lambda\)</span>. If <span class="math">\(\lambda\)</span> is too big then overshoot can happen and the algorithm can oscilate around the optimal solution.</p>
<p>Try it yourself:<br>
<a href="https://sergedesmedt.github.io/MathOfNeuralNetworks/GradientDescend.html">Gradient Descend</a></p>
<h2 id="the-adaline-learning-rule">The ADALINE learning rule</h2>
<p>The ADALINE learning rule can be written as:</p>
<p><div class="math">$\mathbf{w}_{i+1} = \mathbf{w}_{i} + \lambda \sum_{j=1}^M (d-o)_j\mathbf{x_j}$</div></p>
<p>in which <span class="math">\((d - o)\)</span> is the error <span class="math">\(e\)</span>, so</p>
<p><div class="math">$\mathbf{w}_{i+1} = \mathbf{w}_{i} + \lambda \sum_{j=1}^M e_j \mathbf{x_j}$</div></p>
<p>Remember that <span class="math">\(\mathbf{w}\)</span> is a vector representing the weights applied to each feature of our featurevector <span class="math">\(\mathbf{x}\)</span>, so what we really have here is (writing vectors in column notation):</p>
<p><div class="math">$\begin{bmatrix} {{w_1}_{i+1}} \\ ... \\ {{w_i}_{i+1}} \\ ... \\ {{w_N}_{i+1}} \\ \end{bmatrix} = \begin{bmatrix} {{w_1}_{i}} \\ ... \\ {{w_i}_{i}} \\ ... \\ {{w_N}_{i}} \\ \end{bmatrix} + \lambda \begin{bmatrix} \sum_{j=1}^M e_j {{x_1}_j} \\ ... \\ \sum_{j=1}^M e_j {{x_i}_j} \\ ... \\ \sum_{j=1}^M e_j {{x_N}_j} \\ \end{bmatrix} $</div></p>
<p>Let us look back at the gradient of the errorfunction:</p>
<p><div class="math">$\frac{\partial E}{\partial w_k} = \sum_{j=1}^M {{e_j x_k}_j}$</div></p>
<p>Also, we found following gradient descent rule to find the minimum of a function <span class="math">\(f(w)\)</span>:</p>
<p><div class="math">$\mathbf{w_{i+1}} = \mathbf{w_i} - \lambda \nabla{f}$</div></p>
<p>In this, <span class="math">\(\nabla{f}\)</span> stands for the vector of all partial derivatives of <span class="math">\(f\)</span>:</p>
<p><div class="math">$\nabla{f} = \begin{bmatrix} \sum_{j=1}^M {{e_j x_1}_j} \\ ... \\ \sum_{j=1}^M {{e_j x_i}_j} \\ ... \\ \sum_{j=1}^M {{e_j x_N}_j} \\ \end{bmatrix}$</div></p>
<p>Substituting the gradient of the errorfunction in the gradient descent rule and taking into account thet <span class="math">\(\mathbf{w_{i+1}}\)</span> and <span class="math">\(\mathbf{w_i}\)</span> are actualy vector gives us:</p>
<p><div class="math">$\begin{bmatrix} {{w_1}_{i+1}} \\ ... \\ {{w_i}_{i+1}} \\ ... \\ {{w_N}_{i+1}} \\ \end{bmatrix} = \begin{bmatrix} {{w_1}_{i}} \\ ... \\ {{w_i}_{i}} \\ ... \\ {{w_N}_{i}} \\ \end{bmatrix} + \lambda \begin{bmatrix} \sum_{j=1}^M e_j {{x_1}_j} \\ ... \\ \sum_{j=1}^M e_j {{x_i}_j} \\ ... \\ \sum_{j=1}^M e_j {{x_N}_j} \\ \end{bmatrix} $</div></p>
<p>Tadaaa: we get our learning rule. So, the ADALINE learning rule is gradient descent on the least squares error function !!!</p>
<p>Notice how we use the sum of the error of all the samples. This is in contrast with the learning rule for the Rosenblatt perceptron which used the error of a single sample. We call this batch learning.</p>
<h2 id="wrap-up">Wrap up</h2>
<h3 id="basic-formula-of-the-adaline-perceptron">Basic formula of the ADALINE Perceptron</h3>
<p>The basic formula classifies the features by weighting them into two seperate classes.<br>
We have seen that the way this is done, is by comparing the dot product of the feature vector <span class="math">\(\mathbf{x}\)</span> and the weight vector <span class="math">\(\mathbf{w}\)</span> with some fixed value <span class="math">\(b\)</span>. If the dot product is larger then this fixed value, then we classisify them info one class by assigning them a label 1, otherwise we put them into the other class by assigning them a label -1.</p>
<p><div class="math">$f(x) = \begin{cases} 1 &amp; \text{if } w_1x_1 + w_2x_2 + ... + w_ix_i + ... + w_nx_n &gt; b\\ -1 &amp; \text{otherwise} \end{cases} $</div></p>
<h3 id="behaviour-of-the-adaline-perceptron">Behaviour of the ADALINE Perceptron</h3>
<p>Because the formula of the perceptron is basically a hyperplane, we can only classify things into two classes which are lineary seperable. A first class with things above the hyper-plane and a second class with things below the hyper-plane.</p>
<p>However, the difference with the Rosenblatt perceptron is that during learning we can get an <em>optimal</em> solution alltough our training samples may not be lineary seperable, a thing not possible during learning of a Rosenblatt perceptron.</p>
<h3 id="formalising-some-things-a-few-definitions">Formalising some things: a few definitions</h3>
<p>We’ve again covered a lot of ground here, but without using a lot of the lingo surrounding perceptrons, neural networks and machine learning in general. There was already enough lingo with the mathematics that I didn’t want to bother you with even more definitions.</p>
<p>In the article on the Rosenblatt Perceptron we already saw a few definitions, and here are some more:</p>
<p><strong>Objective function</strong><br>
An <em>objective function</em> is a function used to evaluate the performance of a solution to an optimization problem. In the context of our perceptron, it is the function used to evaluate a candidate weigth vector.</p>
<p><strong>Loss function</strong><br>
Where an objective function makes no assumption on how we optimize, that is if we use maximization, minimization or anything else, a <em>loss function</em> makes the assumption that we minimize our objective function. Because we want to minimize, a good loss function generaly gives small values for good results and large values for bad results.</p>
<p><strong>Cost function</strong><br>
The name <em>cost function</em> is in general synonimous with <em>loss function</em></p>
<p><strong>Error function</strong><br>
The <em>error  function</em> can be seen as the cost function for a single training example.</p>
<p><strong>L1 and L2 Loss</strong></p>
<p>Remember from our discussion on the squared errors and the possibility of just using the absolute value of the difference? The absolute value of the error is called the <em>L1 loss</em> and the square value is called the <em>L2 loss</em></p>
<p><strong>Batch learning vs Online learning</strong></p>
<p>If we use all the samples at once to find the optimal solution, we speak of <em>batch learning</em>: we use a batch of samples to optimize our cost function. This is in contrast with the Rosenblatt perceptron for which we recalculated the weights after each new sample. It the latter case we speak of <em>online learning</em>.</p>
<h2 id="what-is-wrong-with-the-adaline-perceptron">What is wrong with the ADALINE perceptron?</h2>
<p>Although the ADALINE perceptron is an improvement on the Rosenblatt perceptron with respect to it’s learning procedure, with respect to it’s classification capabilities we did not gain anything: we still can only classify two classes of <em>things</em>.</p>
<p>In order to be able to classify more classes we need more perceptrons linked together: neural networks.</p>
<p>Which is the subject of the next article…</p>
<h2 id="references">References:</h2>
<h3 id="sum-of-squared-errors-1">Sum of Squared Errors</h3>
<p>About the various <em>calculation</em> of squares of <em>somethings</em></p>
<ul>
<li><a href="https://stats.stackexchange.com/questions/146092/mean-squared-error-versus-least-squared-error-which-one-to-compare-datasets">Mean squared error versus Least squared error, which one to compare datasets?</a></li>
<li><a href="https://datascience.stackexchange.com/questions/11467/what-is-the-difference-between-residual-sum-of-squares-and-ordinary-least-square">What is the difference between residual sum of squares and ordinary least squares?</a></li>
</ul>
<p>About why to use the Mean squared Sum of Errors:</p>
<ul>
<li><a href="https://stats.stackexchange.com/questions/155580/cost-function-in-ols-linear-regression">Cost function in OLS linear regression</a></li>
</ul>
<p>If you want a deeper dive into why we use the squares sum of errors instead of the summation of the absolute values, following are good reads:</p>
<ul>
<li><a href="https://www.benkuhn.net/squared">Why squared error?</a></li>
<li><a href="https://www.quora.com/Is-there-a-mathematical-reason-to-use-sum-of-squared-errors-rather-than-the-absolute-value-of-the-errors-in-regression">Is there a mathematical reason to use sum of squared errors rather than the absolute value of the errors in regression?</a></li>
</ul>
<p>On convex optimization (which is a whole subfield on its own)</p>
<ul>
<li><a href="https://en.wikipedia.org/wiki/Convex_optimization">Convex optimization</a></li>
<li><a href="https://en.wikipedia.org/wiki/Convex_function">Convex function</a></li>
<li><a href="https://en.wikipedia.org/wiki/Mathematical_optimization">Mathematical optimization</a></li>
</ul>
<h3 id="maxima-minima-and-the-slope">Maxima, Minima and the Slope</h3>
<p>Wikipedia on maxima and minima:</p>
<ul>
<li><a href="https://en.wikipedia.org/wiki/Maxima_and_minima">Maxima and minima</a></li>
</ul>
<p>Wikipedia on the slope:</p>
<ul>
<li><a href="https://en.wikipedia.org/wiki/Slope">https://en.wikipedia.org/wiki/Slope</a></li>
</ul>
<p>Another deep dive into the slope of a function</p>
<ul>
<li><a href="https://www.whitman.edu/mathematics/calculus_online/section02.01.html">The slope of a function</a></li>
</ul>
<h3 id="limit-continuity-and-differentiability">Limit, Continuity and Differentiability</h3>
<p>About Limits:</p>
<ul>
<li><a href="https://math.libretexts.org/Bookshelves/Calculus/Book%3A_Calculus_(Apex)/1%3A_Limits/1.2%3A_Epsilon-Delta_Definition_of_a_Limit">Epsilon-Delta Definition of a Limit</a></li>
<li><a href="https://brilliant.org/wiki/epsilon-delta-definition-of-a-limit/">Another take on the Epsilon-Delta Definition</a></li>
</ul>
<p>Prooving the limit using the epsilon delta definition (which is not the same as calculating the liimit using the epsilon delta definition):</p>
<ul>
<li><a href="https://socratic.org/questions/how-do-you-find-the-limit-using-the-epsilon-delta-definition">how do you find the limit using the epsilon-delta definition</a></li>
</ul>
<p>What does it mean if we say “becomes infinite”?</p>
<ul>
<li><a href="https://www.themathpage.com/aCalc/infinity.htm">infinity</a></li>
</ul>
<p>A mathematical reasoning as to why, if the denominator is zero, then the nominator must be zero to to have a finite result:</p>
<ul>
<li><a href="https://math.stackexchange.com/questions/2585793/limit-of-a-quotient-when-the-denominator-tends-to-zero/2585829">Limit of a quotient when the denominator tends to zero</a></li>
<li><a href="https://www.quora.com/Is-zero-divided-by-zero-0-0-equal-to-zero-undefined-or-1">Is zero divided by zero (0/0) equal to zero, undefined or 1?</a></li>
</ul>
<p>About Differentiability and the Derivative</p>
<ul>
<li><a href="https://math.dartmouth.edu/opencalc2/cole/lecture21.pdf">Differentiability</a></li>
<li><a href="https://www.geeksforgeeks.org/mathematics-limits-continuity-differentiability"># Mathematics | Limits, Continuity and Differentiability</a></li>
</ul>
<p>Following contains a mathematical rigourous proof as to why continuity is needed for differentiability</p>
<ul>
<li><a href="https://www.quora.com/Why-is-a-differentiable-function-continuous">Why-is-a-differentiable-function-continuous</a></li>
<li><a href="http://www.math.jhu.edu/~brown/courses/f17/Documents/DiffImpliesCont.pdf">DIFFERENTIABILITY IMPLIES CONTINUITY</a></li>
</ul>
<p>Some clarification on the proof:</p>
<ul>
<li><a href="https://math.stackexchange.com/questions/1314630/differentiability-implies-continuity-a-question-about-the-proof">Differentiability implies continuity - A question about the proof</a></li>
</ul>
<p>A different take on the proof using the ϵ−δ definition:</p>
<ul>
<li><a href="https://math.stackexchange.com/questions/269666/how-to-prove-differentiability-implies-continuity-with-epsilon-delta-definit">How to prove differentiability implies continuity with ϵ−δ definition?</a></li>
</ul>
<p>About partial derivatives:</p>
<ul>
<li><a href="https://en.wikipedia.org/wiki/Partial_derivative">Partial derivative</a></li>
<li><a href="https://www.khanacademy.org/math/multivariable-calculus/multivariable-derivatives/partial-derivative-and-gradient-articles/a/introduction-to-partial-derivatives">Introduction to partial derivatives</a></li>
</ul>
<p>More on Directional Derivatives:</p>
<ul>
<li><a href="http://mathworld.wolfram.com/DirectionalDerivative.html">Directional Derivative</a></li>
<li><a href="https://www.quora.com/What-is-the-second-directional-derivative">What is the second directional derivative</a></li>
</ul>
<p>About the gradient</p>
<ul>
<li><a href="https://en.wikipedia.org/wiki/Gradient">Gradient</a></li>
<li><a href="https://www.khanacademy.org/math/multivariable-calculus/multivariable-derivatives/partial-derivative-and-gradient-articles/a/the-gradient">The gradient</a></li>
<li><a href="https://betterexplained.com/articles/vector-calculus-understanding-the-gradient/">Vector Calculus: Understanding the Gradient</a></li>
</ul>
<p>Following is a very understandable proof of the relation between the Directional Derivative and the Gradient: <a href="http://tutorial.math.lamar.edu/Classes/CalcIII/DirectionalDeriv.aspx">Directional Derivatives</a><br>
It does however make use of the so called chain rule which I have not handled yet, but you can read about it here: <a href="https://en.wikipedia.org/wiki/Chain_rule">Chain rule</a><br>
A similar proof but less detailed can be found in <a href="https://mathinsight.org/directional_derivative_gradient_derivation">Derivation of the directional derivative and the gradient</a> and <a href="https://users.math.msu.edu/users/gnagy/teaching/05-fall/Math20C/w6-C.pdf">Math 20C Multivariable Calculus - Lecture 16</a></p>
<p>More rules for calculating the derivative: <a href="https://www.mathsisfun.com/calculus/derivatives-rules.html">Derivatives rules</a></p>
<h3 id="matrices">Matrices</h3>
<p>Wikipedia on matrices: <a href="https://en.wikipedia.org/wiki/Matrix_(mathematics)">Matrix</a></p>
<p>Why the strange way of defining multiplication? If has to do with linear transformations:</p>
<ul>
<li><a href="https://medium.com/@ghenshaw.work/3-ways-to-understand-matrix-multiplication-fe8a007d7b26">3 Ways to Understand Matrix Multiplication</a></li>
<li><a href="https://www.quora.com/Why-is-matrix-multiplication-defined-the-way-it-is">Why is matrix multiplication defined the way it is?</a></li>
</ul>
<h3 id="critical-points-stationary-points-and-fermats-theorem">Critical Points, Stationary Points and Fermat’s Theorem</h3>
<p>More about stationary points: <a href="https://math.stackexchange.com/questions/1368188/what-is-the-difference-between-stationary-point-and-critical-point-in-calculus">what is the difference between stationary point and critical point in calculus?</a></p>
<p>If you want to digg deeper into what the Hessian matrix is:</p>
<ul>
<li><a href="https://en.wikipedia.org/wiki/Hessian_matrix">https://en.wikipedia.org/wiki/Hessian_matrix</a></li>
<li><a href="https://math.stackexchange.com/questions/481060/relation-bewteen-hessian-matrix-and-curvature">https://math.stackexchange.com/questions/481060/relation-bewteen-hessian-matrix-and-curvature</a></li>
<li><a href="https://math.stackexchange.com/questions/481334/how-does-hessian-matrix-describe-the-local-curvature">https://math.stackexchange.com/questions/481334/how-does-hessian-matrix-describe-the-local-curvature</a></li>
</ul>
<p>Proof of Fermat’s theorem:<br>
2 different ways of prooving the statement:</p>
<ul>
<li>According to wikipedia: <a href="https://en.wikipedia.org/wiki/Fermat%27s_theorem_(stationary_points)#Proof">Fermat’s theorem on stationary points: Proof</a></li>
<li>Another (similar) proof: <a href="https://planetmath.org/proofoffermatstheoremstationarypoints">proof of Fermat’s Theorem (stationary points)</a></li>
</ul>
<p>More on the second derivative test and the higher order derivative test:</p>
<ul>
<li><a href="https://www.math.hmc.edu/calculus/tutorials/secondderiv/">Second derivative</a></li>
<li><a href="https://en.wikipedia.org/wiki/Derivative_test#Second_derivative_test_(single_variable)">Derivative test (single variable)</a></li>
<li><a href="https://www.wikihow.com/Find-Inflection-Points">Find Inflection Points</a></li>
<li><a href="https://math.stackexchange.com/questions/2591385/finding-extreme-values-where-second-derivative-is-zero">Finding extreme values where second derivative is zero</a></li>
<li><a href="https://en.wikipedia.org/wiki/Derivative_test#Higher-order_derivative_test">Derivative test (Higher order)</a></li>
<li><a href="https://calculus.subwiki.org/wiki/Higher_derivative_test">Higher derivative test</a></li>
</ul>
<p>Prooving convexity of linear least squares:</p>
<ul>
<li><a href="https://math.stackexchange.com/questions/946156/proving-convexity-of-a-function-whose-hessian-is-positive-semidefinite-over-a-co">Proving convexity of a function whose Hessian is positive semidefinite over a convex set</a></li>
<li><a href="https://math.stackexchange.com/questions/483339/proof-of-convexity-of-linear-least-squares">Proof of convexity of linear least squares</a></li>
</ul>
<h3 id="gradient-descent-1">Gradient Descent</h3>
<p>An article with what I consider a correct illustration of gradient descent: <a href="https://mc.ai/a-way-to-improve-gradient-descent-stochastic-gradient-descent-with-restarts/">A way to improve gradient descent stochastic gradient descent with restarts</a></p>
<p>Another take on why subtraction is the correct operation for the update:<br>
<a href="http://neuralnetworksanddeeplearning.com/chap1.html#learning_with_gradient_descent">Learning with gradient descent</a></p>
<p>It can be proven that gradient descent, under some conditions, is guaranteed to converge to the minimum. See following references:</p>
<ul>
<li><a href="https://www.stat.cmu.edu/~ryantibs/convexopt-F13/scribes/lec6.pdf">Gradient Descent: Convergence Analysis</a></li>
<li><a href="https://datascience.stackexchange.com/questions/24534/does-gradient-descent-always-converge-to-an-optimum">Does gradient descent always converge to an optimum?</a></li>
<li><a href="https://cs.stackexchange.com/questions/96962/mathematical-proof-for-why-gradient-descent-algorithm-always-converges">Mathematical proof for why gradient descent algorithm always converges</a></li>
</ul>
<h3 id="adaline">ADALINE</h3>
<p>Wikipedia on ADALINE: <a href="%5Bhttps://en.wikipedia.org/wiki/ADALINE%5D(https://en.wikipedia.org/wiki/ADALINE)">ADALINE</a></p>
<p>ADALINE seen from the standpoint of signal processing: <a href="http://www.ahmetozkurt.net/MEC509/adaline%20and%20madaline.pdf">Adaline and Madaline</a></p>


</body>
</html>