2024-02-12
An optimization task can be traditionally defined as the minimization of loss function $\mathcal{L}(b)$ with respect to betas $\beta$.
Let's take the simplest of statistical models, a linear model. $$f(x) = \beta x $$
in the case of an OLS regression, the loss function $\mathcal{L}$ is $$\mathcal{L}(\beta) = \sum_{i=1}{\frac{1}{n}(x_i\beta-yi)^2}$$
or in vector form $$ \mathcal{L}(\beta) = (Xb-y)T(Xb-y) $$
Which is also commonly named the L2 loss.
and so our optimization problem is minimize $\mathcal{L}$ with respect to $\beta$. to solve it - one must first find the function's derivative. vector calculus comes in handy..
expand $$\mathcal{L}(\beta) = \beta^T X^T X \beta - 2 y^T X \beta + y^T y$$
derive
$$\nabla \mathcal{L}(\beta) = 2 X^T X \beta - 2 X^T y$$ plug in 0 to find local minima $$2 X^T X \beta - 2 X^T y = 0$$ $$ X^T X \beta = X^T y$$ $$ \beta = (X^TX)^{-1} X^T y$$ The following proves, that given, $X \in R^{n,m}$, $y \in R^{n}$. locally optimal betas can be computed by finding the inverse of the gram matrix multiplied by $X^Ty$. the estimation formula has one main drawback - it fails when $X^TX$ is singular.
A prevalent area of optimization is the training of Deep Neural Networks. initial training was done using naive gradient descent. $$\theta_{t+1} = \theta_{t} - \eta \nabla f(\theta_{t})$$
NOTE: theta $\theta$ and $\beta$ will be used interchangeably
The former is pretty straightforward; $\eta$ is traditionally called the learning rate - a greater learning rate results in greater gradient updates, and vice versa.
The successor of naive Gradient Descent was (historical inaccuracies) Stochastic Gradient Descent, which again, is frankly obvious and simple.
$$\theta_{t} = \theta_{t} - \eta \nabla f(\theta_{t}, \xi_{t})$$ Roughly same form, pretty small. main advantage of SGD compared to GD is it only using a random sample of the data (size $k$ where $k$ < $n$) - less computationally heavy. direct downside is the inherit greater uncertainty stemming from only calculating the gradients on a subset of the data.
Before we get to Adam & It's derivatives, which are the primary optimization algorithm used in modern training - there are a couple of intuitive extensions to SGD which can considerably improve performance.
To come up with new variants, it's helpful to
An evident hypothesis is that if a gradient & betas (parameters) updates result in sufficient minimization of loss $\mathcal{L}$ - go in that direction at a greater force; aka. Momentum (not the trading strategy).
$$ v_{t+1} = \beta v_t + (1 - \beta) \nabla f(\theta_t) \ $$ $$ \theta_{t+1} = \theta_t - \eta v_{t+1} $$ where $\beta$ is the momentum scale, higher $\beta$ results in greater auto-correlative assumptions and more "stubbornness".
$$ \quad \theta_t = \theta_{t-1} - \eta \frac{m_t}{\sqrt{v_t} + \epsilon} $$ where $$v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2$$ $$m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t$$
An initial observation will be the fact there exist two gradient-derived variables! yet they serve different purposes. both closely resemble an exponential moving average, both effect the learning rate $\eta$. yet the former measures raw gradients, while the latter measures gradient variance. Intuitively, one would want small steps in highly volatile terrain. $m_t$ incorporates both momentum and smoothness, the former was rediscovered previously. $\epsilon$ is just a small constant for greater floating point stability (ugh IEEE756).
AdamW makes a slight modification to the original algorithm, retaining the same structure, yet adding another constant $\lambda$ and a different kind of $\theta$ decay.
$$\quad \theta_t = \theta_{t-1} - \eta \left( \frac{m_t}{\sqrt{v_t} + \epsilon} + \lambda \theta_{t-1} \right)$$ where $m_t$ and $v_t$ are the same as in Adam. $\lambda$, or the weight decay coefficient is generally around $\lambda \in [10^{-4}, 10^{-2}]$