So You Want Optimal Parameters

2024-02-12

An optimization task can be traditionally defined as the minimization of loss function $\mathcal{L}(b)$ with respect to betas $\beta$.

Let's take the simplest of statistical models, a linear model. $$f(x) = \beta x $$

in the case of an OLS regression, the loss function $\mathcal{L}$ is $$\mathcal{L}(\beta) = \sum_{i=1}{\frac{1}{n}(x_i\beta-yi)^2}$$

or in vector form $$ \mathcal{L}(\beta) = (Xb-y)T(Xb-y) $$

Which is also commonly named the L2 loss.

and so our optimization problem is minimize $\mathcal{L}$ with respect to $\beta$. to solve it - one must first find the function's derivative. vector calculus comes in handy..

expand $$\mathcal{L}(\beta) = \beta^T X^T X \beta - 2 y^T X \beta + y^T y$$

derive

$$\nabla \mathcal{L}(\beta) = 2 X^T X \beta - 2 X^T y$$ plug in 0 to find local minima $$2 X^T X \beta - 2 X^T y = 0$$ $$ X^T X \beta = X^T y$$ $$ \beta = (X^TX)^{-1} X^T y$$ The following proves, that given, $X \in R^{n,m}$, $y \in R^{n}$. locally optimal betas can be computed by finding the inverse of the gram matrix multiplied by $X^Ty$. the estimation formula has one main drawback - it fails when $X^TX$ is singular.

A prevalent area of optimization is the training of Deep Neural Networks. initial training was done using naive gradient descent. $$\theta_{t+1} = \theta_{t} - \eta \nabla f(\theta_{t})$$

NOTE: theta $\theta$ and $\beta$ will be used interchangeably

The former is pretty straightforward; $\eta$ is traditionally called the learning rate - a greater learning rate results in greater gradient updates, and vice versa.

The successor of naive Gradient Descent was (historical inaccuracies) Stochastic Gradient Descent, which again, is frankly obvious and simple.

$$\theta_{t} = \theta_{t} - \eta \nabla f(\theta_{t}, \xi_{t})$$ Roughly same form, pretty small. main advantage of SGD compared to GD is it only using a random sample of the data (size $k$ where $k$ < $n$) - less computationally heavy. direct downside is the inherit greater uncertainty stemming from only calculating the gradients on a subset of the data.

Before we get to Adam & It's derivatives, which are the primary optimization algorithm used in modern training - there are a couple of intuitive extensions to SGD which can considerably improve performance.

To come up with new variants, it's helpful to

  1. Observe underlying logic
  2. Observe what constants and variables are present in the current implementation
  3. Observe what changes from iteration to iteration

An evident hypothesis is that if a gradient & betas (parameters) updates result in sufficient minimization of loss $\mathcal{L}$ - go in that direction at a greater force; aka. Momentum (not the trading strategy).

$$ v_{t+1} = \beta v_t + (1 - \beta) \nabla f(\theta_t) \ $$ $$ \theta_{t+1} = \theta_t - \eta v_{t+1} $$ where $\beta$ is the momentum scale, higher $\beta$ results in greater auto-correlative assumptions and more "stubbornness".

Adam & Eve AdamW

$$ \quad \theta_t = \theta_{t-1} - \eta \frac{m_t}{\sqrt{v_t} + \epsilon} $$ where $$v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2$$ $$m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t$$

An initial observation will be the fact there exist two gradient-derived variables! yet they serve different purposes. both closely resemble an exponential moving average, both effect the learning rate $\eta$. yet the former measures raw gradients, while the latter measures gradient variance. Intuitively, one would want small steps in highly volatile terrain. $m_t$ incorporates both momentum and smoothness, the former was rediscovered previously. $\epsilon$ is just a small constant for greater floating point stability (ugh IEEE756).

AdamW makes a slight modification to the original algorithm, retaining the same structure, yet adding another constant $\lambda$ and a different kind of $\theta$ decay.

$$\quad \theta_t = \theta_{t-1} - \eta \left( \frac{m_t}{\sqrt{v_t} + \epsilon} + \lambda \theta_{t-1} \right)$$ where $m_t$ and $v_t$ are the same as in Adam. $\lambda$, or the weight decay coefficient is generally around $\lambda \in [10^{-4}, 10^{-2}]$