Classical ML Equations in LaTeX

A collection of classical ML equations in Latex . Some of them are provided with simple notes and paper link. Hopes to help writings such as papers and blogs.

Better viewed at https://blmoistawinde.github.io/ml_equations_latex/

Model

RNNs(LSTM, GRU)

encoder hidden state hth_t at time step tt, with input token embedding xtx_t

ht=RNNenc(xt,ht1)h_t = RNN_{enc}(x_t, h_{t-1})

decoder hidden state sts_t at time step tt, with input token embedding yty_t

st=RNNdec(yt,st1)s_t = RNN_{dec}(y_t, s_{t-1})

h_t = RNN_{enc}(x_t, h_{t-1}) s_t = RNN_{dec}(y_t, s_{t-1})

The RNNencRNN_{enc}, RNNdecRNN_{dec} are usually either

Attentional Seq2seq

The attention weight αij\alpha_{ij}, the $i$th decoder step over the $j$th encoder step, resulting in context vector cic_i

ci=j=1Txαijhjc_i = \sum_{j=1}^{T_x} \alpha_{ij}h_j

αij=exp(eij)k=1Txexp(eik)\alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k=1}^{T_x} \exp(e_{ik})}

eij=a(si1,hj)e_{ij} = a(s_{i-1}, h_j)

c_i = \sum_{j=1}^{T_x} \alpha_{ij}h_j \alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k=1}^{T_x} \exp(e_{ik})} e_{ij} = a(s_{i-1}, h_j)

aa is an specific attention function, which can be

Bahdanau Attention

Paper: Neural Machine Translation by Jointly Learning to Align and Translate

eij=vTtanh(W[si1;hj])e_{ij} = v^T tanh(W[s_{i-1}; h_j])

e_{ij} = v^T tanh(W[s_{i-1}; h_j])

Luong(Dot-Product) Attention

Paper: Effective Approaches to Attention-based Neural Machine Translation

If sis_i and hjh_j has same number of dimension.

eij=si1Thje_{ij} = s_{i-1}^T h_j

otherwise

eij=si1TWhje_{ij} = s_{i-1}^T W h_j

e_{ij} = s_{i-1}^T h_j e_{ij} = s_{i-1}^T W h_j

Finally, the output oio_i is produced by:

st=tanh(W[st1;yt;ct])s_t = tanh(W[s_{t-1};y_t;c_t])

ot=softmax(Vst)o_t = softmax(Vs_t)

s_t = tanh(W[s_{t-1};y_t;c_t]) o_t = softmax(Vs_t)

Transformer

Paper: Attention Is All You Need

Scaled Dot-Product attention

Attention(Q,K,V)=softmax(QKTdk)VAttention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V

Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V

where dk\sqrt{d_k} is the dimension of the key vector kk and query vector qq .

Multi-head attention

MultiHead(Q,K,V)=Concat(head1,...,headh)WOMultiHead(Q, K, V) = Concat(head_1, ..., head_h)W^O

where

headi=Attention(QWiQ,KWiK,VWiV)head_i = Attention(Q W^Q_i, K W^K_i, V W^V_i)

MultiHead(Q, K, V) = Concat(head_1, ..., head_h)W^O head_i = Attention(Q W^Q_i, K W^K_i, V W^V_i)

Generative Adversarial Networks(GAN)

Paper: Generative Adversarial Networks

Minmax game objective