A collection of classical ML equations in Latex . Some of them are provided with simple notes and paper link. Hopes to help writings such as papers and blogs.
Better viewed at https://blmoistawinde.github.io/ml_equations_latex/
encoder hidden state at time step , with input token embedding
decoder hidden state at time step , with input token embedding
h_t = RNN_{enc}(x_t, h_{t-1})
s_t = RNN_{dec}(y_t, s_{t-1})
The , are usually either
LSTM (paper: Long short-term memory)
GRU (paper: Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation).
The attention weight , the $i$th decoder step over the $j$th encoder step, resulting in context vector
c_i = \sum_{j=1}^{T_x} \alpha_{ij}h_j
\alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k=1}^{T_x} \exp(e_{ik})}
e_{ij} = a(s_{i-1}, h_j)
is an specific attention function, which can be
Paper: Neural Machine Translation by Jointly Learning to Align and Translate
e_{ij} = v^T tanh(W[s_{i-1}; h_j])
Paper: Effective Approaches to Attention-based Neural Machine Translation
If and has same number of dimension.
otherwise
e_{ij} = s_{i-1}^T h_j
e_{ij} = s_{i-1}^T W h_j
Finally, the output is produced by:
s_t = tanh(W[s_{t-1};y_t;c_t])
o_t = softmax(Vs_t)
Paper: Attention Is All You Need
Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V
where is the dimension of the key vector and query vector .
where
MultiHead(Q, K, V) = Concat(head_1, ..., head_h)W^O
head_i = Attention(Q W^Q_i, K W^K_i, V W^V_i)
Paper: Generative Adversarial Networks