## Table of Contents

# [Note] Deep Learning: Practice and Trends (NIPS 2017)

## Practice

### Architectures

- Convolutional Nets
- Locality: objects tend to have a local spatial support
- locally-connected

- Teanslation invariance: object appearance is independent of location
- Tricks of the Trade
- Optimization
- SGD with momentum
- Batch Norm

- Initialization
- Weight init: start from the weights which lead to stable training
- Sample from zero-mean normal distribution w/ small variance 0.01
- Adaptively choose variance for each layer
- preserve gradient magnitude: 1/sqrt(fan_in)
- works fine for VGGNets (up to 20 layers), but not sufficient for deeper nets

- Adaptively choose variance for each layer

- Model
- Stacking 3x3 convolutions
- Inception
- ResNet adds modules which ensure that the gradient doesn’t vanish

- Optimization

- Locality: objects tend to have a local spatial support

### Recurrent Nets

- Two Key Ingredients: Neural Embeddings, Recurrent Language Models
- Dot product Attention
- Inputs: “I am a cat”
- Input RNN states: $e_1e_2e_3e_4$
- Decoder RNN state at step i (query): $h_i$
- Compute scalars $h_i^Te_1, h_i^Te_2,…$representing similarity / relevance between encoder steps and query
- Normaliza $[h_i^Te_1,h_i^Te_2,…]$ with softmax to produce attention weights

- Tricks of the Trade
- Long sequences?
- Attention
- Bigger state
- Reverse inputs

- Can’t overfit?
- Bigger hidden state
- Deep LSTM + Skip Connections

- Overfit?
- Dropout + Ensembles

- Tuning
- Keep calm and decrease your learning rate
- Initalization of parameters is critical (in seq2seq we used U(-0.05, 0.05))
**Clip the gradients!**- E.g. if ||grad|| > 5: grad = grad/||grad|| * 5

- Long sequences?
- Attention and Memory Toolbox
- Read/Write memories (neural turing machine)
- Sequence Prediction
- Seq2Seq
- Temporal Hierarchies
- Multimodality
- Attention/Pointers
- Recurrent Architectures
- Key, Value memories

## Trends

### Autoregressive Models

$$P(x;\theta) = \prod_{n=1}^N P(x_n|x_n;\theta)$$

- Each factor can be parametrized by $\theta$, which can be shared
- The variables can be arbitrarily ordered and grouped, as long as the ordering and grouping is consistent.
- Building Blocks
- Inputs and Outpus: these can also be conditioning variables
- Image pixels
- Text sequences
- Audio waveforms

- Architectures
- Recurrent, over space and time
- Causal convolutions
- Causal conv + attention
- Attention-only!

- Losses:
- Discrete case: softmax cross entropy
- Continuous: Gaussian (mixture) likelihood
**Mixture of logistics loss**(PixelCNN++ in ICLR 2017)- $$ v ~ \sum_{i=1}^K \pi_i logistic(\mu_i,s_i) $$
- $$ P(x|\pi,\mu,s) = \sum_{i=1}^K \pi_i[\sigma((x+0.5-\mu_i)/s_i) - \sigma((x-0.5-\mu_i) / s_i)] $$

- Inputs and Outpus: these can also be conditioning variables
- Autogressive scoring and sampling
- Fully sequential models:
- PixelCNN, PixelCNN++, WaveNet, …
- O(1) scoring, O(N) sampling

- Models with conditional independence assumptions:
- O(1) scoring, sampling can be O(1), O(log N), etc depending on cond. indep. assumptions

- Distilled models:
- Parall WaveNet, Parallel NMT
- O(N) scoring, O(1) sampling

- Fully sequential models:

### Domain Alignment (unsupervised)

- Building Blocks
- Inputs and Outpus
- Sets of images with shared structure, but weak or no alignment
- Text corpora in different languages, but not parallel

- Architecures
- Nothing fancy!
- For images: mostly convolutional nets
- For text: recurrent

- Losses
- Latent space: Domain confusion
- Pixel space: Cycle consistency
- Both adversarial loss and likelihoods work!

- Inputs and Outpus
- Approach
- Cross-modal retrieval
- Unsupervised domain transfer for classification

- Case
- DiscoGAN - Car2Face
- GraspGAN
- Unsupervised Neural Machine Translation

### Learning to Learn / Meta Learning

- Building Blocks:
- Inputs and Outputs: text, images, actions
- Architectures: Recurrent, CNN (+attention)
- Losses (loss based on another loss): Model, Optimization, Initialization

- Learning to learn:
- What is Meta Learning?
- Go beyond train/test from same distribution
- Task between train/test changes, so model has to “learn to learn”

- Model Based, Metric Based, Optimization Based

- What is Meta Learning?

## Conclusions and Expectations

- Deep autoregressive models and ConvNets are ubiquitous and already useful in consumer applications
- Inductive biases are useful
- spatial invariance for CNNs
- time recurrence for RNNs
- Permutation invariance for Graphs

- Simple tricks like ResNet will be discovered
- Adversarial networks and unsupervised domain adaptation have interesting market app (e.g. phone apps like style transfer)
- Meta learning: more and more of the model lifecycle (train/val/test) will be learned in an end-to-end way
- Program syntesis + Graph networks may be very important and find more real-world applications (e.g. RobustFill)