Table of Contents

[Note] Deep Learning: Practice and Trends (NIPS 2017)



  • Convolutional Nets
    • Locality: objects tend to have a local spatial support
      • locally-connected
    • Teanslation invariance: object appearance is independent of location
    • Tricks of the Trade
      • Optimization
        • SGD with momentum
        • Batch Norm
      • Initialization
        • Weight init: start from the weights which lead to stable training
        • Sample from zero-mean normal distribution w/ small variance 0.01
          • Adaptively choose variance for each layer
            • preserve gradient magnitude: 1/sqrt(fan_in)
            • works fine for VGGNets (up to 20 layers), but not sufficient for deeper nets
      • Model
        • Stacking 3x3 convolutions
        • Inception
        • ResNet adds modules which ensure that the gradient doesn’t vanish

Recurrent Nets

  • Two Key Ingredients: Neural Embeddings, Recurrent Language Models
  • Dot product Attention
    • Inputs: “I am a cat”
    • Input RNN states: $e_1e_2e_3e_4$
    • Decoder RNN state at step i (query): $h_i$
    • Compute scalars $h_i^Te_1, h_i^Te_2,…$representing similarity / relevance between encoder steps and query
    • Normaliza $[h_i^Te_1,h_i^Te_2,…]$ with softmax to produce attention weights
  • Tricks of the Trade
    • Long sequences?
      • Attention
      • Bigger state
      • Reverse inputs
    • Can’t overfit?
      • Bigger hidden state
      • Deep LSTM + Skip Connections
    • Overfit?
      • Dropout + Ensembles
    • Tuning
      • Keep calm and decrease your learning rate
      • Initalization of parameters is critical (in seq2seq we used U(-0.05, 0.05))
      • Clip the gradients!
        • E.g. if ||grad|| > 5: grad = grad/||grad|| * 5
  • Attention and Memory Toolbox
    • Read/Write memories (neural turing machine)
    • Sequence Prediction
    • Seq2Seq
    • Temporal Hierarchies
    • Multimodality
    • Attention/Pointers
    • Recurrent Architectures
    • Key, Value memories

Autoregressive Models

$$P(x;\theta) = \prod_{n=1}^N P(x_n|x_n;\theta)$$

  • Each factor can be parametrized by $\theta$, which can be shared
  • The variables can be arbitrarily ordered and grouped, as long as the ordering and grouping is consistent.
  • Building Blocks
    • Inputs and Outpus: these can also be conditioning variables
      • Image pixels
      • Text sequences
      • Audio waveforms
    • Architectures
      • Recurrent, over space and time
      • Causal convolutions
      • Causal conv + attention
      • Attention-only!
    • Losses:
      • Discrete case: softmax cross entropy
      • Continuous: Gaussian (mixture) likelihood
      • Mixture of logistics loss (PixelCNN++ in ICLR 2017)
        • $$ v ~ \sum_{i=1}^K \pi_i logistic(\mu_i,s_i) $$
        • $$ P(x|\pi,\mu,s) = \sum_{i=1}^K \pi_i[\sigma((x+0.5-\mu_i)/s_i) - \sigma((x-0.5-\mu_i) / s_i)] $$
  • Autogressive scoring and sampling
    • Fully sequential models:
      • PixelCNN, PixelCNN++, WaveNet, …
      • O(1) scoring, O(N) sampling
    • Models with conditional independence assumptions:
      • O(1) scoring, sampling can be O(1), O(log N), etc depending on cond. indep. assumptions
    • Distilled models:
      • Parall WaveNet, Parallel NMT
      • O(N) scoring, O(1) sampling

Domain Alignment (unsupervised)

  • Building Blocks
    • Inputs and Outpus
      • Sets of images with shared structure, but weak or no alignment
      • Text corpora in different languages, but not parallel
    • Architecures
      • Nothing fancy!
      • For images: mostly convolutional nets
      • For text: recurrent
    • Losses
      • Latent space: Domain confusion
      • Pixel space: Cycle consistency
      • Both adversarial loss and likelihoods work!
  • Approach
    • Cross-modal retrieval
    • Unsupervised domain transfer for classification
  • Case
    • DiscoGAN - Car2Face
    • GraspGAN
    • Unsupervised Neural Machine Translation

Learning to Learn / Meta Learning

  • Building Blocks:
    • Inputs and Outputs: text, images, actions
    • Architectures: Recurrent, CNN (+attention)
    • Losses (loss based on another loss): Model, Optimization, Initialization
  • Learning to learn:
    • What is Meta Learning?
      • Go beyond train/test from same distribution
      • Task between train/test changes, so model has to “learn to learn”
    • Model Based, Metric Based, Optimization Based

Conclusions and Expectations

  • Deep autoregressive models and ConvNets are ubiquitous and already useful in consumer applications
  • Inductive biases are useful
    • spatial invariance for CNNs
    • time recurrence for RNNs
    • Permutation invariance for Graphs
  • Simple tricks like ResNet will be discovered
  • Adversarial networks and unsupervised domain adaptation have interesting market app (e.g. phone apps like style transfer)
  • Meta learning: more and more of the model lifecycle (train/val/test) will be learned in an end-to-end way
  • Program syntesis + Graph networks may be very important and find more real-world applications (e.g. RobustFill)


comments powered by Disqus