Skip to content
This repository was archived by the owner on Jul 11, 2021. It is now read-only.
Markham Anderson edited this page May 16, 2019 · 3 revisions

To-do

  • Fine tune Resnet-18 using KL Divergence
  • Train SAE on CIFAR10
  • Fine tune SAE using KL Divergence

What I might do

Bokun tried Xie's experiment on CIFAR10, found that it's not going to work well on complex data such as that because its loss function relies on the initial similarity of the embedding, and the autoencoders don't do well with that for image data.

I'll try to compare two results:

  1. repeat Bokun's work (i.e. Xie's work which uses SAE's but on CIFAR10)
  2. do the same experiment with a pretrained image classifier

I could still use actual labels to compute accuracy, but I would not use that for loss, just for comparison with the original model. For loss, I would the KL-divergence loss from Xie et al.

Questions

  1. How do I do greedy layer-wise training with the constraints that most of the layers are ReLUs? (See below.)
  2. When would I stop layer-wise training? (i.e how many iterations?)

Primary source material

Consider the paper Unsupervised Deep Embedding for Clustering Analysis by Xie, Girshik, and Farhadi from 2016. https://arxiv.org/abs/1511.06335v2

For one of their experiments, they use SAE's (stacked autoencoders) to create an embedding on which to cluster, then backpropagate the error, based on KL divergence. Their results with DEC (Deep Embedded Clustering) surpass previous clustering methods.

Implementation

  • Each SAE is two layers deep
  • SAE's get greedy layer-wise training
  • Use ReLUs except for decoder of first SAE and encoder of last SAE.

Dataset and evaluation

  • Work on subset of 10k examples from Reuters
  • Uses tf-idf features on the 2000 most frequently occurring word stems (is dataset available?)
  • Pretrain each SAE at 50k iterations with dropout of 20%.
  • Fine-tune whole network for 100k iterations.
  • minibatch size of 256
  • SAE phase: lr 0.1, dividing by 10 every 20k iterations. no weight decay
  • initialize centroids: k-means with 20 restarts, keeping the best
  • KL-divergence phase:
    • lr 0.01
    • convergence threshold 0.1%

Their improvements

  • new SOTA
  • k-means and GMM are fast, but distnace metrics are limited ot the original data space
  • variant of k-means which iteratively projects data into lower dimensional space is limited to linear embeddings
  • spectral clustering uses lots of memory; this paper's method is parametric (linear in number of data points)

Clone this wiki locally