Home

To-do

Fine tune Resnet-18 using KL Divergence
Train SAE on CIFAR10
Fine tune SAE using KL Divergence

What I might do

Bokun tried Xie's experiment on CIFAR10, found that it's not going to work well on complex data such as that because its loss function relies on the initial similarity of the embedding, and the autoencoders don't do well with that for image data.

I'll try to compare two results:

repeat Bokun's work (i.e. Xie's work which uses SAE's but on CIFAR10)
do the same experiment with a pretrained image classifier

I could still use actual labels to compute accuracy, but I would not use that for loss, just for comparison with the original model. For loss, I would the KL-divergence loss from Xie et al.

Questions

How do I do greedy layer-wise training with the constraints that most of the layers are ReLUs? (See below.)
When would I stop layer-wise training? (i.e how many iterations?)

Primary source material

Consider the paper Unsupervised Deep Embedding for Clustering Analysis by Xie, Girshik, and Farhadi from 2016. https://arxiv.org/abs/1511.06335v2

For one of their experiments, they use SAE's (stacked autoencoders) to create an embedding on which to cluster, then backpropagate the error, based on KL divergence. Their results with DEC (Deep Embedded Clustering) surpass previous clustering methods.

Implementation

Each SAE is two layers deep
SAE's get greedy layer-wise training
Use ReLUs except for decoder of first SAE and encoder of last SAE.

Dataset and evaluation

Work on subset of 10k examples from Reuters
Uses tf-idf features on the 2000 most frequently occurring word stems (is dataset available?)
Pretrain each SAE at 50k iterations with dropout of 20%.
Fine-tune whole network for 100k iterations.
minibatch size of 256
SAE phase: lr 0.1, dividing by 10 every 20k iterations. no weight decay
initialize centroids: k-means with 20 restarts, keeping the best
KL-divergence phase:
- lr 0.01
- convergence threshold 0.1%

Their improvements

new SOTA
k-means and GMM are fast, but distnace metrics are limited ot the original data space
variant of k-means which iteratively projects data into lower dimensional space is limited to linear embeddings
spectral clustering uses lots of memory; this paper's method is parametric (linear in number of data points)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly