Caffe Notes

Caffe is a deep learning architecture by Yangqing Jia at BVLC.

Here are some notes on my exploration about Caffe. Organized in questions.

Since Caffe doesn't have a systematically organized documentation, it is hard to organize the following questions. Maybe later.

 

Question 1: Deos Caffe support Multi-Dimensional labels?

Somebody has added codes to extend the current layers to support multi-dimensional labels.

See issue 845. A multi-channel pixel-wise output is extended. The author used that extension to implement his semantic scene understanding task.

The ground truth labelled image has 23 classes, therefore, I set the number of filters in convolution layer to only 23. The input is depth image of size 320x240 and the output of the network is 23 channel 320x240 images. The softmax loss layer then picks the appropriate label to compare that to the GT and computes the loss function which is summed over not just batch images but also on the number of pixels in the image.

proto file and c++ code included.

However, Jim Xinbo and I have came up with a workaround, which simply introduce another source of input, which works as label. The predicted results and the newly introduced label are taken euclidean loss at the end. The gradient descend will propagate downwards.

Question 2: How to resume training?

See Imagenet Tutorial.

Use `resume_training.sh` and binary file `iter_xxxx.solverstate`

Question 3: The training and testing phase

See Imagenet Tutorial.

The phase is used to define similar nets for training and testing. The major part of the net remains constant while the input and output differs slightly.

Phase variant layers are defined using include { phase: TEST }

Question 3: How is Euclidean loss layer computed?

See the cpp file `euclidean_loss_layer.cpp`

The Euclidean_loss_layer computes the MSE instead of L-2 norm.

Daily Research Log 2014-7-24

Collect lately papers on `super resolution` for video

Why no "resolution" keyword in ICCV2011? Not brought out yet?

  • High Resolution 3D Shape Texture from Multiple Videos, Vagia Tsiminaki Jean-Sebastien Franco, CVPR2014
    • Restore texture from video
  • Super-resolution via Transform-invariant Group-sparse Regularization, Carlos Fernandez-Granda, Emmanuel J. Candes, CVPR2014

Others

  • Nonparametric Blind Super-Resolution, Michal Irani, ICCV2013
    • Find blur filter in super resolution application
  • Megastereo: Constructing High-Resolution Stereo Panoramas, Christian Richardt, Yael Pritch, Henning Zimmer, CVPR2013
    • Consecutive frames of course
    • But more related to reconstruction, flow, structure, rigid scene, rather than general video
  • Fast Direct Super-Resolution by Simple Functions, Chih-Yuan Yang and Ming-Hsuan Yang, ICCV2013
    • Decomposing the image patches into "features"

Daily Research Log 2014-7-17

Learning a Deep Convolutional Network for Image Super-Resolution, Chao Dong, Xiaoou Tang

  • Overview
    • Learns an end-to-end LR to HR mapping via deep learning, differs from traditional method fundamentally
      • Does not learn a dictionary or manifold explicitly
      • But learns it implicitly
      • Pipeline is fully learned, without hard-coded pre/post-processing
    • SRCNN's appealing properties
      • Intentionally designed with simplicity in mind
      • Moderate numbers of filters and layers makes the method fast and on-line practical
      • Huge potential in performance when larger dataset, lager model is available
    • Contributions
      • CNN for SR, end-to-end mapping between LR and HR
      • Relate SRCNN and sparse-coding-based SR methods, which guides the construction of the network
      • Demonstrate DL is useful inSR
  • Model
    • Notations
      • Y: LR image, X: HR image, F: X = F(Y)
    • F incorporates three operations
      • Patch extraction and representation
        • Extract overlapping patches from LR image, represent as vectors comprise of feature maps
      • Non-linear mapping
        • Map the features into another high-dimensional vector
      • Reconstruction
    • Patch extraction
      • Traditionally: pre-trained bases like PCA, DCT, Haar
      • Ours: optimize the filters in the framework
      • W1 is c*f1*f1*n1 dimensional
        • C: channels, f1: kernel size, n1: filter numbers
      • B1: biases
      • Perform convolution of W1 and LR image
    • Non-linear mapping
      • Nothing to explain
      • W2 is of a size n1*1*1*n2
    • Reconstruction
      • Take mean of the output layer
      • Consider the 'mean' as a convolution kernel
      • W3 is n2*f3*f3*c dimensional
  • Experiment
    • Loss function: MSE
    • Parameters
      • F1=9, f3=5, n1=64, n2=32
    • Implementation, using 'cuda-convnet'
    • Channels
      • Only consider the luminance channel in YCrCb color space. Chrominance channels are bicubic upsampled
    • Evaluation
      • No padding the convolutional layers
      • Calculate and evaluate the inner area, with 20*20 crop
      • Evaluate PSNR value

Daily Research Log 2014-7-11

RESTRICTED BOLTZMANN MACHINE APPROACH TO COUPLE DICTIONARY TRAINING FOR IMAGE SUPER-RESOLUTION, ICIP2013, Gao, Junbin

  • Review of SR via sparse representation
    • Assumption of HR and LR
    • Sparse prior
    • Learn couple dictionary D_h and D_l
  • Our model
    • Use RBM to help training the couple dictionary
    • Min.
    • Given input y
      • Generate x0 by naïve methods
      • Use [x0 y0] as input, train RBM, get dictionary coefficients

Daily Research Log 2014-7-10

Space-Time Super-Resolution from a Single Video

  • Assumption
    • Small space-time patches recur many times
    • Statistically explored
  • Introduction
    • Problem Investigation
      • Spatial resolution
        • CCD density and point spreading function
      • Temporal resolution
        • Frame-rate and exposure time
  • Method
    • Find similar ST patches in a coarser temporal scale
    • Use the corresponding fine temporal patch to fill the super-resolution of current patch

Super-resolution & deep learning

Thoughts

  • Pros on using DL in video SR
    • Huge dataset, good for training
  • Cons on using DL in video SR
    • In converting a certain patch from LR to HR, the pixel number varies, results in the different number of neurons
  • Intrinsic problem
    • Information increasing?
      • Increasing resolution itself is information increasing problem, which is theoretically unsolvable
      • However the example pool has provided extra information
      • The core problem is to find the 'subspace' where the current video sequence lies in. The projection of LR image
  • Problems of current methods
    • I assume that current methods are based on exemplar approaches, where raw image patches are used
    • We can use features instead, or decomposing the existing exemplars, and reconstruct the high-resolution patch
  • Possible methods
    • Method I, RF decomposition
      • Decompose patch into resolution and feature dimensions
      • Features are resolution-invariant
      • Each patch belongs to the linear combination of certain feature-basis at certain resolution
      • For each feature basis, abundant multi-scale exemplars are available in the database, compared with raw exemplars
    • Method II
      • Directly train a low-to-high NN
        • The difference between scale and resolution
        • Multi-resolution auto-encoder
          • Pure hack

Notes: Learning to Detect A Salient Object, Liu Tie, Xiaoou Tang

Learning to Detect A Salient Object, PAMI 2011, Xiaoou Tang

  • First quantitative evaluation dataset for visual attention algorithms
  • Most existing saliency algorithms are based on bottom-up computational framework
    • Steps
      • Feature extraction
      • Saliency computation
        • Center-surround operation, self-information, graph-based random walk
      • Find fixations, or sparse points by winner-take-all or inhibition-of-return
    • Problem
      • Finding fixations rather than where visual attention should be
      • Focus on low-level features rather than real attention
  • Our model
    • Incorporate top-down information about salient object
      • User label is considered to be top-down information
    • Local, regional, global features to define generic salient object
    • Use Condition Random Field (CRF) learning
  • Our dataset
    • 20,000+ well labeled images
    • What is salient object? Multiple user labeling
    • Selected 60,000 out of 130000 images containing a salient object or a distinctive foreground object
      Further select 20,000 images for labeling

Method Description

  • Our model
    • CRF model
    • Features
      • E = E_salient_object + S(x, x'), salient object feature & pairwise term for adjacent pixels
      • Salient object feature
        • For every pixel,
          • Salient object feature

            This is designed for finding A={a} s.t. min. sum(F).
            If a pixel is hypothesised to be salient, the feature response is negative; if not, the feature is positive.
          • Pairwise term, appearance difference multiplies label difference. Penalties adjacent pixels with similar color labeled differently. Naïve stuff.
    • CRF learning
      • Learn lambda for each feature under the Maximized Likelihood criteria, in the following sense
      • Objective function being
    • Salient object features
      • Multi-scale contrast
        • Naïve contrast, center – neighbor, in Gaussian pyramids
          6 level pyramids, 9x9 window
          highlights boundaries
      • Center-surround histogram
        • Salient object usually has a larger extend in contrast
          For each pixel, enumerate hypothesis rectangle R and RS with different aspect ratio and size
          Test Chi-square distance of RGB histogram of R and RS

      • Color spatial distribution
        • Distinctiveness of color
          GMM color model

Implementation

Daily Research Log 2014-7-6

Learning to Detect A Salient Object, Xiaoou Tang

  • Our model
    • CRF model
    • Features
      • E = E_salient_object + S(x, x'), salient object feature & pairwise term for adjacent pixels
      • Salient object feature
        • For every pixel,
          • Salient object feature

            This is designed for finding A={a} s.t. min. sum(F).
            If a pixel is hypothesised to be salient, the feature response is negative; if not, the feature is positive.
          • Pairwise term, appearance difference multiplies label difference. Penalties adjacent pixels with similar color labeled differently. Naïve stuff.
    • CRF learning
      • Learn lambda for each feature under the Maximized Likelihood criteria, in the following sense
      • Objective function being

 

 

 

Daily Research Log 2014-7-5

The Secrets of Salient Object Segmentation, CVPR2014, Xiaodi Hou, Alan

  • Fixation v.s. salient object
    • People have long been neglecting the intrinsic difference between gaze data and salient object, while evaluating the two in the same dataset.
  • Dataset
    • Out dataset
      • Segment, fixation, salient object
    • Analysis
      • Ground-truth consistency
        • Measure F-score by comparing threshold binary map of user labeled salient object
      • Bias
        • Design bias
          • FT dataset, choose images with predominant salient objects
        • Center bias
  • FixationSalient object
    • Fixation-based representations are disadvantaged in a salient object segmentation task
      • [1] R. Achanta, S. Hemami, F. Estrada, and S. Susstrunk. Frequency-tuned salient region detection. CVPR2009
      • [4] A. Borji, D. N. Sihite, and L. Itti. Salient object detection: A benchmark. In ECCV 2011
    • Traditionally two steps
      • Design a suitable representation for salient object segmentation
      • Saliency principles
    • Our Model
      • Overview: segmentation and rank by fixation
      • Step I
        • Object proposal by CPMC
      • Step II
        • Salient segment ranking
          • Density of fixation
          • Non-uniform spatial distribution of fixations, e.g. fixation at the center of a segment increases the probability
        • Learn a scoring function of object candidate, w.r.t. candidate mask and fixation distribution map
          • Features: 33
            • Shape features
              • Major axis length, eccentricity, minor axis length , Euler number
            • Fixation distribution features
              • Align by major axis, extract 4*3 histogram
            • No appearance feature
      • Random forest
  • Model Validation
    • Test the upper-bound of the selector
      • Run algorithm on ground-truth segments using PASCAL-S
    • Test the performance of the CPMC segmentation algorithm
      • Use the first 200 segments to compare with PASCAL-S segmentation groundtruth
  • QUESTIONS & ANSWER
    • Q: Is fixation predicting algorithm good enough?
    • A: According to the authors,
      • In addition to the fixation prediction results, we also tested the F-measure of the ground-truth human fixations on IS and PASCAL-S.
      • When we remove the effect of center bias and dataset design bias, the performance of fixation algorithms be- comes very competitive. We also notice that both fixation and salient object algorithms are on a par with human fixation data in F-measure. The

Learning to Detect A Salient Object, Xiaoou Tang

  • First quantitative evaluation dataset for visual attention algorithms
  • Most existing saliency algorithms are based on bottom-up computational framework
    • Steps
      • Feature extraction
      • Saliency computation
        • Center-surround operation, self-information, graph-based random walk
      • Find fixations, or sparse points by winner-take-all or inhibition-of-return
    • Problem
      • Finding fixations rather than where visual attention should be
      • Focus on low-level features rather than real attention
  • Our model
    • Incorporate top-down information about salient object
      • User label is considered to be top-down information
    • Local, regional, global features to define generic salient object
    • Use Condition Random Field (CRF) learning
  • Our dataset
    • 20,000+ well labeled images
    • What is salient object? Multiple user labeling
    • Selected 60,000 out of 130000 images containing a salient object or a distinctive foreground object
      Further select 20,000 images for labeling

     
     

  • 3rd party
    • CPMC
      • Overview: generate over-complete potential object coverage, assess objectness score
      • Initializes foreground seeds uniformly, calculate min-cut