Welcome to ShortScience.org! 
[link]
# Object detection system overview. https://i.imgur.com/vd2YUy3.png 1. takes an input image, 2. extracts around 2000 bottomup region proposals, 3. computes features for each proposal using a large convolutional neural network (CNN), and then 4. classifies each region using classspecific linear SVMs. * RCNN achieves a mean average precision (mAP) of 53.7% on PASCAL VOC 2010. * On the 200class ILSVRC2013 detection dataset, RCNN’s mAP is 31.4%, a large improvement over OverFeat , which had the previous best result at 24.3%. ## There is a 2 challenges faced in object detection 1. localization problem 2. labeling the data 1 localization problem : * One approach frames localization as a regression problem. they report a mAP of 30.5% on VOC 2007 compared to the 58.5% achieved by our method. * An alternative is to build a slidingwindow detector. considered adopting a slidingwindow approach increases the number of convolutional layers to 5, have very large receptive fields (195 x 195 pixels) and strides (32x32 pixels) in the input image, which makes precise localization within the slidingwindow paradigm. 2 labeling the data: * The conventional solution to this problem is to use unsupervised pretraining, followed by supervise finetuning * supervised pretraining on a large auxiliary dataset (ILSVRC), followed by domain specific finetuning on a small dataset (PASCAL), * finetuning for detection improves mAP performance by 8 percentage points. * Stochastic gradient descent via back propagation was used to effective for training convolutional neural networks (CNNs) ## Object detection with RCNN This system consists of three modules * The first generates categoryindependent region proposals. These proposals define the set of candidate detections available to our detector. * The second module is a large convolutional neural network that extracts a fixedlength feature vector from each region. * The third module is a set of class specific linear SVMs. Module design 1 Region proposals * which detect mitotic cells by applying a CNN to regularlyspaced square crops. * use selective search method in fast mode (Capture All Scales, Diversification, Fast to Compute). * the time spent computing region proposals and features (13s/image on a GPU or 53s/image on a CPU) 2 Feature extraction. * extract a 4096dimensional feature vector from each region proposal using the Caffe implementation of the CNN * Features are computed by forward propagating a meansubtracted 227x227 RGB image through five convolutional layers and two fully connected layers. * warp all pixels in a tight bounding box around it to the required size * The feature matrix is typically 2000x4096 3 Test time detection * At test time, run selective search on the test image to extract around 2000 region proposals (we use selective search’s “fast mode” in all experiments). * warp each proposal and forward propagate it through the CNN in order to compute features. Then, for each class, we score each extracted feature vector using the SVM trained for that class. * Given all scored regions in an image, we apply a greedy nonmaximum suppression (for each class independently) that rejects a region if it has an intersectionover union (IoU) overlap with a higher scoring selected region larger than a learned threshold. ## Training 1 Supervised pretraining: * pretrained the CNN on a large auxiliary dataset (ILSVRC2012 classification) using imagelevel annotations only (bounding box labels are not available for this data) 2 Domainspecific finetuning. * use the stochastic gradient descent (SGD) training of the CNN parameters using only warped region proposals with learning rate of 0.001. 3 Object category classifiers. * use intersectionover union (IoU) overlap threshold method to label a region with The overlap threshold of 0.3. * Once features are extracted and training labels are applied, we optimize one linear SVM per class. * adopt the standard hard negative mining method to fit large training data in memory. ### Results on PASCAL VOC 201012 1 VOC 2010 * compared against four strong baselines including SegDPM, DPM, UVA, Regionlets. * Achieve a large improvement in mAP, from 35.1% to 53.7% mAP, while also being much faster https://i.imgur.com/0dGX9b7.png 2 ILSVRC2013 detection. * ran RCNN on the 200class ILSVRC2013 detection dataset * RCNN achieves a mAP of 31.4% https://i.imgur.com/GFbULx3.png #### Performance layerbylayer, without finetuning 1 pool5 layer * which is the max pooled output of the network’s fifth and final convolutional layer. *The pool5 feature map is 6 x6 x 256 = 9216 dimensional * each pool5 unit has a receptive field of 195x195 pixels in the original 227x227 pixel input 2 Layer fc6 * fully connected to pool5 * it multiplies a 4096x9216 weight matrix by the pool5 feature map (reshaped as a 9216dimensional vector) and then adds a vector of biases 3 Layer fc7 * It is implemented by multiplying the features computed by fc6 by a 4096 x 4096 weight matrix, and similarly adding a vector of biases and applying halfwave rectification #### Performance layerbylayer, with finetuning * CNN’s parameters finetuned on PASCAL. * finetuning increases mAP by 8.0 % points to 54.2% ### Network architectures * 16layer deep network, consisting of 13 layers of 3 _ 3 convolution kernels, with five max pooling layers interspersed, and topped with three fullyconnected layers. We refer to this network as “ONet” for OxfordNet and the baseline as “TNet” for TorontoNet. * RCNN with ONet substantially outperforms RCNN with TNet, increasing mAP from 58.5% to 66.0% * drawback in terms of compute time, with in terms of compute time, with than TNet. 1 The ILSVRC2013 detection dataset * dataset is split into three sets: train (395,918), val (20,121), and test (40,152) #### CNN features for segmentation. * full RCNN: The first strategy (full) ignores the re region’s shape and computes CNN features directly on the warped window. Two regions might have very similar bounding boxes while having very little overlap. * fg RCNN: the second strategy (fg) computes CNN features only on a region’s foreground mask. We replace the background with the mean input so that background regions are zero after mean subtraction. * full+fg RCNN: The third strategy (full+fg) simply concatenates the full and fg features https://i.imgur.com/n1bhmKo.png
1 Comments

[link]
## Introduction * [Link to Paper](http://arxiv.org/pdf/1412.6071v4.pdf) * Spatial pooling layers are building blocks for Convolutional Neural Networks (CNNs). * Input to pooling operation is a $N_{in}$ x $N_{in}$ matrix and output is a smaller matrix $N_{out}$ x $N_{out}$. * Pooling operation divides $N_{in}$ x $N_{in}$ square into $N^2_{out}$ pooling regions $P_{i, j}$. * $P_{i, j}$ ⊂ $\{1, 2, . . . , N_{in}\}$ $\forall$ $(i, j) \in \{1, . . . , N_{out} \}^2$ ## MP2 * Refers to 2x2 maxpooling layer. * Popular choice for maxpooling operation. ### Advantages of MP2 * Fast. * Quickly reduces the size of the hidden layer. * Encodes a degree of invariance with respect to translations and elastic distortions. ### Issues with MP2 * Disjoint nature of pooling regions. * Since size decreases rapidly, stacks of backtoback CNNs are needed to build deep networks. ## FMP * Reduces the spatial size of the image by a factor of *α*, where *α ∈ (1, 2)*. * Introduces randomness in terms of choice of pooling region. * Pooling regions can be chosen in a *random* or *pseudorandom* manner. * Pooling regions can be *disjoint* or *overlapping*. ## Generating Pooling Regions * Let $a_i$ and $b_i$ be 2 increasing sequences of integers, starting at 1 and ending at $N_{in}$. * Increments are either 1 or 2. * For *disjoint regions, $P = [a_{i−1}, a_{i − 1}] × [b_{j−1}, b_{j − 1}]$ * For *overlapping regions, $P = [a_{i−1}, a_i] × [b_{j−1}, b_j 1]$ * Pooling regions can be generated *randomly* by choosing the increment randomly at each step. * To generate pooling regions in a *peusdorandom* manner, choose $a_i$ = ceil($\alpha  (i+u))$, where $\alpha \in (1, 2)$ with some $u \in (0, 1)$. * Each FMP layer uses a different pair of sequence. * An FMP network can be thought of as an ensemble of similar networks, with each different poolingregion configuration defining a different member of the ensemble. ## Observations * *Random* FMP is good on its own but may underfit when combined with dropout or training data augmentation. * *Pseudorandom* approach generates more stable pooling regions. * *Overlapping* FMP performs better than *disjoint* FMP. ## Weakness * No justification is provided for the observations mentioned above. * It needs to be seen how performance is affected if the pooling layer in architectures like GoogLeNet. 
[link]
This paper explores the use of socalled Monte Carlo objectives for training directed generative models with latent variables. Monte Carlo objectives take the form of the logarithm of a Monte Carlo estimate (i.e. an average over samples) of the marginal probability $P(x)$. One important motivation for using Monte Carlo objectives is that they can be shown (see the Importance Weighted Variational Autoencoder paper \cite{journals/corr/BurdaGS15} and my notes on it) to correspond to bounds on the true likelihood of the model, and one can tighten the bound simply by drawing more samples in the Monte Carlo objective. Currently, the most successful application of Monte Carlo objectives is based on an importance sampling estimate, which involves training a proposal distribution $Q(hx)$ in addition to the model $P(x,h)$. This paper considers the problem of training with gradient descent on such objectives, in the context of a model to which the reparametrization trick cannot be used (e.g. for discrete latent variables). They analyze the sources of variance in the estimation of the gradients (see Equation 5) and propose a very simple approach to reducing the variance of a samplingbased estimator of these gradients. First, they argue that gradients with respect to the $P(x,h)$ parameters are less susceptible to problems due to high variance gradients. Second, and most importantly, they derive a multisample estimate of the gradient that is meant to reduce the variance of gradients on the proposal distribution parameters $Q(hx)$. The end result is the gradient estimate of Equations 1011. It is based on the observation that the first term of the gradient of Equation 5 doesn't distinguish between the contribution of each sampled latent hi. The key contribution is this: they notice that one can incorporate a variance reducing baseline for each sample hi, corresponding to the Monte Carlo estimate of the loglikelihood when removing hi from the estimate (see Equation 10). The authors show that this is a proper baseline, in that using it doesn't introduce a bias in the estimation for the gradients. Experiments show that this approach yields better performance than training based on Reweighted Wake Sleep \cite{journals/corr/BornscheinB14} or the use of NVIL baselines \cite{conf/icml/MnihG14}, when training sigmoid belief networks as generative models or as structured output prediction (image completion) models on binarized MNIST. 
[link]
* They use an implementation of Qlearning (i.e. reinforcement learning) with CNNs to automatically play Atari games. * The algorithm receives the raw pixels as its input and has to choose buttons to press as its output. No handengineered features are used. So the model "sees" the game and "uses" the controller, just like a human player would. * The model achieves good results on various games, beating all previous techniques and sometimes even surpassing human players. ### How * Deep Q Learning * *This is yet another explanation of deep Q learning, see also [this blog post](http://www.nervanasys.com/demystifyingdeepreinforcementlearning/) for longer explanation.* * While playing, sequences of the form (`state1`, `action1`, `reward`, `state2`) are generated. * `state1` is the current game state. The agent only sees the pixels of that state. (Example: Screen shows enemy.) * `action1` is an action that the agent chooses. (Example: Shoot!) * `reward` is the direct reward received for picking `action1` in `state1`. (Example: +1 for a kill.) * `state2` is the next game state, after the action was chosen in `state1`. (Example: Screen shows dead enemy.) * One can pick actions at random for some time to generate lots of such tuples. That leads to a replay memory. * Direct reward * After playing randomly for some time, one can train a model to predict the direct reward given a screen (we don't want to use the whole state, just the pixels) and an action, i.e. `Q(screen, action) > direct reward`. * That function would need a forward pass for each possible action that we could take. So for e.g. 8 buttons that would be 8 forward passes. To make things more efficient, we can let the model directly predict the direct reward for each available action, e.g. for 3 buttons `Q(screen) > (direct reward of action1, direct reward of action2, direct reward of action3)`. * We can then sample examples from our replay memory. The input per example is the screen. The output is the reward as a tuple. E.g. if we picked button 1 of 3 in one example and received a reward of +1 then our output/label for that example would be `(1, 0, 0)`. * We can then train the model by playing completely randomly for some time, then sample some batches and train using a mean squared error. Then play a bit less randomly, i.e. start to use the action which the network thinks would generate the highest reward. Then train again, and so on. * Indirect reward * Doing the previous steps, the model will learn to anticipate the *direct* reward correctly. However, we also want it to predict indirect rewards. Otherwise, the model e.g. would never learn to shoot rockets at enemies, because the reward from killing an enemy would come many frames later. * To learn the indirect reward, one simply adds the reward value of highest reward action according to `Q(state2)` to the direct reward. * I.e. if we have a tuple (`state1`, `action1`, `reward`, `state2`), we would not add (`state1`, `action1`, `reward`) to the replay memory, but instead (`state1`, `action1`, `reward + highestReward(Q(screen2))`). (Where `highestReward()` returns the reward of the action with the highest reward according to Q().) * By training to predict `reward + highestReward(Q(screen2))` the network learns to anticipate the direct reward *and* the indirect reward. It takes a leap of faith to accept that this will ever converge to a good solution, but it does. * We then add `gamma` to the equation: `reward + gamma*highestReward(Q(screen2))`. `gamma` may be set to 0.9. It is a discount factor that devalues future states, e.g. because the world is not deterministic and therefore we can't exactly predict what's going to happen. Note that Q will automatically learn to stack it, e.g. `state3` will be discounted to `gamma^2` at `state1`. * This paper * They use the mentioned Deep Q Learning to train their model Q. * They use a kth frame technique, i.e. they let the model decide upon an action at (here) every 4th frame. * Q is implemented via a neural net. It receives 84x84x4 grayscale pixels that show the game and projects that onto the rewards of 4 to 18 actions. * The input is HxWx4 because they actually feed the last 4 frames into the network, instead of just 1 frame. So the network knows more about what things are moving how. * The network architecture is: * 84x84x4 (input) * 16 convs, 8x8, stride 4, ReLU * 32 convs, 4x4, stride 2, ReLU * 256 fully connected neurons, ReLU * <N_actions> fully connected neurons, linear * They use a replay memory of 1 million frames. ### Results * They ran experiments on the Atari games Beam Rider, Breakout, Enduro, Pong, Qbert, Seaquest and Space Invaders. * Same architecture and hyperparameters for all games. * Rewards were based on score changes in the games, i.e. they used +1 (score increases) and 1 (score decreased). * Optimizer: RMSProp, Batch Size: 32. * Trained for 10 million examples/frames per game. * They had no problems with instability and their average Q value per game increased smoothly. * Their method beats all other state of the art methods. * They managed to beat a human player in games that required not so much "long" term strategies (the less frames the better). * Video: starts at 46:05. https://youtu.be/dV80NAlEins?t=46m05s ![Algorithm](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Playing_Atari_with_Deep_Reinforcement_Learning__algorithm.png?raw=true "Algorithm") *The original full algorithm, as shown in the paper.*  ### Rough chapterwise notes * (1) Introduction * Problems when using neural nets in reinforcement learning (RL): * Reward signal is often sparse, noise and delayed. * Often assumption that data samples are independent, while they are correlated in RL. * Data distribution can change when the algorithm learns new behaviours. * They use Qlearning with a CNN and stochastic gradient descent. * They use an experience replay mechanism (i.e. memory) from which they can sample previous transitions (for training). * They apply their method to Atari 2600 games in the Arcade Learning Environment (ALE). * They use only the visible pixels as input to the network, i.e. no manual feature extraction. * (2) Background * blablabla, standard deep q learning explanation * (3) Related Work * TDBackgammon: "Solved" backgammon. Worked similarly to Qlearning and used a multilayer perceptron. * Attempts to copy TDBackgammon to other games failed. * Research was focused on linear function approximators as there were problems with nonlinear ones diverging. * Recently again interest in using neural nets for reinforcement learning. Some attempts to fix divergence problems with gradient temporaldifference methods. * NFQ is a very similar method (to the one in this paper), but worked on the whole batch instead of minibatches, making it slow. It also first applied dimensionality reduction via autoencoders on the images instead of training on them endtoend. * HyperNEAT was applied to Atari games and evolved a neural net for each game. The networks learned to exploit design flaws. * (4) Deep Reinforcement Learning * They want to connect a reinforcement learning algorithm with a deep neural network, e.g. to get rid of handcrafted features. * The network is supposes to run on the raw RGB images. * They use experience replay, i.e. store tuples of (pixels, chosen action, received reward) in a memory and use that during training. * They use Qlearning. * They use an epsilongreedy policy. * Advantages from using experience replay instead of learning "live" during game playing: * Experiences can be reused many times (more efficient). * Samples are less correlated. * Learned parameters from one batch don't determine as much the distributions of the examples in the next batch. * They save the last N experiences and sample uniformly from them during training. * (4.1) Preprocessing and Model Architecture * Raw Atari images are 210x160 pixels with 128 possible colors. * They downsample them to 110x84 pixels and then crop the 84x84 playing area out of them. * They also convert the images to grayscale. * They use the last 4 frames as input and stack them. * So their network input has shape 84x84x4. * They use one output neuron per possible action. So they can compute the Qvalue (expected reward) of each action with one forward pass. * Architecture: 84x84x4 (input) => 16 8x8 convs, stride 4, ReLU => 32 4x4 convs stride 2 ReLU => fc 256, ReLU => fc N actions, linear * 4 to 18 actions/outputs (depends on the game). * Aside from the outputs, the architecture is the same for all games. * (5) Experiments * Games that they played: Beam Rider, Breakout, Enduro, Pong, Qbert, Seaquest, Space Invaders * They use the same architecture und hyperparameters for all games. * They give a reward of +1 whenever the ingame score increases and 1 whenever it decreases. * They use RMSProp. * Mini batch size was 32. * They train for 10 million frames/examples. * They initialize epsilon (in their epsilon greedy strategy) to 1.0 and decrease it linearly to 0.1 at one million frames. * They let the agent decide upon an action at every 4th ingame frame (3rd in space invaders). * (5.1) Training and stability * They plot the average reward und Qvalue per N games to evaluate the agent's training progress, * The average reward increases in a noisy way. * The average Q value increases smoothly. * They did not experience any divergence issues during their training. * (5.2) Visualizating the Value Function * The agent learns to predict the value function accurately, even for rather long sequences (here: ~25 frames). * (5.3) Main Evaluation * They compare to three other methods that use handengineered features and/or use the pixel data combined with significant prior knownledge. * They mostly outperform the other methods. * They managed to beat a human player in three games. The ones where the human won seemed to require strategies that stretched over longer time frames. 
[link]
### Main Idea: It basically tunes the hyperparameters of the neural network architecture using reinforcement learning. The reward signal is taken as evaluation on the validation set. The method is policy gradient as the cost function is nondifferentiable. ### Method: #### i. Actions: 1. There is controller RNN which predicts some hyperparameters of the layer conditioned on the previous predictions. This prediction is just a onehot vector based on the previous hyperparameter chosen. At the start for the first prediction  this vector is just all zeros. 2. Once the network is generated completely, it is trained for a fixed number of epochs, the reward signal is calculated based on the evaluation on the validation set. #### ii. Training: 1. Nothing fancy in the reinforcement learning approach simple policy gradients. 2. Baseline is added to reduce the variance. 3. It takes 23 weeks to train it over 800 GPUs!! #### iii. Results: 1. Use it to generate CNNs and LSTM cells. Close to stateofart results with generated architectures. ### Possible new directions: 1. Use better techniques RL techniques like TRPO, PPO etc. 2. Right now, they generate fixed length architecture. Their reason is for variable length architectures, it is difficult to determine how much time each architecture is trained. Smaller networks are easier to train. Thus, somehow determine training time as function of the learning capacity of the network. Code: They haven't released the code yet. I tried to simulate it in torch.(https://github.com/abhigenie92/nn_search) 