light head rcnn: In Defense of Two-Stage Object Detector

Machine 发表了文章 • 1 个评论 • 33 次浏览 • 2017-12-20 16:06 • 来自相关话题

In this paper, we first investigate why typical two-stage methods are not as fast as single-stage, fast detectors like YOLO [26, 27] and SSD [22]. We find that Faster RCNN [28] and R-FCN [17] perform an intensive computation after or before RoI warping. Faster R-CNN involves two fully connected layers for RoI recognition, while RFCN produces a large score maps. Thus, the speed of these networks is slow due to the heavy-head design in the architecture. Even if we significantly reduce the base model, the computation cost cannot be largely decreased accordingly. We propose a new two-stage detector, Light-Head RCNN, to address the shortcoming in current two-stage approaches. In our design, we make the head of network as light as possible, by using a thin feature map and a cheap R-CNN subnet  pooling and single fully-connected layer). Our ResNet-101 based light-head R-CNN outperforms state-of-art object detectors on COCO while keeping time efficiency. More importantly, simply replacing the
backbone with a tiny network (e.g, Xception), our LightHead R-CNN gets 30.7 mmAP at 102 FPS on COCO, significantly outperforming the single-stage, fast detectors like YOLO [26, 27] and SSD [22] on both speed and accuracy.
Code will be make publicly available
  查看全部
In this paper, we first investigate why typical two-stage methods are not as fast as single-stage, fast detectors like YOLO [26, 27] and SSD [22]. We find that Faster RCNN [28] and R-FCN [17] perform an intensive computation after or before RoI warping. Faster R-CNN involves two fully connected layers for RoI recognition, while RFCN produces a large score maps. Thus, the speed of these networks is slow due to the heavy-head design in the architecture. Even if we significantly reduce the base model, the computation cost cannot be largely decreased accordingly. We propose a new two-stage detector, Light-Head RCNN, to address the shortcoming in current two-stage approaches. In our design, we make the head of network as light as possible, by using a thin feature map and a cheap R-CNN subnet  pooling and single fully-connected layer). Our ResNet-101 based light-head R-CNN outperforms state-of-art object detectors on COCO while keeping time efficiency. More importantly, simply replacing the
backbone with a tiny network (e.g, Xception), our LightHead R-CNN gets 30.7 mmAP at 102 FPS on COCO, significantly outperforming the single-stage, fast detectors like YOLO [26, 27] and SSD [22] on both speed and accuracy.
Code will be make publicly available
 

【Facebook 2017】Focal Loss for Dense Object Detection

Paper 发表了文章 • 2 个评论 • 180 次浏览 • 2017-08-11 00:47 • 来自相关话题

 
Focal Loss for Dense Object Detection
 
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, Piotr Dollár
 
paper: https://arxiv.org/pdf/1708.02002.pdf
 
The highest accuracy object detectors to date are based on a two-stage approach popularized by R-CNN, where a classifier is applied to a sparse set of candidate object locations. In contrast, one-stage detectors that are applied over a regular, dense sampling of possible object locations have the potential to be faster and simpler, but have trailed the accuracy of two-stage detectors thus far. In this paper, we investigate why this is the case. We discover that the extreme foreground-background class imbalance encountered during training of dense detectors is the central cause. We propose to address this class imbalance by reshaping the standard cross entropy loss such that it down-weights the loss assigned to well-classified examples. Our novel Focal Loss focuses training on a sparse set of hard examples and prevents the vast number of easy negatives from overwhelming the detector during training. To evaluate the effectiveness of our loss, we design and train a simple dense detector we call RetinaNet. Our results show that when trained with the focal loss, RetinaNet is able to match the speed of previous one-stage detectors while surpassing the accuracy of all existing state-of-the-art two-stage detectors.
 

  查看全部
 
Focal Loss for Dense Object Detection
 
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, Piotr Dollár
 
paper: https://arxiv.org/pdf/1708.02002.pdf
 
The highest accuracy object detectors to date are based on a two-stage approach popularized by R-CNN, where a classifier is applied to a sparse set of candidate object locations. In contrast, one-stage detectors that are applied over a regular, dense sampling of possible object locations have the potential to be faster and simpler, but have trailed the accuracy of two-stage detectors thus far. In this paper, we investigate why this is the case. We discover that the extreme foreground-background class imbalance encountered during training of dense detectors is the central cause. We propose to address this class imbalance by reshaping the standard cross entropy loss such that it down-weights the loss assigned to well-classified examples. Our novel Focal Loss focuses training on a sparse set of hard examples and prevents the vast number of easy negatives from overwhelming the detector during training. To evaluate the effectiveness of our loss, we design and train a simple dense detector we call RetinaNet. Our results show that when trained with the focal loss, RetinaNet is able to match the speed of previous one-stage detectors while surpassing the accuracy of all existing state-of-the-art two-stage detectors.
 

 

【ICLR 2015】Multiple Object Recognition with Visual Attention

Paper 发表了文章 • 0 个评论 • 96 次浏览 • 2017-08-09 14:37 • 来自相关话题

关键词: 注意力, 强化学习, 多目标识别
 
Multiple Object Recognition with Visual Attention
 
Jimmy Ba, Volodymyr Mnih, Koray Kavukcuoglu
 
DeepMind
 
paper: https://arxiv.org/abs/1412.7755
 
We present an attention-based model for recognizing multiple objects in images.
The proposed model is a deep recurrent neural network trained with reinforcement
learning to attend to the most relevant regions of the input image. We show that the
model learns to both localize and recognize multiple objects despite being given
only class labels during training. We evaluate the model on the challenging task of
transcribing house number sequences from Google Street View images and show
that it is both more accurate than the state-of-the-art convolutional networks and
uses fewer parameters and less computation.
 
关键词: 注意力, 强化学习, 多目标识别
  查看全部
关键词: 注意力, 强化学习, 多目标识别
 
Multiple Object Recognition with Visual Attention
 
Jimmy Ba, Volodymyr Mnih, Koray Kavukcuoglu
 
DeepMind
 
paper: https://arxiv.org/abs/1412.7755
 
We present an attention-based model for recognizing multiple objects in images.
The proposed model is a deep recurrent neural network trained with reinforcement
learning to attend to the most relevant regions of the input image. We show that the
model learns to both localize and recognize multiple objects despite being given
only class labels during training. We evaluate the model on the challenging task of
transcribing house number sequences from Google Street View images and show
that it is both more accurate than the state-of-the-art convolutional networks and
uses fewer parameters and less computation.
 
关键词: 注意力, 强化学习, 多目标识别
 

【CVPR 2017】RON: Reverse Connection with Objectness Prior Networks for Object Detection

Paper 发表了文章 • 0 个评论 • 88 次浏览 • 2017-08-08 01:09 • 来自相关话题

 
RON: Reverse Connection with Objectness Prior Networks for Object Detection
 
Tao Kong, Fuchun Sun, Anbang Yao, Huaping Liu, Ming Lu, Yurong Chen
 
github: https://github.com/taokong/RON
code: https://github.com/taokong/RON
 
We present RON, an efficient and effective framework for generic object detection. Our motivation is to smartly associate the best of the region-based (e.g., Faster R-CNN) and region-free (e.g., SSD) methodologies. Under fully convolutional architecture, RON mainly focuses on two fundamental problems: (a) multi-scale object localization and (b) negative sample mining. To address (a), we design the reverse connection, which enables the network to detect objects on multi-levels of CNNs. To deal with (b), we propose the objectness prior to significantly reduce the searching space of objects. We optimize the reverse connection, objectness prior and object detector jointly by a multi-task loss function, thus RON can directly predict final detection results from all locations of various feature maps. Extensive experiments on the challenging PASCAL VOC 2007, PASCAL VOC 2012 and MS COCO benchmarks demonstrate the competitive performance of RON. Specifically, with VGG-16 and low resolution 384X384 input size, the network gets 81.3% mAP on PASCAL VOC 2007, 80.7% mAP on PASCAL VOC 2012 datasets. Its superiority increases when datasets become larger and more difficult, as demonstrated by the results on the MS COCO dataset. With 1.5G GPU memory at test phase, the speed of the network is 15 FPS, 3X faster than the Faster R-CNN counterpart.
 
  查看全部
 
RON: Reverse Connection with Objectness Prior Networks for Object Detection
 
Tao Kong, Fuchun Sun, Anbang Yao, Huaping Liu, Ming Lu, Yurong Chen
 
github: https://github.com/taokong/RON
code: https://github.com/taokong/RON
 
We present RON, an efficient and effective framework for generic object detection. Our motivation is to smartly associate the best of the region-based (e.g., Faster R-CNN) and region-free (e.g., SSD) methodologies. Under fully convolutional architecture, RON mainly focuses on two fundamental problems: (a) multi-scale object localization and (b) negative sample mining. To address (a), we design the reverse connection, which enables the network to detect objects on multi-levels of CNNs. To deal with (b), we propose the objectness prior to significantly reduce the searching space of objects. We optimize the reverse connection, objectness prior and object detector jointly by a multi-task loss function, thus RON can directly predict final detection results from all locations of various feature maps. Extensive experiments on the challenging PASCAL VOC 2007, PASCAL VOC 2012 and MS COCO benchmarks demonstrate the competitive performance of RON. Specifically, with VGG-16 and low resolution 384X384 input size, the network gets 81.3% mAP on PASCAL VOC 2007, 80.7% mAP on PASCAL VOC 2012 datasets. Its superiority increases when datasets become larger and more difficult, as demonstrated by the results on the MS COCO dataset. With 1.5G GPU memory at test phase, the speed of the network is 15 FPS, 3X faster than the Faster R-CNN counterpart.