执行train_quick.sh时报错

回复

图像分类chenfu 发起了问题 • 1 人关注 • 0 个回复 • 39 次浏览 • 2017-10-28 23:24 • 来自相关话题

【Facebook 2017】Focal Loss for Dense Object Detection

目标检测Paper 发表了文章 • 1 个评论 • 136 次浏览 • 2017-08-11 00:47 • 来自相关话题

 
Focal Loss for Dense Object Detection
 
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, Piotr Dollár
 
paper: https://arxiv.org/pdf/1708.02002.pdf
 
The highest accuracy object detectors to date are based on a two-stage approach popularized by R-CNN, where a classifier is applied to a sparse set of candidate object locations. In contrast, one-stage detectors that are applied over a regular, dense sampling of possible object locations have the potential to be faster and simpler, but have trailed the accuracy of two-stage detectors thus far. In this paper, we investigate why this is the case. We discover that the extreme foreground-background class imbalance encountered during training of dense detectors is the central cause. We propose to address this class imbalance by reshaping the standard cross entropy loss such that it down-weights the loss assigned to well-classified examples. Our novel Focal Loss focuses training on a sparse set of hard examples and prevents the vast number of easy negatives from overwhelming the detector during training. To evaluate the effectiveness of our loss, we design and train a simple dense detector we call RetinaNet. Our results show that when trained with the focal loss, RetinaNet is able to match the speed of previous one-stage detectors while surpassing the accuracy of all existing state-of-the-art two-stage detectors.
 

  查看全部
 
Focal Loss for Dense Object Detection
 
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, Piotr Dollár
 
paper: https://arxiv.org/pdf/1708.02002.pdf
 
The highest accuracy object detectors to date are based on a two-stage approach popularized by R-CNN, where a classifier is applied to a sparse set of candidate object locations. In contrast, one-stage detectors that are applied over a regular, dense sampling of possible object locations have the potential to be faster and simpler, but have trailed the accuracy of two-stage detectors thus far. In this paper, we investigate why this is the case. We discover that the extreme foreground-background class imbalance encountered during training of dense detectors is the central cause. We propose to address this class imbalance by reshaping the standard cross entropy loss such that it down-weights the loss assigned to well-classified examples. Our novel Focal Loss focuses training on a sparse set of hard examples and prevents the vast number of easy negatives from overwhelming the detector during training. To evaluate the effectiveness of our loss, we design and train a simple dense detector we call RetinaNet. Our results show that when trained with the focal loss, RetinaNet is able to match the speed of previous one-stage detectors while surpassing the accuracy of all existing state-of-the-art two-stage detectors.
 

 

【视频】Caffe2 湾区研讨会

计算机视觉caffe 发表了文章 • 2 个评论 • 228 次浏览 • 2017-08-30 12:26 • 来自相关话题

 
视频 Caffe2 湾区研讨会 
 
 1. Caffe2 Overview (Session 1)
    Facebook - 贾扬清



 
 
2. High Performance Training with Caffe2 and FP16 (Session 2)
 
   NVIDIA - Pooya Davoodi
 



 
3. On-device Mobile ML Deployment (Session 3)
 
   Facebook - Bram Wasti
 



 
  查看全部
 
视频 Caffe2 湾区研讨会 

 
 1. Caffe2 Overview (Session 1)
    Facebook - 贾扬清




 
 
2. High Performance Training with Caffe2 and FP16 (Session 2)
 
   NVIDIA - Pooya Davoodi
 




 
3. On-device Mobile ML Deployment (Session 3)
 
   Facebook - Bram Wasti
 




 
 

请问AlexNet论文中的Local Response Normalization目前还在使用吗?

图像分类Paper 回复了问题 • 2 人关注 • 1 个回复 • 80 次浏览 • 2017-09-08 09:44 • 来自相关话题

【CVPR 2017】Tutorial: Learning Deep Representations for Visual Recognition

计算机视觉Paper 发表了文章 • 0 个评论 • 153 次浏览 • 2017-08-28 11:18 • 来自相关话题

 
内容重点涵盖了他参与发明的ResNet的细节结构以及一系列重要模型(包括LeNet、AlexNet、GoogleNet)的回顾。整个讲座内容深入浅出,重要的技巧比如Batch Normalization也都有涵盖。这个教程非常适合对ResNet想深入了解的读者。

Learning Deep Representations for Visual Recognition
Kaiming He (Facebook AI Research)
Slides: http://deeplearning.csail.mit.edu/cvpr2017_tutorial_kaiminghe.pdf
 
Deep Learning for Object Detection, 
Ross Girshick (Facebook AI Research)
Video: 查看全部
 
内容重点涵盖了他参与发明的ResNet的细节结构以及一系列重要模型(包括LeNet、AlexNet、GoogleNet)的回顾。整个讲座内容深入浅出,重要的技巧比如Batch Normalization也都有涵盖。这个教程非常适合对ResNet想深入了解的读者。

Learning Deep Representations for Visual Recognition
Kaiming He (Facebook AI Research)
Slides: http://deeplearning.csail.mit.edu/cvpr2017_tutorial_kaiminghe.pdf
 
Deep Learning for Object Detection, 
Ross Girshick (Facebook AI Research)
Video:


【经典课程】2017斯坦福大学CS231n Lecture 1 Introduction to Convolutional Neural Networks for Visual Recognition

计算机视觉Paper 发表了文章 • 0 个评论 • 116 次浏览 • 2017-08-15 11:46 • 来自相关话题

 
CS231n Convolutional Neural Networks for Visual Recognition  Spring 2017
 
课件地址: http://cs231n.stanford.edu/slides/2017/ 
 
视频: (需自备上网工具)
Bilibili视频地址:http://www.bilibili.com/video/av13260183/#page=1
 
Lecture 1 | Introduction to Convolutional Neural Networks for Visual Recognition



  查看全部
 
CS231n Convolutional Neural Networks for Visual Recognition  Spring 2017
 
课件地址: http://cs231n.stanford.edu/slides/2017/ 
 
视频: (需自备上网工具)
Bilibili视频地址:http://www.bilibili.com/video/av13260183/#page=1
 
Lecture 1 | Introduction to Convolutional Neural Networks for Visual Recognition




 

【CVPR 2017】Multi-Context Attention for Human Pose Estimation

姿态估计Paper 发表了文章 • 0 个评论 • 107 次浏览 • 2017-08-13 15:38 • 来自相关话题

关键词:上下文,注意力,姿态估计
 
Multi-Context Attention for Human Pose Estimation
 
Xiao Chu, Wei Yang, Wanli Ouyang, Cheng Ma, Alan L. Yuille, Xiaogang Wang
 
Paper: https://arxiv.org/abs/1702.07432
 
In this paper, we propose to incorporate convolutional neural networks with a multi-context attention mechanism into an end-to-end framework for human pose estimation. We adopt stacked hourglass networks to generate attention maps from features at multiple resolutions with various semantics. The Conditional Random Field (CRF) is utilized to model the correlations among neighboring regions in the attention map. We further combine the holistic attention model, which focuses on the global consistency of the full human body, and the body part attention model, which focuses on the detailed description for different body parts. Hence our model has the ability to focus on different granularity from local salient regions to global semantic-consistent spaces. Additionally, we design novel Hourglass Residual Units (HRUs) to increase the receptive field of the network. These units are extensions of residual units with a side branch incorporating filters with larger receptive fields, hence features with various scales are learned and combined within the HRUs. The effectiveness of the proposed multi-context attention mechanism and the hourglass residual units is evaluated on two widely used human pose estimation benchmarks. Our approach outperforms all existing methods on both benchmarks over all the body parts. 查看全部
关键词:上下文,注意力,姿态估计
 
Multi-Context Attention for Human Pose Estimation
 
Xiao Chu, Wei Yang, Wanli Ouyang, Cheng Ma, Alan L. Yuille, Xiaogang Wang
 
Paper: https://arxiv.org/abs/1702.07432
 
In this paper, we propose to incorporate convolutional neural networks with a multi-context attention mechanism into an end-to-end framework for human pose estimation. We adopt stacked hourglass networks to generate attention maps from features at multiple resolutions with various semantics. The Conditional Random Field (CRF) is utilized to model the correlations among neighboring regions in the attention map. We further combine the holistic attention model, which focuses on the global consistency of the full human body, and the body part attention model, which focuses on the detailed description for different body parts. Hence our model has the ability to focus on different granularity from local salient regions to global semantic-consistent spaces. Additionally, we design novel Hourglass Residual Units (HRUs) to increase the receptive field of the network. These units are extensions of residual units with a side branch incorporating filters with larger receptive fields, hence features with various scales are learned and combined within the HRUs. The effectiveness of the proposed multi-context attention mechanism and the hourglass residual units is evaluated on two widely used human pose estimation benchmarks. Our approach outperforms all existing methods on both benchmarks over all the body parts.

【ICCV 2017】Interleaved Group Convolutions for Deep Neural Networks

图像分类Paper 发表了文章 • 0 个评论 • 164 次浏览 • 2017-08-11 11:27 • 来自相关话题

关键词:缩减模型,性能无损,交错组卷积
 
Interleaved Group Convolutions for Deep Neural Networks
 
Ting Zhang, Guo-Jun Qi, Bin Xiao, Jingdong Wang
 
ICCV 2017
 
paper: https://arxiv.org/abs/1707.02725
中文翻译: http://www.sohu.com/a/161110049_465975
 
In this paper, we present a simple and modularized neural network architecture, named interleaved group convolutional neural networks (IGCNets). The main point lies in a novel building block, a pair of two successive interleaved group convolutions: primary group convolution and secondary group convolution. The two group convolutions are complementary: (i) the convolution on each partition in primary group convolution is a spatial convolution, while on each partition in secondary group convolution, the convolution is a point-wise convolution; (ii) the channels in the same secondary partition come from different primary partitions. We discuss one representative advantage: Wider than a regular convolution with the number of parameters and the computation complexity preserved. We also show that regular convolutions, group convolution with summation fusion, and the Xception block are special cases of interleaved group convolutions. Empirical results over standard benchmarks, CIFAR-10 , CIFAR-100 , SVHN and ImageNet demonstrate that our networks are more efficient in using parameters and computation complexity with similar or higher accuracy.
  查看全部
关键词:缩减模型,性能无损,交错组卷积
 
Interleaved Group Convolutions for Deep Neural Networks
 
Ting Zhang, Guo-Jun Qi, Bin Xiao, Jingdong Wang
 
ICCV 2017
 
paper: https://arxiv.org/abs/1707.02725
中文翻译: http://www.sohu.com/a/161110049_465975
 
In this paper, we present a simple and modularized neural network architecture, named interleaved group convolutional neural networks (IGCNets). The main point lies in a novel building block, a pair of two successive interleaved group convolutions: primary group convolution and secondary group convolution. The two group convolutions are complementary: (i) the convolution on each partition in primary group convolution is a spatial convolution, while on each partition in secondary group convolution, the convolution is a point-wise convolution; (ii) the channels in the same secondary partition come from different primary partitions. We discuss one representative advantage: Wider than a regular convolution with the number of parameters and the computation complexity preserved. We also show that regular convolutions, group convolution with summation fusion, and the Xception block are special cases of interleaved group convolutions. Empirical results over standard benchmarks, CIFAR-10 , CIFAR-100 , SVHN and ImageNet demonstrate that our networks are more efficient in using parameters and computation complexity with similar or higher accuracy.
 

【ICLR 2015】Multiple Object Recognition with Visual Attention

目标检测Paper 发表了文章 • 0 个评论 • 78 次浏览 • 2017-08-09 14:37 • 来自相关话题

关键词: 注意力, 强化学习, 多目标识别
 
Multiple Object Recognition with Visual Attention
 
Jimmy Ba, Volodymyr Mnih, Koray Kavukcuoglu
 
DeepMind
 
paper: https://arxiv.org/abs/1412.7755
 
We present an attention-based model for recognizing multiple objects in images.
The proposed model is a deep recurrent neural network trained with reinforcement
learning to attend to the most relevant regions of the input image. We show that the
model learns to both localize and recognize multiple objects despite being given
only class labels during training. We evaluate the model on the challenging task of
transcribing house number sequences from Google Street View images and show
that it is both more accurate than the state-of-the-art convolutional networks and
uses fewer parameters and less computation.
 
关键词: 注意力, 强化学习, 多目标识别
  查看全部
关键词: 注意力, 强化学习, 多目标识别
 
Multiple Object Recognition with Visual Attention
 
Jimmy Ba, Volodymyr Mnih, Koray Kavukcuoglu
 
DeepMind
 
paper: https://arxiv.org/abs/1412.7755
 
We present an attention-based model for recognizing multiple objects in images.
The proposed model is a deep recurrent neural network trained with reinforcement
learning to attend to the most relevant regions of the input image. We show that the
model learns to both localize and recognize multiple objects despite being given
only class labels during training. We evaluate the model on the challenging task of
transcribing house number sequences from Google Street View images and show
that it is both more accurate than the state-of-the-art convolutional networks and
uses fewer parameters and less computation.
 
关键词: 注意力, 强化学习, 多目标识别
 

【CVPR 2017】RON: Reverse Connection with Objectness Prior Networks for Object Detection

目标检测Paper 发表了文章 • 0 个评论 • 74 次浏览 • 2017-08-08 01:09 • 来自相关话题

 
RON: Reverse Connection with Objectness Prior Networks for Object Detection
 
Tao Kong, Fuchun Sun, Anbang Yao, Huaping Liu, Ming Lu, Yurong Chen
 
github: https://github.com/taokong/RON
code: https://github.com/taokong/RON
 
We present RON, an efficient and effective framework for generic object detection. Our motivation is to smartly associate the best of the region-based (e.g., Faster R-CNN) and region-free (e.g., SSD) methodologies. Under fully convolutional architecture, RON mainly focuses on two fundamental problems: (a) multi-scale object localization and (b) negative sample mining. To address (a), we design the reverse connection, which enables the network to detect objects on multi-levels of CNNs. To deal with (b), we propose the objectness prior to significantly reduce the searching space of objects. We optimize the reverse connection, objectness prior and object detector jointly by a multi-task loss function, thus RON can directly predict final detection results from all locations of various feature maps. Extensive experiments on the challenging PASCAL VOC 2007, PASCAL VOC 2012 and MS COCO benchmarks demonstrate the competitive performance of RON. Specifically, with VGG-16 and low resolution 384X384 input size, the network gets 81.3% mAP on PASCAL VOC 2007, 80.7% mAP on PASCAL VOC 2012 datasets. Its superiority increases when datasets become larger and more difficult, as demonstrated by the results on the MS COCO dataset. With 1.5G GPU memory at test phase, the speed of the network is 15 FPS, 3X faster than the Faster R-CNN counterpart.
 
  查看全部
 
RON: Reverse Connection with Objectness Prior Networks for Object Detection
 
Tao Kong, Fuchun Sun, Anbang Yao, Huaping Liu, Ming Lu, Yurong Chen
 
github: https://github.com/taokong/RON
code: https://github.com/taokong/RON
 
We present RON, an efficient and effective framework for generic object detection. Our motivation is to smartly associate the best of the region-based (e.g., Faster R-CNN) and region-free (e.g., SSD) methodologies. Under fully convolutional architecture, RON mainly focuses on two fundamental problems: (a) multi-scale object localization and (b) negative sample mining. To address (a), we design the reverse connection, which enables the network to detect objects on multi-levels of CNNs. To deal with (b), we propose the objectness prior to significantly reduce the searching space of objects. We optimize the reverse connection, objectness prior and object detector jointly by a multi-task loss function, thus RON can directly predict final detection results from all locations of various feature maps. Extensive experiments on the challenging PASCAL VOC 2007, PASCAL VOC 2012 and MS COCO benchmarks demonstrate the competitive performance of RON. Specifically, with VGG-16 and low resolution 384X384 input size, the network gets 81.3% mAP on PASCAL VOC 2007, 80.7% mAP on PASCAL VOC 2012 datasets. Its superiority increases when datasets become larger and more difficult, as demonstrated by the results on the MS COCO dataset. With 1.5G GPU memory at test phase, the speed of the network is 15 FPS, 3X faster than the Faster R-CNN counterpart.
 
 

【CVPR 2017】Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

行为识别Paper 发表了文章 • 0 个评论 • 94 次浏览 • 2017-08-07 16:47 • 来自相关话题

 
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
 
arXiv: https://arxiv.org/abs/1705.07750
 
CVPR 2017
 
The paucity of videos in current action classification datasets (UCF-101 and HMDB-51) has made it difficult to identify good video architectures, as most methods obtain similar performance on existing small-scale benchmarks. This paper re-evaluates state-of-the-art architectures in light of the new Kinetics Human Action Video dataset. Kinetics has two orders of magnitude more data, with 400 human action classes and over 400 clips per class, and is collected from realistic, challenging YouTube videos. We provide an analysis on how current architectures fare on the task of action classification on this dataset and how much performance improves on the smaller benchmark datasets after pre-training on Kinetics.
We also introduce a new Two-Stream Inflated 3D ConvNet (I3D) that is based on 2D ConvNet inflation: filters and pooling kernels of very deep image classification ConvNets are expanded into 3D, making it possible to learn seamless spatio-temporal feature extractors from video while leveraging successful ImageNet architecture designs and even their parameters. We show that, after pre-training on Kinetics, I3D models considerably improve upon the state-of-the-art in action classification, reaching 80.7% on HMDB-51 and 98.0% on UCF-101.
  查看全部
 
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
 
arXiv: https://arxiv.org/abs/1705.07750
 
CVPR 2017
 
The paucity of videos in current action classification datasets (UCF-101 and HMDB-51) has made it difficult to identify good video architectures, as most methods obtain similar performance on existing small-scale benchmarks. This paper re-evaluates state-of-the-art architectures in light of the new Kinetics Human Action Video dataset. Kinetics has two orders of magnitude more data, with 400 human action classes and over 400 clips per class, and is collected from realistic, challenging YouTube videos. We provide an analysis on how current architectures fare on the task of action classification on this dataset and how much performance improves on the smaller benchmark datasets after pre-training on Kinetics.
We also introduce a new Two-Stream Inflated 3D ConvNet (I3D) that is based on 2D ConvNet inflation: filters and pooling kernels of very deep image classification ConvNets are expanded into 3D, making it possible to learn seamless spatio-temporal feature extractors from video while leveraging successful ImageNet architecture designs and even their parameters. We show that, after pre-training on Kinetics, I3D models considerably improve upon the state-of-the-art in action classification, reaching 80.7% on HMDB-51 and 98.0% on UCF-101.
 

【CVPR 2017 BEST PAPER】Densely Connected Convolutional Netoworks

图像分类Paper 发表了文章 • 0 个评论 • 79 次浏览 • 2017-08-07 16:57 • 来自相关话题

 
Densely Connected Convolutional Networks
 
Gao Huang*, Zhuang Liu*, Laurens van der Maaten and Kilian Weinberger
 
CVPR 2017 BEST PAPER
 
arXiv: https://arxiv.org/abs/1608.06993
code: https://github.com/liuzhuang13/DenseNet
 
Recent work has shown that convolutional networks can be substantially deeper, more accurate, and efficient to train if they contain shorter connections between layers close to the input and those close to the output. In this paper, we embrace this observation and introduce the Dense Convolutional Network (DenseNet), which connects each layer to every other layer in a feed-forward fashion. Whereas traditional convolutional networks with L layers have L connections - one between each layer and its subsequent layer - our network has L(L+1)/2 direct connections. For each layer, the feature-maps of all preceding layers are used as inputs, and its own feature-maps are used as inputs into all subsequent layers. DenseNets have several compelling advantages: they alleviate the vanishing-gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters. We evaluate our proposed architecture on four highly competitive object recognition benchmark tasks (CIFAR-10, CIFAR-100, SVHN, and ImageNet). DenseNets obtain significant improvements over the state-of-the-art on most of them, whilst requiring less memory and computation to achieve high performance. Code and models are available at this https URL .
 
  查看全部
 
Densely Connected Convolutional Networks
 
Gao Huang*, Zhuang Liu*, Laurens van der Maaten and Kilian Weinberger
 
CVPR 2017 BEST PAPER
 
arXiv: https://arxiv.org/abs/1608.06993
code: https://github.com/liuzhuang13/DenseNet
 
Recent work has shown that convolutional networks can be substantially deeper, more accurate, and efficient to train if they contain shorter connections between layers close to the input and those close to the output. In this paper, we embrace this observation and introduce the Dense Convolutional Network (DenseNet), which connects each layer to every other layer in a feed-forward fashion. Whereas traditional convolutional networks with L layers have L connections - one between each layer and its subsequent layer - our network has L(L+1)/2 direct connections. For each layer, the feature-maps of all preceding layers are used as inputs, and its own feature-maps are used as inputs into all subsequent layers. DenseNets have several compelling advantages: they alleviate the vanishing-gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters. We evaluate our proposed architecture on four highly competitive object recognition benchmark tasks (CIFAR-10, CIFAR-100, SVHN, and ImageNet). DenseNets obtain significant improvements over the state-of-the-art on most of them, whilst requiring less memory and computation to achieve high performance. Code and models are available at this https URL .
 
 

【CVPR 2017】Multi-Context Attention for Human Pose Estimation

行为识别Paper 发表了文章 • 0 个评论 • 62 次浏览 • 2017-08-11 00:58 • 来自相关话题

 
Multi-Context Attention for Human Pose Estimation
 
Xiao Chu, Wei Yang, Wanli Ouyang, Cheng Ma, Alan L. Yuille, Xiaogang Wang
 
CVPR 2017
 
code: https://github.com/bearpaw/pose-attention
paper: https://arxiv.org/abs/1702.07432
 
In this paper, we propose to incorporate convolutional neural networks with a multi-context attention mechanism into an end-to-end framework for human pose estimation. We adopt stacked hourglass networks to generate attention maps from features at multiple resolutions with various semantics. The Conditional Random Field (CRF) is utilized to model the correlations among neighboring regions in the attention map. We further combine the holistic attention model, which focuses on the global consistency of the full human body, and the body part attention model, which focuses on the detailed description for different body parts. Hence our model has the ability to focus on different granularity from local salient regions to global semantic-consistent spaces. Additionally, we design novel Hourglass Residual Units (HRUs) to increase the receptive field of the network. These units are extensions of residual units with a side branch incorporating filters with larger receptive fields, hence features with various scales are learned and combined within the HRUs. The effectiveness of the proposed multi-context attention mechanism and the hourglass residual units is evaluated on two widely used human pose estimation benchmarks. Our approach outperforms all existing methods on both benchmarks over all the body parts.
  查看全部
 
Multi-Context Attention for Human Pose Estimation
 
Xiao Chu, Wei Yang, Wanli Ouyang, Cheng Ma, Alan L. Yuille, Xiaogang Wang
 
CVPR 2017
 
code: https://github.com/bearpaw/pose-attention
paper: https://arxiv.org/abs/1702.07432
 
In this paper, we propose to incorporate convolutional neural networks with a multi-context attention mechanism into an end-to-end framework for human pose estimation. We adopt stacked hourglass networks to generate attention maps from features at multiple resolutions with various semantics. The Conditional Random Field (CRF) is utilized to model the correlations among neighboring regions in the attention map. We further combine the holistic attention model, which focuses on the global consistency of the full human body, and the body part attention model, which focuses on the detailed description for different body parts. Hence our model has the ability to focus on different granularity from local salient regions to global semantic-consistent spaces. Additionally, we design novel Hourglass Residual Units (HRUs) to increase the receptive field of the network. These units are extensions of residual units with a side branch incorporating filters with larger receptive fields, hence features with various scales are learned and combined within the HRUs. The effectiveness of the proposed multi-context attention mechanism and the hourglass residual units is evaluated on two widely used human pose estimation benchmarks. Our approach outperforms all existing methods on both benchmarks over all the body parts.