논문명 | Monocular 3D Object Detection for Autonomous Driving |
---|---|
저자(소속) | Xiaozhi Chen |
학회/년도 | CVPR 2016, 논문 |
키워드 | MV3D 저자, KITTI, 카메라 1대, 깊이 정보 사용 못함, 물체 탐지 초점 |
참고 | 홈페이지 |
코드 | Download |
연구 목표 및 관련 연구가 주로
후보영역 선출
에 관한 것들임
목적 : perform 3D object detection from a single monocular image
Our method
object proposals
object detections
object proposals
에 좀더 중점을 두고 있음
자율주행차의 센서로 LIDAR를 많이 쓰지만, 비싼가격으로 최근에는 저렴한 Camera를 활용하는 방법에 대하여 연구 되고 있다.
Faster R-CNN등의 물체 탐지 방법들은 후보영역을 선출하는 방법을 쓰고 있다. Most of the recent object detection pipelines [19-Fast RCNN, 20-RCNN] typically proceed by generating adiverse set of object proposals that have a high recall and are relatively fast to compute [45, 2]. By doing this, computationally more intense classifiers such as CNNs [28, 42]can be devoted to a smaller subset of promising image regions, avoiding computation on a large set of futile candidates.
[45] K. Van de Sande, J. Uijlings, T. Gevers, and A. Smeulders. Segmentation as selective search for object recognition. In ICCV, 2011
[2] P. Arbelaez, J. Pont-Tusetand, J. Barron, F. Marques, and J. Malik. Multiscale combinatorial grouping. In CVPR. 2014.
[28] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
[42] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In arXiv:1409.1556, 2014.
본 논문도 후보영역
아이디어를 활용한다. Our paper follows this line of work. Different types of object proposal methods have been developed in the past few years.
후보영역 선출의 일반적 방법은 픽셀단위
로 나누고 유사도
를 측정하는 것이다. A common approach is to over-segment the image into super pixels and group these using several similarity measures [45, 2].
objectness
와 contour
정보를 이용하여 윈도우 탐색하는 방법도 있다. Approaches that efficiently explore an exhaustive set of windows using simple “objectness” features [1, 11], or contour information[55] have also been proposed.
[1] B. Alexe, T. Deselares, and V. Ferrari. Measuring the objectness of image windows. PAMI, 2012.
[11] M. Cheng, Z. Zhang, M. Lin, and P. Torr. BING: Binarized normed gradients for objectness estimation at 300fps. In CVPR, 2014
[55] L. Zitnick and P. Dollar. Edge boxes: Locating object proposals from edges. In ECCV. 2014
세그멘테이션 모델
, parametric energies
, CNN Feature
를 사용하는 방법이 연구되고 있다.The most recent line of work aims to learn how to propose promising object candidates using either ensembles of binary segmentation models [27], parametric energies [29] or window classifiers based on CNN features [18].
[27] P. Kr ahenb uhl and V. Koltun. Learning to propose objects. In CVPR, 2015.
[29] T. Lee, S. Fidler, and S. Dickinson. A learning framework for generating region proposals with mid-level cues. In ICCV, 2015
[18] A. Ghodrati, A. Diba, M. Pedersoli, T. Tuytelaars, and L. V. Gool. Deepproposal: Hunting objects by cascading deep convolutional layers. In arXiv:1510.04445, 2015.
이러한 방식들은 PASCAL VOC에서는 좋은 성과를 보였다. 하지만, 자율주행의 경우에는 좀더 Strict한 룰이 적용 되어야 한다. 유명한 R-CNN같은 것들도 KITTI데이터에서는 성능이 않좋다. KITTI 데이터에서 좋은 성능을 보이는 [10]은 stereo imagery(2개의)을 이용하여서 3D 후보영역을 제안 한다.The current leader on KITTI is Chen et al. [10], which exploits stereo imagery to create accurate 3D proposals.
[10] X. Chen, K. Kundu, Y. Zhu, A. Berneshawi, H. Ma, S. Fidler, and R. Urtasun. 3d object proposals for accurate object class detection. In NIPS, 2015
하지만 대부분의 차량은 카메라가 한개 달려 있다.따라서 monocular object detection
는 중요한 도전 과제 이다.
본 논문 제안 : this paper proposes a method that learns to generate class-specific 3D object proposals with very high recall by exploiting contextual models as well as semantics.
These proposals are generated by exhaustively placing 3D bounding boxes on the ground-plane and scoring them via simple and efficiently computable image features.
In particular, we use semantic and object instance segmentation, context, as well as shape features and location priors to score our boxes.
We learn per-class weights for these features using S-SVM [24], adapting to each individual object class.
The top object candidates are then scored with a CNN, resulting in the final set of detections.
Our work is related to methods for object proposal generation
,as well as monocular 3D object detection.
autonomous Driving분야에서의 object proposal에 대하여 주로 살펴 보겠다.
Mostof the existing work on proposal generation uses RGB [45,55, 9, 2, 11, 29], RGB-D [4, 21, 31, 25], or video [35].
In RGB, most methods combine superpixels into larger regions via several similarity functions using e.g. color and texture [45, 2].
These approaches prune the exhaustive set of windows down to about 2K proposals per image achievingal most perfect recall on PASCAL VOC [12].
[9] defines parametric affinities between pixels and finds the regions using parametric min-cut.
The resulting regions are then scored via simple features, and the top-ranked proposals are used in recognition tasks [8, 15, 53].
Exhaustively sampled boxes are scored using several “objectness” features in [1].
BING proposals [11] score boxes based on an object closure measure as a proxy for “objectness”.
Edgeboxes [55] score an exhaustive set of windows based on contour information inside and on the boundary of each window.
본 논문의 방식과 비슷한 것들은 학습을 이용한 방식이다.
The most related approaches to ours are recent methods that aim to
learnhow to propose objects.
[29] learns parametric energies
in order to propose multiple diverse regions.
In [27], an ensemble of figure-ground segmentation models are learnt.
Joint learning of the ensemble of local and globalbinary CRFs enables the individual predictors to specializein different ways.
[26] learned how to place promising object seeds and employ geodesic distance transform to obtain candidate regions.
Parallel to our work, [18] introduced a method that generates object proposals by cascading the layers of the convolutional neural network.
The method is efficient since it explores an exhaustive set of windows via integral images over the CNN responses.
Our approach also exploits integral images to score the candidates, however,in our work we exploit domain priors to place 3D bounding boxes and score them with semantic features.
We use pixel levelclass scores from the output layer of the grid CNN, as well as contextual and shape features.
In RGB-D, [10] exploited stereo imagery to exhaustively scored 3D bounding boxes using a conditional random field with several depth-informed potentials.
Our work also evaluates 3D bounding boxes, but uses semantic object and instance segmentation and 3D priors to place proposals onthe ground plane.
Our RGB potentials are partly inspired by [15, 53] which exploits efficiently computed segmentation potentials for 2D object detection.
Our work is also related to detection approaches for autonomous driving.
[54] first detects a candidate set of objects via a poselet-like approach and then fits a deformable wireframe model within the box.
[38] extends DPM [13] to 3D by linking parts across different viewpoints, while [14]extends DPM to reason about deformable 3D cuboids.
[34]uses an ensemble of models derived from visual and geometrical clusters of object instances.
Regionlets [32] proposes boxes via Selective Search and re-localizes them using a top-down approach.
[46] introduced a holistic model that re-reasons about DPM object candidates via cartographicpriors.
Recently proposed 3DVP [47] learns occlusion patterns in order to significantly improve performance of occluded cars on KITTI.
In this paper, we present an approach to object detection to perform accurate 3D object detection.
Since our input is a single monocular image, our ground-plane is assumed to be orthogonal to the image plane and a distance down from the camera, the value of which we assume to be known from calibration.
Since this ground-plane may not reflect perfect reality in each image, we do not force objects to lie on the ground, and only encourage them to be close.
3D후보군 결과들은 점수 순으로 정렬된후 가장 높은 것만 CNN을 통해 scored 된다. The resulting 3D candidates are then sorted according to their score, and only the most promising ones (after non-maxima suppression) are further scored via a Convolutional Neural Net (CNN)
.
This results in a fast and accurate approach to 3D detection.
We represent each object with a 3D bounding box, $$y = (x, y, z, \theta, c, t)$$
BBox의 크기 : We represent the size of the bounding box with a set of representative 3D templates t
, which are learnt from the training data.
We then define our scoring function by combining semantic cues (both class and instance level segmentation), location priors, context as well as shape:
This potential takes as input a pixel wise semantic segmentation containing multiple semantic
classes such as car, pedestrian, cyclist and road.
We incorporate two types of features encoding semantic segmentation.
The first feature encourages the presence of an object inside the bounding box by counting the percentage of pixels labeled as the relevant class:
with $$\Omega(y)$$ the set of pixels in the 2D box generated by projecting the 3D box y
to the image plane, and $$S_c$$ the segmentation mask for class c
.
The second feature computes the fraction of pixels that belong to classes other than the
object class
We use exhaustive search as inference to create our candidate proposals.
This can be done efficiently as all the features can be computed with integral images. (1.8s in a single core)
We learn the weights of the model using structured SVM [44].
We use the parallel cutting plane implementation of [40].
We use 3D Intersectionover-Union (IoU) as our task loss.
적분 영상(integral image)이란 쉽게 말해서 다음 픽셀에 이전 픽셀까지의 합이 더해진 영상이다, 적분 영상의 장점은 특정 영역의 픽셀 값의 총합을 매우 쉽게 구할 수 있다
NMS이후 선발된 top candidates들을 CNN을 이용하여서 further scored하는지 설명 In this section, we describe how the top candidates (after non-maxima suppression) are further scored via a CNN.
We employ the same network as in [10-3DOP], which for completeness we briefly describe here.
RoIs are obtained by projecting the proposals or context regions onto the conv5 feature maps.
We obtain the final feature vectors by concatenating the output features from the two branches.
[10] X. Chen, K. Kundu, Y. Zhu, A. Berneshawi, H. Ma, S. Fidler, and R. Urtasun. 3d object proposals for accurate object class detection. In NIPS, 2015
[53] Y. Zhu, R. Urtasun, R. Salakhutdinov, and S. Fidler. SegDeepM: Exploiting segmentation and context in deep neural networks for object detection. In CVPR, 2015.
The network architecture is illustrated in Fig. 2.
We use a multi-task loss to jointly predict category labels, bounding box offsets, and object orientation.
For background boxes, only the category label loss is employed.
We weight each loss equally, and define the category loss as cross entropy, the orientation loss as a smooth $$l1$$ and the bounding box offset loss as a smooth $$l1$$ loss over the
4 coordinates that parameterized the 2D bounding box, as in [20-RCNN].