논문명 | Vehicle Detection from 3D Lidar Using Fully Convolutional Network |
---|---|
저자(소속) | Bo Li (Baidu) |
학회/년도 | 29 Aug 2016, 논문 |
키워드 | It is able to predict full 3D bounding boxes even using a 2D CNN |
데이터셋(센서)/모델 | KITTI(Velodyne 64E) |
참고 | |
코드 |
robotic 분야에서 이루어진 3D 클라우드 포인트를 이용한 tasks들 : Localization, mapping, object detection and scene parsing [16]
목표 : we design a fully convolutional network (FCN) to detect and localize objects as 3D boxes from range scan data.
Object detection algorithms = propose candidates + classify
rule-based
cluster the remaining points can generate reasonable segmentation
[10, 5].forming graphs on the point cloud [32, 14, 21, 29, 30].
[2] suggests to segment the scene hierarchically and keep segments of different scales.
directly exhaust the range scan space to propose candidates to avoid incorrect segmentation.
최근의 머신러닝 기반의 탐지 방법들은 일부 Feature들을 hand-crafted한후에 분류 하는데 사용 하였다.
A comparison of features can be found in [1].
Besides the hand-crafted features, Deuge et al. [4], Laiet al. [15] explore to learn feature representation of point cloud via sparse coding.
range scan에서의 물체 탐지는 RGBD images [3, 17]를 이용한 탐지 방법과 깊은 연관이있다.
RGBD data : 색상정보(RGB) + 물체까지의 거리정보 (Depth)를 함께 측정할
[3]은 RGBD가 아니라, 그냥 카메라 센서로 알고 있음
[3] Xiaozhi Chen, Kaustav Kundu, Yukun Zhu, Andrew G Berneshawi, Huimin Ma, Sanja Fidler, and Raquel Urtasun. 3d object proposals for accurate object class detection. Advances in Neural Information Processing Systems, pages 424–432, 2015.
[17] Dahua Lin, Sanja Fidler, and Raquel Urtasun. Holistic scene understanding for 3D object detection with RGBD cameras. Proceedings of the IEEE International Conference on Computer Vision, pages 1417–1424, 2013.
R-CNN [8] proposes candidate regions and uses CNN to verify candidates as valid objects
OverFeat [25], DenseBox [11] and YOLO [23] uses end-to-end unified FCN frameworks which predict the objectness confidence and the bounding boxes simultaneously over the whole image.
Some research has also been focused on applying CNN on 3D data.
RGBD dat에서 D를 하나의 이미지 채널로 간주 하고 2D CNN를 이용하여 분류/탐지 하였다. [9, 24, 26].
[9] S Gupta, R Girshick, P Arbelaez, and J Malik. Learning ´Rich Features from RGB-D Images for Object Detection and Segmentation. arXiv preprint arXiv:1407.5736, pages 1–16, 2014.
[24] Max Schwarz, Hannes Schulz, and Sven Behnke. RGB-D Object Recognition and Pose Estimation based on Pretrained Convolutional Neural Network Features. IEEE International Conference on Robotics and Automation
(ICRA), (May), 2015.
[26] Richard Socher, Brody Huval, Bharath Bath, Christopher D Manning, and Andrew Y Ng. Convolutionalrecursive deep learning for 3d object classification. Advances in Neural Information Processing Systems, pages 665–673, 2012.
[33] Zhirong Wu and Shuran Song. 3D ShapeNets : A Deep Representation for Volumetric Shapes. (CVPR2015), pages 1–9, 2015.
[19] Daniel Maturana and Sebastian Scherer. VoxNet : A 3D Convolutional Neural Network for Real-Time Object Recognition. pages 922–928, 2015
These classifier scan be integrated with region proposal method like slidin gwindow [27] for detection tasks.
In this paper, our approach project range scans as 2D maps similar to the depthmap of RGBD data.
The frameworks of Huang et al. [11], Sermanet et al. [25] are transplanted to predict the objectness and the 3D object bounding boxes in a unified end-to-end manner.
센서 : Velodyne 64E lidar.
아래 투영(projection)함수를 이용하여 2D pointmap으로 변경 points from a Velodyne scan can be roughly projected and discretized into a 2D point map, using the following projection function.
The projected point map is analogous to cylindral images.
We fill the element at $$(r, c)$$ in the 2D point map with 2-channel data $$(d, z)$$
An example of the $$d$$ channel of the 2D point map is shown in Figure 1a.
Rarely some points might be projected into a same 2D position, in which case the point nearer to the observer is kept.
Elements in 2D positions where no 3D points are projected into are filled with $$(d, z) = (0, 0)$$
제안된 방식의 큰 흐름은 [11],[18]과 비슷하다. The trunk part of the proposed CNN architecture is similar to Huang et al. [11], Long et al. [18].
[11] Lichao Huang, Yi Yang, Yafeng Deng, and Yinan Yu. DenseBox: Unifying Landmark Localization with End to End Object Detection. pages 1–13, 2015.
[18] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. arXiv preprint arXiv:1411.4038, 2014.
[Fig. 2. The proposed FCN structure to predict vehicle objectness and bounding box simultaneously.]
- The output feature map of conv1/deconv5a, conv1/deconv5b and conv2/deconv4 are first concatenated and
- then ported to their consecutive layers, respectively.
As illustrated in Figure2,
픽셀단위 prediction을 위해 입력(point map)과 출력(Bounding box/Objectness map)의 가로/세로 크기는 동일하다. The input point map, output objectness map and bounding box map are of the same width and height, to provide point-wise prediction.
Each element of the objectness map predicts whether its corresponding point is on a vehicle.
If the corresponding point is on a vehicle, its corresponding element in the bounding box map predicts the 3D bounding box of the belonging vehicle.
Section III-C explains how the objectness and bounding box is encoded.
In conv1, the point map is down-sampled by 4 horizontally and 2 vertically.
Similarly, the feature map is up-sampled by this factor of (4, 2) in deconv6a and deconv6b, respectively.
The rest conv/deconv layers all have equal horizontal and vertical resolution, respectively, and use squared strides of (2, 2) when up-sampling or down-sampling.
The output feature map pairs of conv3/deconv4, conv2/deconv5a, conv2/deconv5b are of the same sizes, respectively.
We concatenate these output feature map pairs before passing them to the subsequent layers.
This follows the idea of Long et al. [18].
저층과 고층의 Feature을 합치면 작은 물체와 물건의 가장자리를 탐지할수 있는 능력이 커진다.
Combining features from lower layers and higher layers improves the prediction of small objects and object edges.
[18] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. arXiv preprint arXiv:1411.4038, 2014.
feature maps 결과물에 대한 설명 We now describe how the output feature maps are defined.
The objectness map deconv6a consists of 2 channels corresponding to foreground,
The 2 channels are normalized by softmax to denote the confidence.
Its observation angle is (θ, φ) by (식 1).
We first denote a rotation matrix R as $$ R = R_z(\theta)R_y(\phi) .... (식 2)$$
[Fig. 3. (a) Illustration of (식 3). ]
- For each vehicle point p, we define a specific coordinate system which is centered at p.
- The x axis (rx) of the coordinate system is along with the ray from Velodyne origin to p(dashed line).
Figure 3a illustrate an example on how $$R$$ is formed.
A bounding box corner $$c_p = (x_c, y_c, z_c)$$ is thus transformed as: $$c\prime_p = R^T (c_p - p) ... (식 3) $$
Our proposed approach uses $$c\prime_p$$ to encode the bounding box corner of the vehicle which $$p$$ belongs to.
The full bounding box is thus encoded by concatenating 8 corners in a 24d vector as $$b\primeP = (c\prime^T{p,1},c\prime^T{p,2},...,c\prime^T{p,8})^T ... (식 4)$$
Corresponding to this 24d vector, deconv6b outputs a 24-channel feature map accordingly.
The transform (3) is designed due to the following two reasons
2D이미지와 비슷하게 3D도 데이터 증강을 하면 성능이 좋아 진다.
For the case of images, training data are usually augmented by randomly zooming or rotating the original images to synthesis more training samples
For the case of range scans, simply applying these operations results in variable ∆θ and ∆φ in (1), which violates the geometry property of the lidar device.
To synthesis(인조, 합성) geometrically correct 3D range scans, we randomly generate a 3D transform near identity.
Before projecting point cloud by (1), the random transform is applied the point cloud.
The translation component of the transform results in zooming effect of the synthesized range scan.
The rotation component results in rotation effect of the range scan.
the proposed network consists of
We respectively denote the losses of the two branches in the training phase.
As notation, denote $$o^a_p$$ and $$o^b_p$$ as the feature map output of deconv6a and deconv6b corresponding to point p respectively.
Also denote $$p$$ as the point cloud and $$V \subset P$$ as all points on all vehicles.
The loss of the objectness classification branch corresponding to a point $$p$$ is denoted as a softmax loss
The loss of the bounding box regression branch corresponding to a point p is denoted as a L2-norm loss
Note that $$L_{box}$$ is only computed for those points on vehicles. For non-vehicle points, the bounding box loss is omitted.
사실 3D보다 이미지 기반이 성능이 좋다 그 이유는 크게 아래와 같다.
Related explanation can also be found from Wang and Posner [31].
Li et al. (2016b) improve upon these results by exploiting a fully convolutional neural network for detecting vehicles from range data. +
They represent the data in a 2D point map, and predict an objectness confidence and a bounding box simultaneously using a single 2D CNN.
The encoding used to represent the data allows them to predict the full 3D bounding box of the vehicles.
A CNN-based approach in [7]
- by projecting the point cloud into a 2D depth map, with an additional channel for the height of a point from the ground.
Their model predicts detection scores and 20regresses to bounding boxes.
However, the projection to a specific viewpoint discards valuable information, which is particularly detrimental, for example, in crowded scenes.+
It also requires the network filters to learn local dependencies with regards to depth, information that is readily available in a 3D representation and which can be efficiently extracted with sparse convolutions.
In addition to the 3D voxel representation, VeloFCN [17] projects point cloud to the front view, obtaining a 2D point map.
They apply a fully convolutional network on the 2D point map and predict 3D boxes densely from the convolutional feature maps.