논문명 | 3D Fully Convolutional Network for Vehicle Detection in Point Cloud |
---|---|
저자(소속) | Bo Li (Baidu) |
학회/년도 | Nov. 2016 ~ Jan. 2017, 논문 |
키워드 | Bo2016b |
데이터셋(센서)/모델 | KITTI |
참고 | 이전 논문(Aug. 2016) |
코드 | TF |
- 차량 탐지 알고리즘 `For example Li (2017) presented a method for vehicle detection in ground-based LiDAR point clouds. `
- The input point cloud was voxelized, and then a fourth binary channel was appended, representing the binary occupancy,i.e. the presence or the absence of a point within each voxel.
- A 3D-FCN was then trained and evaluated to produce two maps representing the objectness and bounding box scores, using the KITTI dataset.
목표 : we design a 3D fully convolutional network (FCN) to detect and localize objects as 3D boxes from point cloud data.
제안 방법 : extends FCN to 3D and is applied to 3D vehicle detection for an autonomous driving system
Pipeline : 후보 영역 선정(candidate proposal) -> 분류(classification)
segmentation algorithms [2], [5], [11], [14], [21], [22], [28], [29], [31],
sliding window [30]
random sampling [13]
RegionProposal Network (RPN) [26]
shape model [6], [13]
geometry statistic features [22], [27], [29], [31].
Sparsing coding [4], [15]
deep learning [26]
3D-2D Projecting Besides directly operating in the 3D point cloud space, some other previous detection alogrithms project 3D point cloud onto 2D surface as depthmaps or range scans [3], [16],[17].
[3-3DOP] Xiaozhi Chen, Kaustav Kundu, Yukun Zhu, Andrew G Berneshawi, Huimin Ma, Sanja Fidler, and Raquel Urtasun. 3d object proposals for accurate object class detection. Advances in Neural Information Processing Systems, pages 424–432, 2015.
[9] S Gupta, R Girshick, P Arbelaez, and J Malik. Learning Rich Features from RGB-D Images for Object Detection and Segmentation. arXiv preprint arXiv:1407.5736, pages 1–16, 2014.
[16-VeloFCN] Bo Li, Tianlei Zhang, and Tian Xia. Vehicle detection from 3d lidar using fully convolutional network. Proceedings of Robotics: Science and Systems, 2016.
[17] Dahua Lin, Sanja Fidler, and Raquel Urtasun. Holistic scene understanding for 3D object detection with RGBD cameras. Proceedings of the IEEE International Conference on Computer Vision, pages 1417–1424, 2013.
[24] Max Schwarz, Hannes Schulz, and Sven Behnke. RGB-D Object Recognition and Pose Estimation based on Pre-trained Convolutional Neural Network Features. IEEE International Conference on Robotics and Automation (ICRA), (May), 2015.
[25] Richard Socher, Brody Huval, Bharath Bath, Christopher D Manning, and Andrew Y Ng. Convolutional-recursive deep learning for 3d object classification. Advances in Neural Information Processing Systems, pages 665–673, 2012.
3D object localization을 2D CNN에 바로 작용 하여도 동작함 [16-VeloFCN]
[10] operates 3D voxel data but regards one dimension as a channel to apply 2D CNN.
[10] Michael Himmelsbach, Felix V Hundelshausen, and Hans-Joachim Wunsche. Fast segmentation of 3d point clouds for ground vehicles. Intelligent Vehicles Symposium (IV), 2010 IEEE, pages 560–565, 2010.
복셀(voxel : 볼륨+픽셀) : 3D 공간에서 볼륨과 색상 정보를 포함하는 점(포인트)
[16] Bo Li, Tianlei Zhang, and Tian Xia. Vehicle detection from 3d lidar using fully convolutional network. Proceedings of Robotics: Science and Systems, 2016
[8] Ben Graham. Sparse 3D convolutional neural networks. Bmvc, pages 1–11, 2015
[20] Daniel Maturana and Sebastian Scherer. VoxNet : A 3D Convolutional Neural Network for Real-Time Object Recognition. pages 922–928, 2015
[32] Zhirong Wu and Shuran Song. 3D ShapeNets : A Deep Representation for Volumetric Shapes. IEEE Conference on Computer Vision and Pattern Recognition (CVPR2015), pages 1–9, 2015.
[26] Shuran Song and Jianxiong Xiao. Sliding shapes for 3d object detection in depth images. pages 634–651, 2014.
본 논문은 FCN이 3D를 처리 할수 있도록 변경 하였다. In this paper, we transplant the fully convolutional network (FCN) to 3D to detect and localize object as 3D boxes in point cloud.
FCN은 최근 좋은 이미지넷, KITTI등에서 좋은 성과를 보이고 있다. The FCN is a recently popular framework for end-to-end object detection, with top performance in tasks including ImageNet, KITTI, ICDAR, etc.
Variations of FCN include DenseBox [12], YOLO [23] and SSD [18].
본 연구는 DenseBox에 영향을 받았다. The approach proposed in this paper is inspired by the basic idea of DenseBox.
FCN의 기본 Task는 두개로 나눌수 있다. The procedure of FCN based detection frameworks can be summarized as two tasks,
FCN은 두개의 Task에 따라 서로 다른 Output Map을 생성한다. As illustrated in Figure 1, a FCN is formed with two output maps corresponding to the two tasks respectively.
[16] Bo Li, Tianlei Zhang, and Tian Xia. Vehicle detection from 3d lidar using fully convolutional network. Proceedings of Robotics: Science and Systems, 2016.
We follow the denotion of [16].
Denote $$o^a_p$$ as the output at region $$p$$ of theobjectness map, which can be encoded by softmax or hinge loss.
Denote $$o^b_p$$ as the output of the bounding box map,which is encoded by the coordinate offsets of the boundingbox.
Denote the ground truth objectness label at region $$p$$ as $$^l_p$$.
For simplicity each class corresponds to one label in this paper.
In some works, (e.g. SSD or DenseBox)
, the network can have multiple objectness labels for one class, corresponding to multiple scales or aspect ratios.
The objectness loss at $$p$$is denoted as
Denote the groundtruth bounding box coordinates offsets at region $$p$$ as $$b_p$$.
Similarly, in this paper we assume only one bounding box map is produced, though a more sophisticated network can have multiple bounding box offsets predicted for one class, corresponding to multiple scales or aspect ratios.
Each bounding box loss is denoted as
In the deployment phase,
Although a variety of discretization embedding have been introduced for high-dimensional convolution [1], [8], for simplicity we discretize the point cloud on square grids.
[1] Andrew Adams, Jongmin Baek, and Myers Abraham Davis. Fast high-dimensional filtering using the permutohedral lattice. Computer Graphics Forum, 29(2):753–762, 2010.
[8] Ben Graham. Sparse 3D convolutional neural networks. Bmvc, pages 1–11, 2015.
The discretized data can be represented by a 4D array with dimensions of length, width, height and channels.
For the simplest case, only one channel of value {0, 1} is used to present whether there is any points observed at the corresponding grid elements.
Some more sophisticated features have also been introduced in the previous works, e.g. [20-VoxNet].
The mechanism of 2D CNN naturally extends to 3D on the square grids.
[Fig. 2. A sample illustration of the 3D FCN structure used in this paper.]
- Feature maps are
- first down-sampled by three convolution operation with the stride of 1/2^3 and
- then up-samped by the deconvolution operation of the same stride.
- The output objectness map (o^a) and bounding box map (o^b) are collected from the deconv4a and deconv4b layers respectively.
Figure 2 shows an example of the network structure used in this paper.
The network follows and simplifies the hourglass shape from [19].
[19] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. arXiv preprint arXiv:1411.4038, 2014.
Layer conv1, conv2and conv3 downsample the input map by $$1/2^3$$sequentially.
Layer deconv4a and deconv4b upsample the incoming mapby $$2^3$$respectively.
The ReLU activation is deployed aftereach layer.
The output objectness map ($$o^a$$) and bounding box map ($$o^b$$) are collected from the deconv4a and deconv4blayers respectively.
Similar to DenseBox, the objectness region V is denoted as the center region of the object.
For the proposed 3D case, a 3D sphere located at the object center is used.
Points inside the sphere are labeled as positive / foreground label.
The bounding box prediction at point p is encoded by the coordinate offsets, defined as:
The training and testing processes of the 3D CNN follows[16].
For the testing phase, candidate bounding boxes are extracted from regions predicted as objects and scored by counting its neighbors from all candidate bounding boxes.
Bounding boxes are selected from the highest score and candidates overlapping with selected boxes are suppressed.
[Fig. 3. Intermediate results of the 3D FCN detection procedure.]
- (a) Bounding box predictions are collected from regions with high objectness confidence and are plotted as green boxes.
- (b) Bounding boxes after clustering plotted with the blue original point cloud.
- (c) Detection in 3D since (a) and (b) are visualized in the bird’s eye view.
Figure 3 shows an example of the detection intermediate results.
Bounding box predictions from objectness points are plotted as green boxes.
Note that for severely occluded vehicles, the bounding boxes shape are distorted and not clustered.
This is mainly due to the lack of similar samples in the training phase.
컴퓨팅 자원 소모 증가 Compared to 2D CNN, the dimension increment of 3D CNN inevitably consumes more computational resource, mainly due to
On the other hand, naturally embedding objects in 3D space avoids perspective distortion and scale variation in the 2D case.
This make it possible to learn detection using a relatively simpler network structure.
Is there a Convolutional Neural Network implementation for 3D.... Available from: https://www.researchgate.net/post/Is_there_a_Convolutional_Neural_Network_implementation_for_3D_images [accessed Aug 14, 2017].