논문명 | VoxNet: A 3D Convolutional Neural Network for Real-Time Object Recognition |
---|---|
저자(소속) | Daniel Maturana (CMU) |
학회/년도 | IROS 2015, 논문 |
키워드 | Daniel2015, Volumetric Occupancy Grid + 3D CNN, LiDAR와 비교 실험(결과는?) |
데이터셋/모델 | Sydney Urban Objects(차량 고려), NYUv2, ModelNet40 |
참고 | 이전연구: 3D convolutional neural networks for landing zone detection from lidar,” in ICRA, 2015. |
코드 | Theano+Lasagne |
Range sensors such as LiDAR and RGBD cameras가 많이 사용되고 있다.
하지만 large amounts of point cloud data를 충분히 활용하고 있지 않다.
VoxNet을 제안 : integrating a volumetric Occupancy Grid
representation with a supervised 3D Convolutional Neural Network (3D CNN).
본 논문에서는 3D point cloud segment
를 이용하여서 object 분류
의 문제점(eg. background clutter)을 살펴 본다.
원인 : 작업 Pipe-line
이러한 Pipe-line은 2D object 분류
와도 비슷하다.
3D point clouds를 이용한 물체 인식은 많은 연구가 있다.
기존 연구 : hand-crafted features(or descriptors) + machine learning classifier ([10], [11],[12], [13]).
제안 방식 : extract features and classify objects from the raw volumetric data
.
Our volumetric representation is also richer than point clouds, as it distinguishes free space
from unknown space.
In addition, features based on point clouds often require spatial neighborhood queries, which can quickly become intractable(고치기 어려운) with large numbers of points.
[10] A. Frome, D. Huber, and R. Kolluri, “Recognizing objects in range data using regional point descriptors,” ECCV, vol. 1, pp. 1–14, 2004.
[11] J. Behley, V. Steinhage, and A. B. Cremers, “Performance of histogram descriptors for the classification of 3D laser range data in urban environments,” in ICRA, 2012, pp. 4391–4398.
[12] A. Teichman, J. Levinson, and S. Thrun, “Towards 3D object recognition via classification of arbitrary object tracks,” in ICRA, 2011, pp. 4034–4041.
[13] A. Golovinskiy, V. G. Kim, and T. Funkhouser, “Shape-based recognition of 3D point clouds in urban environments,” ICCV, 2009.
[14] D. Munoz, N. Vandapel, and M. Hebert, “Onboard contextual classification of 3-D point clouds with learned high-order markov random fields,” in ICRA, 2009.
[15] H. Koppula, “Semantic labeling of 3D point clouds for indoor scenes,”NIPS, 2011.
[16] X. Ren, L. Bo, and D. Fox, “RGB-(D) scene labeling: Features and algorithms,” in CVPR, 2012.
이미지(RGB)에서 사용한 CNN기법을 RGBD로 확장하려는 연구가 진행 되었다. [17], [18], [19], [20]
[17] I. Lenz, H. Lee, and A. Saxena, “Deep learning for detecting robotic grasps,” in RSS, 2013.
[18] Richard Socher and Brody Huval and Bharath Bhat and Christopher D.Manning and Andrew Y. Ng, “Convolutional-Recursive Deep Learning for 3D Object Classification,” in NIPS, 2012.
[19] L. A. Alexandre, “3D object recognition using convolutional neural networks with transfer learning between input channels,” in IAS, vol. 301, 2014.
[20] N. Hoft, H. Schulz, and S. Behnke, “Fast semantic segmentation of RGBD scenes with gpu-accelerated deep neural networks,” in 37th Annual German Conference on AI, 2014, pp. 80–85.
[4] propose a feature that locally describes scans with a 2.5D representation,
[21] studies this approach in combination with a form of unsupervised feature learning.
[22] propose an encoding that makes better use of the 3D information in the depth, but is still 2D-centric.
[4] A. Quadros, J. Underwood, and B. Douillard, “An occlusion-aware feature for range images,” in ICRA, May 14-18 2012.
[21] M. De Deuge, A. Quadros, C. Hung, and B. Douillard, “Unsupervised feature learning for classification of outdoor 3d scans,” in ACRA, 2013.
[22] S. Gupta, R. Girshick, P. Arbelaez, and J. Malik, “Learning rich features ´from RGB-D images for object detection and segmentation,” in ECCV, 2014.
비디오 분석에서는 3D CNN이 성공적으로 적용 되었다. ([23], [24])
시간 정보
가 3rd D로 작용하였다. [25] uses an unsupervised volumetric feature learning approach as part of a pipeline to detect indoor objects.
sparse coding
, which is generally slower than convolutional models. [26] propose a generative 3D convolutional model of shape and apply it to RGBD object recognition, among other tasks.
[25] K. Lai, L. Bo, and D. Fox, “Unsupervised feature learning for 3D scene labeling,” in ICRA, 2014.
[26] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao, “3d shapenets: A deep representation for volumetric shape modeling,” in CVPR, 2015.
[27] is an early work that studies a3D CNN for use with LiDAR data with a binary classification task.
[28], which introduced 3D CNNs for landing zone detection in UAVs.
We also study different representations of occupancy and propose techniques to improve performance when the data varies significantly in scale and orientation
[27] D. Prokhorov, “A convolutional learning system for object classification in 3-D lidar data,” IEEE TNN, vol. 21, no. 5, pp. 858–863, May 2010.
[28] D. Maturana and S. Scherer, “3D convolutional neural networks for landing zone detection from lidar,” in ICRA, 2015.
The segment is usually given by the intersection of a point cloud with a bounding box and may include background clutter.
본 제안 기능 : 물체 분류 예측 Our task is to predict an object class label for the segment.
본 제안 요소(2가지) Our system for this task has two main components:
1번째 이유
they allow us to efficiently estimate free, occupied and unknown space from range measurements, even for measurements coming from different viewpoints and time instants.
제안 방식은 포인트 클라우드 처럼 occupied space VS. free space
만 고려 하는 방식 보다 정보량이 풍부하다. 왜냐 하면 free and unknown space를 구분하는 것은 향후 중요한 정보가 될수 있기 때문이다.
This representation is richer than those which only consider occupied space versus free space such as point clouds, as the distinction between free and unknown space can potentially be a valuable shape cue.
2번째 이유
In our volumetric representation, each point (x, y, z) is mapped to discrete voxel coordinates (i, j, k).
맵핑은 아래 3가지 요소에 Depend하다. The mapping is a uniform discretization but depends on the origin, orientation and resolution of the voxel grid in space.
복셀화된 물체의 외관은 위 3가지가 중요한 요소를 끼치기 때문이다. . The appearance of the voxelized objects depends heavily on these parameters.
we assume that the z axis of the grid frame is approximately aligned with the direction of gravity.
This can be achieved with an IMU or simply keeping the sensor upright.
This still leaves a degree of freedom, the rotation around the z axis (yaw).
If we defined a canonical orientation for each object and were capable of detecting this orientation automatically, it would be reasonable to always align the grid to this orientation.
However, it is often non-trivial in practice to detect this orientation from sparse and noisy point clouds.
In this paper we propose a simple alternative based on data augmentation, discussed in III-F.
we adopt two strategies, depending on the dataset.
For our LiDAR dataset: we use a fixed spatial resolution, e.g. a voxels of (0.1 m)^3
.
For the other datasets: the resolution is chosen so the object of interest occupies a subvolume of 24 × 24 × 24
voxels.
In all experiments we use a fixed occupancy grid of size
32 × 32 × 32
voxels.
The trade-off between these two strategies is that
(e.g., cars and persons tend to have a consistent physical size)
; (so that the object is larger than the grid)
or when the voxels are too large (so that details are lost by aliasing)
.Let $${z^t}^T_{t=1}$$ be a sequence of range measurements that either hit ($$z^t = 1$$) or pass through ($$z^t = 0$$) a given voxel with coordinates (i, j, k).
Assuming an ideal beam sensor model, we use 3D ray tracing [32] to calculate the number of hits and pass-throughs for each voxel.
3가지의 occupancy grid models 정의 Given this information, we consider three different occupancy grid models to estimate occupancy:
In this model, each voxel is assumed to have a binary state, occupied or unoccupied.
The probabilistic estimate of occupancy for each voxel is computed with log odds for numerical stability.
Using the formulation from [31], we update each voxel traversed by the beam as
$$ l^t{ijk} = l^{t-1}{ijk} + z^tl{occ} + (1-z^t)l{free}
$$
Empirically we found that within reasonable ranges these parameters had little effect on the final outcome.
The initial probability of occupancy is set to 0.5, or $$l^0_{ijk}=0$$
In this case, the network acts on the log odd values $$l_{ijk}=0$$
In this model each voxel is assumed to have a continuous density, corresponding to the probability the voxel would block a sensor beam.
자세한 내용 생략
This model only consider hits, and ignores the difference between unknown and free space.
자세한 내용 생략
CNN을 사용한 3가지 이유 There are three main reasons CNNs are an attractive option for our task.
First, they can explicitly make use of the spatial structure of our problem.
In particular, they can learn local spatial filters useful to the classification task.
In our case, we expect the filters at the input level to encode spatial structures such as planes and corners at different orientations.
This layer accepts a fixed-size grid of I×J×K voxels. In this work, we use I = J = K = 32.
Depending on the occupancy model, each value for each grid cell is updated from Equation 1(Binary Occupancy grid model), Equation 2(Density Grid Model) or Equation 3(Hit Grid Model).
In all three cases we subtract 0.5 and multiply by 2, so the input is in the(−1, 1) range; no further preprocessing is done.
While this work only considers scalar-valued inputs, our implementation can trivially accept additional values per cell, such as LiDAR intensity values or RGB information from cameras.
These layers accept four dimensional input volumes in which three of the dimensions are spatial, and the fourth contains the feature maps.
The layer creates $$f$$ feature maps by convolving the input with $$f$$ learned filters of shape d × d × d × f'
, where d are the spatial dimensions and f'
is the number of input feature maps.
Convolution can also be applied at a spatial stride s
.
The output is passed through a leaky rectified nonlinearity unit(ReLU) [35] with parameter 0.1.
Fully connected layers haven output neurons.
The output of each neuron is a learned linear combination of all the outputs from the previous layer,passed through a nonlinearity.
We use ReLUs save for the final output layer, where the number of outputs corresponds to the number of class labels and a softmax nonlinearity is used to provide a probabilistic output.
in our previous work [28] we performed extensive stochastic search over hundreds of 3D CNN architectures on a simple classification task on simulated LiDAR data.
[28] D. Maturana and S. Scherer, “3D convolutional neural networks for
landing zone detection from lidar,” in ICRA, 2015.
Several of the best-performing networks had a small number of parameters in comparison to state of the art networks used for imagedata;
[7] has around 60 million parameters, while the majority of our best models used less than 2 million.
While it is difficult to compare these numbers meaningfully, given the vast differences in tasks and datasets, we speculate that volumetric classification for point clouds is in some sense a simpler task, as many of the factors of variation in image data (perspective, illumination, viewpoint effects) are diminished or not present.
Guided by this precedent, our base model, VoxNet, is C(32, 5, 2)−C(32, 3, 1)−P(2)−F C(128)−F C(K),
VoxNet is essentially a simpler version of the two-stage model reported in [28].
The changes aimed to reduce the number of parameters and increase computational efficiency, making the network easier and faster to learn.
The model has 921736 parameters, most of them from inputs to the first dense layer.
일관된 orientation 을 유지 하는것은 중요하다. it is nontrivial(중대한) to maintain a consistent orientation of objects around their z axis.
문제 해결을 위해 기존에는 rotationally invariant하도록 설게 되었다. To counter this problem, many features for point clouds are designed to be rotationally invariant (e.g. [36], [37]).
문제 해결을 위해 본 논문은 built-in된 대처 알고리즘은 없지만 다른 방법으로 문제를 해결 하였다. Our representation has no built-in invariance to large rotations;
n
copies of each input instance, each rotated 360◦/n intervals around the z axis. n
copies. n
값은 12 or 18. 보팅방법과 비슷 : This can be seen as a voting approach, similar to how networks such as [7] average predictions over random crops and flips of the input image;
제안 방식은 아래에서 영감을 받았다. This approach is inspired by the interpretation of convolution as weight sharing across translations; implicitly, we are sharing weights across rotations.
초반에는 Initial versions of this approach were implemented by max-pooling or mean-pooling the dense layers of the network during training in the same way as during test time.
추후 수정 하였다. However, we found that the approach described above yielded comparable results while converging noticeably faster.
LiDAR 데이터에서 탐지를 위해서는 $$0.2 m^3$$정도 Resolution가 필요 하다고 제안되고 있다. Visual inspection of the LiDAR dataset suggested a(0.2 m^3) resolution preserves all necessary information for the classification, while allowing sufficient spatial context for most larger objects such as trucks and trees.
우리가 세운 가설은 좋은 해상도는 판별력을 더 좋게 할수 있다. However, we hypothesized that a finer resolution would help in discriminating other classes such as traffic signs and traffic lights, especially for sparser data.
따라서 Foveal 구조[24]를 참고 삼아 multiresolution VoxNet를 구현 하였다. Therefore, we implemented a multiresolution VoxNet, inspired by the “foveal” architecture of [24] for video analysis.
In this model we use two networks with an identical VoxNet architectures, each receiving occupancy grids at different resolutions
: $$0.1m^3$$and $$0.2m^3$$. Both inputs are centered on the same location
, To fuse the information from both networks,
[24] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, “Large-scale video classification with convolutional neural networks,” in CVPR, 2014.
Training of the network parameters is performed by Stochastic Gradient Descent (SGD) with momentum.
The objective is the multinomial negative log-likelihood plus 0.001 times the L2 weight norm for regularization.
SGD is initialized with a learning rate of
The momentum parameter was 0.9.
Batch size is 32.
The learning rate was decreased by a factor of 10 each 8000 batches for the LiDAR dataset and each 40000 batches in the other datasets.
Dropout regularization is added after the output of each layer.
Convolutional layers were initialized with the method proposed by [38], whereas dense layers were initialized from a zero-mean Gaussian with σ = 0.01.
Following common practices for CNN training, we augment the data by adding randomly perturbed copies of each instance.
Mirroring is done by along the x and y axes; shifting is done between −2 to 2 voxels along all axes.
구현 언어 : Our implementation uses a combination of C++ and Python.
라이브러리 : The Lasagne library was used to compute gradients and accelerate computations on the GPU.
학습 소요 시간 : The training process takes around 6 to 12 hours on our K40 GPU, depending on the complexity of the network.
To evaluate VoxNet we consider benchmarks with data from three different domains:
Sydney데이터셋 설명 : labeled Velodyne LiDAR scans of 631 urban objects in 26 categories.
Sydney데이터셋 선택 이유: We chose this dataset for evaluation as it provides labeled object instances and the LiDAR viewpoint, which is used to compute occupancy.
When voxelizing the point cloud we use all points in a bounding box around the object, including background clutter.
평가 요소 : F1 값 We report the average F1 score, weighted by class support, for a subset of 14 classes over four standard training/testing splits.
데이터 증가 및 보팅 실시 : For this dataset we perform augmentation and voting with 18 rotations per instance.
제외
제외
LiDAR + RGB + CAD
Dense 3D occupancy grids obtained from point clouds are processed with CNNs in VoxNet
- **voxelization **기법에 대한 다양한 연구도 진행 되었다. `Other authors have explored variations of voxelization methods including, `
- binary occupancy grid,
- density grid,
- hitgrid.
- VoxNet에서는 위 다양한 방식에 대한 테스트가 진행 되었다. `In VoxNet, Maturana and Scherer (2015) tested each voxelization model individually, to train 3D-CNN swith 32x32x32 grid inputs. `
- To handle multi-resolution inputs, they trained two separate networks each receiving an occupancy grid with different resolution Parallel development of both multiview and volumetric CNNs has resulted in an empirical performance gap.
Maturana and Scherer have also employed volumetric (i.e., spatially 3D) representation of the 3D data to perform 3D object recognition [Maturana and Scherer 2015].+
- In the proposed VoxNet architecture, a volumetric occupancy grid of size 32×32× 32 voxels was at first generated from a point cloud’s segment that was then given as input to a CNN.
- The employed network was constructed using two convolutional (with 3D filters), one pooling, and two FC layers, while it was trained using SGD with momentum.
- An object class label was finally predicted for each seg-ment.
- Data from three different domains were used for evaluating VoxNet.
. LIDAR data point clouds
. RGB-D point clouds
. CAD models
A similar approach is VoxNet [24], which also uses binary voxel grids and a corresponding 3D CNN architecture.
The advantage of these approaches is that it can process different sources of 3D data, including LiDAR point clouds, RGB-D point clouds, and CAD models; we likewise follow this direction. [중요] 3DShapeNets & VoxNet = LiDAR 데이터에도 적용할수 있다는 장점이 있다.
Dense 3D occupancy grids obtained from point clouds are processed with CNNs in [8-Voxnet] and [9-Daniel2015a=Landingzone].+
With a minimum cell size of 0.1m, [8-Voxnet] reports a speed of 6ms on a GPU to classify a single crop with a grid-size of 32×32×32 cells.
Similarly, a processing time of 5ms per $$m^3$$ for landing zone detection is reported in [9].
With 3D point clouds often being larger than 60m × 60m × 5m, this would result in a processing time of $$60×60×5×5×10^{−3} = 90s$$ per frame, which does not comply with speed requirements typically encountered in robotics applications.
[16] introduced `Voxnet, a 3D CNN for 3D point cloud data and voxelized models, which performed significantly better than[27].
Recently CNNs have been applied to 3D shapes by representing them as 3D occupancy grids, and building generative [39-ShapeNet]or discriminative [26-VoxNet] networks.
VoxNet [13] introduces three different occupancy grids (32 × 32 × 32 voxels) that employ 3D ray tracing to compute the number of beams hitting or passing each voxel and then use that information to compute the value of each voxel depending on the chosen model: +
- a binary occupancy grid using probabilistic estimates,
- a density grid in which each voxel holds a value corresponding to the probability that it will block a sensor beam,
- a hit grid that only considers hits thus ignoring empty or unknown space.
The binary and density grids proposed by Maturana et al[13]. differentiate unknown and empty space, whilst the hit grid and the binary tensor do not.
Volumetric CNNs: [28-ShapeNet, 17-VoxNet, 18-VMCNN] are the pioneers applying 3D convolutional neural networks on voxelized shapes. However, volumetric representation is constrained by its resolution due to data sparsity and computation cost of 3D convolution.
[VoxNet (확장버젼)] boosted VoxNet
Sedaghat et al. [2016] modified VoxNet’s architecture in such a way that the object’s orientation was taken into account.
In their final model, the class labels were extracted directly from the orientation activations.
[boosted VoxNet] N. Sedaghat, M. Zolfaghari, and Th. Brox. 2016. Orientation-boosted voxel nets for 3D object recognition. CoRR abs/1604.03351 (2016).
[VoxNet (이전연구)] 3D convolutional neural networks for landing zone detection from lidar (Daniel2015a)
[Daniel2015a] propose a generative 3D convolutional model of shape and apply it to RGBD object recognition, among other tasks.