논문명 | VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection |
저자(소속) | Yin Zhou(Apple) |
학회/년도 | arXiv Nov 2017, 논문 |
Citation ID / 키워드 | Yin2017VoxelNet, LiDAR Only |
데이터셋(센서)/모델 | KITTI |
관련연구 | |
참고 | post, 애플 자율주행차 기술 ‘복셀넷’…최신 라이다 기능 훌쩍 뛰어넘어 |
코드 | docker push adioshun/voxelnet ,VoxelNet-ROS, TF / Docker, pytorch, TF, |
To interface a highly sparse LiDAR point cloud with a region proposal network (RPN), most existing efforts have focused on hand-crafted feature representations, for example, a bird’s eye view projection.
In this work, we remove the need of manual feature engineering for 3D point clouds and propose VoxelNet,
a generic 3D detection network that unifies feature extraction and bounding box prediction into a single stage, end-to-end trainable deep network.
equally spaced 3D voxels
과 transforms a group of points
로 나누게 된다.In this way,the point cloud is encoded as a descriptive volumetric representation,
which is then connected to a RPN to generate detections.
LiDAR는 많은 분야에 이용되고, 차량주행 등 신뢰성이 보장되는 분야에 적용된다.
하지만, 포인트 클라우드 데이터는 산재되어 있고 아래와 같은 이유로 highly variable point density하다. LiDAR point clouds are sparse and have highly variable point density, due to factors such as
and the relative pose.
이렇나 문제를 해결하기 위하여 manually crafted feature representation기법들이 활용 되었다.
`To handle these challenges, many approaches manually crafted feature representations for point clouds that are tuned for 3D object detection.
일부 방법은 포인트 클라우드를 원근법 시점(perspective view)
으로 투영하고 이미지 기반 특징 추출 방식을 적용 하였다. Several methods project point clouds into a perspective view and apply image-based feature extraction techniques[28, 15, 22].
Other approaches rasterize point clouds into a 3D voxel grid and encode each voxel with handcrafted features [41, 9, 37, 38, 21, 5-MV3D].
하지만 이러한 manual 방식들은 정보 Bottlenet을 유발하여 3D shape 정보를 제대로 활용하지 못하게 하고 물체 탐지를 위해서 invariances에 대한 처리 방법을 필요로 한다. However, these manual design choices introduce an information bottleneck that prevents these approaches from effectively exploiting 3D shape information and the required invariances for the detection task.
인지와 탐지 분야의 새로운 도약점(breakthrough )은 수작업 특징 추출에서 기계 학습 기반으로 넘어온 것이다. A major breakthrough in recognition [20] and detection [13] tasks on images was due to moving from hand-crafted features to machine-learned features.
Recently, Qi et al.[29] proposed PointNet, an end-to end deep neural network that learns point-wise features directly from point clouds.
This approach demonstrated impressive results on 3D object recognition, 3D object part segmentation, and point-wise semantic segmentation tasks.
In [30], an improved version of PointNet was introduced which enabled the network to learn local structures at different scales.
좋은 성능을 위하여 위 두 네트워크는 모든 인풋 입력(1k points)들에 대하여 학습을 수행 하였다. To achieve satisfactory results, these two approaches trained feature transformer networks on all input points (∼1k points).
보통 포인트 클라우드로 입련 되는것이 100k point임을 감안 하면 위 방식은 자원 소모가 크다. Since typical point clouds obtained using LiDARs contain ∼100k points, training the architectures as in [29, 30] results in high computational and memory requirements.
Scaling up 3D feature learning networks to orders of magnitude more points and to 3D detection tasks are the main challenges that we address in this paper.
RPN은 물체 탐지에 최적화된 방법이다. Region proposal network (RPN) [32] is a highly optimized algorithm for efficient object detection [17, 5, 31,24].
하지만 이 방식은 입력으로 Dense하고 텐서 구조(e.g. image, video)
로 구성되어 이어야 하기 때문에 LiDAR에 적합하지 않다. However, this approach requires data to be dense and organized in a tensor structure (e.g. image, video) which is not the case for typical LiDAR point clouds.
본 논문에서는 point set feature learning
간의 차이를 줄였다. In this paper,we close the gap between point set feature learning and RPN for 3D detection task.
We present VoxelNet, a generic 3D detection framework that simultaneously learns a discriminative feature representation from point clouds and predicts accurate 3D bounding boxes, in an end-to-end fashion, as shown in Figure 2.
[Figure 2. VoxelNet architecture.]
- The feature learning network takes a raw point cloud as input,
- partitions the space into voxels, and transforms points within each voxel to a vector representation characterizing the shape information.
- The space is represented as a sparse 4D tensor.
- The convolutional middle layers processes the 4D tensor to aggregate spatial context.
- Finally, a RPN generates the 3D detection
We design a novel voxel feature encoding (VFE) layer, which enables
point-wise features combine을 통해 Voxel내 포인트간의 상호 작용이 가능 inter-point interaction within a voxel, by combining point-wise features with a locally aggregated feature.
VFE 층을 쌓음으로써 성능 향상이 가능하다. Stacking multiple VFE layers allows learning complex features for characterizing local 3D shape information.
Specifically,VoxelNet divides the point cloud into equally
and then 3D convolution further aggregates local voxel features, transforming the point cloud into a high-dimensional volumetric representation.
마지막으로 volumetric representation에 RPN을 적용하여 탐지 결과를 출력한다. Finally, a RPN consumes the volumetric representation and yields the detection result.
이 방식은 산재된 포인트 클라우드 구조와 voxel grid의 parallel연산에도 효율적이다. This efficient algorithm benefits both from the sparse point structure and efficient parallel processing on the voxel grid.
성능평가는 KITTI데이터셋의 bird’s eye view detection 와 the full 3D detection tasks에 대하여 진행 하였다. We evaluate VoxelNet on the bird’s eye view detection and the full 3D detection tasks, provided by the KITTI benchmark [11].
다른 아이디어 대비 성능 좋다. Experimental results show that VoxelNet outperforms the state-of-the-art LiDAR based 3D detection methods by a large margin.
보행자와 자전거 탐지에도 좋은 성능을 보인다. We also demonstrate that VoxelNet achieves highly encouraging results in detecting pedestrians and cyclists from LiDAR point cloud.
3D sensor기술의 발전은 detect과 localize을 위한 효율적인 representations개발에 모티브가 되었다. Rapid development of 3D sensor technology has motivated researchers to develop efficient representations to detect and localize objects in point clouds.
초반기의 특징 representations방법들은 다음과 같다. Some of the earlier methods for feature representation are [39, 8, 7, 19, 40, 33,6, 25, 1, 34, 2].
이러한 수작업 특징들은 rich하고 detailed 3D shape information가 있을때는 좋은 성과를 보인다. These hand-crafted features yield satisfactory results when rich and detailed 3D shape information is available.
하지만, 자율 주행 분야 등의 특징에서는 안 좋다. However their inability to adapt to more complex shapes and scenes, and learn required invariances from data resulted in limited success for uncontrolled scenarios such as autonomous navigation.
이미지 데이터가 상세한 texture정보를 제공할수 있으므로 많은 알고리즘은 2D 이미지를 이용하여서 3D B.Box를 예측 한다. Given that images provide detailed texture information,many algorithms infered the 3D bounding boxes from 2D images [4, 3, 42, 43, 44, 36].
하지만 이미지 기반의 3D 탐지 기법은 깊이 예측의 정확도로 인하여 한계가 있다. However, the accuracy of image-based 3D detection approaches are bounded by the accuracy of the depth estimation.
일부 LiDAR기반의 3D 물체 탐지 기법들은 voxel grid representation를 활용한다. Several LIDAR based 3D object detection techniques utilize a voxel grid representation.
[41, 9-Vote3deep] encode each nonempty voxel with 6 statistical quantities that are derived from all the points contained within the voxel.
Several other studies project point clouds onto a perspective view and then use image-based feature encoding schemes [28, 15, 22].
There are also several multi-modal fusion methods that combine images and LiDAR to improve detection accuracy[10, 16, 5].
이러한 방식들은 작은 물체는 멀리 있는 물체를 탐지 하는데는 3D만 사용 하는 방식 보다는 성능이 좋. These methods provide improved performance compared to LiDAR-only 3D detection, particularly for small objects (pedestrians, cyclists) or when the objects are far, since cameras provide an order of magnitude more measurements than LiDAR.
그러나 카메라를 사용하게 되면 싱크 시간이나 칼리브레이션 이 필요 하며 둘중 하나의 센서가 고장 나는 위험도 있다. However the need for an additional camera that is time synchronized and calibrated with the LiDAR restricts their use and makes the solution more sensitive to sensor failure modes.
따라서 본 논문은 라이다만 고려 한다. In this work we focus onLiDAR-only detection.
산재된 3D points에 바로 동작 하여 메뉴얼한 피쳐 엔지니어링 불필요 We propose a novel end-to-end trainable deep architecture for point-cloud-based 3D detection, VoxelNet,that directly operates on sparse 3D points and avoids information bottlenecks introduced by manual feature engineering.
We present an efficient method to implement VoxelNet which benefits both from the
efficient parallel processing on the voxel grid.
성능 평가 결과 좋음 We conduct experiments on KITTI benchmark and show that VoxelNet produces state-of-the-art results in LiDAR-based car, pedestrian, and cyclist detection benchmarks.
The proposed VoxelNet consists of three functional blocks:
Given a point cloud, we subdivide the 3D space into equally spaced voxels as shown in Figure 2.
Suppose the point cloud encompasses 3D space with range D,H, W along the Z, Y, X axes respectively.
We define each voxel of size $$v_D$$, $$v_H$$, and $$v_W$$ accordingly.
The resulting 3D voxel grid is of size $$D\prime = \frac {D}{v_D}, H\prime = \frac{H}{v_H}, W\prime= \frac{W}{v_W}$$ .
Here, for simplicity, we assume D, H, W are a multiple of $$v_D, v_H, v_W$$ .
We group the points according to the voxel they reside in.
Due to factors the LiDAR point cloud is sparse and has highly variable point density throughout the space.
Therefore, after grouping, a voxel will contain a variable number of points.
An illustration is shown in Figure 2, where Voxel-1 has significantly more points than Voxel-2 and Voxel-4, while Voxel-3 contains nopoint
일반적으로 고밀도 LiDAR는 ~100k개의 포인트로 구성된다. Typically a high-definition LiDAR point cloud is composed of ∼100k points.
모든 포인트를 모두 계산에 고려 하는것은 부하가 크다. 또한 highly variable point density한것은 탐지에 영향(bias)
을 미칠수 있다. Directly processing all the points not only imposes increased memory/efficiency burdens on the computing platform, but also highly variable point density throughout the space might bias the detection.
따라서 복셀에 t개 이상의 포인트가 있다면, t개의 샘플들만 선택 하여 사용한다. To this end, we randomly sample a fixed number, T, of points from those voxels containing more than T points.
이 샘플링의 목적은 다음과 같다. This sampling strategy has two purposes,
computational savings (see Section 2.3 for details); and
decreases the imbalance of points between the voxels which reduces the sampling bias, and adds more variation to training.
가장 중요한 요소는 VFE layers.
이다. The key innovation is the chain of VFE layers.
For simplicity, Figure 2 illustrates the hierarchical feature encoding process for one voxel.
Without loss of generality, we use VFE Layer-1 to describe the details in the following paragraph.
Figure 3 shows the architecture for VFE Layer-1.
The FCN is composed of
point-wise feature representations
를 획득후 $V$에 연관된 모든 $$f_i$$에 대하여 element-wise MaxPooling
를 적용하여 V에 대한 locally aggregated feature를 획득 한다.비어 있지 않은 복셀들은 모두 같은 절차를 적용 받고 동일한 파라미터 셋을 공유 한다. All non-empty voxels are encoded in the same way and they share the same set of parameters in FCN.
i번째 VFE층으로 표현되는$$VFE-i(c{in}, c{out})$$는 입력 특징의 차원을 출력 특징의 차원으로 변경한 것이다.
`Because the output feature combines both point-wise features and locally aggregated feature, stacking VFE layers encodes point interactions within a voxel and enables the final feature representation to learn descriptive shape information.
복셀 단위 특징(voxel-wise feature)
은 FCN과 element-wise Maxpool을 이용하여 출력 $$VFE-n$$을 $$\Re^C$$로 변환하여 얻은 것이다.
By processing only the non-empty voxels, we obtain a list of voxel features,
각각은 복셀의 공간적 좌표와 연결되어 있다. each uniquely associated to the spatial coordinates of a particular non-empty voxel.
얻은 복셀 특징 리스트는 4D Tensor로 표현된다.
The obtained list of voxel-wise features can be represented as a sparse 4D tensor, of size $C × D \prime× H \prime × W \prime$ as shown in Figure 2.
비록 포인트 클라우드가 ∼100k points로 구성되어 있지만 90%는 비어 있는 것이다. Although the point cloud contains ∼100k points, more than 90% of voxels typically are empty.
비어있지 않는 복셀 특징을 sparse tensor표현하는것은 효율화 측면에서 좋다. Representing non-empty voxel features as a sparse tensor greatly reduces the memory usage and computation cost during backpropagation, and it is a critical step in our efficient implementation.
The convolutional middle layers aggregate voxel-wise features within a progressively expanding receptive field, adding more context to the shape description.
The detailed sizes of the filters in the convolutional middle layers are explained in Section 3
Recently, region proposal networks [32-Faster R-CNN] have become an important building block of top-performing object detection frameworks [38, 5, 23].
본 연구에서는 RPN의 일부분을 수정하여 적용 하였다. In this work, we make several key modifications to the RPN architecture proposed in [32] ,and combine it with the feature learning network and convolutional middle layers to form an end-to-end trainable pipeline.
RPN의 입력은 합성곱 중간 층에서 제공되는 특징맵이다. The input to our RPN is the feature map provided by the convolutional middle layers.
The network has three blocks of fully convolutional layers.
$$ { a^{neg}j}{j=1...N_{neg}}$$ be the set of N_neg negative anchors.
We parameterize a 3D ground truth box as $$(x^g_c, y^g_c, z^g_c, l^g, w^g, h^g, \theta^g)$$
$$\theta^g$$ is the yaw rotation around Z-axis.
To retrieve the ground truth box from a matching positive anchor parameterized as $$(x^a_c, y^a_c, z^a_c, l^a, w^a, h^a, \theta^a)$$ , we define the residual vector $$ u\star \in \Re^7$$ containing the 7 regression targets corresponding to center location ∆ x, ∆y, ∆z three di-mensions ∆l, ∆w, ∆h, and the rotation ∆θ, which are computed as:
4000개의 학습 데이터는 오버 피팅 위험성이 있으므로 데이터 증폭 필요 With less than 4000 training point clouds, training our network from scratch will inevitably suffer from overfitting.
본 논문에서는 3가지 형태의 방법 사용 To reduce this issue, we introduce three different forms of data augmentation.
The augmented training data are generated on-the-fly without the need to be stored on disk [20].
3D Data Augmentation방법에 대하여 조사 하고 이를 적용시 성능향상에 대한 논문 찾아 보기
기존 방식은 수작업 특징에 의존 하였다. Most existing methods in LiDAR-based 3D detection rely on hand-crafted feature representations, for example,a bird’s eye view projection.
본 논문에서는 수작업 피쳐 엔지니어링을 제거한 VoxelNet를 제안 하였다. In this paper, we remove the bottleneck of manual feature engineering and propose VoxelNet,
a novel end-to-end trainable deep architecture for point cloud based 3D detection.
제안 방식은 산재된 3D데이터에 바로 적용 가능하다. Our approach can operate directly on sparse 3D points and capture 3D shape information effectively.
또한 이를 구현 하였다. We also present an efficient implementation of VoxelNet that benefits from point cloud sparsity and parallel processing on a voxel grid.
실험 결과 성능이 좋다. Our experiments on the KITTI car detection task show that VoxelNet outperformsstate-of-the-art LiDAR based 3D detection methodsby a large margin.
어려운 도전중 하나인 보행자와 자전거 인식도 잘된다. On more challenging tasks, such as 3D detection of pedestrians and cyclists, VoxelNet also demonstrates encouraging results showing that it provides a better 3D representation.
향후 이미지와의 결합을 통한 성능 향상을 진행 할 예정이다. Future work includes extending VoxelNet for joint LiDAR and image based end-to-end 3D detection to further improve detection and localization accuracy
해당 코드로는 사람 탐지를 위해 추가 되어야 하는 부분(anchor box)이 있음 [출처]
validation 데이터 확보 방안 [출처]