FAQ
미분류
2018-SECOND:Sparsely Embedded Convolutional Detection

논문명	SECOND:Sparsely Embedded Convolutional Detection
저자(소속)	Yan Yan(=traveller59), Bo Li()
학회/년도	2018, 논문
Citation ID / 키워드
데이터셋(센서)/모델	KITTI-3D Object Detection
관련연구	Saprse ConvNet 아이디어 활용
참고
코드	깃허브

년도	1st 저자	논문명	코드
2014	Benjamin Graham	Spatially-sparse convolutional neural networks	2013-2015
2017	Benjamin Graham	Submanifold Sparse Convolutional Networks	SparseConvNet
2018	Benjamin Graham	3D Semantic Segmentation with Submanifold Sparse Convolutional Networks	~~SparseConvNet(deprecated)~~,spconv
2018	Bo Li	SECOND:Sparsely Embedded Convolutional Detection	SECOND
2019	Alex H. Lang	PointPillars: Fast Encoders for Object Detection from Point Clouds	PointPillars

SECOND: Sparsely Embedded Convolutional Detection (작성중 30%)

Lidar는 여러 분야에 중요 하다. LiDAR-based or RGB-D-based object detection is used in numerous applications, ranging from autonomous driving to robot vision.

Voxel-based 3D convolutional networks는 Lidar데이터 에서 의미 있는 정보 수집시 사용 되어 왔다. Voxel-based 3D convolutional networks have been used for some time to enhance the retention of information when processing point cloud LiDAR data.

문제는 느린 성능과, 자세(=Orientation)추정 성능이다. However, problems remain, including a slow inference speed and low orientation estimation performance.

학습/테스트 속도를 올릴수 있는 sparse conv네트워크를 조사 하였다. We therefore investigate an improved sparse convolution method for such networks, which significantly increases the speed of both training and inference.

또한 새로운 자세 추정을 위한 angle LOSS와 데이터 증폭 기법을 제안 한다. We also introduce a new form of angle loss regression to improve the orientation estimation performance and a new data augmentation approach that can enhance the convergence speed and performance.

제안 기법은 빠른 성능을 보이면서 좋은 결과를 나타냈다. The proposed network produces state-of-the-art results on the KITTI 3D object detection benchmarks while maintaining a fast inference speed.

1. Introduction

State-of-the-art methods can achieve

an average precision (AP) of 90% of 2D car detection
but only an AP of 15% [7] for 3D image-based car detection.

최근 3D Detector 연구 트랜트는 이미지와 클라우드 데이터를 퓨전하여 많이 사용한다. : Many current 3D detectors use a fusion method that exploits both images and point cloud data.

Point cloud data are converted into a 2D bird’s eye view image [8] or are projected onto an image [9,10].
Features are then extracted using a convolutional network, and a fusion process is applied to map the features between the image and other views.

[8] Chen, X.; Ma, H.; Wan, J.; Li, B.; Xia, T. Multi-view 3D object detection network for autonomous driving. In Proceedings of the IEEE Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; Volume 1, p. 3. [Google Scholar]

[9] Ku, J.; Mozifian, M.; Lee, J.; Harakeh, A.; Waslander, S. Joint 3D Proposal Generation and Object Detection from View Aggregation. arXiv, 2017; arXiv:1712.02294. [Google Scholar]

[10] Du, X.; Ang Jr, M.H.; Karaman, S.; Rus, D. A general pipeline for 3D detection of vehicles. arXiv, 2018; arXiv:1803.00387. [Google Scholar]

In [11],

the point cloud data are initially filtered using bounding boxes generated by a 2D detector,
and a convolutional network is then used to directly process the points.

[11] Qi, C.R.; Liu, W.; Wu, C.; Su, H.; Guibas, L.J. Frustum PointNets for 3D Object Detection from RGB-D Data. arXiv, 2017; arXiv:1711.08488. [Google Scholar]

In other methods, such as those of [12,13,14,15],

the point cloud data are assigned to volumetric grid cells via quantization,
and 3D CNNs are then applied.

[12] Wang, D.Z.; Posner, I. Voting for Voting in Online Point Cloud Object Detection. In Proceedings of the Robotics: Science and Systems, Rome, Italy, 13–17 July 2015; Volume 1. [Google Scholar]

[13] Engelcke, M.; Rao, D.; Wang, D.Z.; Tong, C.H.; Posner, I. Vote3deep: Fast object detection in 3D point clouds using efficient convolutional neural networks. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 1355–1361. [Google Scholar]

[14] Zhou, Y.; Tuzel, O. VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection. arXiv, 2017; arXiv:1711.06396. [Google Scholar]

[15] Li, B. 3D fully convolutional network for vehicle detection in point cloud. In Proceedings of the IEEE 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada, 24–28 September 2017; pp. 1513–1518. [Google Scholar]

Recently, a new approach called VoxelNet [14] has been developed.

This approach combines raw point cloud feature extraction and voxel-based feature extraction in a single-stage end-to-end network.
It first groups point cloud data into voxels and then applies linear networks voxel by voxel before converting the voxels into dense 3D tensors to be used in a region proposal network (RPN) [16].
가장 최신 기술이지만, 속도가 느리다. At present, this is a state-of-the-art approach. However, its computational cost makes it difficult to use for real-time applications.

본 논문에서 제안 하는 SECOND를 이용하여 이런 문제를 해결 하고자 한다. In this paper, we present a novel approach called SECOND (Sparsely Embedded CONvolutional Detection), which addresses these challenges in 3D convolution-based detection by maximizing the use of the rich 3D information present in point cloud data.

This method incorporates several improvements to the existing convolutional network architecture.

1.1 첫번째 장점 : Spatially sparse convolutional networks

Spatially sparse convolutional networks are introduced for LiDAR-based detection and are used to extract information from the z-axis before the 3D data are downsampled to something akin to 2D image data.

We also use a GPU-based rule generation algorithm for sparse convolution to increase the speed.

제안 하는 Sparse conv네트워크는 기존의 Dense conv네트워크에 비해서 속도가 빠르다. In comparison to a dense convolution network, our sparse-convolution-based detector achieves a factor-of-4 speed enhancement during training on the KITTI dataset and a factor-of-3 improvement in the speed of inference.

As a further test, we have designed a small model for real-time detection that has a run time of approximately 0.025 s on a GTX 1080 Ti GPU, with only a slight loss of performance.

제출 시점에 큰 모델로는 20fps, 작은 모델은 40fps 속도로 KITI 3D Detection을 수행 ㅏ였다.

1.2 두번째 장점 : Data augmentation

Another advantage of using point cloud data is that it is very easy to scale, rotate and move objects by applying direct transformations to specified points on those objects.

SECOND incorporates a novel form of data augmentation based on this capability.

A ground-truth database is generated that contains the attributes of objects and the associated point cloud data.

Objects sampled from this database are then introduced into the point clouds during training.

This approach can greatly increase the convergence speed and the final performance of our network.

1.3 세번째 장점 : angle loss regression

In addition to the above, we also introduce a novel angle loss regression approach to solve the problem of the large loss generated when the difference in orientation between the ground truth and the prediction is equal to π, which yields a bounding box identical to the true bounding box.

The performance of this angle regression approach surpasses that of any current method we know about, including the orientation vector regression function available in AVOD [9].

We also introduce an auxiliary direction classifier to recognize the directions of objects.

1.4 본 논문의 기여

The key contributions of our work are as follows:

We apply sparse convolution in LiDAR-based object detection, thereby greatly increasing the speeds of training and inference.
We propose an improved method of sparse convolution that allows it to run faster.
We propose a novel angle loss regression approach that demonstrates better orientation regression performance than other methods do.
We introduce a novel data augmentation method for LiDAR-only learning problems that greatly increases the convergence speed and performance.

Below, we briefly review existing works on 3D object detection based on point cloud data and images.

2.1. Front-View- and Image-Based Methods

Methods using 2D representations of RGB-D data can be divided into two classes:

those based on a bird’s eye view (BEV)
those based on a front view.

In typical image-based methods [5], 2D bounding boxes, class semantics and instance semantics are generated first, and then hand-crafted approaches are used to generate feature maps.

Another method [17] uses a CNN to estimate 3D bounding boxes from images and a specially designed discrete-continuous CNN to estimate the orientations of objects.

Methods using LiDAR [18] involve the conversion of point clouds into front-view 2D maps and the application of 2D detectors to localize objects in the front-view images.

These methods have been shown to perform poorly for both BEV detection and 3D detection compared to other methods.

2.2. Bird’s-Eye-View-Based Methods

MV3D [8] is the first method to convert point cloud data into a BEV representation.

In this method, point cloud data are converted into several slices to obtain height maps,
and these height maps are then concatenated with the intensity map and density map to obtain multichannel features.

ComplexYOLO [19]

uses a YOLO (You Only Look Once) [20] network and a complex angle encoding approach to increase speed and orientation performance,
but it uses fixed heights and z-locations in the predicted 3D bounding boxes.

In [21],

a fast single-stage proposal-free detector is designed that makes use of specific height-encoded BEV input.

위 방법들의 단점은 BEV를 생성하면서 버려지는 정보(point)들이 많다. A key problem with all of these approaches, however, is that many data points are dropped when generating a BEV map, resulting in a considerable loss of information on the vertical axis.

이로 인해 3D BBox생성에 문제가 된다. This information loss severely impacts the performance of these methods in 3D bounding box regression.

2.3. 3D-Based Methods

Most 3D-based methods

either use point cloud data directly
or require converting these data into 3D grids or voxels instead of generating BEV representations.

In [12],

point cloud data are converted into voxels containing feature vectors,
and then a novel convolution-like voting-based algorithm is used for detection.

Ref. [13]

exploits sparsity in point cloud data by leveraging a feature-centric voting scheme to implement novel convolutions,
thus increasing the computation speed.

위 방법들은 수작업 특징정보를 이용하였다. 특정 데이터셋에 기반한 결과는 좋았지만 자율주행 같은 복잡한 환경에는 적합 하지 않았다. These methods use hand-crafted features, and while they yield satisfactory results on specific datasets, they cannot adapt to the complex environments commonly encountered in autonomous driving.

In a distinct approach,

the authors of [22,23] develop a system that could learn pointwise features directly from point clouds by means of a novel CNN-based architecture,
whereas Ref. [24] uses a k-neighborhood method together with convolution to learn local spatial information from a point cloud.

위 방법들은 직접 포인트클라우들 다루기에 많은 수의 포인트 들을 처리 하기 어렵다. 그래서 이미지 기반 탐지 작업후 원본 포인트들을 한번 걸러낸다. These methods directly process point cloud data to perform 1D convolution on k-neighborhood points, but they cannot be applied to a large number of points; thus, image detection results are needed to filter the original data points.

일부 CNN기반 기법들은 포인트 클라우드를 Voxel로 변환하여 처리 한다. Some CNN-based detectors convert point cloud data into voxels.

In the method presented in [15], point cloud data are discretized into two-valued voxels, and then 3D convolution is applied.
The method of [14] groups point cloud data into voxels, extracts voxelwise features, and then converts these features into a dense tensor to be processed using 3D and 2D convolutional networks.

위 방법들은 연산 부하가 크다. The major problem with these methods is the high computational cost of 3D CNNs. Unfortunately, the computational complexity of a 3D CNN grows cubically with the voxel resolution.

일부 제안은 sparse convolution이 제안 되었다. In [25,26], a spatially sparse convolution is designed that increases the 3D convolution speed,

whereas Ref. [27] proposes a new approach to 3D convolution in which the spatial structure of the output remains unchanged, which greatly increases the processing speed.
In [28], submanifold convolution is applied for the 3D semantic segmentation task;
- 본 논문의 기반인 [28]은 탐지 기능을 가지고 있진 않다. however, there is no known method that uses sparse convolution for the detection task.

위 여러 방법들과 같이 제안 방식도 3D conv구조를 지닌다. 하지만 몇가지 개선 사항을 가지고 있다. Similar to all of these approaches, our method makes use of a 3D convolutional architecture, but it incorporates several novel improvements.

2.4. Fusion-Based Methods

일부 기법은 카메라 이미지와 포이트 클라우드를 합쳤다. Some methods combine camera images with point clouds.

For instance, the authors of [29]

use a 3D RPN at two scales with different receptive fields to generate 3D proposals and then feed the 3D volume from the depth data of each 3D proposal into a 3D CNN and the corresponding 2D color patch into a 2D CNN to predict the final results.

In the method presented in [8],

point cloud data are converted into a front view and a BEV,
and then the feature maps extracted from both point cloud maps are fused with an image feature map.
The MV3D network with images performs better than the BEV-only network by a large margin,
- but this architecture does not work well for small objects and runs slowly because it contains three CNNs.

The authors of [9]

combine images with a BEV and then use a novel architecture to generate high-resolution features maps and 3D object proposals.

In [11],

2D detection results are used to filter a point cloud such that PointNet [22] could then be applied to predict 3D bounding boxes.

위 퓨전 기반 식들은 많은 이미지 파일을 처리 해야 하기 때문에 대부분 느리다. However, fusion-based methods typically run slowly because they need to process a significant amount of image input.

또한 카메라와 Lidar간의 시간 동기화 문제도 고려 해야 한다. The additional requirement of a time-synchronized and calibrated camera with LiDAR capabilities restricts the environments in which such methods can be used and reduces their robustness.

반대로 우리의 제안 방식은 Lidar데이터만 가지고도 좋은 성과를 보인다. Our method, by contrast, can achieve state-of-the-art performance using only LiDAR data.

implementation (pyTorch)

SECOND

SECOND 시스템 요구 사항 : ONLY support python 3.6+, pytorch 1.0.0+. Tested in Ubuntu 16.04/18.04.,
SPConv 시스템 요구 사항 : CUDA 9.0+, cmake >= 3.13.2 (내부적으로사용)

도커로 실행

# host PC에서 
$ cd /workspace
$ git clone https://github.com/traveller59/second.pytorch.git

$ docker pull adioshun/second:latest


## KITTI  데이터 위치 : /media/adioshun/data/datasets
## code 위치 : /workspace/second.pytorch
$ docker run --runtime=nvidia -it --privileged --network=host -v /tmp/.X11-unix:/tmp/.X11-unix --volume="$HOME/.Xauthority:/root/.Xauthority:rw" -e DISPLAY -v /media/adioshun/data/datasets:/datasets --volume /workspace:/workspace --name 'second'  adioshun/second:latest /bin/bash

$ docker start 
$ docker exec -it second bash

# 도커 상에서 

$ cd /workspace/second.pytorch/second

Create kitti infos:`python create_data.py create_kitti_info_file --data_path=/datasets`
Create reduced point cloud: `python create_data.py create_reduced_point_cloud --data_path=/datasets`
Create groundtruth-database infos:`python create_data.py create_groundtruth_database --data_path=/datasets`

TRAIN : `python ./pytorch/train.py train --config_path=./configs/car.fhd.config --model_dir=/path/to/model_dir`
Evaluate : `python ./pytorch/train.py evaluate --config_path=./configs/car.fhd.config --model_dir=/path/to/model_dir --measure_time=True --batch_size=1`

└── KITTI_DATASET_ROOT (eg. /datasets/
       ├── training    <-- 7481 train data
       |   ├── image_2 <-- for visualization
       |   ├── calib
       |   ├── label_2
       |   ├── velodyne
       |   └── velodyne_reduced <-- empty directory
       └── testing     <-- 7580 test data
           ├── image_2 <-- for visualization
           ├── calib
           ├── velodyne
           └── velodyne_reduced <-- empty directory

코드 설치로 진행

기존방식

wget https://github.com/Kitware/CMake/releases/download/v3.13.3/cmake-3.13.3.tar.gz
./bootstrap
 make
 make install

sudo apt-get install libboost-all-dev

cd ~
#pip3 install shapely fire pybind11 tensorboardX protobuf scikit-image numba pillow numba
conda install -c conda-forge shapely fire pybind11 tensorboardX protobuf scikit-image numba pillow numba

#pip3 install torch torchvision
conda install -c anaconda pytorch-gpu 

git clone https://github.com/traveller59/spconv.git --recursive
cd spconv
python3 setup.py bdist_wheel
cd ./dist
pip3 install spconv-1.0-cp36-cp36m-linux_x86_64.whl

# Setup cuda for numba
vi ~/.bashrc
export NUMBAPRO_CUDA_DRIVER=/usr/lib/x86_64-linux-gnu/libcuda.so
export NUMBAPRO_NVVM=/usr/local/cuda/nvvm/lib64/libnvvm.so
export NUMBAPRO_LIBDEVICE=/usr/local/cuda/nvvm/libdevice

export PATH="/usr/local/cuda/bin:$PATH"  
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda/lib64"
export CUDA_HOME=/usr/local/cuda

export PYTHONPATH=$PYTHONPATH:/workspace/second.pytorch
source ~/.bashrc

cd /workspace
git clone https://github.com/traveller59/second.pytorch.git

에러 처리

/usr/lib/x86_64-linux-gnu/libcuda.so: file too short -> libcuda.so의 심볼릭 링크 확인 후 재 설

pybind11/detail/common.h:112:20: fatal error: Python.h: No such file or directory -> apt install python3.6-dev

2018-SECOND:Sparsely Embedded Convolutional Detection

SECOND: Sparsely Embedded Convolutional Detection (작성중 30%)

1. Introduction

1.1 첫번째 장점 : Spatially sparse convolutional networks

1.2 두번째 장점 : Data augmentation

1.3 세번째 장점 : angle loss regression

1.4 본 논문의 기여

2.1. Front-View- and Image-Based Methods

2.2. Bird’s-Eye-View-Based Methods

2.3. 3D-Based Methods

2.4. Fusion-Based Methods

implementation (pyTorch)

도커로 실행

코드 설치로 진행

Recommended

1. Clone code

2. Install Python packages

3. Setup cuda for numba

4. PYTHONPATH

기존방식

에러 처리

results for ""

No results matching ""

2018-SECOND:Sparsely Embedded Convolutional Detection

SECOND: Sparsely Embedded Convolutional Detection (작성중 30%)

1. Introduction

1.1 첫번째 장점 : Spatially sparse convolutional networks

1.2 두번째 장점 : Data augmentation

1.3 세번째 장점 : angle loss regression

1.4 본 논문의 기여

2. Related Work

2.1. Front-View- and Image-Based Methods

2.2. Bird’s-Eye-View-Based Methods

2.3. 3D-Based Methods

2.4. Fusion-Based Methods

implementation (pyTorch)

도커로 실행

코드 설치로 진행

Recommended

1. Clone code

2. Install Python packages

3. Setup cuda for numba

4. PYTHONPATH

기존방식

에러 처리

results for ""

No results matching ""