논문명 | Multi-View 3D Object Detection Network for Autonomous Driving |
---|---|
저자(소속) | Xiaozhi Chen (칭화대), Bo Li (Baidu) |
학회/년도 | 2017, 논문 |
키워드 | Region-based feature fusion |
참고 | 홈페이지, CVPR2017, Youtube, MobileNets |
코드 | 저자코드, MV3D, 3rd Reader |
!!!!! 3. MV3D Network 부터 읽기 시작 (2017.09.21)
목표 : A sensory-fusion framework
구성
3D object proposal generation : Generates 3D candidate boxes
multi-view feature fusion
특징
목표
각 센서들의 장점
최근 LIDAR기반 방식들 기술 동향
Apply CNN to the front view point map in a dense box prediction scheme [17].
[26] D. Z. Wang and I. Posner. Voting for voting in online point cloud object detection. In Proceedings of Robotics: Science and Systems, 2015
[7] M. Engelcke, D. Rao, D. Zeng Wang, C. Hay Tong, and I. Posner. Vote3Deep: Fast Object Detection in 3D Point Clouds Using Efficient Convolutional Neural Networks. arXiv:1609.06666, 2016
[17] B. Li, T. Zhang, and T. Xia. Vehicle detection from 3d lidar using fully convolutional network. In Robotics: Science and Systems, 2016
최근 Image기반 방식들 기술 동향
Image-based methods [4, 3] typically first generate 3D box proposals and then perform region-based recognition using the Fast RCNN [10] pipeline.
[4] X. Chen, K. Kundu, Y. Zhu, A. Berneshawi, H. Ma, S. Fidler, and R. Urtasun. 3d object proposals for accurate object class detection. In NIPS, 2015
[3] X. Chen, K. Kundu, Z. Zhang, H. Ma, S. Fidler, and R. Urtasun. Monocular 3d object detection for autonomous driving. In CVPR, 2016
최근 Fusion(LiDAR + Image) 기반 방식들 기술 동향
[참고] 본 논문은 Deep fusion schemes 이용
[11] A. Gonzalez, D. Vazquez, A. Lopez, and J. Amores. Onboard object detection: Multicue, multimodal, and multiview random forest of local experts. In IEEE Transactions on Cybernetics,
2016
[8] M. Enzweiler and D. M. Gavrila. A multilevel mixture-of experts framework for pedestrian classification. IEEE Transactions on Image Processing, 20(10):2967–2979, 2011
The main idea for utilizing multimodal information is to perform region-based
feature fusion.
MV3D | a 3D Proposal Network | + | a Region-based Fusion Network. |
---|---|---|---|
제안 #1 : multi-view encoding scheme
목적 : 포인트 클라우드의 bird’s eye view를 이용하여서 3D 후보 영역을 추천 한다. The 3D proposal network utilizes a bird’s eye view representation of point cloud to generate highly accurate 3D candidate boxes.
장점 : The benefit of [3D object proposals] is that it can be projected to any views in 3D space.
목적 : 3D 후보영역을 mulitple views에서 Feature map으로 투영하여 region-wise feature들을 추출 한다. 'The multi-view fusion network extracts region-wise features by projecting 3D proposals to the feature maps from mulitple views.'
특징 : We design a deep fusion
approach to enable interactions of intermediate layers from different views.
[15] G. Larsson, M. Maire, and G. Shakhnarovich. Fractalnet:Ultra-deep neural networks without residuals.arXiv:1605.07648, 2016.
대부분의 방법들은 3D 포인트 클라우드를 Voxel 그리드로 인코딩 시키는 것이다. Most existing methods encode 3D point cloud with voxel grid representation.
Some recently proposed methods [23, 7, 16] improve feature representation with 3D convolutions.
In addition to the 3D voxel representation, VeloFCN [17] projects point cloud to the front view, obtaining a 2D point map.
[24, 18, 12] investigate volumetric and multi-view representation of point cloud for 3D object classification.
In this work, we encode 3D point cloud with multi-view feature maps, enabling region-based representation for multimodal fusion
[22] S. Song and J. Xiao. Sliding shapes for 3d object detection in depth images. In ECCV. 2014.
[26] D. Z. Wang and I. Posner. Voting for voting in online point cloud object detection. In Proceedings of Robotics: Science and Systems, 2015.
[23] S. Song and J. Xiao. Deep sliding shapes for amodal 3d object detection in rgb-d images. In CVPR, 2016.
[7] M. Engelcke, D. Rao, D. Zeng Wang, C. Hay Tong, and I. Posner. Vote3Deep: Fast Object Detection in 3D Point Clouds Using Efficient Convolutional Neural Networks. arXiv:1609.06666, 2016.
[16] B. Li. 3d fully convolutional network for vehicle detection in point cloud. IROS, 2017.
[17] B. Li, T. Zhang, and T. Xia. Vehicle detection from 3d lidar using fully convolutional network. In Robotics: Science and Systems, 2016.
[24] H. Su, S.Maji, E.Kalogerakis, and E. Learned-Miller. Multiview convolutional neural networks for 3d shape recognition. In ICCV, 2015.
[18] C. R. Qi, M. N. H. Su, A. Dai, M. Yan, and L.Guibas. Volumetric and multi-view cnns for object classification on 3d data. In CVPR, 2016.
[12] V. Hegde and R. Zadeh. Fusionnet: 3d object classification using multiple data representations. CoRR, abs/1607.05695, 2016.
3DVP [28] introduces 3D voxel patterns and employ a set of ACF detectors to do 2D detection and 3D pose estimation.
3DOP [4] reconstructs depth from stereo images and uses an energy minimization approach to generate 3D box proposals, which are fed to an R-CNN [10] pipeline for object recognition.
Mono3D [3] shares the same pipeline with 3DOP, it generates 3D proposals from monocular images.
[31, 32] introduces a detailed geometry representation of objects using 3D wire frame models.
To incorporate temporal information, some work[6, 21] combine structure from motion and ground estimation to lift 2D detection boxes to 3D bounding boxes.
Image-based methods usually rely on accurate depth estimation or landmark detection.
본 논문에서는 : Our work shows how to incorporate LIDAR point cloud to improve 3D localization.
[28] Y. Xiang, W. Choi, Y. Lin, and S. Savarese. Data-driven 3d voxel patterns for object category recognition. In CVPR, 2015
[4] X. Chen, K. Kundu, Y. Zhu, A. Berneshawi, H. Ma, S. Fidler, and R. Urtasun. 3d object proposals for accurate object class detection. In NIPS, 2015
[3] X. Chen, K. Kundu, Z. Zhang, H. Ma, S. Fidler, and R. Urtasun. Monocular 3d object detection for autonomous driving. In CVPR, 2016
[31] M. Z. Zia, M. Stark, B. Schiele, and K. Schindler. Detailed 3d representations for object recognition and modeling. PAMI, 2013.
[32] M. Z. Zia, M. Stark, and K. Schindler. Are cars just 3d boxes? jointly estimating the 3d shape of multiple objects. In CVPR, pages 3678–3685, 2014.
[6] V. Dhiman, Q. H. Tran, J. J. Corso, and M. Chandraker. A continuous occlusion model for road scene understanding. In CVPR, pages 4331–4339, 2016.
[21] S. Song and M. Chandraker. Joint sfm and detection cues for monocular 3d localization in road scenes. In Computer Vision and Pattern Recognition, pages 3734–3742, 2015
자율 주행 분야에서는 멀티모달 퓨전에 대하여서는 많은 연구가 이루어 지지는 않았다.
[11] combines images, depth and optical flow using a mixture-of-experts framework for 2D pedestrian detection.
[8] fuses RGB and depth images in the early stageand trains pose-based classifiers for 2D detection.
본 논문에서는 FractalNet [15] and Deeply-Fused Net [27]에 영감을 받았다.
위 방식과 다른점 : Our network differs from them by using the same base network for each column and adding auxiliary paths and losses for regularization.
[11] A. Gonzalez, D. Vazquez, A. Lopez, and J. Amores. Onboard object detection: Multicue, multimodal, and multiview random forest of local experts. In IEEE Transactions on Cybernetics,
2016.
[8] M. Enzweiler and D. M. Gavrila. A multilevel mixture-of experts framework for pedestrian classification. IEEE Transactions on Image Processing, 20(10):2967–2979, 2011
[15] G. Larsson, M. Maire, and G. Shakhnarovich. Fractalnet: Ultra-deep neural networks without residuals. arXiv:1605.07648, 2016.
[27] J. Wang, Z. Wei, T. Zhang, and W. Zeng. Deeply-fused nets. arXiv:1605.07716, 2016.
2D와 유사한 방법으로 3D에서는 Object Proposals 을 수행 한다.
generate a smallset of 3D candidate boxes in order to cover most of the objects in 3D space.
3DOP [4] designs some depth features in stereo point cloud to score a large set of 3D candidate boxes.
Mono3D [3] exploits the ground plane priorand utilizes some segmentation features to generate 3D proposals from a single image.
Both 3DOP and Mono3D (use hand-crated) features.
본 논문에서는 : We propose a more efficient approach by introducing a bird’s eye view representation of point cloud and employing 2D convolutions to generate accurate 3D proposals.
Dim | Paper | Method | Weekpoint |
---|---|---|---|
2D | Selective search[25],[33],[2] | ||
3D | 3DOP [4] | depth features(From stereo point cloud) | hand-crated features |
3D | Mono3D [3] | some segmentation features(From Image) | hand-crated features |
3D | Deep Sliding Shapes [23] | more powerful deep learning features(??) | computationally expensive |
3D | 제안(MV3D) | 2D convolutions (From bird’s eye view ) |
기존 방식
제안 방식
The bird’s eye view representation is encoded by height, intensity and density.
We discretize the projected point cloud into a 2D grid with resolution of 0.1m.
For each cell, the height feature is computed as the maximum height of the points in the cell.
To encode more detailed height information, the point cloud is devided equally into M slices.
A height map is computed for each slice, thus we obtain M height maps.
To normalize the feature, it is computed as $$min(1.0, \frac{log(N+1)}{log(54)})$$
Note that the intensity and density features are computed for the whole point cloud while the height feature is computed for M slices, thus in total the bird’s eye view map is encoded as (M +2)-channel features.
Front view는 bird’s eye view에 추가적 정보(complementary information)들 제공(provide)
LiDAR 포인트 클라우드는 sparse하기 때문에 이미지에 투영하게 되면 sparse 2D pointmap이 생성된다.
As LIDAR point cloud is very sparse, projecting it into the image plane results in a sparse 2D pointmap.
따라서, 제안 방식은 cylinder plane
에 투영하여 dense front view map
을 생성한다. [17:Bo Li]
As LIDAR point cloud is very sparse, projecting it into the image plane results in a sparse 2D pointmap.
Instead, we project it to a cylinder plane to generate a dense front view map as in [17].
Given a 3D point $$p = (x, y, z)$$, its coordinates $$p_{fv} = (r, c)$$ in the front view map can be computed using
We encode the front view map with three-channel features, which are height, distance and intensity
Faster R-CNN의 RPN에 영감을 얻음
입력 : bird’s eye view map
In 3D object detection에서 bird’s eye view map
이 front view/image plane
대비 가지는 장점 3가지
물리적 크기 정보를 가지고 있음 : objects preserve physical sizes when projected to the bird’s eye view, thus having small size variance, which is not the case in the front view/image plane.
가림 현상 제거 : objects in the bird’s eye view occupy different space, thus avoiding the occlusion problem.
차량들이 바닥에 위치 하기 떄문에 위에서 내려다 보는것이 더 유용하다. in the road scene, since objects typically lie on the ground plane and have small variance invertical location, the bird’s eye view location is more crucial to obtaining accurate 3D bounding boxes.
Given a bird’s eye view map. the network generates 3D box proposals from a set of 3D prior boxes.
Each 3D box is parameterized by $$(x, y, z, l, w, h)$$
For each 3D prior box, the corresponding bird’s eye view anchor $$(x{bv}, y{bv}, l{bv}, w{bv})$$ can be obtained by discretizing $$(x, y, l, w)$$.
prior boxes
의 (l, w)= {(3.9, 1.6), (1.0, 0.6)}, h = 1.56m.We design $$N$$ 3D prior boxes by clustering ground truth object sizes in the training set.
By rotating the bird’s eye view anchors 90 degrees, we obtain N = 4 prior boxes.
We do not do orientation regression in proposal generation, whereas we left it to the next prediction stage.
The orientations of 3D boxes are restricted to {0◦, 90◦}, which are close to the actual orientations of most road scene objects.
This simplificatio nmakes training of proposal regression easier
문제점 With a disretization resolution of 0.1m으로 인해서 , object boxes in the bird’s eye view only occupy 5~40 pixels.
해결책 #1 : 고해상도 이미지를 입력으로 이용 -> 컴퓨팅 파워 필요
해결책 #2 : We opt for feature map upsampling as in [1].
In our implementation, the front-end convolutions only proceed three pooling operations, i.e., 8x downsampling.
Therefore, combined with the 2x deconvolution, the feature map fed to the proposal network is 4x downsampled with respect to the bird’s eye view input.
RMP과 같은 방식 사용
We do 3D box regression by regressing to $$t = (\Delta x, \Delta y, \Delta z, \Delta l, \Delta w, \Delta h)$$, similarly to RPN [19].
$$(\Delta l, \Delta w, \Delta h)$$ are computed as$$∆s = log\frac{S{GT}}{S{Anchor}},s \in {l,w,h} $$ .
Loss계산 : we use a multi-task loss to simultaneously classify object/background and do 3D box regression.
In particular, we use class-entropy for the “objectness” loss and Smooth $$l_1$$[10] for the 3D box regression loss.
Background anchors are ignored when computing the box regression loss.
During training, we compute the IoU overlap between anchors and ground truth bird’s eye view boxes.
An anchor is considered to be positive if its overlap is above 0.7, and negative if the overlap is below 0.5.
Anchors with overlap in between are ignored
Lidar의 산재한 포인트 클라우드로 인해서 빈 anchor가 발생한다. 본 논문에서는 연산부하를 줄이기 위해서 학습/테스트시 빈 Anchor는 제거 하였다.
Different from [23], we did not use 3D NMS because objects shouldoccupy different space on the ground plane.
We use IoU threshold of 0.7 for NMS.
The top 2000 boxes are kept during training, while in testing, we only use 300 boxes.
역할
목적 : 여러 view에서 오는 데이터들을 same length로 맞추기
Since features from different views/modalities usually have different resolutions, we employ ROI pooling [10] for each view to obtain feature vectors of the same length.
Given the generated 3D proposals,we can project them to any views in the 3D space.
Given a 3D proposal $$p_{3D}$$, we obtain ROIs on each view via:
$$ ROIv = T{3D \rightarrow v}(p3d), v \in {BV, FV, RGB}
$$
Given an input feature map x from the front-end network of each view, we obtain fixed-length features $$f_v$$ via ROI pooling:
$$ f_v = R(x, ROI_v), v \in{BV, FV, RGB}
$$
서로 다른 Feature의 정보를 합치는 방법
For a network that has L layers, early fusion combines features $${f_v}$$ from multiple views in the input stage:
In contrast, late fusion uses seperate subnetworks to learn feature transformation independently and combines their outputs in the prediction stage:
To enable more interactions among features of the intermediate layers from different views, we design the following deep fusion process:
We use element-wise mean for the join operation for deep fusion since it is more flexible when combined with droppath training [15].
Given the fusion features of the multi-view network, we regress to oriented 3D boxes from 3D proposals.
In particular, the regression targets are the 8 corners of 3D boxes:
They are encoded as the corner offsets normalized by the diagonal length of the proposal box.
24D 벡터가 불필요 해 보이지만, 연구 결과 효과 적이었다.
Despite such a 24-D vector representation is redundant in representing an oriented 3D box, we found that this encoding approach works better than the centers and sizes encoding approach.
Note that our 3D box regression differs from [23] which regresses to axis-aligned 3D boxes.
In our model, the object orientations can be computed from the predicted 3D box corners.
We use a multitask loss to jointly predict object categories and oriented 3D boxes.
As in the proposal network, the category loss uses cross-entropy and the 3D box loss uses smooth $$l_1$$.
the positive/negative ROIs are determined based on the IoU overlap of brid’s eye view boxes.
A 3D proposal is considered to be positive if the bird’s eye view IoU overlap is above 0.5, and negative otherwise.
we apply NMS on the 3D boxes after 3D bounding box regression.
We project the 3D boxes to the bird’s eye view to compute their IoU overlap.
We use IoU threshold of 0.05 to remove redundant boxes, which ensures objects can not occupy the same space in bird’s eye view.
region-based fusion network에 사용된 두가지 Regularization 기법들
For each iteration, we randomly choose to do global drop-path
or local drop-path
with a probability of 50%.
global drop-path
is chosen, wes elect a single view from the three views with equal probability. local drop-path
is chosen, paths input to each join node are randomly dropped with 50% probability. We ensure that for each join node at least one input path is kept.
To further strengthen the representation capability of each view, we add auxiliary paths and losses to the network.
As shown in Fig. 4, the auxiliary paths have the same number of layers with the main network.
Each layer in the auxiliary paths shares weights with the corresponding layer in the main network.
We use the same multi-task loss, i.e. classification loss plus 3D box regression loss, to back-propagate
each auxiliary path.
We weight all the losses including auxiliary losses equally.
중요 : 예측시에는 auxiliary paths
사용 안함 : The auxiliary paths are removed during inference.
In our multi-view network, each view has the same architecture.
기본은 VGG-16이고, 몇가지 부분을 수정 하였다.
초기 파라미터는 VGG-16 network pretrained on ImageNet으로 세팅 하였다.
Despite our network has three branches, the number of parameters is about 75% of the VGG-16 network.
예측(inference )시 소요 시간 : 이미지당 0.36S (GeForce Titan X GPU)
In the case of KITTI, which provides only annotations for objects in the front view (around 90◦ field of view), we use point cloud in the range of [0,70.4] × [-40, 40] meters.
We also remove points that are out of the image boundaries when projected to the image plane.
For bird’s eye view, the discretization resolution is set to 0.1m, therefore the bird’s eye view input has sizeof 704×800.
Since KITTI uses a 64-beam Velodyne lasers canner, we can obtain a 64×512 map for the front viewpoints.
The RGB image is up-scaled so that the shortest size is 500.
The network is trained in an end-to-end fashion.
For each mini-batch we use 1 image and sample 128 ROIs, roughly keeping 25% of the ROIs as positive.
We train the network using SGD with a learning rate of 0.001 for 100K iterations.
Then we reduce the learning rate to 0.0001 and train another 20K iterations.
LIDAR point cloud + RGB images Multi-View 3D Object Detection Network for Autonomous Driving
Chen et al. (2016c) combine LiDAR laser range data with RGB images for object detection. In their approach, the sparse point cloud is encoded using a compact multi-view representation and a proposal generation network utilizes the bird’s eye view representation of the point cloud to generate 3D candidates. +
Finally, they combine region-wise features from multiple views with a deep fusion scheme
zxf8665905 : 추천
lihua213#2: Old version
install tensorflow-gpu and CUDA.
A Nvidia GPU card with computation capability
ubuntu (* Cuda7.5에 맞는 버젼은 14.04임)
CUDA
(*environment_gpu.yml상 버젼 =7.5)
cuDNN
Python3.5 for MV3D related code
Tensorflow-GPU(version>1.0)
# https://github.com/adioshun/gitBook_DeepDrive/blob/master/papermultiview-3d-cnn/environment_gpu.yml
# conda env create -f environment_gpu.yml --name mv3d_p3_gpu
conda create -n python35 python=3.5
conda install tensorflow-gpu opencv3 shapely scikit-learn keras Cython matplotlib simplejson numba
pip install easydict
src/net/lib/setup.py
and src/lib/make.sh
: "arch=sm_37" #Google Cloud GPU Tesla K80
# Which CUDA capabilities do we want to pre-build for?
# https://developer.nvidia.com/cuda-gpus
# Compute/shader model Cards
# 6.1 P4, P40, Titan X so CUDA_MODEL = 61
# 6.0 P100 so CUDA_MODEL = 60
# 5.2 M40
# 3.7 K80
# 3.5 K40, K20
# 3.0 K10, Grid K520 (AWS G2)
# Other Nvidia shader models should work, but they will require extra startup
# time as the code is pre-optimized for them.
CUDA_MODELS=30 35 37 52 60 61
test
import tensorflow as tf
sess = tf.Session()
print(tf.__version__) # version more than v1.
The KITTI Vision Benchmark Suite Raw Data
data/raw/kitti/
경로 밑에 데이터 위치 tracklet_labels.xml
파일은 2011_09_26_drive_0001_sync
하위 폴더에 위치 src/kitti_data/pykitti/tracklet.py L289
에서 다운 받은 파일명으로 변경
DEFAULT_DRIVE = '2011_09_26_drive_0001'
cd src
source activate didi
sudo chmod 755 ./make.sh
./make.sh
conda create -n python27 python=2.7
아래 [2.2]를 직접 실행 하는것 추천
#- `python ./net/lib/setup.py build_ext --inplace` : Fast R-CNN (MS)
#- 'bash ./net/lib/make.sh` : building psroi_pooling layer
#- build required .so files
ln -s ./net/lib/roi_pooling_layer/roi_pooling.so ./net/roipooling_op/roi_pooling.so
ln -s ./net/lib/nms/gpu_nms.cpython-35m-x86_64-linux-gnu.so ./net/processing/gpu_nms.cpython-35m-x86_64-linux-gnu.so
ln -s ./net/lib/nms/cpu_nms.cpython-35m-x86_64-linux-gnu.so ./net/processing/cpu_nms.cpython-35m-x86_64-linux-gnu.so
ln -s ./net/lib/utils/cython_bbox.cpython-35m-x86_64-linux-gnu.so ./net/processing/cython_bbox.cpython-35m-x86_64-linux-gnu.so
arning: calling a constexpr __host__ function from a __host__ __device__ function is not allowed.
make.sh
파일에 아래 flag --expt-relaxed-constexpr
추가 if [ -d "$CUDA_PATH" ]; then
nvcc -std=c++11 -c -o roi_pooling_op.cu.o roi_pooling_op_gpu.cu.cc \
-I $TF_INC -D GOOGLE_CUDA=1 -x cu -Xcompiler -fPIC $CXXFLAGS \
-arch=sm_37 --expt-relaxed-constexpr
./src/data.py
)kitti기준, didi data 이용시
utils/bag_to_kitti
실행 필요
./data/preprocessing/kitti/
- gt_boxes3d :npy
- gt_box_plot : png
- gt_labels : npy
- rgb : png
- top : .npy.npz
- top_image : png
작업 환경
#data.py
if config.cfg.USE_CLIDAR_TO_TOP:
SharedLib = ctypes.cdll.LoadLibrary('/workspace/mv3d/src/lidar_data_preprocess/'
'Python_to_C_Interface/ver3/LidarTopPreprocess.so')
#if config.cfg.USE_CLIDAR_TO_TOP:
# SharedLib = ctypes.cdll.LoadLibrary('/home/stu/MV3D/src/lidar_data_preprocess/'
# 'Python_to_C_Interface/ver3/LidarTopPreprocess.so')
# /MV3D/src/net/processing/boxes3d.py 상단에 추가
# ./src/config.py L126 참고
#rgb camera
MATRIX_Mt = ([[ 2.34773698e-04, 1.04494074e-02, 9.99945389e-01, 0.00000000e+00],
[ -9.99944155e-01, 1.05653536e-02, 1.24365378e-04, 0.00000000e+00],
[ -1.05634778e-02, -9.99889574e-01, 1.04513030e-02, 0.00000000e+00],
[ 5.93721868e-02, -7.51087914e-02, -2.72132796e-01, 1.00000000e+00]])
MATRIX_Kt = ([[ 721.5377, 0. , 0. ],
[ 0. , 721.5377, 0. ],
[ 609.5593, 172.854 , 1. ]])
#if __C.DATA_SETS_TYPE=='test':
# __C.DATA_SETS_DIR = osp.abspath('/home/stu/round12_data_test')
if __C.DATA_SETS_TYPE=='test':
__C.DATA_SETS_DIR = osp.abspath('/workspace/mv3d')
이후에도 같은 문제가 발생 하므로 [C]방법 추천
cd ./src/net/roipooling_op
mv roi_pooling.so roi_pooling.so~
cp ../../net/lib/roi_pooling_layer/roi_pooling.so ./
chmod +x roi_pooling.so
"tensorflow.python.framework.errors_impl.NotFoundError: YOUR_FOLDER/roi_pooling.so: undefined symbol: ZN10tensorflow7strings6StrCatB5cxx11ERKNS0_8AlphaNumES3"
OR Download and replace the .so with following file :[Download], CUDA 8.0, Python 3.5
user this version of train.py: for python2
conda list | grep tensorflow
후 tensorflow(cpu & gpu) 버젼을 1.0 이상으로 변경
File Structure
├── data <-- all data is stored here. (Introduced in detail below)
│ ├── predicted <-- after prediction, results will be saved here.
│ ├── preprocessed <-- MV3D net will take inputs from here(after data.py)
│ └── raw <-- raw data
├── environment_cpu.yml <-- install cpu version.
├── README.md
├── saved_model <--- model and weights saved here.
├── src <-- MV3D net related source code
│ ├── config.py
│ ├── data.py
│ ├── didi_data
│ ├── kitti_data
│ ├── lidar_data_preprocess
│ ├── make.sh
│ ├── model.py
│ ├── mv3d_net.py
│ ├── net
│ ├── play_demo.ipynb
│ ├── __pycache__
│ ├── tracking.py <--- prediction after training.
│ ├── tracklets
│ └── train.py <--- training the whole network.
│── utils <-- all related tools put here, like ros bag data into kitti format
│ └── bag_to_kitti <--- Take lidar value from ROS bag and save it as bin files.
└── external_models <-- use as a submodule, basically code from other repos.
└── didi-competition <--- Code from Udacity's challenge repo with slightly modification, sync with Udacity's new
updates regularly.
Related data are organized in this way. (Under /data directory)
├── predicted <-- after prediction, results will be saved here.
│ ├── didi <-- when didi dataset is used, the results will be put here
│ └── kitti <-- When kitti dataset used for prediction, put the results here
│ ├── iou_per_obj.csv <-- What will be evaluated for this competition, IoU score
│ ├── pr_per_iou.csv <--precision and recall rate per iou, currently not be evaluated by didi's rule
│ └── tracklet_labels_pred.xml <-- Tracklet generated from prediction pipeline.
├── preprocessed <-- Data will be fed into MV3D net (After processed by data.py)
│ ├── didi <-- When didi dataset is processed, save it here
│ └── kitti <-- When Kitti dataset is processed, save it here
│ ├── gt_boxes3d
│ └── 2011_09_26
│ └── 0005
| |___ 00000.npy
├ |── gt_labels
│ └── 2011_09_26
│ └── 0005
| |___ 00000.npy
| ├── rgb
│ └── 2011_09_26
│ └── 0005
| |___ 00000.png
| ├── top
│ └── 2011_09_26
│ └── 0005
| |___ 00000.npy
| └── top_image
| └── 2011_09_26
| └── 0005
| |___ 00000.png
└── raw <-- this strictly follow KITTI raw data file format, while seperated into didi and kitti dataset.
├── didi <-- will be something similar to kitti raw data format below.
└── kitti
└── 2011_09_26
├── 2011_09_26_drive_0005_sync
│ ├── image_02
│ │ ├── data
│ │ │ └── 0000000000.png
│ │ └── timestamps.txt
│ ├── tracklet_labels.xml
│ └── velodyne_points
│ ├── data
│ │ └── 0000000000.bin
│ ├── timestamps_end.txt
│ ├── timestamps_start.txt
│ └── timestamps.txt
├── calib_cam_to_cam.txt
├── calib_imu_to_velo.txt
└── calib_velo_to_cam.txt