논문명 | PointNet: A 3D Convolutional Neural Network for real-time object class recognition |
저자(소속) | A. Garcia-Garcia () |
학회/년도 | IJCNN 2016, 논문 |
키워드 | PointNet3D2016, 분류 |
데이터셋(센서)/모델 | ModelNet(RGB-D) |
참고 | |
코드 |
PointNet제안 : A new approach inspired by VoxNet
and 3D ShapeNets
개선 방법 :
대부분의 연구는 Handcrafted Local Feature 를 사용 하였따. The vast majority of 3D object recognition methods [2] are typically based on hand-crafted local feature descriptors[3].
기존 방식의 Pipe-line These kinds of approaches rely on specific pipelines [4] consisting of
분류 문제 해결 법 : 거리 기반 or 머신러닝 알고리즘 That classification is performed by using distance metrics or machine learning algorithms, e.g.,
문제점 : 도메인 지식 필요, 완벽하지 않음 handcrafting feature descriptors requires domain expertise and remarkable engineering and theoretical skills, and even fulfilling both requirements they are still far from perfection.
본 논문의 기여도 Its contribution is twofold:
2.5D 데이터에 CNN을 적용하려는 연구가 시작 되었다. 깊이 정보를 또다른 채널로 인식 하는 것이다. Due to the successful applications of the CNNs to 2D image analysis, several researchers decided to take the same approach for 2.5D data, treating the depth channel as an additional one together with the RGB ones [10]–[12].
These methods simply extend the architecture to take four channels – matrices – as input instead of the three featured by RGB images.
This is still a image-based approach which does not fully exploit the geometric information of 3D shapes despite its straightforward implementation.
[10] R. Socher, B. Huval, B. Bath, C. D. Manning, and A. Y. Ng, “Convolutional-recursive deep learning for 3d object classification,” in Advances in Neural Information Processing Systems, 2012, pp. 665–673.
[11] L. A. Alexandre, “3d object recognition using convolutional neural networks with transfer learning between input channels,” in Proc. the 13th International Conference on Intelligent Autonomous Systems, 2014.
[12] I. Lenz, H. Lee, and A. Saxena, “Deep learning for detecting robotic grasps,” The International Journal of Robotics Research, vol. 34, no. 4-5, pp. 705–724, 2015.
최근에는 3D의 정보를 이용하는 연구도 진행 되었다. Apart from 2.5D approaches, specific architectures to learn from volumetric data, which make use of pure 3D convolutions, have been recently developed.
Those architectures are commonly referred as 3DCNNs and their foundations are the same as the 2D or 2.5D ones, but the nature of the input data is radically different.
volumetric 데이터는 처리량이 많기 때문에 좀더 컴팩트한 표현방식(representation )
으로 바꾸어서 진행 한다. Since volumetric data is usually quite dense and hard to process, most of the successful 3DCNNs resort to a more compact representation of the 3D space:
3DCNNs방식은 인기를 얻기 시작한다. Those 3DCNNs are slowly overtaking other approaches when applying object recognition to complete 3D scenes [15-DeepSlidingShape].
이유 두가지 This progress has been mainly enabled by two factors:
The substantial growth in the number of 3D models available online through repositories,
The reduction of training times thanks to frameworks and libraries which exploit the power of massively parallel architectures for this kind of tasks.
하지만, 3D데이터는 받지만 라벨링 된 데이터는 적다. On the one hand, there exist many collections of 3D models, but they tend to be small and usually lack annotations and other useful information for training this kind of deep architectures.
반대로 2D 데이터는 좋은 데이터가 많다. In contrast, 2D approaches have taken advantage of the numerous and high-quality datasets that already exist such as ImageNet[9], LabelMe [16], and SUN [17].
그래서 최근 3D데이터셋 증가를 위해 많은 노력이 있었다. ModelNet & ShapeNets During the last years,researchers have unified efforts to create large-scale annotated 3D datasets inspired by the success of the 2D counterparts. The most popular 3D datasets which have revamped data driven solutions – for computer vision in general, and object recognition in particular – are the Princeton ModelNet [14],and ShapeNets [18] datasets.
하드웨어와 딥러닝 프레임워크 발전도 있다. On the other hand, the creationof deep learning frameworks such as Caffe [19], Theano [20],Torch [21], or TensorFlow [22], which allow researchers to easily express and launch their architectures and accelerate thetraining calculations with Graphics Processing Units (GPUs)by using CUDA or OpenCL, has enabled quick prototypingand testing.
Both facts have turned out to be crucial for the development of the field.
[15] S. Song and J. Xiao, “Deep sliding shapes for amodal 3d object detection in rgb-d images,” arXiv preprint arXiv:1511.02300, 2015.
제안 방식The proposed system takes a point cloud of an object as an input and predicts its class label.
단순 이미지 분류는 의미 없은 전체 이미지에서 물체를 탐지하고 해당 물체를 분류 하여야 함
In this regard, the proposal is two fold:
– inspired by VoxNet [13] occupancy models based on probabilistic estimates – provides a compact representation of the object’s 3D information from the point cloud.
That grid is fed to the CNN architecture, which in turn computes a label for that sample, i.e., predicts the class of the object.
This architecture was implemented using the Point Cloud Library (PCL) [23] – which contains state-of-the-art algorithms for 3D point cloud processing – and Caffe [19], a deeplearning framework developed and maintained by the BerkeleyVision and Learning Center (BVLC) and an active community of contributors on GitHub 1.
This BSD-licensed C++ libraryallows us to design, train, and deploy CNN architecturesefficiently, mainly thanks to its drop-in integration of NVIDIAcuDNN [24] to take advantage of GPU acceleration.
At that midpoint, occupancy grids provide considerable shape cues to perform learning, while enabling an efficient processing of that information thanks to their array-like implementation.
최근 3D DL구조들은 occupancy grids를 사용하는 사례가 증가 하고 있다. Recent 3D deep learning architectures make use of occupancy grids as a representation for the input data to be learned or classified.
[25] S. Thrun, “Learning occupancy grid maps with forward sensor models,”Autonomous robots, vol. 15, no. 2, pp. 111–127, 2003.
3D ShapeNets [14] is a Convolutional Deep Belief Network (CDBN) which represents a 3D shape as a 30 × 30 × 30 binary tensor in which a one indicates that a voxel intersects the mesh surface, and a zero represents empty space.
VoxNet [13] introduces three different occupancy grids (32 × 32 × 32 voxels) that employ 3D ray tracing to compute the number of beams hitting or passing each voxel and then use that information to compute the value of each voxel depending on the chosen model:
The binary and density grids proposed by Maturana et al[13]. differentiate unknown and empty space, whilst the hit grid and the binary tensor do not.
[13] D. Maturana and S. Scherer, “Voxnet: A 3d convolutional neural network for real-time object recognition.” IROS, 2015.
최근까지는 VoxNet’s 의occupancy grid가 가장 좋은 성능을 보이고 있다. Currently, VoxNet’s occupancy grid holds the best accuracy in the ModelNet challenge for the 3D-centric approaches described above.
However, ray tracing grids considerably harmed performance in terms of execution time so that other approaches must be considered for a real-time implementation.
In that very same work, the authors show that hit grids performed comparably to other approaches while keeping alow complexity to achieve a reduced runtime.
In this regard, we propose an occupancy grid inspired by the aforementioned successes but aiming to maintain a reason able accuracy while allowing a real-time implementation.
In ourvolumetric representation, each point of a cloud is mapped toa voxel of a fixed-size occupancy grid.
Before performing thatmapping, the object cloud is scaled to fit the grid.
Each voxelwill hold a value representing the number of points mappedto itself.
At last, the values held by each cell are normalized.Figure 1 shows the proposed occupancy grid representationfor a sample object.