Chpter 4
Deep Learning Advances in Computer Vision with 3D Data: A Survey
A generic pipeline for processing a 3D scene
- Scans of real scenes can contain millions of points, therefore at first, some form of preprocessing is commonly applied.
- Point reduction (i.e., remove redundant points in order to reduce the computational cost),
- data structuring (i.e., organize the point cloud using data structures like kd-trees or octrees),
- hardware exploitation (e.g., GPU calculations) are a few of the methods proposed so far. --- Following, segmentation of the point cloud is typically performed in order to identify semantically meaningful regions.
- After acquiring the objects of the scene from segmentation, keypoint detection and descriptors extraction are applied to every identified object or scene segment.
- The extracted representation is subsequently utilized in order to match the scene segments with known object models
- finally recognize or classify them into a category or even retrieve similar objects.
- 3D scene segmentation
- 3D keypoint detection / 3D descriptor extraction
- 3D shape retrieval / 3D object recognition
Deeplearning이전에 사용되던 Descriptors/Detector에 대한 연구 들인듯 (연구 년도도 2010년 전)
2.1 3D Scene Segmentation
- semantic segmentation/labeling
- labeling each point of a scene as part of a foreground object of interest or of a background surface.
- 3D object classification
3D point cloud segmentation methods into five categories (Nguyen and Le [2013])
- region-based
- edge-based
- attributes-based
- model-based
- graph-based methods
2.2 3D keypoint detection / 3D descriptor extraction
3D keypoint detection은 물체 인식/Retrive을 위해서 필수적인 Step 이다.
Existing approaches (Tombari et al. [2013])
- Fixed-scale keypoint detectors : Identify distinctive keypoints at a constant scale given to the algorithm as an input argument
- Local Surface Patches (LSPs) [Chen and Bhanu 2007]
- Intrinsic Shape Signatures (ISSs) [Zhong 2009]
- the 3D detector proposed in Mian et al. [2010] termed as “KeyPoint Quality” (KPQ) in Tombari et al. [2013]
- Adaptive-scale keypoint detectors : Identify keypoints after creating a scale space defined on the surface or alternatively after computing an embedding of the data on a 2D plane.
- MeshDoG [Zaharescu et al. 2009]
2.3. 3D Object Retrieval and Recognition
Content-based image retrieval is a well-studied task in 2D computer vision.
3D shape retrieval is one of the first problems
Given a 3D object query, the goal is to retrieve semantically similar objects from a given database.
Two steps are included in a typical retrieval pipeline:
- (a) descriptors extraction from the 3D objects
- (b) matching of the queries’ descriptors with the stored descriptors of the database objects using an appropriate similarity measure.
Existing approaches on 3D object retrieval can be divided into [Gao and Dai 2014]
- (a)3D model-based methods , which are based mostly on low-level descriptors extraction from the 3D models,
- (b)view-based methods , which utilize multiple 2D views of the 3D objects.
2.4 4D Modeling
3D data + Time
2.5. Current Trends
DNN에 대한 기초 설명 -> 정리 제외
- 입력으로 사용할 descriptors 추출 : methods that extract descriptors from the 3D data and give these as input to the DNN.
- 데이터 수집 : The approaches belonging to the second category exploit RGB-D data (i.e., separate color and depth channels) captured from popular low-cost cameras like Microsoft’s Kinect.
- 아키텍쳐 설계 : Deep architectures designed to have direct access to the 3D data form the third category.
- The fourth category includes methods utilizing one or more 2D projections/views of the 3D object/scene captured from different viewpoints and use them to feed the employed deep model.
- DL methods designed for data captured from hyperspectral cameras are included in the last category.
4.1. DL Architectures Exploiting Descriptors Extracted from 3D Data
- A common practice is to extract low-level descriptors and then, provide them as input to a DNN in order to get a more effective high-level representation for recognition, retrieval, or other tasks.
여러 가공된 Feature들이 있지만 가장 좋은건 Raw정보를 입력 하는것 아닌가?
4.2. DL Architectures Exploiting RGB-D Data
A study of data fusion methods for RGB-D visual recognition can be found in Sanchez-Riera et al. [2016].
RGB-D data를 이용한 최초의 3D 물체 식별 방법은 Socher et al. [2012]
에 의해 제안 되었다.
- The authors proposed a combination of convolutional and recursive neural networks where color and depth channels were processed separately.
- At first, two single-layer CNNs were employed in order to extract low-level descriptors from the RGB and depth images.
- Then, each CNN’s output was forwarded to a different set of RNNs initialized with random weights.
- The RNN descriptors extracted from each modality were finally merged and provided to a joint softmax classifier.
- The proposed method demonstrated accurate performance in classifying household objects.
B. Multiscale CNN
Couprie et al. [2013] used a multiscale CNN for semantic segmentation of indoor RGB-D scenes.
- The network processed the input depth and RGB images at three different scales
- the upsampled results were combined and forwarded to a classifier in order to get object class labels.
- The final labeling of the scene was obtained by merging the classi-fier’s predictions with a superpixels segmentation of the scene performed in parallel
The possibility of using transfer learning(전이학습) between CNNs for object recognition was investigated in Alexandre [2014].
- The author proposed the employment of four independent CNNs for processing the four input channels of an RGB-D image.
- The four CNNs were trained sequentially, passing the weights of a trained CNN as input to the next.
- Experiments on 3D objects from 10 categories indicated that the proposed training strategy can boost the performance.
The task of RGB-D object recognition was also addressed by Eitel et al. [2015].
- A two-stream CNN architecture for RGB-D object recognition was designed in this work too.
- Each stream (one for color and the other for depth) contained five convolutional and two FC layers.
- The two streams were originally trained individually and afterward, they were fused together in a FC layer and a softmax classifier.
- The two CNNs employed for recognition were pretrained for the task of object classification on the ImageNet dataset hence, preprocessing of the input data (especially depth) was required.
4.3. DL Architectures Exploiting Directly 3D Data
최근 들어 3D Geo정보를 통채로 이용하는 방법이 개발 되기 시작 하였다.
A. 3D ShapeNets Wu et al. [2015]
- 3D shapes were provided as input (
a 3D voxel grid
where each voxel was a binary variable indicating whether it belonged to the 3D shape or it was empty space), while the DBN model was employed. - In order to diminish the huge number of parameters required from feeding a fully connected DBN with a 3D voxel volume of normal resolution, convolution with 3D filters was applied.
- Most specifically, a Convolutional Deep Belief Network (CDBN) with five layers (three convolutional, one fully connected, and one output layer) was proposed.
- The model was initially pretrained layerwise and afterward, fine-tuned by backpropagation.
- Standard contrastive divergence was used for training the first four layers, but the more sophis-ticated Fast Persistent Contrastive Divergence (FPCD) was employed for training the top layer.
- The proposed framework was tested on the tasks of 3D shape classification and retrieval, next-best view prediction, and view-based 2.5D recognition outperform-ing other state-of-the-art methods.
B. VoxNet
Maturana and Scherer have also employed volumetric (i.e., spatially 3D) representation
of the 3D data to perform 3D object recognition [Maturana and Scherer 2015].
- In the proposed VoxNet architecture, a
volumetric occupancy grid
of size 32×32× 32 voxels was at first generated from a point cloud’s segment that was then given as input to a CNN. - The employed network was constructed using two convolutional (with 3D filters), one pooling, and two FC layers, while it was trained using SGD with momentum.
- An object class label was finally predicted for each seg-ment.
- Data from three different domains were used for evaluating VoxNet.
- LIDAR data point clouds
- RGB-D point clouds
- CAD models
C. boosted VoxNet
Sedaghat et al. [2016] modified VoxNet’s architecture in such a way that the object’s orientation was taken into account.
In their final model, the class labels were extracted directly from the orientation activations.
D. Convolutional AutoEncoder Extreme Learning Machine (CAE-ELM)
A new 3D descriptor learning method combining the strengths of CNNs, AEs, and ELMs
구성 요소
- Convolutional feature map generation:
- in this part of the network, the 3D input data, that is, voxel and Signed Distance Field (SDF) data, were convolved with randomly generated 3D kernels and convolutional feature maps were computed.
- Following, average pooling was applied to the feature maps in order to maintain rotation invariance.
- AE descriptors extraction:
- after pooling, each feature map was provided as input to a separate AE.
- All AEs were originally initialized with random weights and their final (output) weights were learned via training.
- ELM classifier
- in the last part of the network, all descriptors extracted from the AEs were concatenated into a vector that was used for predicting the current 3D shape’s label.
성능평가 결과 ShapeNets(Wu et al. [2015])보다 좋은 성과를 보임
E. Mesh Convolutional Restricted Boltzmann Machines (MCRBMs)
learning high discriminative 3D features from 3D meshes. , Han et al. [2016]
- The learned features were designed to preserve the structure between local regions and can be used as local or global features.
- A novel raw representation of the local region, called Local Function Energy Distribution (LFED), was provided as input to the network.
- In addition, Multiple MCRBMs were combined forming a deeper model, named Mesh Convolutional Deep Belief Network (MCDBN).
F. 성능 향상 제안
Qi et al. [2016] elaborated on two factors
- The first proposed CNN included a 3D extension of the
mlpconv layers
proposed in Lin et al. [2013] - The second CNN initially took advantage of long anisotropic kernels to consider long-distance interactions and exploited an adapted NIN network [Lin et al. 2013]
G. Voxception-ResNet (VRN)
여러 DLL 기술들을 적용하여 성능 향상
voxel-based (i.e., fully 3D) models for shape modeling and 3D object classification
The authors took advantage of recent advancements in the field of DNNs and designed an architecture that relied on
- (i) inception-style modules [Szegedy et al. 2016],
- (ii) batch normalization [Ioffe and Szegedy 2015],
- (iii) residual connections with preactivation (He et al. [2015a, 2016])
- (iv) stochastic network depth [Huang et al. 2016].
The proposed model,Voxception-ResNet (VRN), is 45 layers deep.
It should be noted that significant data augmentation was required for training such a deep model.
H. Deep Sliding Shapes
A pipeline for 3D object detection and recognition in RGB-D scenes was presented in Song and Xiao [2016].
흥미로운 점은 depth channel을 사용하는 대신 TSDF를 이용하여 full 3D voxel grid 컨버젼 하는 방법을 채택 한것이다.
Interestingly, instead of just working on the depth channel, Song and Xiao exploited the raw 3D information of the scenes by converting each depth image to a full 3D voxel grid using a directional Truncated Signed Distance Function (TSDF).
A fully 3D convolutional network, called 3D Region Proposal Network (RPN), was then utilized in order to generate 3D object bounding boxes from the 3D voxel grid at two different scales so that it could handle different object sizes.
Objectness scores were also provided for each generated object proposal.
Moreover, each detected 3D proposal box and its corresponding 2D color patch (i.e., 2D projection of the 3D proposal) were fed to a 3D ConvNet and a 2D ConvNet, respectively, for jointly learning the object’s category and 3D box regression.
4.4. DL Architectures Exploiting 2D Projections/Views of 3D Objects
3D를 여러개의 2D로 투영하여 활용하는것은 일종의 트릭
으로 많이 사용되고 있다.Collecting multiple 2D projections rendered from different directions in order to rep-resent a 3D shape/object is a “trick” commonly adopted for 3D shape analysis and understanding.
A. 최초 도입한 논문 (2014)
In the aforementioned work, an AE was used in order to generate a global deep representation of a 3D shape for the application scenario of 3D shape retrieval.
Pose normalization for differences in translation and scale was initially applied to each 3D model, while a set of 2D projections was subsequently collected for each of them.
After pretraining the stacked RBMs with the projections, the AE was fine-tuned using back-propagation in order to minimize the reconstruction error.
Finally, the hidden (code) layer was used for representing the corresponding projection/view of the 3D shape in the retrieval process.
Since more than one code was generated for each model (one per projection), a variant of the Hausdorff distance was used to compute the distance between the final representations of two different 3D shapes.
Z. Zhu, X. Wang, S. Bai, C. Yao, and X. Bai. 2014. Deep learning representation using autoencoder for 3D shape retrieval.CoRR abs/1409.7164 (2014).
B. Stacked Local Convo-lutional AutoEncoder (SLCAE) (2015)
An AE was also adopted for 3D object retrieval in Leng et al. [2015a].
In this method, an extension of the standard AE inspired from CNNs, called Stacked Local Convo-lutional AutoEncoder (SLCAE)
A Local Convolutional Autoencoder (LCAE) is constructed by substituting the FC layers of a standard AE with locally connected layers using the convolution operation.
In the stacked version of LCAE, many encoders were placed on top of each other and the output of the last one was used as the representation of a 3D object.
The input provided to the proposed AE was multiple depth images of several views of the 3D object, while each layer of the architecture was trained using the gradient descent method.
C. 3D Convolutional Neural Network (3DCNN) (2016)
Dealing with multiple 2D views of a 3D object at the same time.
Each object’s views were sorted into three reasonable sequences before being fed to the network, so that the views were listed in a fixed order.
The 3DCNN was comprised of four convolutional layers, three subsampling layers, and two FC layers.
Convolu-tional layers were initially pretrained in the same way of training an AE.
Afterward, the whole network was fine-tuned using backpropagation.
The output of the first FC layer was used as the representation of the input data for the retrieval.
성능이 좋기는 하지만, SLCAE보다는 약함, indicates that the latter representation is probably a better choice for this task
D. Multi-View CNN (MVCNN) (2015)
Multiple views of a 3D object were also exploited in the work of Su et al. [2015] in order to build a compact shape descriptor for the tasks of 3D object classification and retrieval.
In order to obtain different views of the models, two setups were tested.
- The first setup included 12 rendered views of the 3D objects by placing an equal number of virtual cameras around them, while the second involved 80 views.
- All the available views of an object passed through the first part of the network separately
- and then, elementwise max pooling was performed across all views in the view pooling layer.
- Finally, the aggregated result passed through the remaining network.
For retrieval, the penultimate seventh layer of the network (which is fully connected) was used as shape descriptor.
The employed network was pretrained using the ImageNet1K dataset and then, fine-tuned using the 3D dataset ModelNet40 [Wu et al. 2015] that was used in the experimental evaluation of the MVCNN architecture.
제안된 Shape descriptors(여러 2D)는 3D ShapeNets of Wu et al. [2015]보다 좋은 성능 보임
E. Pairwise Multi-View CNN (2016)
A different approach for exploiting the multiple views of a 3D object was followed by Johns et al. [2016] for the application scenario of multiview object recognition under unconstrained camera trajectories.
In this work, the collection of views was organized in pairs that were provided to a CNN together with their relative pose.
The VGG-M network [Chatfield et al. 2014] was employed in this case consisting of five convolutional and three FC layers.
입력 : Grayscale images + depth images
The outputs of the convolutional layers from the two images were concatenated before being provided to the first FC layer.
제안 방식은 voxel-based 3D ShapeNets [Wu et al. 2015], MVCNN보다 좋은 성능 보임
A real-time 3D shape search engine based on 2D views of 3D objects was presented in Bai et al. [2016].
The proposed system exploited GPU for CNN-based feature extraction and utilized two inverted files,
- one for accelerating the multiview matching process
- the other for re-ranking the initial results.
수초 이내에 retrieval process완료 가능. The retrieval process for a query shape was reported to be completed within a second.
G. 2D view + 2D sketch (2015)
Sketch 데이터가 필요 하므로 참고 활용 어려움 - 생략
A different approach where 3D models were retrieved based on 2D sketches and 2D views has recently been presented in Wang et al. [2015]. More specifically, Wang et al. proposed an architecture that takes as input a {2D view + sketch} pair of an object. The model consisted of two Siamese CNNs (i.e., two identical subconvolutional networks), one for dealing with the 2D sketch of the 3D object to be retrieved and the other with the 2D view. The two subnetworks were trained separately using SGD and backpropagation. Each subnetwork contained three convolutional layers, each followed by a max-pooling layer, and one FC plus one output layer on top. Every 3D model was characterized by two randomly generated views as far as their angles differed morethan 45◦ . The proposed network was tested on three datasets and achieved the best performance.
Xie et al. [2015b] presented the Multi-View Deep Extreme Learning Machine (MVD-ELM) and tested it on the tasks of 3D shape classification and segmentation.
Each 3D shape was represented by a collection of 20 2.5D depth images/projections captured uniformly using a sphere centered at each object.
The MVD-ELM model contained convolutional and pooling layers.
The weights in each convolutional layer were shared across all views. The output weights were optimized based on the extracted feature maps.
[확장버젼] FC-MVD-ELM
A Fully Convolutional extension of the proposed model (FC-MVD-ELM) was also presented for the task of 3D shape segmentation.
This network contained only two convolutional layers without any pooling layer.
FC-MVD-ELM was trained using the multiview depth images of the training examples.
Then, all the predicted labels were projected back into the original 3D mesh.
Finally, the segmentation result was smoothed using graph cuts optimization.
I. sphere rendering
비교 분석 내용 포함 : comparison of volumetric VS. multiview CNN
For the case of multiview CNNs, Qi et al. proposed sphere rendering
sphere rendering is, multiresolution 3D filtering in order to exploit information from mul-tiple scales, and in combination with training data augmentation
4.5. DL Architectures Exploiting HyperSpectral Data
HyperSpectral Data, 인공위성등 -> 생략
4.6. DL Architectures Fusing Different 3D Data Modalities
fusion of different data modalities 한 연구들
- Doulamis and Doulamis [2012]
- Mart ́ ınez and Yannakakis [2014]
- Xu et al. [2015a]
- Zhang et al. [2016b]
A. FusionNet
3D volumetric + 2D pixel, AlexNet network사용
In the work of Hegde and Zadeh [2016], a fusion of volumetric (i.e., 3D) and pixel (i.e., 2D views) representations was attempted for 3D object classification.
More specifically, the authors used AlexNet network [Krizhevsky et al. 2012] for the 2D views of each 3D object, while they proposed two 3D CNNs for the volumetric data.
The multiview network performed better on ModelNet40 than the volumetric ones, but the highest performance was achieved by the combination of the three different networks, named FusionNet
B. Convolutional hypercube pyramid
In a similar vein, RGB, Depth, and Point Cloud data were combined in Zaki et al. [2016].
Depth maps and point cloud embedding was initially performed, while a CNN pretrained on RGB images was employed for feature extraction.
A Hypercube Pyramid descriptor was proposed for representing multiscale, spatially relevant information for object and instance classification using ELMs.
The extracted descriptor was fused with the activations of the pretrained network’s FC layers creating an even more compact representation.
4.7. Overview of the DL Architectures Designed for 3D Data
딥러닝 기술들이 1D, 2D용으로 개발 되었기 때문에 3D에 바로 적용하기는 어렵다. 본 챕터에서는 이를 해결하기 위한 5개 Categories의 접근 방법들을 기술 하였다. (입력 데이터 처리 방식에 따라 분류 )
- DL을 이용한 고수준 Descriptor
- 2.
- 2D CNN을 3D CNN으로 대체
- 4.
- Multiview
3D 데이터를 다루기 위해서 Feature Engineering기술을 활용 하여 왔다.In order to deal with the 3D data, many researchers took advantage of the developments in feature engineering.
Low-level 특징 추출은 CV분야에서 많은 성과를 냈고, 다양한 descriptors 들이 3D data 분석에 제안 되었다.Low-level feature extraction has been used in several computer vision tasks with great success and a large variety of local or global descriptors has been proposed for 3D data so far.
Low-level 특징은 3D 물체의 고수준 의미 분석에는 어렵기 때문에 DL기술을 이용하여서 [high-level descriptor]을 추출 한다.Since low-level descriptors are usually not sufficient to characterize the high-level semantics of the 3D objects, the works in this category exploited them in combination with a deep model in order to extract high-level descriptors.
하지만, 3D데이터의 복잡성을 고려하면 'shallow representations'은 중요 정보를 놓쳐서 식별력이 부족할수 있다.However, considering the complexity of 3D data, this representation may be lacking discriminative power since the shallow representations may omit significant information from the 3D representation.
RGB-D센서는 색상과 깊이 정보를 가지고 있으며 보통 이둘을 분리 하여 활용 한다. 이 센서는 가격이 싸지만 대신에 노이즈와 완벽한 데이터 수집이 어렵다.RGB-D sensors provide the extra depth modality (in addition to the standard RGB channels) that contains important information about a 3D object’s shape. Most researchers dealt with color and depth channels (i.e., images) separately, while others used only the depth information in order to design their systems. The big advantage of these sensors is that they are inexpensive for an average user and at the same time many open-source software solutions exist facilitating their usage. However, their low cost is often combined with noisy and incomplete captured data that probably makes them unsuitable for complex scenarios.
2D CNN을 3D CNN으로 대체 하여 3D 데이터를 바로 사용 하려는 연구가 최근 진행 되고 있다.Exploiting directly the 3D information by replacing the 2D convolutional layers with 3D ones has been attempted by some recent works.
3D volumetric models은 3D 물체를 표현하는데 충분한 표현력을 가지고 있다.3D volumetric models provide a rich and powerful representation of 3D shapes including all the important details.
컴퓨팅 파워가 발전 하였지만 그래도 많은 메모리와 계산 시간은 부담스러운 요소이다. 결과적으로 최근까지는 저해상도가 주로 사용되었다.Despite the huge advancements made in computational hardware though, their processing is still demanding both in terms of memory and computation time. As a result, low resolutions have only been utilized so far.
일부 연구원들은 2D의 여러 각도이미지를 이용하는 방법(Multiview)을 통해 이를 해결 하였다.Other researchers approached the problem from a different angle and utilized one or more 2D views of a scene captured from different viewpoints (multiview). By doing so, the problem is indirectly transformed to the image domain, therefore multiview-based methods can take advantage of the latest advancements in image processing and are straightforward to employ.
하지만 몇가지 고려 사항들이 발생 하였다. However, several concerns arise from their exploitation:
- (1) the full 3D geometry information of a 3D shape is lost in 2D views
- (2) the number of views that should be acquired and the way in which they should be linked for representing a 3D shape is a critical step that could influence both the efficiency and the effectiveness of a proposed method.
[ 3D object retrieval에 대한 방법 비교 ]
- Regarding 3D shape retrieval, AEs are a typical choice.
[ 3D object classification/recognition에 대한 방법 비교 ]
- DBNs and CNNs have been used extensively in 3D ob-ject classification and recognition.
[ 정확도 성능 비교 ]
CV기반 방식보다 DL기반 방식이 3D 에서도 좋은 성능을 보이고 있다.
DL 기술을 3D 데이터에 적용 하는 기술을 5개로 분류 (입력 데이터 처리 방식에 따라)
In this work, we classified methods applying DL tech-niques on 3D data into five categories based on how they treated the input data before feeding them to the employed DNN.
실험 결과 Multivew방식이 3D 정보를 통채로 사용하는 것보다 조금 성능 향상이 있음
Experimental results indicate in general a slight advantage of methods exploiting multiple 2D views for representing the 3D scene in comparison to those taking advantage of the full 3D geometry.3D model을 이용하여 성능 향상을 보였지만, 좀더 복잡한 구조와 Data augmentation 필요 함
A recent work managed to achieve superior performance utilizing volumetric (i.e., 3D) models; however, a more complex architecture was proposed and notable data augmentation was required
향후 연구 방향
RNN과 결합 : Another current research trend, not applied yet in 3D, is the use of spatial RNNs, especially LSTM-based models. LSTMs have several desired properties, for example, they can be fine-tuned end-to-end and they allow variable lengths in in-put and output.
시각화 기술 개발 : By visualizing how a network responds to a specific input, the opportunity to guide and improve its training process or its design arises.