논문명 | DeepPano: Deep Panoramic Representation for 3-D Shape Recognition |
---|---|
저자(소속) | Baoguang Shi () |
학회/년도 | 2015, 논문 |
키워드 | Shi2015, |
데이터셋/모델 | ModelNet-10, ModelNet-40 |
참고 | |
기존/확장/관련 연구 | PANORAMA: A 3d shape descriptor based on panoramic views for unsupervised 3d object retrieval(2010) |
코드 | Matlab |
제안 : A robust representation of 3-D shapes learned with deep CNN
절차
Firstly, each 3-D shape is converted into a panoramic view
Then, a variant of CNN is specifically designed for learning the deep representations directly from such views.
기존 CNN과 다른점은 row-wise맥스 풀링층을 추가 하여 invariant에 대처 하도록 함 Different from typical CNN, a row-wise max-pooling layer is inserted between the convolution and fully-connected layers, making the learned representations invariant to the rotation around a principle axis
The performance of many tasks, including shape classification and shape retrieval, heavily depend on the quality of the representation
본 논문은 DeepPano을 제안,
Row-Wise Max-Pooling (RWMP) layer 제안
기존 연구 동향 분류 : The previous methods on 3-D shape analysis can be coarsely categorized into model-based and view-based methods.
Model based methods calculate a set of features directly from the 3-D shape mesh or its rendered voxels.
예 : Such methods include the Shape Histogram descriptor [2] and the Spin Images [3].
View based methods represent 3-D shapes by a set of views [4]–[10].
The views can be 2-D projections of the shape or the panoramic view.
However, different from most of the methods mentioned above that use hand-crafted features, we learn the representation from data with a variant of CNN.
[14] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao, “3d shapenets: A deep representation for volumetric shapes,” in Proc. CVPR, 2015, pp. 1912–1920.
Our method is related to previously introduced PANORAMA[6].
[6] P. Papadakis, I. Pratikakis, T. Theoharis, and S. J. Perantonis,“PANORAMA: A 3d shape descriptor based on panoramic views for unsupervised 3d object retrieval,” Int. J. Comput. Vis., vol. 89, no.2–3, pp. 177–192, 2010
기존 연구도 panoramic views를 이용하였다. In [6], Panagiotis et al. proposed to represent a 3-D shape by the Discrete Fourier Transform and Discrete Wavelet Transform descriptors calculated from a set of panoramic views.
Panoramoc view의 큰 문제점 : However,the panoramic view shifts when the 3-D shape rotates along its principle axis.
기존 연구에서는 normalization으로 해결 In [6], this problem is alleviated by pose normalization.
[Fig. 1. Rotation invariance of DeepPano]
- (a) 3-D shapes of the same model, but rotated to different angles;
- (b) Convolutional feature map for each 3-D shape;
- (c) Output vectors of the RWMP layer;
- (d) Comparisons among within model distances, within class distances and between class distances
As illustrated in Fig. 1, the convolutional feature maps extracted from panoramic views shifts when the 3-D shape rotates.
We pool the the responses of each row so that the resulting representation is not affected by this kind of shift.
As a result, the representation is invariant to the 3-D shape rotation.
Our method consists of two main steps:
본 논문의 가정(Assumption)사항 : 물체는 upright로 회전되어 있다. 대부분의 Datasets(3-D Warehouse)이 이렇게 되어 있다. Throughout this letter, we assume that 3-D models are upright oriented, so that the rotation is along a
axis that is also upright oriented. This assumption is satisfied in many real-world model repositories, such as the 3-D Warehouse[15].
The projection process is illustrated in Fig. 2.
[Fig. 2. Panoramic view construction]
- (a) Illustration of the panoramic view construction process.
- p ,q and d are respectively the grid point,
- the corresponding point on the axis and the value assigned to that grid point;
- (b) 3-D shapes and their corresponding panoramic views (with some padding)
간단한 방법은 CNN에 파노라믹뷰를 학습 시키는 것이다. A straightforward method is to train a CNN on the panoramic views of all training data, and extract the representation from it.
However, the view shifts when the 3-D shape rotates. This shift will greatly affect the representation produced by the CNN, although the CNN providessome form of translation invariance.
Moreover, unfolding the lateral surface creates two boundaries on the left and right sides of the panoramic view. The boundaries cause artifacts in the convolutional feature maps, thus affecting the representation extracted.
[Fig. 3. The network for learning and extracting shape representation]
- The network takes the padded panoramic view as the input.
- On the top it outputs a probability vector representing class probabilities.
- The 3-D shape representation can be extracted from the highlighted layers, namely RWMP, fc1 or fc2.
To avoid boundary artifacts, the panoramic view is padded on one side. The padded area is cloned from the other side of the map.
To obtain rotation-invariance, the representation has to be shift-invariant to the input panoramic view.
RWMP 적용 : row-wise max-pooling layer(RWMP),
학습 입력값 : The network is trained on a dataset consisting of pairs of panoramic views and class labels, using the back propagation algorithm [16].
학습 출력값 : Finally, the representation can be extracted from the RWMP layer, or any fully-connected layer after it.
Since the network for learning the representation is itself a classifier, we direct adopt it for classification tasks.
The softmax layer on the top of the network outputs class probabilities, and the class with the highest probability is taken as the prediction.
Since each 3-D shape is represented by a fixed-length vector and Euclidean distance is used for retrieval, we can perform fast retrieval on large-scale datasets, particularly when adopting some approximate nearest neighbor search schemes, e.g.[17].
Panoramic views are constructed from 3-D shapes and representations are learned and extracted from them.
제약 사항 The limitation of our method is similar to many previous view-based approaches,
In the future, some sequence prediction techniques[21], [22] might be used for exploring more contextual information,
in order to further improve the performance of shape recognition, as a panoramic view can be considered as a map of feature sequence.
In addition, to establish the robust alignments/correspondence[23], [24] between different panoramic views is another direction that is worthy of being studied.
DeepPano [28] converts 3D shapes into panoramic views; i.e., a cylinder projection around its principle axis.
Recently, CNN architectures have been extended to allow for recognition from image sequences using a single network,
by unwrapping an object shape into a panorama and max pooling across each row [33-DeepPano].
단점 : However, both these methods assume that a fixed-length image sequence is provided during both training and testing, and hence are unsuitable for generalised multi-view recognition.
In [21], the authors suggest a new robust representation of 3D data by way of a cylindrical panoramic projection that is learned using a CNN.+
The authors tested their panoramic representation on ModelNet datasets and outperformed typical methods when they published their work.
[기존연구] “PANORAMA: A 3d shape descriptor based on panoramic views for unsupervised 3d object retrieval (2015)
기존 연구도 panoramic views를 이용하였다. In [6], Panagiotis et al. proposed to represent a 3-D shape by the Discrete Fourier Transform and Discrete Wavelet Transform descriptors calculated from a set of panoramic views.+
Panoramoc view의 큰 문제점 : However,the panoramic view shifts when the 3-D shape rotates along its principle axis.
기존 연구에서는 normalization으로 해결 In [6], this problem is alleviated by pose normalization.
Also, they presented a 3D descriptor (PANORAMA) [15] that captures the panoramic view of a 3D shape by projecting it to a lateral surface of a cylinder parallel to one of its three principal axes.
By aligning its principle axes to capture theg lobal information and combining 2D Discrete Fourier Transformand 2D Discrete Wavelet Transform, the PANORAMA outperforms all the other 3D shape retrieval methods on several standard 3D benchmarks.