논문명 | Depth Estimation from Single Image Using CNN-Residual Network |
---|---|
저자(소속) | Xiaobai Ma (Standford) |
학회/년도 | cs231n-Report 2017, 논문 |
키워드 | 1 camera, CNN+FC, Pure CNN, CNN+Residual, NYU Depth |
참고 | |
코드 | 유사프로젝트 |
Most works rely on
기존 연구 문제점 Classic methods rely on
[6] do not predict depth explicitly, but instead categorize image regions into geometric structures and then compose a simple 3D model of the scene.
A second type of related works perform feature-based matching between a given RGB image and the images of a RGB-D repository in order to find the nearest neighbors, the retrieved depth counterparts are then warped and combined to produce the final depth map.
Karsch et al. [7] perform warping using SIFT Flow [16], followed by a global optimization scheme, whereas Konrad et al.
[8] compute a median over the retrieved depth maps followed by cross bilateral filtering for smoothing.
Instead of warping the candidates, Liu et al. [19], formulate the optimization problem as a Conditional Random Field (CRF) with continuous and discrete variable potentials.
Notably, these approaches rely on the assumption that similarities between regions in the RGB images imply also similar depth cues.
최근 CNN기반 방법들이 제안 되었다. 이 작업있기 때문에 기존 이 semantic labeling와 깊은 연관이 AlexNet, VGG 같은 방법을 기반으로 연구 되었다. Since the task is closely related to semantic labeling, most works have built upon the most successful architectures of the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [25], often initializing their networks with AlexNet [9] or the deeper VGG [30]. Eigen et al.
[3]are the first to use CNN for single image depth estimation.
The authors addressed the task by employing two deep network stacks.
[3] D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction from a single image using a multi-scale deep network. In Advances in neural information processing systems, pages
2366–2374, 2014.
This idea is later extended in [2], where three stacks of CNN are used to additionally predict surface normals and labels together with depth.
[2] D. Eigen and R. Fergus. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE International Conference on Computer Vision, pages 2650–2658, 2015.
Roy et al. [24] combined CNN with regression forests [14], using very shallow architectures at each tree node, thus limiting the need for big data.
[24] A. Roy and S. Todorovic. Monocular depth estimation using neural regression forest. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5506–5514, 2016.
Liu et al. [17] propose to learn the unary and pairwise potentials during CNN training in the form of a conditional random field (CRF) loss and achieved state of-the-art result without using geometric priors.
This idea makes sense because the depth values are continuous [18].
[17] F. Liu, C. Shen, and G. Lin. Deep convolutional neural fields for depth estimation from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5162–5170, 2015.
Li et al. [13] and Wang et al. [32] use hierarchical CRFs to refine their patch-wise CNN predictions from superpixel down to pixel level.
[13] B. Li, C. Shen, Y. Dai, A. van den Hengel, and M. He. Depth and surface normal estimation from monocular images using regression on deep features and hierarchical crfs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1119–1127, 2015.
[32] P. Wang, X. Shen, Z. Lin, S. Cohen, B. Price, and A. L. Yuille. Towards unified depth and semantic prediction from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2800–2809, 2015.
researchers are trying to improve CNN method accuracy further.
Recent work has shown that fully convolutional networks(FCNs)[20] is a desirable choice for dense prediction problems due to its ability of taking arbitrarily sized inputs and returning spatial outputs.
[1] uses FCN and adopts CRF as post-processing.
[1] Y. Cao, Z. Wu, and C. Shen. Estimating depth from monocular images as classification using deep fully convolutional residual networks. arXiv preprint arXiv:1605.02305, 2016.
Besides classical convolutional layers,[12] uses dilated convolutions as an efficient way to expand the receptive field of the neuron without increasing the parameters for depth estimation;
[12] B. Li, Y. Dai, H. Chen, and M. He. Single image depth estimation by dilated deep residual convolutional neural network and soft-weight-sum inference. arXiv preprint
arXiv:1705.00534, 2017.
[23] uses transpose convolution for up-sampling the feature map and output for images egmentation.
[23] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 234–241. Springer, 2015.
Laina et al. [11] proposed a fully connected network, which removes the fully connected layers and replaced with efficient residual up-sampling blocks.
[11] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab. Deeper depth prediction with fully convolutional residual networks. In 3D Vision (3DV), 2016 Fourth International Conference on, pages 239–248. IEEE, 2016.
본 논문에서 차용한 방법 We closely follow this work in this project.
The first architecture follows the work in [3], where the authors used coarse and fine CNN networks to do depth estimation.
많은 파라미터로 overfit문제있음, dropout layers로도 해결 어려움
오버피팅 문제를 해결하기 위해 FC를 convolution layers로 변경 : CNN+FC 보다 성능 좋음
다른 CNN based method for depth estimation들은 일반적으로
그래서 바꿈 So we changed the last two layers of this network.
Our third and most promising architecture follows the work in [11].
전이학습 수행 :
the transfer learning involves
입력 : RGB
CNN with fully connected layer like the one used in [3] is powerful but can easily overfit on the dataset because of the large number of parameters in fully connected layer.
[3] D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction from a single image using a multi-scale deep network. In Advances in neural information processing systems, pages
2366–2374, 2014.
모티베이션 : This motivates us to use only convolutional layers and stack more layers to increase the receptive field.
We also try [18] which is a CNN architecture using transfer learning on the ResNet[5] and are able to get reasonable results on the validation set.
[18] F. Liu, C. Shen, G. Lin, and I. Reid. Learning depth from single monocular images using deep convolutional neural fields. IEEE transactions on pattern analysis and machine intelligence, 38(10):2024–2039, 2016.