논문명 | SqueezeDet: Unified, Small, Low Power Fully Convolutional Neural Networks for Real-Time Object Detection for Autonomous Driving |
---|---|
저자(소속) | Bichen Wu(버클리) |
학회/년도 | Bichen2016, CVPR 2017, 논문 |
키워드 | KITTI, YOLO + SqueezeNet |
참고 | 정리 |
Code | TF, TF /wKITTI, Docker |
자율주행용 물체 탐지 요구 사항
차별점 : In our network we use convolutional layers not only to extract feature maps, but also as the output layer to compute bounding boxes and class probabilities.
빠른 이유 : The detection pipeline of our model only contains a single forward pass of a neural network, thus it is extremely fast.
이미지 데이터는 중요한 요소중 하나이다.
Autonomous driving some basic requirements for image object detectors
The detection pipeline of SqueezeDet is inspired by [21]
기본 구조는 SqueezeNet [16]에서 따왔다.
[21] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In CVPR, 2016.
[16] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer. SqueezeNet: Alexnet-level accuracy with 50x fewer parameters and <0.5mb model size. arXiv:1602.07360, 2016
From 2005 to 2013, various techniques were applied to advance the accuracy of object detection on datasets such as PASCAL [7]. In most of these years, versions of HOG+SVM [5] or DPM [8] led the state-of-art accuracy on these datasets
However, in 2013, Girshick et al. proposed Region-based Convolutional Neural Networks (RCNN)[11], which led to substantial gains in object detection accuracy.
The R-CNN approach begins
단점 : One disadvantage of R-CNN is that it computes the CNN independently on each region proposal, leading to time-consuming (≤ 1 fps) and energy-inefficient (≥ 200 J/frame) computation.
해결책 : To remedy this, Girshick et al. experimented with a number of strategies to amortize computation across the region proposals [13, 17, 10], culminating in Faster R-CNN [22].
An other model, R-FCN, is fullyconvolutional and delivers accuracy that is competitive with R-CNN, but R-FCN is fully-convolutional which allows it to amortize more computation across the region proposals
[2] modified the CNN architecture to use shallower networks to improve accuracy.
[3, 26] on the other hand focused on generating better region proposals.
[2] K. Ashraf, B. Wu, F. N. Iandola, M. W. Moskewicz, and K. Keutzer. Shallow networks for high-accuracy road objectdetection. arXiv:1606.01561, 2016.
[3] Z. Cai, Q. Fan, R. Feris, and N. Vasconcelos. A unified multi-scale deep convolutional neural network for fast object detection. In ECCV, 2016.
[26] Y. Xiang, W. Choi, Y. Lin, and S. Savarese. Subcategoryaware convolutional neural networks for object proposals and detection. arXiv:1604.04693, 2016.
Region proposals are a cornerstone in all of the object detection methods that we have discussed so far.
However, in YOLO (You Only Look Once) [21], region proposition and classification are integrated into one single stage.
Given the same level of accuracy, it is often beneficial to develop smaller CNNs (i.e. CNNs with fewer model parameters), as discussed in [16].
[20] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
FCN의 마지막 레이어는 Vector방식이 아니라 그리드 방식이다. FCN defines a broad class of CNNs, where the output of the final parameterized layer is a grid rather than a vector.
그리드 방식은 픽셀단위 예측시 사용할수 있으므로 세그멘테이션에 유용한다. This is useful in semantic segmentation, where each location in the grid corresponds to the predicted class of a pixel.
FCN은 다른 분야에도 사용 된다. FCN models have been applied in other areas as well.
예를 들어 분류 문제에서 기존에는 Fully Connected Layer를 써서 1차원 벡터로 Class의 확률을 표현 하였ㅏ. To address the image classification problem, a CNN needs to output a 1-dimensional vector of class probabilities. One common approach is to have one or more fully connected layers, which by definition output a 1D vector– 1×1×Channels (e.g. [18, 23]).
[18] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. In NIPS, 2012.
[23] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556, 2014.
However, an alternative approach is to have the final parameterized layer be a convolutional layer that outputs a grid (H×W×Channels), and to then use average-pooling to downsample the grid to 1×1×Channels to a vector of produce class probabilities(e.g. [16, 19]).
[16] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer. SqueezeNet: Alexnet-level accuracy with 50x fewer parameters and <0.5mb model size. arXiv:1602.07360, 2016.
[19] M. Lin, Q. Chen, and S. Yan. Network in network. In ICLR, 2014.
Finally, the R-FCN method that we mentioned earlier in this section is a fully-convolutional network.
(single-stage detection)
pipeline을 채택 하였다. Inspired by YOLO [21], we also adopt a single-stage detection pipeline : region proposition and classification is performed by one single network simultaneously.
Feature map을 ConvDet layer에 입력
Compute bounding boxes centered around W × H uniformly distributed spatial grids.
W
and H
are number of grid centers along horizontal and vertical axes.[Figure 1. SqueezeDet detection pipeline.]
- A convolutional neural network extracts a feature map from the input image and feeds it into the ConvDet layer.
- The ConvDet layer then computes bounding boxes centered around W × H uniformly distributed grid centers.
- Each bounding box is associated with 1 confidence score and C conditional class probabilities.
- Then, we keep the top N bouding boxes with highest confidence and use NMS to filter them to
get the final detections.
C + 1
valuesC
: number of classes to distinguish, extra 1
: for the confidence scoreA high confidence score = a high probability that an object of interest does exist and that the overlap between the predicted bounding box and the ground truth is high.
YOLO같은 confidence score 사용 : $$Pr(Object) \times IOU^{Pred}_{truth}$$
The other C
scalars represents the conditional class probability distribution given that the object exists within the bounding box.
More formally, we denote the conditional probabilities as $$Pr(class_c \mid Object), c \in [1,C]$$
We assign the label with the highest conditional probability to this bounding box and we use
$$maxc Pr (Class_c \mid Object) \times Pr(Object) \times IOU^{Pred}{truth} $$
as the metric to estimate the confidence of the bounding box prediction.
Finally, we keep the top N
bounding boxes with the highest confidence and use Non-Maximum Suppression (NMS) to filter redundant bounding boxes to obtain the final detections.
During inference, the entire detection pipeline consists of only one forward pass of one neural network with minimal post-processing.
SqueezeDet의 기본 방법은 YOLO에서 가져 왔지만, ConvDet layer
을 사용함으로써 YOLO보다 적은 model parameters를 가지고도 tens-of-thousands of region proposals을 생성 할수 있다.
ConvDet은 학습을 통해 bounding box 좌표 와 class probabilities 출력하는 중요한 합성곱 레이어이다.
sliding window처럼 Feature Map에 동작한다. It works as a sliding window that moves through each spatial position on the feature map.
Using the notation from [22], we call these reference bounding boxes as anchor
.
Each position on the feature map corresponds to a grid center in the original image, so each anchor can be described by 4 scalars
as $$(\hat{x}_i, \hat{y}_j, \hat{w}_k, \hat{h}_k), i \in [1,W], j \in [1,H], k \in [1,K]$$.
k-th
reference bounding boxWe used the method described by [2] to select reference bounding box shapes to match the data distribution.
For each anchor (i, j, k), we compute 4 relative coordinates $(\delta x{ijk}, \delta y{ijk}, \delta w{ijk}, \delta h{ijk})$ to transform the anchor into a predicted bounding box, as shown in Fig. 2.
[Figure 2. Bounding box transformation.]
- Each grid center has K anchors with pre-selected shapes.
- Each anchor is transformed to its new position and shape using the relative coordinates computed by the ConvDet layer.
- Each anchor is associated with a confidence score and class probabilities to predict the category of the object within the bounding box
Following [12], the transformation is described by
$$ x^P_i, y^P_j, w^P_k, h^P_k $$ : predicted bounding box coordinates.
the other C + 1 outputs for each anchor encode the confidence score
ConvDet은 Faster-RCNN RPN의 마지막 레이어와 비슷 하다. ConvDet is similar to the last layer of RPN in Faster RCNN [22].
다른점은 RPN은 Weak Detector이다. The major difference is that, RPN is regarded as a “weak” detector
For simplicity, we denote the detection layers of YOLO [21] as FcDet
(only counting the last two fully connected layers)
.
The comparison between ConvDet and FcDet is illustrated in Fig. 3.
![]() |
![]() |
![]() |
---|---|---|
Last layer of Region Proposal Network (RPN) | The ConvDet layer | The detection layer of YOLO [21] |
is a 1x1 convolution with K × (4 + 1) outputs. 4 is the number of relative coordinates, and 1 is the confidence score. It’s only responsible for generating region proposals. |
is a Fw × Fh convolution with output size of K × (5 + C). It’s responsible for both computing bounding boxes and classifying the object within. |
contains 2 fully connected layers. |
The parameter size for this layer is $$Ch_f × K × 5$$ | The parameter size for this layer is $$F_wF_hCh_fK(5 + C)$$ | The first one is of size $$WfH_fCh_fF{fc1}$$ The second one is of size $$F_{fc1}W_oH_oK(5 + C)$$ |
[Figure 3. Comparing RPN, ConvDet and the detection layer of YOLO [21].]
- Activations are represented as blue cubes and layers (and their parameters) are represented as orange ones.
- Activation and parameter dimensions are also annotated.
Table 1. Comparison between RPN, ConvDet and FcDet.
- RP stands for region proposition. cls stands for classification.
Assume that the input feature map is of size ($W_f , H_f , Ch_f$ ),
ConvDet’s 필터(filter)
넓이(width)
as $F_w$ and 높이(height)
as $F_h$.
적절한 패팅과/Striding을 사용하여 ConvDet의 출력은 Feature Map과 같도록 유지 된다. With proper padding/striding strategy, the output of ConvDet keeps the **same spatial dimension** as the feature map.
To compute K × (4 + 1 + C) outputs for each reference grid, the number of parameters required by the ConvDet layer is
FcDet layer구성 : 2개의 FCL The FcDet layer described in [21-YOLO] is comprised of two fully connected layers.
Using the same notation for the input feature map and assuming the number of outputs of the $fc1$ layer is $F|{fc1|$, then the number of parameters in the $fc1$ layer is $WfH_fCh_fF{fc1}$.
The second fully connected layer in [21] generates C class probabilities as well as K×(4+1) bounding box coordinates and confidence scores for each of the $W_o × H_o$ grids.
Thus, the number of parameters in the fc2 layer is $F_{fc1}W_oH_o(5K + C)$.
The total number of parameters in these two fully connected layers is $F_{fc1}(W_fH_fCh_f + W_oH_o(5K + C))$.
In [21], the input feature map is of size 7x7x1024.
$F_{fc1}$ = 4096, K = 2, C = 20, $W_o$ = $H_o$ = 7, thus the total number of parameters required by the two fully connected layers is approximately 212 × 106 .
If we keep the feature map sizes, number of output grid centers, classes, and anchors the same, and use 3x3 ConvDet, it would only require 3×3×1024×2×25 ≈ 0.46×106 parameters, which is 460X smaller than $FcDet$.
The comparison of RPN, ConvDet and FcDet is illustrated in Fig. 3 and summarized in Table 1.
Faster R-CNN은 4단계 학습이 필요 하지만, SqueezeDet은 YOLO처럼 End-to-End 학습이 가능하다. Unlike Faster R-CNN [22], which deploys a (4-step) alternating training strategy to train RPN and detector network, our SqueezeDet detection network can be trained end-to-end, similarly to YOLO [21].
Detection + Localization + Classification을 위한 multi-task loss function을 정의 하였다. To train the ConvDet layer to learn detection, localization and classification, we define a multi-task loss function:
The first part of the loss function is the bounding box regression.
They are outputs of the ConvDet layer.
ground truth bounding box계산 : The ground truth bounding box $\delta G{ijk}$ or $(\delta G{ijk}, \delta G{ijk}, \delta G{ijk}, \delta G_{ijk})$ is computed as:
Note that Equation 3 is essentially the inverse transformation of Equation 1.
$(x^G, y^G. w^G, h^G)$ are coordinates of a ground truth bounding box.
학습시, During training, we compare ground truth bounding boxes with all anchors and assign them to the anchors that have the largest overlap (IOU) with each of them.
1
if the k-th anchor at position-(i, j) has the largest overlap with a ground truth box, and to 0
if no ground truth is assigned to it. The second part of the loss function is confidence scorere gression.
$\gamma_{ijk}$ is the output from the ConvDet layer,
$\gamma^G_{ijk}$ is obtained by computing the IOU of the predicted bounding box with the ground truth bounding box.
Same as above, we only include the loss generated by the anchor box with the largest overlap with the ground truth.
For anchors that are not “responsible” for the detection, we penalize their confidence scores with the $\barI{ijk} \gamma^2{ijk}$ term, where $\barI{ijk}= 1 - I{ijk}$.
Usually, there are much more anchors that are not assigned to any object.
In order to balance their influence, we use $\lambda^+{conf}$ and $\lambda^−{conf}$ to adjust the weight of these two loss components.
By definition, the confidence score’s range is [0, 1].
To guarantee that $\gamma_{ijk}$ falls into that range, we feed the corresponding ConvDet output into a sigmoid function to normalize it.
The last part of the loss function is just cross-entropy loss for classification.
We used softmax to normalize the corresponding ConvDet output to make sure that $p_c$ is ranged between [0, 1].
The hyper-parameters in Equation 2 are selected empirically.
In our experiments, we set $\lambda{bbox}$ = 5, $\lambda^+{conf}% = 75, $\lambda^-_{conf}$ = 100.
This loss function can be optimized directly using back-propagation.