논문명/저자/소속	You Only Look Once:Unified, Real-Time Object Detection
저자(소속)	Joseph Redmon(워싱톤대)
학회/년도	CVPR 2016, 논문
키워드	Joseph2016a
데이터셋/모델	Pascal VOC 2007, 2012 (Detect 20 Class)/GoogLeNet
참고	홈페이지v1, PR100, 정리#1, 정리#2, 정리#3
코드	Darknet caffe, TF(Train+Test), TF(Test)

YOLO

1. 개요

1.1 R-CNN과의 2비교

기존 R-CNN계열의 문제점 : 복잡한 처리과정으로 인해 느린 속도, 최적화의 어려움 문제점 있음

제안 : object detection을 regression 문제로 접근하며, 별도의 region proposal을 위한 구조 없이 한 번에 분류/위치예측 가능

이미지 내의 bounding box와 class probability를 single regression problem으로 간주

single convolutional network를 통해 multiple bounding box에 대한 class probablility를 계산하는 방식

저자들이 자신들이 직접 만든 Darknet이라는 framework를 사용

1.2 장점

간단한 처리과정으로 속도가 매우 빠르다. 또한 기존의 다른 real-time detection system들과 비교할 때, 2배 정도 높은 mAP를 보인다.
- YOLO는 초당 45 프레임, 단순화한 Fast YOLO는 초당 155 프레임
Image 전체를 한 번에 바라보는 방식으로 class에 대한 맥락적 이해도가 높다. 이로인해 낮은 backgound error(False-Positive)를 보인다.
Object에 대한 좀 더 일반화된 특징을 학습한다. 가령 natural image로 학습하고 이를 artwork에 테스트 했을때, 다른 Detection System들에 비해 훨씬 높은 성능을 보여준다.
global reasoning가능 (??): 전체를 한번에 보기 때문에.
General Representation
- Robust on various back ground
- Other domain에 적용 가능

1.2 단점

상대적으로 낮은 정확도 (특히, 작은 object에 대해)

YOLO2가 나오면서 78.6를 만족 하고 있다. 속도 40FPS(544x544이미지)

2. 특징

2.1 Unified Detection

Input image를 S X S grid로 나눈다.
각각의 grid cell은 B개의 bounding box와 각 bounding box에 대한 confidence score를 갖는다.

$Confidence\ Score : Pr(Object) \times IOU_{truthpred}$

만약 cell에 object가 존재하지 않는다면 confidence score는 0이 된다.

각각의 grid cell은 C개의 conditional class probability를 갖는다.

$Conditional\ Class\ Probability : Pr(Class_i \mid Object)$

각각의 bounding box는 x, y, w, h, confidence로 구성된다.
- (x,y): Bounding box의 중심점을 의미하며, grid cell의 범위에 대한 상대값이 입력된다.
  - - 예1: 만약 x가 grid cell의 가장 왼쪽에 있다면 x=0, y가 grid cell의 중간에 있다면 y=0.5
- (w,h): 전체 이미지의 width, height에 대한 상대값이 입력된다.
  - 예2: bbox의 width가 이미지 width의 절반이라면 w=0.5
Test time에는 class-specific confidence score를 얻는다.

class-specific confidence score = (conditional class probability) x (bounding box의 confidence score)

$Class\ Specific\ Confidence\ Score= Conditional\ Class\ Probability \times Confidence\ Score =$

$Pr(Class_i \mid Object) \times Pr(Object) \times IOU_{truthpred} = Pr(Class_i) \times IOU_{truthpred}$

논문에서는 S는 7, B는 2이다. 즉, 7x7그리드에 각 그리드 마다 2개의 박스를 그린다.

2.2 Non-maximal suppression

불필요한 중복 BBox들을 제거하기 위해

2.3 단점

각각의 grid cell2개까지의 BBox와 1개의 class만을 예측할 수 있으므로, 작은 object 여러개가 다닥다닥 붙으면 제대로 예측하지 못한다.
- eg. 새 떼와 같이 작은 object들의 그룹이나 특이한 종횡비의 BBox
bounding box의 형태가 training data를 통해서만 학습되므로, 새로운/독특한 형태의 bouding box의 경우 정확히 예측하지 못한다.
몇 단계의 layer를 거쳐서 나온 feature map을 대상으로 bouding box를 예측하므로 localization이 다소 부정확해지는 경우가 있다.

Limitations of YOLO

YOLO imposes strong spatial constraints on bounding box predictions since each grid cell only predicts two boxes and can only have one class. This spatial constraint limits the number of nearby objects that our model can predict. Our model struggles with small objects that appear in groups, such as flocks of birds.

Since our model learns to predict bounding boxes from data, it struggles to generalize to objects in new or unusual aspect ratios or configurations. Our model also uses relatively
coarse features for predicting bounding boxes since our architecture has multiple downsampling layers from the input image.

Finally, while we train on a loss function that approximates detection performance, our loss function treats errors the same in small bounding boxes versus large bounding boxes. A small error in a large box is generally benign but a small error in a small box has a much greater effect on IOU. Our main source of error is incorrect localizations.

3. 구조

GoogLeNet for image classification 모델을 기반 : 24 Convolutional layers & 2 Fully Connected layers

1×1 reduction layer를 여러 번 적용한 것이 특징, [상세구조설명]

Fast YOLO는 위 디자인의 24의 convolutional layer를 9개의 convolutional layer로 대체

입력: 448×448의 해상도를 가집니다.
출력 : S x S x (B x 5 + C)개의 값
- S 는 grid size (default: 7)입니다. 각 이미지는 S×S개의 cell로 나눠집니다.
- B 는 각 cell 마다 테스트되는 Bounding Box (BB)의 개수이며 기본 값은 2입니다
  - 모델은 각 BB마다 위치 (x, y)와 크기 (height, width), 그리고 confidence value를 예측합니다.
  - confidence value는 grid cell에 물체가 없으면 0, 있으면 예측된 BB와 실제 BB 사이의 IoU 값이 됩니다.
- C 는 dataset의 class 개수로, Pascal VOC에서는 20이 됩니다.

Step 1. 7X7은 49개의 Grid Cell을 의미한다. 그리고 각각의 Grid cell은 B개(B=2개)의 bounding Box를 가지고 있음

첫 5개 값 : 1st BBox의 값이 채워져있다.
이후 5개 값 : 2nd BBox의 값이 채워져있다.
마지막 20개 값 : 20개 Class에 대한 Conditional class probability

Step 2. 1st bbox의 confidence score(c)값과 각 conditional class probability를 곱하여 1st bbox의 class specific confidence score도출

Step 3. bb98(7x7x2)까지 반복 동작 하며 생성

bb1, bb2가 쌍이 되어 첫 grid cell을 나타내고 있음

Step 4. Class 및 bounding box Location를 결정

이 98개의 class specific confidence score에 대해 각 20개의 클래스를 기준으로 non-maximum suppression을 하여, Object에 대한 Class 및 bounding box Location를 결정한다. [상세동작설명]

NMS Algorithm : non - maximum suppression은 중심 픽셀을 기준으로 8방향의 픽셀 값들을 비교하여 중심픽셀이 가장 클 경우 그대로 두고, 아닐 경우 제거해주는 과정 [출처]

4. 학습/테스트

Linear activation function으로는 leaky ReLU (α=0.1)를 사용

4.1 Loss Function

기본적으로 sum-of-squared-error 개념
- object가 존재하는 grid cell과 object가 존재하지 않는 grid cell 각각에 대해 아래의 loss를 계산
  - coordinates (x,y,w,h)
  - confidence score
  - conditional class probability

전제조건

Grid cell의 여러 bounding box들 중, ground-truth box와의 IOU가 가장높은 bounding box를 predictor로 설정한다.
1의 기준에 따라 아래 기호들이 사용된다.
- $\upharpoonleft^{obj}_{ij}$ : Object가 존재하는 grid cell i의 predictor bounding box j
- $\upharpoonleft^{noobj}_{ij}$ : Object가 존재하지 않는 grid cell i의 bounding box j
- $\upharpoonleft^{obj}_{i}$ : Object가 존재하는 grid cell i

Ground-truth box의 중심점이 어떤 grid cell 내부에 위치하면, 그 grid cell에는 Object가 존재한다고 여긴다.

$\lambda_{coord}$ : coordinates(x,y,w,h)에 대한 loss와 다른 loss들과의 균형을 위한 balancing parameter.
: obj가 있는 box와 없는 box간에 균형을 위한 balancing parameter.
- (일반적으로 image내에는 obj가 있는 cell보다는 obj가 없는 cell이 훨씬 많으므로)

(1) Object가 존재하는 grid cell i의 predictor bounding box j에 대해, x와 y의 loss를 계산.
(2) Object가 존재하는 grid cell i의 predictor bounding box j에 대해, w와 h의 loss를 계산.
- 큰 box에 대해서는 small deviation을 반영하기 위해 제곱근을 취한 후, sum-squared error를 한다.
- (같은 error라도 larger box의 경우 상대적으로 IOU에 영향을 적게 준다.)
(3) Object가 존재하는 grid cell i의 predictor bounding box j에 대해, confidence score의 loss를 계산. (Ci = 1)
(4) Object가 존재하지 않는 grid cell i의 bounding box j에 대해, confidence score의 loss를 계산. (Ci = 0)
(5) Object가 존재하는 grid cell i에 대해, conditional class probability의 loss 계산. (Correct class c: pi(c)=1, otherwise: pi(c)=0)

4.2 학습(Training)

먼저 ImageNet으로 pretraining을 하고 다음에 Pascal VOC에서 finetuning하는 2단계 방식으로 진행
- 학습에는 앞의 20개의 convolutional layer만을 사용했고 (feature extractor),
- 마지막 4개의 convolutional layer와 2개의 fully connected layer를 추가해서 (object classifier) 실제 object detection에 사용할 system을 구성합니다.
Pascal VOC를 사용한 training에는 135번의 epoch, batch size 64, momentum 0.9, decay 0.0005, dropout rate 0.5를 각각 적용했습니다. Learning rate를 epoch에 따라 바꿔가며 사용했습니다.

출처: curt-park

ImageNet 1000-class dataset으로 20개의 convolutioanl layer를 pre-training
Pre-training 이후 4 convolutioanl layers와 2 fully connected layers를 추가
Bounding Box의 width와 height는 이미지의 width와 height로 nomalize (Range: 0~1)
Bounding Box의 x와 y는 특정 grid cell 위치의 offset값을 사용한다 (Range: 0~1)
λcoord: 5, λnoobj: 0.5
Batch size: 64
Momentum: 0.9 and a decay of 0.0005
Learning Rate: 0.001에서 0.01로 epoch마다 천천히 상승시킴.
- 이후 75 epoch동안 0.01, 30 epoch동안 0.001, 마지막 30 epoch동안 0.0001
Dropout Rate: 0.5
Data augmentation: random scailing and translations of up to 20% of the original image size
Activation function: leaky rectified linear activation, leaky ReLU (α=0.1)를 사용

5. 성능 평가

5.1 속도

Fast YOLO(155FPS) > YOLO(45FPS) > Faster R-CNN VGG-16(7FPS)

5.2 정확도

Faster R-CNN VGG-16(73.2) > YOLO(63.4) > Fast YOLO(52.7)

Fast R-CNN의 error는 background 영역에서 대부분 발생하는 한편, YOLO의 error는 localization에서 발생

2016_YOLO