https://arxiv.org/pdf/1703.07402.pdf
https://github.com/nwojke/deep_sort
https://github.com/abhyantrika/nanonets_object_tracking/
blog, by Shishira R Maiya, 2019.07
Meanshift or Mode seeking is a popular algorithm, which is mainly used in clustering and other related unsupervised problems.
It is similar to K-Means, but replaces the simple centroid technique of calculating the cluster centers with a weighted average that gives importance to points that are closer to the mean.
The goal of the algorithm is to find all the modes in the given data distribution. Also, this algorithm does not require an optimum “K” value like K-Means. More info on this can be found here.
Suppose we have a detection for an object in the frame and we extract certain features from the detection (colour, texture, histogram etc).
By applying the meanshift algorithm, we have a general idea of where the mode of the distribution of features lies in the current state.
Now when we have the next frame, where this distribution has changed due to the movement of the object in the frame, the meanshift algo looks for the new largest mode and hence tracks the object.
This method differs from the above two methods, as we do not necessarily use features extracted from the detected object. Instead, the object is tracked using the spatio-temporal image brightness variations at a pixel level.
Here we focus on obtaining a displacement vector for the object to be tracked across the frames.
Tracking with optical flow rests on three important assumptions:
Once these criteria are satisfied, we use something called the Lucas-Kanade method to obtain an equation for the velocity of certain points to be tracked (usually these are easily detected features).
Using the equation and some prediction techniques, a given object can be tracked throughout the video.
For more info on Optical flow, refer here.
In almost any engineering problem that involves prediction in a temporal or time series sense, be it computer vision, guidance, navigation or even economics, “Kalman Filter” is the go to algorithm.
The core idea of a Kalman filter is to use the available detections and previous predictions to arrive at a best guess of the current state, while keeping the possibility of errors in the process.
Tutorial: Kalman Filter with MATLAB example part1: youtube Tutorial: The Kalman Filter: pdf, 8page
Kalman filter works best for linear systems with Gaussian processes involved.
In our case the tracks hardly leave the linear realm and also, most processes and even noise in fall into the Gaussian realm. So, the problem is suited for the use of Kalman filters.
One of the early methods that used deep learning, for single object tracking.
A model is trained on a dataset consisting of videos with labelled target frames.
The objective of the model is to simply track a given object from the given image crop.
To achieve this, they use a two-frame CNN architecture which uses both the current and the previous frame to accurately regress on to the object.
As shown in the figure, we take the crop from the previous frame based on the predictions and define a “Search region” in the current frame based on that crop.
Now the network is trained to regress for the object in this search region
The network architecture is simple with CNN’s followed by Fully connected layers that directly give us the bounding box coordinates.
An elegant method to track objects using deep learning.
Slight modifications to YOLO detector and attaching a recurrent LSTM unit at the end, helps in tracking objects by capturing the spatio-temporal features.
As shown above, the architecture is quite simple.
This simple trick of using CNN’s for feature extraction and LSTM’s for bounding box predictions gave high improvements to tracking challenges.
The most popular and one of the most widely used, elegant object tracking framework is Deep SORT, an extension to SORT (Simple Real time Tracker).
We shall go through the concepts introduced in brief and delve into the implementation.
Let us take a close look at the moving parts in this paper.
Our friend from above, Kalman filter is a crucial component in deep SORT.
Our state contains 8 variables; (u,v,a,h,u’,v’,a’,h’) where
As we discussed previously, the variables have only absolute position and velocity factors, since we are assuming a simple linear velocity model.
The Kalman filter helps us factor in the noise in detection and uses prior state in predicting a good fit for bounding boxes.
For each detection, we create a “Track”, that has all the necessary state information.
It also has a parameter to track and delete tracks that had their last successful detection long back, as those objects would have left the scene.
Also, to eliminate duplicate tracks, there is a minimum number of detections threshold for the first few frames.
Now that we have the new bounding boxes tracked from the Kalman filter, the next problem lies in associating new detections with the new predictions.
Since they are processed independently, we have no idea on how to associate track_i with incoming detection_k.
[중요 구성요소] To solve this, we need 2 things:
The authors decided to use the squared Mahalanobis distance (effective metric when dealing with distributions) to incorporate the uncertainties from the Kalman filter.
마할라노비스 거리 : '평균과의 거리가 표준편차의 몇 배' 인지 나타내는 값 [참고]
Thresholding this distance can give us a very good idea on the actual associations.
This metric is more accurate than say, euclidean distance as we are effectively measuring distance between 2 distributions (remember that everything is distribution under Kalman!)
In this case, we use the standard Hungarian algorithm, which is very effective and a simple data association problem. I won’t delve into it’s details. More on this can be found here.
Lucky for us, it is a single line import on sklearn !
Well, we have an object detector giving us detections, Kalman filter tracking it and giving us missing tracks, the Hungarian algorithm solving the association problem. So, is deep learning really needed here ?
The answer is Yes. Despite the effectiveness of Kalman filter, it fails in many of the real world scenarios we mentioned above, like occlusions, different view points etc.
So, to improve this, the authors of Deep sort introduced another distance metric based on the appearance of the object.
The idea to obtain a vector that can describe all the features of a given image is quite simple.
Assuming a classical architecture, we will be left with a dense layer producing a single feature vector, waiting to be classified.
That feature vector becomes our “appearance descriptor” of the object.
The “Dense 10” layer shown in the above pic will be our appearance feature vector for the given crop.
Once trained, we just need to pass all the crops of the detected bounding box from the image to this network and obtain the “128 X 1” dimensional feature vector.
Now, the updated distance metric will be :
$$ D=Lambd_a∗D_k+(1−Lambda)∗D_a
$$
The importance of D_a is so high, that the authors make a claim saying, they were able to achieve state of the art even with Lambda=0, ie only using Da !!
A simple distance metric, combined with a powerful deep learning technique is all it took for deep SORT to be an elegant and one of the most widespread Object trackers.
https://nanonets.com/blog/object-tracking-deepsort/
I hope this blog has helped you gain a comprehensive understanding of the core ideas and components in object tracking and acquainted you with the tools required to build your own custom object detector.
Get coding and make sure there are no strays in your front yard anymore ever!