For the past 2 weeks, I have been working on further improving the structure and API of PMR as well as adding and benchmarking different Face Detection models. In this post, I will discuss the technical details of the things that I have done during this time.
In my last post, I introduced
Kernel – a class that is used to encapsulate all the ML-related code and run it in an isolated environment as a separate process. With
FaceDetectorKernel ML logic can be encapsulated even better.
Kernel now acts purely as a base class for different types of Kernels, each suited for a particular task and
FaceDetectorKernel is one of the first of this type. As the name of this class suggests, it is used for Face Detection.
Previously, the user had to subclass
Kernel and implement
predict function. Turns out, this approach causes a lot of boilerplate code if we want to add several models of the same category.
Instead, if your model does the same thing as already available models in PMR, you should simply choose a corresponding subclass of
FaceDetectorKernel) and implement only 2 abstract functions –
load_model, as the name suggests, initializes the model by deploying it into memory.
inference is called in
FaceDetectorKernel.predict in a loop that iterates over all the frames provided by
VideoFrames frames reader.
FaceDetectorKernel adding a new model for Face Detection is very easy. I have already added 2 more Face Detection models and the whole process was really straightforward as model loading and inference procedures were already given and all I had to do is to put model’s code into a proper directory and import it in a newly created subclass of
FaceDetectorKernel, which implements all the required functions. This approach will be used for other tasks such as Face Recognition, Speech Recognition and many other types of algorithms that will eventually come to PMR!
Other Changes to API
Preprocessors in VideoFrame
VideoFrames now supports preprocessors, which, as the name suggests, are used for preprocessing images and video frames. To create your own preprocessor, you need to subclass
Preprocessor and implement your preprocessing logic in
process. Every time you read a new frame via
VideoFrames, it is first handled by all the preprocessors that you passed to
VideoHandlerElem (in the same order) and then outputs the frame.
Passing parameters to Pipeline
I have already said several times that one of the main goals of this project is a coherent API, which is heavily influenced by TensorFlow and other great libraries.
Previously, you had to define a new
Pipeline for every new file that you wanted to process. Now, you declare a
Pipeline and pass a dictionary with parameters in the call to
run. It follows the same logic as in Keras and TensorFlow – declare once, use multiple times. The user also has a possibility to pass parameters not as a dictionary, but as a list, effectively decreasing the amount of code needed to create and run a
Growing Model Zoo
As I stated in my proposal, PMR has 2 different categories of models – “Average performance, best speed” that is supposed to be run on CPU and “State-of-the-Art” – real beasts, that require GPU for reasonable inference time. The Model Zoo of PMR continues to grow and each category already has 2 models – MobileNetSSD and MTCNN for fast inference, and beasts like YOLOv3 and DSFD for inference with GPU.
Each of these models has its own advantages that depend on your hardware and type of video (action movie or an interview, where faces don’t change pretty quick). Let’s discuss each of them in more details and compare them based on the benchmark results.
Both of these models have existed since the time of PoC and this week I have updated them to work as
FaceDetectorKernels. YOLOv3 was ported from OpenCV’s dnn (which doesn’t support CUDA and as a result inference is quite slow) to Keras and is now can be used with GPU.
MobileNetSSD stands for MobileNet Single Shot Detector. MobileNet is a neural networks model developed at Google for fast inference on mobile and other low-performance devices, while SSD is a network used for fast object detection and classification. Combined, they can produce very good face detection results with reasonable inference time.
YOLOv3 is one of the best object detectors, that, unfortunately, should be run only via GPU. As the original website of YOLOv3 suggests, running it on the CPU takes roughly 6-12 seconds per image (sic!) making this model fall into “State-of-the-Art” category.
Another new “State-of-the-Art” model, which, according to PapersWithCode.com is the current SOTA on Face Detection Data Set Benchmark. DSFD stands for “Dual Shot Face Detector” and it was developed by scientists from several organizations, among which there is a Tencent’s Youtu Lab.
DSFD is an extremely accurate algorithm – it is capable of detecting faces of various scales, with or without makeup and even the blurry ones. However, this also has a price – the inference is extremely slow on average GPUs.
By far one of the most popular models for Face Detection that has already become a classical solution for this task since its release in 2016. MTCNN can be very fast, and yet accurate, given the low-resolution image without a lot of difficult to detect faces.
This model falls into “Average performance, best speed” category. The current implementation of MTCNN is based on TensorFlow but I consider moving it to MXNet based backend since the latter one offers optimization for Intel CPUs.
With the growing model zoo, it has become hard to compare algorithms simply by inspecting the video with their predictions drawn as boxes. Thus, I came to the conclusion that a bunch of quick tests, which would measure the performance of Face Detection models in a quantitative way would greatly help to understand the strong points of each of the models.
Labeling the Mini Test Set
I started to look for video face detection benchmarks and came across several datasets, but they were either really huge or owners simply weren’t approving my request for their dataset (still would like to get my hands on it!). After thinking for a bit, I decided to come up with my own set of benchmarking videos that would cover different cases (action movies, interview shows, tv series, etc.).
One could say, that this is reinventing the wheel and it is better to use a readily-available dataset. In a business world, this would be a 100% rightful statement, but one of the main purposes of this project is to learn new things. Thus, I consider labeling a few videos myself as a great opportunity to understand what is actually happening in the lowest level of a machine learning algorithm – in the data, that powers these systems.
To do that I used a great tool called Computer Vision Annotation Tool (CVAT), that was recently open-sourced by Intel and is now a part of OpenCV. CVAT offers an intuitive web interface and it has a bunch of features that greatly simplify the process of labeling the videos. Using its interpolation labeling mode one can only specify a position of a box in the n-th and n+10th frame, and the box will be seamlessly transitioned in frames between n and n+10.
My mini face detection test consists of 5 videos, each representing a particular type of video:
- “Friends” TV Show (first 2 minutes) – Contains multiple faces at the same time.
- Russian interview show “Pozner” – Only 2 faces that are easy to detect
- Faces from around the world – A lot of faces of different types, genders, and ages
- Frozen II Trailer – Lots of human faces, though it is a cartoon
- Angel Has Fallen Trailer – Trailer of the action movie with a fast-changing environment
So far I have finished labeling only 3 videos from the mini test set (video 1,2,3) and thus only they were used in tests. The procedure for the test is straightforward:
- Only predictions above 0.6 confidence score are taken.
- In our case, precision indicates how many of the predicted faces turned out to be true faces, while recall shows how many faces our model could predict out of all faces that exist in a video. Thus, we would like to maximize both metrics in order to have a precise model that can detect all the faces in a video.
- Every video is resized to 640×480 resolution.
- MobileNetSSD and MTCNN are run on CPU, while YOLOv3 uses GPU.
- DSFD was not tested due to the time-constraints caused by hardware at the time of the test.
- The test was performed on Dell XPS 15 9570 Laptop. CPU – i7-8750H, GPU – GTX 1050Ti 4GB, RAM – 16 GB
Definitely, for future tests, I will switch to more elaborate metric techniques like AUC-ROC, but for the time being, I find this test capable enough to show differences between the models.
Russian interview show “Pozner”
Let’s start with something easy. This is an example of a very simple face detection problem, as most of the time there is only one face displayed in a close-up manner.
As you can see, both YOLOv3 and MTCNN have roughly the same precision, while MobileNetSSD is a clear outsider with recall being significantly lower than that for others. YOLOv3 shows the best results overall and the inference was several minutes faster due to the use of GPU.
One thing to note is that MobileNetSSD is much faster than MTCNN if we run both algorithms with GPU, especially for images with high resolution.
“Friends” TV Show
This scene from “Friends” is a great example of multi-face detection as it features multiple actors with different gender and facial attributes. Also, Rachel is wearing a crown, though it didn’t stop YOLOv3 from delivering good results.
As you can see, both MobileNetSSD and MTCNN are clear outsiders here with MobileNetSSD having the worst recall. Again, YOLOv3 performed its job really good by recognizing almost all the faces present in the video.
Faces from around the world
This was by far the hardest challenge for our models due to the huge diversity of faces of different gender, age, etc.
Given the difference in the recall, I can say that YOLOv3 is a leader again as high precision without meaningful recall value is not that useful. It is interesting that MobileNetSSD has higher recall than MTCNN, meaning that it was able to capture the diversity of faces slightly better than its counterpart.
Clearly, more tests are needed to better understand which algorithm performs the best in particular circumstances. It is obvious from the results that YOLOv3 is a winner and if you want to do serious video analysis you should consider using GPU. Moreover, in my next post, I will provide test results for DSFD, which showed even better performance in my intermediate tests.
After labeling a few videos and conducting a benchmark of models I finally confirmed it to myself – we need some kind of in-between frames optimization as a lot of the frames are quite similar. Thus, treating each frame independent of surrounding frames is just plain dumb.
There are different kinds of optimization techniques that come to my mind – ranging from a simple t-test to measure the difference between RGB values of 2 frames to more elaborate feature extraction. However, one has to keep in mind the cost of similarity measurement operation. If it requires a lot of time, it might be easier to do the inference on every frame.
Plans for the Next 2 Weeks
In roughly 2 weeks we have our first GSoC evaluations and this is a list of things I hope to get done in this time frame:
- Add optimization techniques to improve inference speed. One way to do that is to mark similar frames before doing face detection/recognition so that other elements of the pipeline simply process a group of frames as the same frame.
- Implement an efficient search engine over a large dataset with celebrities faces for facial recognition. This paper suggests a method for efficient search in MS-Celeb-1M dataset and my plan is to try the same strategy.
- Finish labeling the remaining videos and add more videos to test Facial Recognition. At the same time, I will move to more elaborate techniques for measuring the accuracy of Face Detection models and do the benchmark of DSFD.
- Test another Face Recognition algorithm called ArcFace. If it will perform better than FaceNet, I will add it to PMR.
Finally, I would like to offer other guys that work on PMR a challenge that we can arrange around the start of August. Based on the labeled videos from my mini test-set (and other videos that you want to bring in), we can create a small Kaggle-like challenge to check whose solution works the best in terms of accuracy, time, memory consumption, etc.! Let me know what you think about it.
It is another long post and I am sorry for making you read that much! At the same time, I am very thankful that you have made it that far! As usual, I will be more than happy to hear your feedback. Stay tuned! 😎
This post is also available in my Medium blog