GSoC 2019 – Facial Recognition and First Evaluations

Image result for ms-celeb-1m
There are billions of faces in the world…

Time literally flies as we already have First Evaluations! This stage is an important milestone for my project as I am finishing to work with core ML-related part of the project and move to Web App, REST API, and Native Bindings. However, more ML-related stuff is coming as I am going to add facial emotions, age and gender recognition right before second evaluations.

Optimize or Die

In my last post, I promised to try various optimization techniques to decrease video processing time. My main hypothesis is that there are a lot of similar (if not almost identical) frames in the video that we can process as one frame, thus saving a significant amount of face detection/recognition cycles.

Introducing SimilarFramesFinder

The basic idea is to preprocess the video and find similar frames that can be processed as one frame. At the same time, the method for computing similarity of 2 frames should be extremely fast and obviously take much less time than naively processing all the frames. Put it briefly – it should be simple and fast.

In SimilarFramesKernel I had to change Kernel a bit. Now, you can create Kernels that run not as a separate process but rather inside the main process, at the same time spawning processes themselves as they need it. Thus, SimilarFramesKernel divides video in n chunks and process it in a multiprocessing fashion.

However, if you will want to add a new similarity metric you will just need to subclass SimilarFramesKernel, define compare and you are ready to compare frames in parallel! Thus, more similarity metrics are coming as clearly we can save a lot of processing time with these simple yet powerful techniques.

Measuring Similarity of Frames

There are various ways to measure the similarity between two frames but so far I have chosen only 2 simple techniques:

Color Histogram Comparison using OpenCV

The color histogram represents a distribution of colors in an image and it has a form of a vector. By comparing histograms of 2 consecutive frames we can measure how similar are the colors in these 2 images. Based on the results of my tests this technique proved to be relatively robust and most importantly very fast.

Color histogram comparison can significantly reduce the number of frames to process in videos that don’t have a lot of actions as obviously there will be multiple series of frames with very similar colors distribution. One of the examples of such videos is interview videos, where 1 or 2 persons are talking with each other and frames don’t differ drastically.

There are various ways to compare histograms and in PMR I use OpenCV’s compareHist with HISTCMP_BHATTACHARYYA method and SciPy’s Chebyshev distance metric. Personally, I find Chebyshev distance to work the best but you can choose another method by adding it to


SSIM stands for Structural similarity It was initially proposed to measure the image quality by comparing the original undistorted image with a transmitted or compressed one. SSIM performs better than plain MSE (Mean-Squared Error) because SSIM accounts for the structural difference in the images as it can be observed from the formula below.

{\hbox{SSIM}}(x,y)={\frac  {(2\mu _{x}\mu _{y}+c_{1})(2\sigma _{{xy}}+c_{2})}{(\mu _{x}^{2}+\mu _{y}^{2}+c_{1})(\sigma _{x}^{2}+\sigma _{y}^{2}+c_{2})}}
SSIM. Source: Wikipedia

However, this comes with its own price – SSIM performs much slower than histogram comparison and thus I haven’t tested it extensively, though I have added it as a kernel for SimilarFramesFinder.

ArcFace and FaceRecognizerKernel

As it was stated in my proposal, my last weeks before First Evaluations were devoted to Face Recognition stuff. I’ve finally cleaned the mess that has existed in that part of PMR since the time of PoC and added a new State-of-the-Art algorithm called ArcFace.

Changes to FaceRecognizerKernel

After creating FaceDetector and SimilarFramesFinder I understood that almost every ML model and algorithm of the identical category work the same. Thus, there will be a lot of repetitive code that can be effectively shared between different algorithms.

FaceRecognizerKernel now does both classifier training with batches of images and face recognition and you don’t need to implement these functions on your own. The procedure for adding a new face recognizing algorithm is straightforward:

  1. Subclass FaceRecognizerKernel
  2. Implement load_model
  3. Implement preprocess_face (most algorithms do require preprocessing of face image)
  4. Implement calculate_embeddings for computing embeddings of the face

Definitely, more face recognition models are coming and it will be very easy to include new SOTA models into existing pipelines.

New SOTA Face Recognizer – ArcFace

While working on PoC I didn’t have a lot of time to add a face recognizer other than FaceNet. Just recently I have come across ArcFace, which holds SOTA for Face Recognition on LFW dataset. It is much newer than FaceNet (2015 vs. 2018) and I also plan to add a SOTA Face Detection model by the same researchers called RetinaFace. Authors of ArcFace kindly provided MXNet implementation of this great algorithm, which I used in PMR along with one of their pretrained models.

Unfortunately, I didn’t have enough time to annotate all the test videos with names, thus there is no comparison between FaceNet and ArcFace but from what I have observed myself, ArcFace performs more stable. From now on, ArcFace will be used as a default model for Face Recognition.

Search for Celebrities

My proposal states that I will create a mix of different celebrity faces datasets. Last week I started to do that with IMDB-Faces, Wiki-Faces, MS-Celeb-1M and my handcrafted datasets but unfortunately, I didn’t have enough time to finish this task and it will be done later.

However, I have already started to prepare for massive face datasets. It is obvious, that searching through 1 million of 512-dimensional face embedding vectors will take a lot of time as KNN can enlarge prohibitively.

Introducing FAISS

FAISS stands for Facebook AI Similarity Search and as the name suggests it was developed and open-sourced by Facebook. FAISS allows one to index your pile of vectors by clustering them. Thus, the next time you will want to find the nearest vector to yours one, it will be first compared to cluster centers and then the comparison of your vector will be done only with vectors from the cluster, which center was the nearest to your vector. Of course, there can be errors as clustering is not ideal but as far as I understood the chance of error is negligible.

Currently, the biggest dataset that I use with PMR is LFW, which has ~13 000 images and thus there is no difference in processing time between FAISS and SciKit-Learn for KNN. However, with the introduction of a huge celebrities dataset, I am pretty sure that FAISS will be of great help as it already available in PMR and user can choose between two different backends for KNN in FaceRecognizer.

Benchmarking New Stuff

Benchmarks are very important to measure the impact of new things. Unfortunately, I didn’t have enough time to finish labeling all the videos from the test and I have added only one new video to it. Nevertheless, there are interesting results to look at!

Saving time with SimilarFramesFinder

SimilarFramesFinder was introduced to reduce processing times and Histogram Comparison allows to quickly find similar frames. As you will see, SimilarFramesFinder can greatly reduce processing time at the same time maintaining a high quality of face detection and recognition.

Benchmark results for “Pozner”

I introduced the interview show Pozner in my last post and it is a perfect candidate for testing similar frames grouping technique. As you can see, we were able to get a 4x decrease in inference time with a relatively small decline in precision and recall. Note that by introducing a higher threshold for similar frames one will most probably increase precision and recall at the same time increasing inference time.

Benchmark results for “Friends”

Famous TV-Show “Friends” is a perfect candidate for a video with a lot of celebrities. As you can see, we were able to get a 1.5x decrease in processing time by sacrificing a few percents of precision and recall.

Benchmark results for “Faces from around the world”

With “Faces from around the world” things get complicated. The decrease in inference time is not that huge compared to other videos, though it is still a decrease. As you can see, there is not much we can do if there are a lot of different people in the video that do various actions.

Looking at ArcFace

As benchmark videos are not labeled yet, I decided to simply upload videos with bounding boxes and names so that you can see yourself the current state of the things.

The password for the next video is pmr (lower case)

As you can see, in the majority of cases ArcFace was able to predict the correct name of a person. Note how well YOLOv3 works in the majority of cases. Still, there are some misclassifications especially for small and different pose faces and I have an idea of how to mitigate this, which I will discuss in the Future Work section.

First Evaluations

With that, I conclude the first stage of the work on core ML part of PMR. During this stage, I have built a foundation of PMR and the rest of the work will be about scaling and enriching it with new types of algorithms.

I have done most of the things that I promised to do in my proposal. I also faced some problems that I didn’t foresee initially like Python’s problematic memory consumption, creating my own small benchmarking dataset and adding new face detection models, and ultimately devising a technique for cutting down processing time.

All the code is available at PMR’s GitHub repository and all my progress done in the first stage can be tracked down in GSoC 2019 category.

Future Plans

Face Recognition

As you can see in the aforementioned videos, there are some situations when our model is not sure about the prediction as the face is not clearly observable. Also, note that most of the time there is only a tiny subset of celebrities in the video. Thus, instead of comparing every face from the video to a huge celebrities dataset I propose a simple technique:

  1. Calculate embeddings for all the faces in the video (SimilarFramesFinder will also decrease the running time for this stage)
  2. Cluster embeddings with K-Means to group embeddings that represent the same person.
  3. Search for only a few (5-10) embeddings from each cluster in our celebrities dataset.
  4. Assign a celebrity name based on the majority of the names found.

Thus, we make predicting a person’s name more stable and save time for searching. In the coming days, I will try this hypothesis and test it on my benchmark set.

Web App, REST API, and Native Binding

As the core ML part is ready, I am going to continue working on the Web App and the rest of the stuff that I have started in PoC. Namely, in the next 2 weeks, I will polish the web interface, making it possible for the user to upload, process and get the results in a nice format. I will also continue to work on REST API and start to implement Native Bindings for other programming languages.

Thank you!

As usually, thank you for reading yet another lengthy post! Stay Tuned!