GSoC 2019 – Choosing the Right Structure

One of the major problems of my PoC was extensive memory consumption and I was thinking that Python’s bad memory management techniques are causing that (I was terribly wrong, more on this later). I remember how I left my machine to run a web server with the demo of PoC overnight and upon waking up I found it running out of 16 GB of RAM.

Clearly, there was a problem which could be considered as a simple memory leak in a C++ world but with Python, it is a bit different as everything is managed for you (and you hope that it is well-managed).

Using Generators

To address this problem I changed every PipelineElement to use generators. Instead of reading all the frames, detecting faces on each of them and subsequently recognizing every face, I first read a frame, then passed it to a face detector, which in turn sent it to a face recognizer.

At first, the problem got solved, but soon I ran into another problem as TensorFlow and (most probably) other DL frameworks that utilize GPU don’t release GPU memory even after we close a session. Thus, it is problematic to handle several models in memory at the same time. Struggling to solve these problems I came across Python’s multiprocessing module. To summarize, I faced 2 problems:

  • Whenever I read the video frame, memory consumption was increasing and never decreasing.
  • TensorFlow didn’t release GPU memory (example issue).

Python’s Multiprocessing

After reading about the multiprocessing module, I thought that it is exactly what I need. I changed every PipelineElement to run it as a separate process and at the first look, everything worked much better – I was able to run TensorFlow model in a separate process, do the things I want and then close the process, effectively clearing the memory used by the model.

However, data that we pass to a new process is essentially copied and not just shared between processes. There are several ways to share data across multiple processes but the problem was that I had to share custom class objects each holding a frame and there was no clean solution for that.

Ray and shift to C++

Continuing to struggle, I came across Ray, a framework for building distributed applications powered by ML, which looked like a possible solution but Ray is something that is used mainly for running several processes on the same data and I actually need just an option of better memory management. However, I see myself using Ray later for scaling the project to AWS or other cloud-based solutions and thus it was useful to learn about this library.

All these things led to me to the idea of writing data pipeline routines in C++ and not in Python but with support of Python extensions as Python is the most popular programming language for DL/ML.

I intended to use C++ for data processing (namely to use FFmpeg without any kind of Python wrappers) to effectively extract and handle raw image data from video files because in Python it is hard to manage memory as you don’t have explicit control over it as in C++. Moreover, the memory that your Python app once requested from OS can be freed but not returned during the lifetime of your process. Indeed, after struggling for more than a week I was kinda mad at Python 😬

To do that I was going to use the pybind11 library that allows deep integration of C++ and Python. I started thinking on how I can migrate everything to C++ and a lot of questions arose – is it really good to integrate Python into C++ app and not vice versa, or maybe it is better to enhance existing Python app with C++ modules in stages where we need critical performance (e.g. reading frames from video file)? All these questions led me to the following solution.

Solving the Problems

Memory Consumption for Video Frames Extraction

My first idea was to write my own module for extracting frames from video in C++ with Python wrapper. Currently, I am using pyav – Python bindings for FFMPEG, which is a pretty convenient thing but given the problems with memory consumption, I was already ready to give it up in favor of my own implementation when I came across the following issue. Basically, the problem was solved by updating to the newest version of pyav 😬

Now if I read one frame, process it and then discard, memory consumption stays very low. However, it is also possible to read frames in batch, in case you want to do batch inference or analyze batches as time-series (e.g. with RNNs). In this case, the memory usage will be greater as we need to store each frame in the memory. Think of it as a hyperparameter – a tradeoff between batch size (thus the speed of inference if the batch inference is applicable) and memory consumption.

However, I am considering creating my own Python bindings for FFMPEG using C++ and pybind11, but having already spent a lot of time on this issue, this is rather a plan for the future.

TensorFlow not Releasing GPU Memory

Even after solving the issue with memory consumption, I was still arguing with myself that I actually should go for C++ rather than Python. How surprised I was when I found out that C++ version of TensorFlow also has the same problem (issue 1, issue 2). Basically, you also need to use a workaround with processes to release GPU memory in C++. All these findings raised a lot of doubts in me for using C++ as the main language for the project.

Introducing Kernels

As you remember from my proposal, everything in the project is based on 2 things – Pipeline and PipelineElement, that is governed by a Pipeline. One of my main goals is to make a very easy to use API and a big inspiration for that is Keras, which is now a part of TensorFlow. Consider this snippet:

p = Pipeline([VideoHandlerElem(input_path),
           MobileNetsSSDFaceDetector(min_score_thresh=.5),
           FacenetRecognizer(),
           VideoOutputHandler("test")])

p.run()

In 4 lines of code, we created a fully fledged pipeline for heavy video analysis. Of course, this is not a final version of the API but you can grasp the general idea.

But to solve the problem with TensorFlow not releasing the memory we have to run every PipelineElement as a separate process in order to clear the memory that MobileNetsSSDFaceDetector used as we don’t need this PipelineElemenet anymore. And that is when everything starts to fall apart – what data should we pass to the process, should we copy it or not, how should we return it – a lot of questions to achieve clear decomposition of the modules.

Changing PipelineElements

One of the beliefs behind this project is that the structure of PMR should encourage Software Engineers to work on the technical stuff that they love, at the same time making it easy for Deep Learning enthusiasts to contribute to the project. To do that I decided to change the design of PipelineElement.

Previously, you could add your own PipelineElement by subclassing either FaceDetectorElem or FaceRecognizerElem depending on your needs. Then in your new class you override run() with all the stuff that your model does.

But FaceDetectorElem and FaceRecognizerElem are now merely proxy classes that handle all the data-related stuff (adding the output of the models to the respected Data object, passing the frame reader to a Kernel), but the heart of these classes is a Kernel object.

Users define Kernel objects and override predict() that does all the predictions. PipelineElement then calls Kernel‘s run() function, that creates a new process, which runs predict(). Upon successful execution of predict(), PipelineElement-based class gets results of the model and stores it in the Data object. So to summarize, this is what you need to do to add a new DL model to PMR:

  1. Decide to which PipelineElement does the nature of your model belongs (FaceDetectorElem or FaceRecognizerElem or any other PipelineElement-based class that is coming soon)
  2. Check out what your PipelineElement-based class passes as an input and expects in return.
  3. Subclass Kernel and implement inference with your model in an override of predict()

Much more options for Pipeline, PipelineElement, and Kernel are coming but for now, I will leave this kind of structure. This kind of decomposition allows us to use any kind of models and any kind of programming languages in Kernel. Later on, to gain even more speed, I will create C++ based Kernels for critical parts of PMR and write down a tutorial on how to create your own Kernels in both Python and C++.

Plans for The Next Week

I have almost finished working with the structure of my project and now I want to focus more on Face Detection and Recognition models, namely:

  • Port YOLOv3 to use TensorFlow and GPU
  • Work on parameters-tuning for both MobileNetsSSD and YOLOv3.
  • Research other possible model architectures that can handle these tasks even better.
  • Finally, update README.MD as it wasn’t updated since the time of PoC…

If you have made it that far reading this lengthy post, I would like to say thank you! Every comment be it a question or just “thumbs up” is welcome! Follow me on Medium and fedoskin.org and stay tuned!

This post is also available in my Medium blog