Seeing Around Corners with a Laser

In this post I would like to show you some results of another interesting paper I came across recently that was published last year in the prestigious Nature journal. It’s on the topic of non-line-of-sight (NLOS) imaging or, in other words, it’s about research that helps you see around corners. NLOS could be something particularly useful for use cases such as autonomous cars in the future.

I’ll break this post up into the following sections:

  • The LIDAR laser-mapping technology
  • LIDAR and NLOS
  • Current Research into NLOS

Let’s get cracking, then.


You may have heard of LIDAR (a term which combines “light” and “radar”). It is used very frequently as a tool to scan surroundings in 3D. It works similarly to radar but instead of emitting sound waves, it sends out pulses of infrared light and then calculates the time it takes for this light to return to the emitter. Closer objects will reflect this laser light quicker than distant objects. In this way, a 3D representation of the scene can be acquired, like this one which shows a home damaged by the 2011 Christchurch Earthquake:

(image obtained from here)

LIDAR has been around for decades and I came across it very frequently in my past research work in computer vision, especially in the field of robotics. More recently, LIDAR has been experimented with in autonomous vehicles for obstacle detection and avoidance. It really is a great tool to acquire depth information of the scene.

NLOS Imaging

But what if where you want to see is obscured by an object? What if you want to see what’s behind a wall or what’s in front of the car in front of you? LIDAR does not, by default, allow you to do this:

The rabbit object is not reachable by the LIDAR system (image adapted from this video)

This is were the field of NLOS comes in.

The idea behind NLOS is to use sensors like LIDAR to bounce laser light off walls and then read back any reflected light.

The laser is bounced off the wall to reach the object hidden behind the occluder (image adapted from this video)

This process is repeated around a particular point (p in the image above) to obtain as much reflected light as possible. The reflected light is then analysed and any objects on the other side of the occlusion are attempted to be reconstructed.

This is still an open area of research with many assumptions (e.g. that light is not reflected multiple times by the occluded object but bounces straight back to the wall and then the sensors) but the work on this done so far is quite intriguing.

Current Research into NLOS

The paper that I came across is entitled “Confocal non-line-of-sight imaging based on the light-cone transform“. It was published in March of last year in the Nature journal (555, no. 7696, p. 338). Nature is one of the world’s top and most famous academic journals, so anything published there is more than just world-class – it’s unique and exceptional.

The experiment setup from this paper was as shown here:

The setup of the experiment for NLOS. The laser light is bounced off the white wall to hit and reflect off the hidden object (image taken from original publication)

The idea, then, was to try and reconstruct anything placed behind the occluder by bouncing laser light off the white wall. In the paper, two objects were scrutinised: an “S” (as shown in the image above) and a road sign. With a novel method of reconstruction, the authors were able to obtain the following reconstructed 3D images of the two objects:


(image adapted from original publication)

Remember, these results are obtained by bouncing light off a wall. Very interesting, isn’t it? What’s even more interesting is that the text on the street sign has been detected as well. Talk about precision! You can clearly see how one day, this could come in handy with autonomous cars who could use information such as this to increase safety on the roads.

A computer simulation was also created to ascertain with dexterity the error rates involved with the reconstruction process. The simulated setup was as shown in the above images with the bunny rabbit. The results of the simulation were as follows:

(image adapted from original publication)

The green in the image is the reconstructed parts of the bunny superimposed on the original object. You can clearly see how the 3D shape and structure of the object is extremely well-preserved. Obviously, the parts of the bunny not visible to the laser could not be reconstructed.


This post introduced the field of non-line-of-sight imaging, which is, in a nutshell, research that helps you see around corners. The idea behind NLOS is to use sensors like LIDAR to bounce laser light off walls and then read back any reflected light. The scene behind an occlusion is then attempted to be reconstructed.

Recent results from state-of-the-art research in NLOS published in the Nature journal were also presented in this post. Although much more work is needed in this field, the results are quite impressive and show that NLOS could one day be very useful with, for example, autonomous cars who could use information such as this to increase safety on the roads.


To be informed when new content like this is posted, subscribe to the mailing list:

Please share what you just read:

Image Colourisation – Converting B&W Photos to Colour

I got another great academic publication to present to you – and this publication also comes with an online interactive website for you to use to your heart’s content. The paper is from the field of image colourisation. 

Image colourisation (or ‘colorization’ for our US readers :P) is the act of taking a black and white photo and converting it to colour. Currently, this is a tedious, manual process usually performed in Photoshop, that can typically take up to a month for a single black and white photo. But the results can be astounding. Just take a look at the following video illustrating this process to give you an idea of how laborious but amazing image colourisation can be:

Up to a month to do that for each image!? That’s a long time, right?

But then came along some researchers from the University of California in Berkeley who decided to throw some deep learning and computer vision at the task. Their work, published at the European Conference on Computer Vision in 2016, has produced a fully automatic image colourisation algorithm that creates vibrant and realistic colourisations in seconds. 

Their results truly are astounding. Here are some examples:



Not bad, hey? Remember, this is a fully automatic solution that is only given a black and white photo as input.

How about really old monochrome photographs? Here is one from 1936:


And here’s an old one of Marilyn Monroe:



Quite remarkable. For more example images, see the official project page (where you can also download the code).

How did the authors manage to get such good results? It’s obvious that deep learning (DL) was used as part of the solution. Why is it obvious? Because DL is ubiquitous nowadays – and considering the difficulty of the task, no other solution is going to come near. Indeed, the authors report that their results are significantly better than previous solutions.

What is intuitive is how they implemented their solution. One might choose to go down the standard route of designing a neural network that maps a black and white image directly to a colour image (see my previous post for an example of this). But this idea will not work here. The reason for it is that similar objects can have very different colours.

Let’s take apples as an example to explain this. Consider an image dataset that has four pictures of an apple: 2 pictures showing a yellow apple and 2 showing a red one. A standard neural network solution that just maps black and white apples to colour apples will calculate the average colour of apples in the dataset and colour the black and white photo this way. So, 2 yellow + 2 red apples will give you an average colour of orange. Hence, all apples will be coloured orange because this is the way the dataset is being interpreted. The authors report that going down this path will produce very desaturated (bland) results.

So, their idea was to instead calculate what the probability is of each pixel being a particular colour. In other words, each pixel in a black and white image has a list of percentages calculated that represent the probability of that particular pixel being each specific colour. That’s a long list of colour percentages for every pixel! The final colour of the pixel is then chosen from the top candidates on this list.

Going back to our apples example, the neural network would tell us that pixels belonging to the apple in the image would have a 50% probability of being yellow and 50% probability of being red (because our dataset consists of only red and yellow apples). We would then choose either of these two colours – orange would never make an appearance.

As is usually the case, ImageNet with its 1.3 million images (cf. this previous blog post that describes ImageNet) is used to train the neural network. Because of the large array of objects in ImageNet, the neural network can hence learn to colour many, many scenes in the amazing way that it does.

What is quite neat is that the authors have also set up a website where you can upload your own black and white photos to be converted by their algorithm into colour. Try it out yourself – especially if you have old photos that you have always wanted to colourise.

Ah, computer vision wins again. What a great area in which to be working and researching.


To be informed when new content like this is posted, subscribe to the mailing list:

Please share what you just read:

Image Completion from SIGGRAPH 2017

Oh, I love stumbling upon fascinating publications from the academic world! This post will present to you yet another one of those little gems that has recently fallen into my lap. It’s on the topic of image completion and comes from a paper published in SIGGRAPH 2017 entitled “Globally and Locally Consistent Image Completion” (project page can be found here).

(Note: SIGGRAPH, which stands for “Special Interest Group on Computer GRAPHics and Interactive Techniques”, is a world renowned annual conference held for computer graphics researchers. But you do sometimes get papers from the world of computer vision being published there as is the case with the one I’m presenting here.)

This post will be divided into the following sections:

  1. What is image completion and some of its prior weaknesses
  2. An outline of the solution proposed by the above mentioned SIGGRAPH publication
  3. A presentation of results

If anything, please scroll down to the results section and take a look at the video published by the authors of the paper. There’s some amazing stuff to be seen there!

1. What is image completion?

Image completion is a technique for filling-in target regions with alternative content. A major use for image completion is in the task of object removal where an object from a photo is erased and the remaining hole is automatically substituted with content that hopefully maintains the contextual integrity of the image.

Image completion has been around for a while. Perhaps the most famous algorithm in this area is called PatchMatch which is used by Photoshop in its Content Aware Fill feature. Take a look at this example image generated by PatchMatch after the flowers in the bottom right corner were removed from the left image:

An image completion example on a natural scene generated by PatchMatch

Not bad, hey? But the problem with existing solutions such as PatchMatch is that images can only be completed with textures that solely come from the input image. That is, calculations for what should be plugged into the hole are done using information obtained just from the input image. So, for images like the flower picture above, PatchMatch works great because it can work out that green leaves is the dominant texture and make do with that.

But what about more complex images… and faces as well? You can’t work out what should go into a gap in an image of a face just from its input image. This is an image completion example done on a face by PatchMatch:

An image completion example on a face generated by PatchMatch

Yeah, not so good now, is it? You can see how trying to work out what should go into a gap from other areas of the input image is not going to work for a lot of cases like this.

2. Proposed solution

This is where the paper “Globally and Locally Consistent Image Completion” comes in. The idea behind it, in a nutshell, is to use a massive database of images of natural scenes to train a single deep learning network for image completion. The Places2 dataset is used for this, which contains over 8 million images of diverse natural scenes – a massive database from which the network basically learns the consistency inherent in natural scenes. This means that information to fill in missing gaps in images is obtained from these 8 million images rather than just one single image!

Once this deep neural network is trained for image completion, a GAN (Generative Adversarial Network) approach is utilised to further improve this network.

GAN is an unsupervised neural network training technique where one or more neural networks are used to mutually improve each other in the training phase. One neural network tries to fool another and all neural networks are updated according to results obtained from this step. You can leave these neural networks running for a long time and watch them improving each other.

The GAN technique is very common in computer vision nowadays in scenarios where one needs to artificially produce images that appear realistic. 

Two additional networks are used in order to improve the image completion network: a global and a local context discriminator network. The former discriminator looks at the entire image to assess if it is coherent as a whole. The latter looks only at the small area centered at the completed region to ensure local consistency of the generated patch. In other words, you get two additional networks assisting in the training: one for global consistency and one local consistency. These two auxiliary networks return a result stating whether the generated image is realistic-looking or artificial. The image completion network then tries to generate completed images to fool the auxiliary networks into thinking that their real.

In total, it took 2 months for the entire training stage to complete on a machine with four high-end GPUs. Crazy!

The following image shows the solution’s training architecture:

Overview of architecture for training for image completion (image taken from original publication)

Typically, to complete an image of 1024 x 1024 resolution that has one gap takes about 8 seconds on a machine with a single CPU or 0.5 seconds on one with a decent GPU. That’s not bad at all considering how good the generated results are – see the next section for this.

3. Results

The first thing you need to do is view the results video released by the authors of the publication. Visit their project page for this and scroll down a little. I can provide a shorter version of this video from YouTube here:

As for concrete examples, let’s take a look at some faces first. One of these faces is the same from the PatchMatch example above.

Examples of image completion on faces (image adapted from original publication)

How’s impressive is this?

My favourite examples are of object removal. Check this out:

Examples of image completion (image taken from original publication)

Look how the consistency of the image is maintained with the new patch in the image. It’s quite incredible!

My all-time favourite example is this one:

Another example of image completion (taken from original publication)

Absolutely amazing. More results can be viewed in supplementary material released by the authors of the paper. It’s well-worth a look!


In this post I presented a paper on image completion from SIGGRAPH 2017 entitled “Globally and Locally Consistent Image Completion”. I first introduced the topic of image completion, which is a technique for filling-in target regions with alternative content, and described some weaknesses of previous solutions – mainly that calculations for what should be generated for a target region are done using information obtained just from the input image. I then presented the more technical aspect of the proposed solution as presented in the paper. I showed that the image completion deep learning network learnt about global and local consistency of natural scenes from a database of over 8 million images. Then, a GAN approach was used to further train this network. In the final section of the post I showed some examples of image completion as generated by the presented solution.


To be informed when new content like this is posted, subscribe to the mailing list:

Please share what you just read:

The Baidu and ImageNet Controversy

Two months ago I wrote a post about some recent controversies in the industry in computer vision. In this post I turn to the world of academia/research and write about something controversial that occurred there.

But since the world of research isn’t as aggressive as that of the industry, I had to go back three years to find anything worth presenting. However, this event really is interesting, despite its age, and people in research circles talk about it to this day.

The controversy in question pertains to the ImageNet challenge and the Baidu research group. Baidu is one of the largest AI and internet companies in the world. Based in Beijing, it has the 2nd largest search engine in the world and is hence commonly referred to as China’s Google. So, when it is involved in a controversy, you know it’s no small matter!

I will divide the post into the following sections:

  1. ImageNet and the Deep Learning Arms Race
  2. What Baidu did and ImageNet’s response
  3. Ren Wu’s (Ex-Baidu Researcher’s) later response (here is where things get really interesting!)

Let’s get into it.

ImageNet and the Deep Learning Arms Race

(Note: I wrote about what ImageNet is in my last post, so please read that post for a more detailed explanation.) 

ImageNet is the most famous image dataset by a country mile. Currently there are over 14 million images in ImageNet for nearly 22,000 synsets (WordNet has ~100,000 synsets). Over 1 million images also have hand-annotated bounding boxes around the dominant object in the image.

However, when the term “ImageNet” is used in CV literature, it usually refers to the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) which is an annual competition for object detection and image classification organised by computer scientists at Stanford University, the University of North Carolina at Chapel Hill and the University of Michigan.

This competition is very famous. In fact, the deep learning revolution of the 2010s is widely attributed to have originated from this challenge after a deep convolutional neural network blitzed the competition in 2012. Since then, deep learning has revolutionised our world and the industry has been forming research groups like crazy to push the boundary of artificial intelligence. Facebook, Amazon, Google, IBM, Microsoft – all the major players in IT are now in the research game, which is phenomenal to think about for people like me who remember the days of the 2000s when research was laughed at by people in the industry.

With such large names in the deep learning world, a certain “computing arms race” has ensued. Big bucks are being pumped into these research groups to obtain (and trumpet far and wide) results better than other rivals. Who can prove to be the master of the AI world? Who is the smartest company going around? Well, competitions such as ImageNet are a perfect benchmark for questions like this, which makes the ImageNet scandal quite significant.

Baidu and ImageNet

To have your object classification algorithm scored on the ImageNet Challenge, you first get it trained on 1.5 million images from the ImageNet dataset. Then, you submit your code to the ImageNet server where this code is tested against a collection of 100,000 images that are not known to anybody. What is key, though, is that to avoid people fine-tuning the parameters in their algorithms to this specific testing set of 100,000 images, ImageNet only allows 2 evaluations/submissions on the test set per week (otherwise you could keep resubmitting until you’ve hit that “sweet spot” specific to this test set).

Before the deep learning revolution, a good ILSVRC classification error rate was 25% (that’s 1 out of 4 images being classified incorrectly). After 2014, error rates have dropped to below 5%!

In 2015, Baidu announced that with its new supercomputer called Minwa it had obtained a record low error rate of 4.58%, which was an improvement on Google’s error rate of 4.82% as well as Microsoft’s of 4.9%. Massive news in the computing arms race, even though the error rate differences appear to be minimal (and some would argue, therefore, that they’re insignificant – but that’s another story).

However, a few days after this declaration, an initial announcement was made by ImageNet:

It was recently brought to our attention that one group has circumvented our policy of allowing only 2 evaluations on the test set per week.

Three weeks later, a follow up announcement was made stating that the perpetrator of this act was Baidu. ImageNet had conducted an analysis and found that 30 accounts connected to Baidu had been used in the period of November 28th, 2014 to May 13th, 2015 to make on average four times the permitted amount of submissions. 

As a result, ImageNet disqualified Baidu from that year’s competition and banned them from re-entering for a further 12 months.

Ren Wu, a distinguished AI scientist and head of the research group at the time, apologised for this mistake. A week later he was dismissed from the company. But that’s not the end of the saga.

Ren Wu’s Response

Here is where things get really interesting. 

A few days after being fired from Baidu, Ren Wu sent an email to Enterprise Technology in which he denied any wrongdoing:

We didn’t break any rules, and the allegation of cheating is completely baseless

Whoa! Talk about opening a can of worms!

Ren stated that there is “no official rule specify [sic] how many times one can submit results to ImageNet servers for evaluation” and that this regulation only appears once a submission is made from one account. From this he came to understand that 2 submissions per week can be made from each account/individual rather than a whole team. Since Baidu had 5 authors working on the project, he argues that he was allowed to make 10 submission per week.

I’m not convinced though because he still used 30 accounts (purportedly to be owned by junior students assisting in the research) to make these submissions. Moreover, he still admits that on two occasions the 10 submission threshold was breached – so, he definitely did break the rules.

Things get even more interesting, however, when he states that he officially apologised just for those two occasions as requested by his management:

A mistake in our part, and it was the reason I made a public apology, requested by my management. Of course, this was my biggest mistake. And things have been gone crazy since. [emphasis mine]

Whoa! Another can of worms. He apologised as a result of a request by his management and he states that this was a mistake. It looks like he’s accusing Baidu of using him as a scapegoat in this whole affair. Two months later he confirms this to the EE Times, by stating that

I think I was set up

Well, if that isn’t big news, I don’t know what is! I personally am not convinced by Ren’s arguments. But it at least shows that the academic/research world can be exciting at times, too 🙂


To be informed when new content like this is posted, subscribe to the mailing list:

Please share what you just read:

Football/Soccer on Your Tabletop

Well, it’s World Cup season now, isn’t it? Australia got eliminated this week so I’m feeling a bit depressed at the moment. However, seeing teams like Germany also not make it past the group stage makes me feel a little better (sorry German people reading this post :P).

But since the World Cup is on, it is only fitting that I write about something from the field of Computer Vision that is related to football. So, in this post I’m going to present to you quite an amazing paper I stumbled upon entitled “Soccer on Your Tabletop” (Rematas et al., CVPR 2018, pp. 4738-4747)

(Recall that CVPR is a world-class academic conference on computer vision. Anything published there is always worth reading.)

The goal of the paper is to present an algorithm that reconstructs a 3D representation of a football game from a single 2D video – just like you would find on YouTube. The 3D video could then be projected onto a tabletop (like a hologram) and viewed by everyone in the room from multiple angles. An interesting concept!

Usually something like this is obtained by having multiple cameras set up that can then work together to provide 3D information of the football pitch. But the idea here is to get all information from a single 2D video.

Here’s a clip of the entire project that has been released by the authors. Watch it!

Ah, you just have to love computer vision!

Let’s take a look at the (slightly simplified here) steps involved in the 2D -> 3D reconstruction process:

  1. Input frame: obtained from any 2D video of a football game (captured from stationary cameras).
  2. Camera calibration: this is performed using the football pitch line markings as guidance. The line markings provide excellent reference points to obtain the 2D plane of the football pitch from which players’ measurements can be deduced.
  3. Player detection, pose estimation, and tracking: this is done using already existing techniques. Specifically this paper is referenced from CVPR 2015 for detecting bounding boxes around players (top left image below), this paper from CVPR 2016 for estimation poses (top right image below), and a simple player tracking algorithm where you compare bounding boxes from adjacent frames and match them according to closest 2D Euclidean distance (bottom left image).

    (image taken from original publication)
  4. Player segmentation: the idea here is to highlight the entire contour of the player after performing the above steps (see bottom right image above). This is performed by taking each pixel and analysing its neighbouring pixels for similarities in colour and edge information until each player is extracted. (Several more steps are performed to fine-tune this process but I’ll skip over these).
  5. Player depth estimation and mesh generation. This is the tricky part. What the authors did is quite intuitive. To constrain the solution space to just football related poses, body shapes, and clothing, the authors created a training dataset from FIFA video games. Lol! What they found was that it was possible to intercept calls between the game engine and the GPU while playing the video game and then to extract depth maps from these intercepted calls. In doing so, they were able to train a deep neural network to extract depth maps from 2D videos. This trained network was then used on 2D YouTube videos. Absolutely brilliant!

    (image obtained from project’s video)
  6. Scene reconstruction. Once player depth estimation and mesh information (which is 3D information) is obtained, the scene can then be reconstructed. What the authors ended up doing is to use Microsoft HoloLens (a mixed reality lens that enables you to see and interact with holograms in real life). So the football pitch on the tabletop you see in the image below isn’t real! Can you imagine watching a match like this around a table with your mates!? There is a catch with the project, however. It’s not good enough yet to reconstruct the ball, which means that at the moment all you can view in 3D are players running around chasing an invisible object 🙂 But that’s work in progress and the job and essence of research.

    (image obtained from project’s video)

Amazing, if you ask me! I can’t wait to see what the future holds for computer vision.

And believe it or not, code for this project is available online for you to play around with as much as you like. So, enjoy!


To be informed when new content like this is posted, subscribe to the mailing list:

Please share what you just read:

Computer Vision in the Fashion Industry – Part 3

In my last two posts I’ve looked at computer vision and the fashion industry. I introduced the lucrative fashion industry and showed what Microsoft recently did in this field with computer vision. I also presented two papers from last year’s International Conference on Computer Vision (ICCV).

In this post, the final of the series, I would like to present to you two papers from last year’s ICCV workshop that was entirely devoted to fashion:

  1. Dress like a Star: Retrieving Fashion Products from Videos” (N. Garcia and G. Vogiatzis, ICCV Workshop, 2017, 2293-2299) [source code]
  2. Multi-Modal Embedding for Main Product Detection in Fashion” (Rubio, et al., ICCV Workshop, 2017, pp. 2236-2242) [source code]

Once again, I’ve provided links to the source code so that you can play around with the algorithms as you wish. Also, as in previous posts, I am going to provide you with just an overview of these publications. Most papers published at this level require a (very) strong academic background to fully grasp, so I don’t want to go into that much detail here.

Dress Like a Star

This paper is impressive because it was written by a PhD student from Birmingham in the UK. By publishing at the ICCV Workshop (I discussed in my previous post how important this conference is), Noa Garcia has pretty much guaranteed her PhD and quite possibly any future research positions. Congratulations to her! However, I do think they cheated a bit to get into this ICCV workshop, as I explain further down.

The idea behind the paper is to provide a way to retrieve clothing and fashion products from video content. Sometimes you may be watching a TV show, film or YouTube clip and think to yourself: “Oh, that shirt looks good on him/her. I wish I knew where to buy it.”

The proposed algorithm works by providing it a photo of a screen that is playing the video content, querying a database, and then returning matching clothing content in the frame, as shown in this example image:

(image source: original publication)

Quite a neat idea, wouldn’t you say?

The algorithm has three main modules: product indexing, training phase, and query phase.

The first two modules are performed offline (i.e. before the system is released for use). They require a database to be set up with video clips and another one with clothing articles. Then, the clothing items and video frames are matched to each other with some heavy computing (this is why it has be performed offline – there’s a lot of computation here that cannot be done in real time).

You may be thinking: but heck, how can you possibly store and analyse all video content with this algorithm!? Well, to save storage and computation space, each video is processed (offline) and divided into shots/scenes that are then summarised into a single vector containing features (features are small “interesting” or “stand-out” patches in images).

Hence, in the query phase, all you need to do is detect features in the provided photo, search for these features in the database (rather than the raw frames), locate the scene depicted in the photo in the video database, and then extract the clothing articles in the scene.

To evaluate this algorithm, the authors set up a system with 40 movies (80+ hours of video). They were able to retrieve the scene from a video depicted in a photo with an accuracy of 87%.

Unfortunately, in their experiments, they did not set up a fashion item database but left this part out as “future work”. That’s a little bit of a let down and I would call that “twisting the truth” in order to get into a fashion-dedicated workshop. But, as they state in the conclusion: “the encouraging experimental results shown here indicate that our method has the potential to index fashion products from thousands of movies with high accuracy”.

I’m still calling this cheating 🙂

Main Product Detection in Fashion

This paper discusses an algorithm to extract the main clothing product in an image according to any textual information associated with it – like in a fashion magazine, for example. The purpose of this algorithm is to extract these single articles of clothing to then be able to enhance other datasets that need to solely work with “clean” images. Such datasets would include ones used in fashion catalogue searches (e.g. as discussed in the first post in this series) or systems of “virtual fitting rooms” (e.g. as discussed in the second post in this series).

The algorithm works by utilising deep neural networks (DNNs). (Is that a surprise? There’s just no escaping deep learning nowadays, is there?) To cut a long story short, neural networks are trained to extract bounding boxes of fashion products that are then used to train other DNNs to match products with textual information.

Example results from the algorithm are shown below.

(image source: original publication)

You can see above how the algorithm nicely finds all the articles of clothing (sunglasses, shirt, necklace, shoes, handbag) but only highlights the pants as the main product in the image according to the textual information associated with the picture.


In this post, the final of the series, I presented two papers from last year’s ICCV workshop that was entirely devoted to fashion. The first paper describes a way to retrieve clothing and fashion products from video content by providing it with a photo of a computer/TV screen. The second paper discusses an algorithm to extract the main clothing product in an image according to any textual information associated with it.

I always say that it’s interesting to follow the academic world because every so often what you see happening there ends up being brought into our everyday lives. Some of the ideas from the academic world I’ve looked at in this series leave a lot to be desired but that’s the way research is: one small step at a time.


To be informed when new content like this is posted, subscribe to the mailing list:

Please share what you just read:

Computer Vision in the Fashion Industry – Part 2

In my last post I introduced the fashion industry and I gave an example of what Microsoft recently did in this field with computer vision. In today’s post, I would like to show you what the academic world has recently been doing in this respect.

It’s interesting to follow the academic world because every so often what you see happening there ends up being brought into our everyday lives. Artificial intelligence, with deep learning at the forefront, is a prime example of this. Hence why I’m always keen to keep up-to-date with the goings-on of computer vision in academia.

A good way to keep abreast of computer vision in the academic world is to follow two of the top conferences in the field: the International Conference on Computer Vision (ICCV) and the Conference on Computer Vision and Pattern Recognition (CVPR). These annual conferences are huge. This is where the best of the best come together; where the titans of computer vision pit their wits against each other. Believe me, publishing in either one of these conferences is a lifetime achievement. I have 10 publications in total there (that’s a lie… I have none :P).

Interestingly, the academic world has been eyeing the fashion industry for the last 5 years it seems. An analysis was performed by Fashwell recently that counted the number of fashion-related papers at these two conferences. There appears to be a steady increase in these since 2013, as the graph below depicts:

Notice, the huge spike at the end? The reason for it is that ICCV last year held an entire workshop specifically devoted to fashion

(Note: a workshop is a, let’s say, less-formal format of a conference usually held as a side event to a major conference. Despite this, publishing at an ICCV or CVPR workshop is still a major achievement.)

As a result, there is plenty of material for me to present to you on the topic of computer vision in the fashion industry. Let’s get cracking!

ICCV 2017

In this post I will present two closely-related papers to you from the 2017 ICCV conference (in my next post I’ll present a few from the workshop):

  1. “Be Your Own Prada: Fashion Synthesis With Structural Coherence” (Zhu, et al., ICCV, 2017, pp. 1680-1688) [source code]
  2. A Generative Model of People in Clothing” (Lassner, et al., ICCV, 2017, pp. 853-862) [source code]

Just like with all my posts, I will give to you an overview of these publications. Most papers published at this level require a (very) strong academic background to fully grasp, so I don’t want to go into that much detail here.

But I have provided links to the source code of these papers, so please feel free to download, install and play around with these beauties at home.

1. Be Your Own Prada

This is a paper that presents an algorithm that can generate new clothing on an existing photo of a person without changing the person’s shape or pose. The desired new outfit is provided as a sentence description, e.g.: “a white blouse with long sleeves but without a collar and blue jeans”.

This is an interesting idea! You can provide the algorithm with a photo of yourself and then virtually try on a seemingly endless combination of styles of shirts, dresses, etc.

“Do I look good in blue and red? I dunno, but let’s find out!”

Neat, hey?

The algorithm has a two-step process:

  1. Image segmentation. This step semantically breaks the image up into human body parts such as face, arms, legs, hips, torso, etc. The result basically captures the shape of the person’s body
    and parts, but not their appearance, as shown in the example image below. Also, along with image segmentation, other attributes are extracted such as skin colour, long/short hair, and gender to provide constraints and boundaries for how far the image rendering step can go (you don’t want to change the person’s skin or hair colour, for example). The segmentation step is performed using a trained generative adversarial network (GAN – see this post for a description of these).

    (image adapted from original publication)
  2. Image rendering. This is the part that places new outfits onto the person using the results (segmentation and constraints/boundaries) from the first step as a guide. GANs are used here again. Example clothing articles were taken from 80,000 annotated images selected from the DeepFashion dataset.

Let’s take a look at some results (taken from the original publication). Remember, all that is provided is one picture of a person and then a description of how that person’s outfit should look like:


Pretty cool! You could really see yourself using this, couldn’t you? We might be using something like this on our phones soon, I would say. Take a look at the authors’ page for this paper for more example result images. Some amazing stuff there.

2. A Generative Model of People in Clothing

This paper is still a work in progress, meaning that more research is needed before anything from it gets rolled out for everyday use. The intended goal of the algorithm is similar to the one presented above but instead of being able to generate images of the same person wearing a different outfit, this algorithm can generate random images of different people wearing different attires. Usually, generating such images is achieved after following a complex 3D graphics rendering pipeline.

It is a very complex algorithm but, in a nutshell, it first creates a dataset containing human pose, shape, and face information along with clothing articles. This information is then used to learn the relationships between body parts and respective clothes and how these clothes fit nicely to its appropriate body part, depending on the person’s pose and shape.

The dataset is created using the SMPLify 3D pose and shape estimation algorithm on the Chictopia10K fashion dataset (that was collected from the Chictopia fashion website) as well as dlib‘s implementation of the fast facial shape matcher to enhance each image with facial information.

Let’s take a look at some results.

The image below shows a randomly generated person wearing different coloured clothes (provided manually). Notice that, for example, with the skirts, the model learnt to put different wrinkles on the skirt depending on its colour. Interesting, isn’t it? The face on the person seems out of place – one reason why the algorithm is still a work in progress.


The authors of the paper also attempted to create a random fashion magazine photo dataset from their algorithm. The idea behind this was to show that fashion magazines could perhaps one day generate photos automatically without going through the costly process of setting up photo sessions with real people. Once again, the results leave a lot to be desired but it’s interesting to see where research is heading.



This post extended my last post on computer vision in the fashion industry. I first examined how fashion is increasingly being looked at in computer vision academic circles. I then presented two papers from ICCV 2017. The first paper describes an algorithm to generate a new attire on an existing photo of a person without changing the person’s shape or pose. The desired new outfit is provided as a sentence description. The second paper shows a work-in-progress algorithm to randomly generate people wearing different clothing attires.

It’s interesting to follow the academic world because every so often what you see happening there ends up being brought into our everyday lives.


To be informed when new content like this is posted, subscribe to the mailing list:

Please share what you just read:

Computer Vision on Mars

I was doing my daily trawl of the internet a few days ago looking at the latest news in artificial intelligence (especially computer vision) and an image caught my eye. The image was of one of the Mars Exploration Rovers (MER) that landed on the Red Planet in 2004. Upon seeing the image I thought to myself: “Heck, those rovers must surely have used computer vision up there!?” So, I spent the day looking into this and, sure as can be, not only was computer vision used by these rovers, it in fact played an integral part in their missions.

In this post, then, I’m going to present to you how, where, and when computer vision was used by those MERs. It’s been a fascinating few days for me researching into this and I’m certain you’ll find this an interesting read also. I won’t go into too much detail here but I’ll give you enough to come to appreciate just how neat and important computer vision can be.

If you would like to read more about this topic, a good place to start is “Computer Vision on Mars” (Matthies, Larry, et al. International Journal of Computer Vision 75.1, 2007: 67-92.), which is an academic paper published by NASA in 2007. You can also follow any additional referenced publications there. All images in this post, unless otherwise stated, were taken from this paper.

Background Information

In 2003, NASA launched two rovers into space with the intention of landing them on Mars to study rocks and soils for traces of past water activity. MER followed upon three other rover-based missions: the two Viking missions of 1975 and 1976 and the Mars Pathfinder mission of 1997.

Due to constraints in processing power and memory capacity no image processing was performed by the Viking rovers. They only took pictures with their on-board cameras to be sent back to Earth.

The Sojourner (the name of the Mars Pathfinder rover), on the other hand, performed computer vision in one way only. It used stereoscopic vision to provide scientists detailed maps of the terrain around the rover for operators on Earth to use in planning movement trajectories. Stereoscopic vision provides visual information from two viewing angles a short distance apart just like our eyes do. This kind of vision is important because two views of the same scene allows for the extraction of 3D data (i.e. depth data). See this OpenCV tutorial on extracting depth maps from stereo images for more information on this.

The MER Rovers

The MER rovers, Spirit and Opportunity as they were named, were identical. Both had a 20 MHz processor, 128 MB of RAM, and 256 MB of flash memory. Not much to work with there, as you can see! Phones nowadays are about 1000 times more powerful.

The rovers also had a monocular descent camera facing directly down and three sets of stereo camera pairs: one pair each at the front and back of the rovers (called hazard cameras, or “hazcams” for short) and a pair of cameras (called navigation cameras, or “navcams” for short) on a mast 1.3m (4.3 feet) above the ground. All these cameras took 1024 x 1024 greyscale photos.

But wait, those colour photos we’ve seen so many times from these missions were fake, then? Nope! Cleverly, each of the stereoscopic camera lenses also had a wheel of 8 filters that could be rotated. Consecutive images could be taken with a different filter (e.g. infrared, ultra-violet, etc.) and colour extracted from a combination of these. Colour extraction was only done on Earth, however. All computer vision processing on Mars was therefore performed in greyscale. Fascinating, isn’t it?

Components of an MER rover (image source)

The Importance of Computer Vision in Space

If you’ve been around computer vision for a while you’ll know that for things such as autonomous vehicles, vision solutions are not necessarily the most efficient. For example, lidar (Light Detection And Ranging – a technique similar to sonar for constructing 3D representations of scenes by emitting pulsating laser light and then measuring reflections of it) can give you 3D obstacle avoidance/detection information much more easily and quickly. So, why did NASA choose to use computer vision (and so much of it, as I’ll be presenting to you below) instead of other solutions? Because laser equipment is fragile and it may not have withstood the harsh conditions of Mars. So, digital cameras were chosen instead.

Computer Vision on Mars

We now have information on the background of the mission and the technical hardware relevant to us so let’s move to the business side of things: computer vision.

The first thing I will talk about is the importance of autonomy in space exploration. Due to communication latency and bandwidth limitations, it is advantageous to minimise human intervention by allowing vehicles or spacecraft to make decisions on their own. The Sojourner had minimal autonomy and only ended up travelling approximately 100 metres (328 feet) during it’s entire mission (which lasted a good few months). NASA wanted the MER rovers to travel on average that much every day, so they put a lot of time and research into autonomy to help them reach this target.

In this respect, the result was that they used computer vision for autonomy on Mars in 3 ways:

  1. Descent motion estimation
  2. Obstacle detection for navigation
  3. Visual odometry

I will talk about each of these below. As mentioned in the introduction, I won’t go into great detail here but I’ll give you enough to satisfy that inner nerd in you 😛

1. Descent Image Motion Estimation System

Two years before the launch of the rocket that was to take the rovers to Mars, scientists realised that their estimates of near-surface wind velocities of the planet were too low. This could have proven catastrophic because severe horizontal winds could have caused irreparable damage upon an ill-judged landing of the rover. Spirit and Opportunity had horizontal impulse rockets that could be used to reduce horizontal velocity upon descent but no system to detect actual horizontal speed of the rovers.

Since a regular horizontal velocity sensor could not be installed due to cost and time constraints, it was decided to turn to computer vision for assistance! A monocular camera was attached to the base of the rover that would take pictures of the surface of the planet as the rovers were descending onto it. These pictures would be analysed in-flight to provide estimates of horizontal speeds in order to trigger the impulse rockets, if necessary.

The computer vision system for motion estimation worked by tracking a single feature (features are small “interesting” or “stand-out” patches in images). The feature was located in photos taken by the rovers and then the position of these patches was tracked between consecutive images.

Coupled with this feature tracking information and measurements from the angular velocity and vertical velocity sensors (that were already installed for the purpose of on-surface navigation), the entire velocity vector (i.e. information about the magnitude and direction of the rover’s speed) was able to be calculated.

The feature tracking algorithm, called the Descent Image Motion Estimation System (DIMES) consisted of 7 steps as summarised by the following image:


The first step reduces the image size to 256 x 256 resolution. The smaller the resolution, the faster that image processing calculations can be performed – but at the possible expense of accuracy. The second step was responsible for estimating the maximum possible area of overlap in consecutive images to minimise the search area for features (there’s no point in detecting features in regions of an image that you know are not going to be present in the second). This was done by taking into consideration knowledge from sensors of things such as the rover’s altitude and orientation. The third step picked out two features from an image using the Harris corner detector (discussed here in this OpenCV tutorial). Only one feature is needed for the algorithm to work but two were detected in case one feature could not be located in the following image. A few noise “clean-up” operations on images were performed in step 4 to reduce effects of things such as blurring.

Step 5 is interesting. The feature patches (aka feature templates) and search windows in consecutive images were rectified (rotated, twisted, etc.) to remove orientation and scale differences in order to make searching for features easier. In other words, the images were rotated, twisted and enlarged/diminished to be placed on the same plane. An example of this from the actual mission (from the Spirit rover’s descent) is shown in the image below. The red squares in the first image are the detected feature patches that are shown in green in the second image with the search windows shown in blue. You can see how the first and second images have been twisted and rotated such that the feature size, for example, is the same in both images.


Step 6 was responsible for locating in the second image the two features found in the first image. Moravec’s correlator (an algorithm developed by Hans Moravec and published in his PhD thesis way back in 1980) was used for this. The general idea in this algorithm is to minimise the search area first instead of searching over every possible location in an image for a match. This is done by first selecting potential regions in an image for matches and only there is a more exhaustive search performed.

The final step is combining all this information to calculate the velocity vector. In total, the DIMES algorithm took 14 seconds to run up there in the atmosphere of Mars. It was run by both rovers during their descent. The Spirit rover was the only one that fired its impulse rockets as a result of calculations from DIMES. Its horizontal velocity was at one stage reduced from 23.5 m/s (deemed to be slightly over a safe limit) to 11 m/s, which ensured a safe landing. Computer vision to the rescue! Opportunity’s horizontal speed was never calculated to be too fast so firing its stabilising rockets was considered to be unnecessary. It also had a successful landing.

All the above steps were performed autonomously on Mars without any human intervention. 

2. Stereo Vision for Navigation

To give the MER rovers as much autonomy as possible, NASA scientists developed a stereo-vision-based obstacle detection and navigation system. The idea behind it was to give the scientists the ability to simply provide the rovers each day with a destination and for the vehicles to work things out on their own with respect to navigation to this target (e.g. to avoid large rocks).

And their system performed beautifully.

The algorithm worked by extracting disparity (depth) maps from stereo images – as I’ve already mentioned, see this OpenCV tutorial for more information on this technique. What was done, however, by the rovers was slightly different to that tutorial (for example a simpler feature matching algorithm was employed), but the gist of it was the same: feature point detection and matching was performed to find the relationship between images and knowledge of camera properties such as focal lengths and baseline distances allowed for the derivation of depth for all pixels in an image. An example of depth maps calculated in this way by the Spirit rover is shown below:

The middle picture was taken by Spirit and shows a rock on Mars approximately 0.5 m (1.6 feet) in height. The left image shows corresponding range information (red is closest, blue furthest). The right image shows corresponding height information.

Interestingly, the Opportunity rover, because it landed on a smoothly-surfaced plain, was forced to use its navcams (that were mounted on a mast) for its navigation. Looking down from a higher angle meant that detailed texture from the sand could be used for feature detection and matching. Its hazcams returned only the smooth surface of the sand. Smooth surfaces are not agreeable to feature detection (because, for example, they don’t have corners or edges). The Spirit rover, on the other hand, because it landed in a crater full of rocks, could use its hazcams for stereoscopic navigation.

3. Visual Odometry

Finally, computer vision on Mars was used at certain times to estimate the rovers’ position and travelling distance. No GPS is available on Mars (yet) and standard means of estimating distance travelled such as counting the number of wheel rotations was deemed during desert testing on Earth to be vulnerable to significant error due to one thing: wheel slippage. So, NASA scientists decided to employ motion estimation via computer vision instead.

Motion estimation was performed using feature tracking in 3D across successive shots taken by the navcams. To obtain 3D information, once again depth maps were extracted from stereoscopic images. Distances to features could easily be calculated from these and then the rovers’ poses were estimated. On average, 80 features were tracked per frame and a photo was taken for visual odometry calculations every 75 cm (30 inches) of travel.

Using computer vision to assist in motion estimation proved to be a wise decision because wheel slippage was quite severe on Mars. In fact, at one time the rover got stuck in sand and the wheels rotated in place for the equivalent of 50m (164 feet) of driving distance. Without computer vision the rovers’ estimated positions would have been severely inaccurate. 

There was another instance where this was strikingly the case. At one time the Opportunity rover was operating on a 17-20 degree slope in a crater and was attempting to maneuver around a large rock. It had been trying to escape the rock for several days and had slid down the crater many times in the process. The image below shows the rover’s estimated trajectory (from a top-down view) using just wheel odometry (left), and the rover’s corrected trajectory (right) as assisted by computer vision calculations. The large rock is represented by the black ellipse. The corrected trajectory proved to be the more accurate estimation.



In this post I presented the three ways computer vision was used by the Spirit and Opportunity rovers during their MER missions on Mars. These three ways were:

  1. Estimating horizontal speeds during their descent onto the Red Planet to ensure the rovers had a smooth landing.
  2. Extracting 3D information of its surroundings using stereoscopic imagery to assist in navigation and obstacle detection.
  3. Using stereoscopic imagery once again but this time to provide motion and pose estimation on difficult terrain.

In this way, computer vision gave the rovers a significant amount of autonomy (much, much more autonomy than its predecessor, the Sojourner rover) that ultimately gave the rovers a safe landing and allowed the robots to traverse up to 370 m (1213 feet) per day. In fact, the Opportunity rover is still active on Mars now. This means that the computer vision techniques described in this post are churning away as we speak. If that isn’t neat, I don’t know what is!


To be informed when new content like this is posted, subscribe to the mailing list:

Please share what you just read:

Machines That Can Read Our Minds

Who here watches Black Mirror? Oh, I love that show (maybe except for Season 4 – only one episode there was any good, IMHO). For those that haven’t seen it yet, Black Mirror basically tries to extrapolate what society will look like if technological advances were to continue on their current trajectory.

A few of the technologies presented in that show are science fiction on steroids. But I think that many are not and that it’s only a matter of time before we reach that (dark?) high-tech reality.

A reader of this blog approached me recently with a (preprint) academic publication that sounded pretty much like it had been taken out of that series. It was a paper purporting that scientists can now reconstruct our visual thoughts by analysing our brain scans.

“No way. Too good to be true”, was my first reaction. There has to be a catch. So, I went off to investigate further.

And boy was my mind blown. Allow me in this post to present to you that publication out of Kyoto, Japan. Try telling me at the end that you’re not impressed either!

(Note: Although this post is technically not directly related to computer vision, I’m still including it in my blog because it deals with trying to understand how our brains work when it comes to visual stimuli. This is important to do because many times in the history of computer vision and AI in fact, it has been shown that biologically inspired solutions (i.e. solutions that mimic the way we behave) can obtain really good if not superior results to other solutions. The prime example of this, of course, is neural networks)

Machines that can read our minds

The paper in question is entitled “Deep image reconstruction from human brain activity” (Shen et al., bioRxiv, 2017). Although it has been published in preprint, meaning that it hasn’t been peer-reviewed and technically doesn’t hold too much academic weight, the site that it was published on is of repute ( as are the institutes that co-authored the work (ATR Computational Neuroscience Laboratories and Kyoto University, the latter being one of Asia’s highest ranked universities). Hence, I’m confident in the credibility of their research.

In a nutshell, the paper is claiming to have developed a way to reconstruct images that people are looking at or thinking about by solely analysing functional magnetic resonance imaging (fMRI). fMRI machines measure brain activity by detecting changes in blood flow through regions of the brain – generally speaking, the more blood flow, the more brain activity. Here’s an example fMRI image showing increased brain activity in orange/yellow:

(image source)

Let’s look at some details of their paper.

The first step was to utilise machine learning to examine what the relationship is between brain activity as detected by fMRI scans and the images the were being shown to test subjects. What’s interesting is that there was a focus on mapping hierarchical features between the fMRI images and real images.

Hierarchical features is a way to deconstruct images into an ascending level of detail. For example, to describe a car you could start by describing edges, then working up to curves, and then particular shapes like circles, and then objects like wheels; and then finally you’d get four wheels, a chassis and some windows.

Deconstructing images by looking at hierarchical features (edges, curves, shapes, and so on) is also the way neural networks and our brains work with respect to visual stimuli. So, this plan of action to focus on hierarchical features in fMRI images seems intuitive.

The hierarchical features extracted from fMRI images were then used to replace the features of a deep neural network (DNN). Next, an iterative process was started where base images (produced by a deep generator network) were fed into the DNN and at each iteration individual pixels of the image were transformed until the pixel values matched the features of the DNN (which were extracted from fMRI images) to a certain threshold (error) level.

It’s a complicated process, I know. I’ve tried to summarise it as best as I could here! To understand it more, you’ll need to go back to the original publication and then follow a trail of referenced articles. The image below shows this process – not that it helps much 🙂 But I think the most important part of this study are the results – that’s coming up in the next section.

Deep image reconstruction (source: original publication)

Test subjects were shown various images from three different classes: natural colour images (sampled from ImageNet), artificial geometrical shapes (in 8 colours), and black and white alphabetical letters. Scans of brain activity were conducted during the viewing sessions.

Interestingly, test subjects were also asked to recollect images they had been previously shown and brain scans during this task were also taken. The idea was to try to see if a machine could truly discern what a person was actually thinking rather than just viewing.

Results from the Study

This is where the fun begins!

But before we dive into the results, I need to mention that this is not the first time fMRI imaging has been used to attempt to reconstruct images. Similar projects (referenced in the publication we’re discussing) go back to 2011 and 2013. The novelty of this study, however, is to work with hierarchical features rather than working with fMRI images directly.

Take a look at this video showing past results from previous studies. The images are monochromatic and of minimal resolution:

Now let’s take a look at some results from this particular study. (All images are taken from the original publication.)

How about a reconstructed image of a swan? Remember, this reconstructed image (right) is generated from someone’s brain scan (taken while the test subject was viewing the original image):

Example result #1: the test subject was shown the image on the left; the image on the right is the reconstructed image from brain scans.

You must be impressed with that!

Here’s a reconstructed picture of a duck:

Example result #2: the test subject was shown the image on the left; the image on the right is the reconstructed image from brain scans.

For a presentation of results, take a look at this video released by the authors of the paper:

What about those letters I mentioned that were also shown to test subjects? Check this out:

Example results #3: the test subject was shown the letters in the top row; the bottom row shows the reconstructed images from brain scans.

Scary how good the reconstructions are, isn’t it? Imagine being able to tell what a person is reading at a given time!? I’ve already written a post about non-invasive lie detection from thermal imaging. If we can just work out how to get fMRI images like that (i.e. without engaging the subject), we’ll be able to tell what a person is looking at without them even knowing about it. Black Mirror stuff!

Results from the artificial geometric shapes images can be viewed here – also impressive, of course.

And how about reconstructed images from scans performed when people were thinking about the images they had been previously shown? Well, results here aren’t as impressive (see below). The authors concede this also. But hey! One step at a time, right?

Example results #4: the test subject was told to think about the images in the rop row; the bottom row shows the attempted reconstructions of these images from brain scans.


In this post I presented a preprint publication in which it is purported that scientists were able to reconstruct visual thoughts by analysing brain scans. fMRI images were analysed for hierarchical features that were then used in a deep neural network. Base images were fed through this network and iteratively transformed until the features of each image matched the features in the neural network. Results from this study were presented in this post and I think we can all agree that they were quite impressive. This is the future, folks! Black Mirror stuff perhaps.

The uses for good of this technology are numerous – communicating with people in comas is one such use case. Of course, the abuse that could come from this is frightening, too. Issues with privacy (ah, that pesky perennial question in computer science!) spring to mind immediately.

But as you may have gathered from all my posts here, I look forward to what the future holds for us, especially with respect to artificial intelligence. And likewise with this technology. I can’t wait to see it progress.


To be informed when new content like this is posted, subscribe to the mailing list:

Please share what you just read:

Gait Recognition – Another Form of Biometric Identification

I was watching The Punisher on Netflix last week and there was a scene (no spoilers, promise) in which someone was recognised from CCTV footage by the way they were walking. “Surely, that’s another example of Hollywood BS“, I thought to myself – “there’s no way that’s even remotely possible”. So, I spent the last week researching into this – and to my surprise it turns out that this is not a load of garbage after all! Gait Recognition is another legitimate form of biometric identification/verification. 

In this post I’m going to present to you my past week’s research into gait recognition: what it is, what it typically entails, and what the current state-of-the-art is in this field. Let me just say that what scientists are able to do now in this respect surprised me immensely – I’m sure it’ll surprise you too!

Gait Recognition

In a nutshell, gait recognition aims to identify individuals by the way they walk. It turns out that our walking movements are quite unique, a little like our fingerprints and irises. Who knew, right!? Hence, there has been a lot of research in this field in the past two decades.

There are significant advantages of this form of identity verification. These include the fact that it can be performed from a distance (e.g. using CCTV footage), it is non-invasive (i.e. the person may not even know that he is being analysed), and it does not necessarily require high-resolution images for it to obtain good results.

The Framework for Automatic Gait Recognition

Trawling through the literature on the subject, I found that scientists have used various ways to capture people’s movements for analysis, e.g. using 3D depth sensors or even using pressure sensors on the floor. I want to focus on the use case shown in The Punisher where recognition was performed from a single, stationary security camera. I want to do this simply because CCTV footage is so ubiquitous today and because pure and neat Computer Vision techniques can be used on such footage.

In this context, gait recognition algorithms are typically composed of three steps:

  1. Pre-processing to extract silhouettes
  2. Feature extraction
  3. Classification

Let’s take a look at these steps individually.

1. Silhouette extraction

Silhouette extraction of subjects is generally performed by subtracting the background image from each frame. Once the background is subtracted, you’re left with foreground objects. The pixels associated with these objects can be coloured white and then extracted.

Background subtraction is a heavily studied field and is by no means a solved problem in Computer Vision. OpenCV provides a few interesting implementations of background subtraction. For example, a background can be learned over time (i.e. you don’t have to manually provide it). Some implementations also allow for things like illumination changes (especially useful for outdoor scenes) and some can also deal with shadows. Which technique is used to subtract the background from frames is irrelevant as long as reasonable accuracy is obtained.

Example of silhouette extraction

2. Feature extraction

Various features can be extracted once we have the silhouettes of our subjects. Typically, a single gait period (a gait cycle) is first detected, which is the sequence of video showing you take one step with each of your feet. This is useful to do because your gait pattern repeats itself, so there’s no need to analyse anything more than one cycle.

Features from this gait cycle are then extracted. In this respect, algorithms can be divided into two groups: model-based and model-free.

Model-based methods of gait recognition take your gait period and attempt to build a model of your movements. These models, for example, can be constructed by representing the person as a stick-figure skeleton with joints or as being composed of cylinders. Then, numerous parameters are calculated to describe the model. For example, the method proposed in this publication from 2001 calculates distance between the head and feet, the head and pelvis, the feet and pelvis, and the step length of a subject to describe a simple model. Another model is depicted in the image below:

An example of a biped model with 5 different parameters as proposed in this solution from 2012

Model-free methods work on extracted features directly. Here, undoubtedly the most interesting and most widely used feature extracted from silhouettes is that of the Gait Energy Image (GEI). It was first proposed in 2006 in a paper entitled “Individual Recognition Using Gait Energy Image” (IEEE transactions on pattern analysis and machine intelligence 28, no. 2 (2006): 316-322).

Note: the Pattern Analysis and Machine Intelligence (PAMI) journal is one of the best in the world in the field. Publishing there is a feat worthy of praise. 

The GEI is used in almost all of the top gait recognition algorithms because it is (perhaps surprisingly) intuitive, not too prone to noise, and simple to grasp and implement. To calculate it, frames from one gait cycle are superimposed on top of each other to give an “average” image of your gait. This calculation is depicted in the image below where the GEI for two people is shown in the last column.

The GEI can be regarded as a unique signature of your gait. And although it was first proposed way back in 2006, it is still widely used in state-of-the-art solutions today.

Examples of two calculated GEIs for two different people shown in the far right column. (image taken from the original publication)

3. Classification

Once step 2 is complete, identification of subjects can take place. Standard classification techniques can be used here, such as k-nearest neighbour (KNN) and the support vector machine (SVM). These are common techniques that are used when one is dealing with features. They are not constrained to the use case of computer vision. Indeed, any other field that uses features to describe their data will also utilise these techniques to classify/identify their data. Hence, I will not dwell on this step any longer. I will, however, will refer you to a state-of-the-art review of gait recognition from 2010 that lists some more of these common classification techniques.

So, how good is gait recognition then?

We’ve briefly taken a look at how gait recognition algorithms work. Let’s now take a peek at how good they are at recognising people.

We’ll first turn to some recent news. Only 2 months ago (October, 2017) Chinese researchers announced that they have developed the best gait recognition algorithm to date. They claim that their system works with the subject being up to 50 metres away and that detection times have been reduced to just 200 milliseconds. If you read the article, you will notice that no data/results are presented so we can’t really investigate their claims. We have to turn to academia for hard evidence of what we’re seeking.

Gaitgan: invariant gait feature extraction using generative adversarial networks” (Yu et al., IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 30-37. 2017) is the latest top publication on this topic. I won’t go through their proposed algorithm (it is model-based and uses the GEI), I will just present their results – which are in fact quite impressive.

To test their algorithm, the authors used the CASIA-B dataset. This is one of the largest publicly available datasets for gait recognition. It contains video footage of 124 subjects walking across a room captured at various angles ranging from front on, side view, and top down. Not only this, but walking is repeated by the same people while wearing a coat and then while wearing a backpack, which adds additional elements of difficulty to gait recognition. And the low resolution of the videos (320×240 – a decent resolution in 2005 when the dataset was released) makes them ideal to test gait recognition algorithms on considering how CCTV footage has generally low quality also.

Three example screenshots from the dataset is shown below. The frames are of the same person with a side-on view. The second and third image shows the subject wearing a coat and a bag, respectively.

Example screenshots from the CASIA B dataset of the same person walking.

Recognition rates with front-on views with no bag or coat linger around 20%-40% (depending on the height of the camera). Rates then gradually increase as the angle nears the side-on view (that gives a clear silhouette). At the side-on view with no bag or coat, recognition rates reach an astounding 98.75%! Impressive and surprising.

When it comes to analysing the clips with the people carrying a bag and wearing a coat, results are summarised in one small table that shows only a few indicative averages. Here, recognition rates obviously drop but the top rates (obtained with side-on views) persist at around the 60% mark.

What can be deduced from these results is that if the camera distance and angle and other parameters are ideal (e.g. the subject is not wearing/carrying anything concealing), gait recognition works amazingly well for a reasonably sized subset of people. But once ideal conditions start to change, accuracy gradually decreases to (probably) inadequate levels.

And I will also mention (perhaps something you may have already garnered) that these algorithms also only work if the subject is acting normally. That is, the algorithms work if the subject is not changing the way he usually walks, for example by walking faster (maybe as a result of stress) or by consciously trying to forestall gait recognition algorithms (like we saw in The Punisher!).

However, an accuracy rate of 98.75% with side-on views shows great potential for this form of identification and because of this, I am certain that more and more research will be devoted to this field. In this respect, I will keep you posted if I find anything new and interesting on this topic in the future!


Gait recognition is another form of biometric identification – a little like iris scanning and fingerprints. Interesting computer vision techniques are utilised on single-camera footage to obtain sometimes 99% recognition results. These results depend on such things as camera angles and whether subjects are wearing concealing clothes or not. But much like other recognition techniques (e.g. face recognition), this is undoubtedly a field that will be further researched and improved in the future. Watch this space.


To be informed when new content like this is posted, subscribe to the mailing list:


Please share what you just read: