object-detection-example

The Top Image Datasets and Their Challenges

In previous posts of mine I have discussed how image datasets have become crucial in the deep learning (DL) boom of computer vision of the past few years. In deep learning, neural networks are told to (more or less) autonomously discover the underlying patterns in classes of images (e.g. that bicycles are composed of two wheels, a handlebar, and a seat). Since images are visual representations of our reality, they contain the inherent complex intricacies of our world. Hence, to train good DL models that are capable of extracting the underlying patterns in classes of images, deep learning needs lots of data, i.e. big data. And it’s crucial that this big data that feeds the deep learning machine be of top quality.

In lieu of Google’s recent announcement of an update to its image dataset as well as its new challenge, in this post I would like to present to you the top 3 image datasets that are currently being used by the computer vision community as well as their associated challenges:

  1. ImageNet and ILSVRC
  2. Open Images and the Open Images Challenge
  3. COCO Dataset and the four COCO challenges of 2018

I wish to talk about the challenges associated with these datasets because challenges are a great way for researchers to compete against each other and in the process to push the boundary of computer vision further each year!

ImageNet

imagenet-logoThis is the most famous image dataset by a country mile. But confusion often accompanies what ImageNet actually is because the name is frequently used to describe two things: the ImageNet project itself and its visual recognition challenge.

The former is a project whose aim is to label and categorise images according to the WordNet hierarchy. WordNet is an open-source database for words that are organised hierarchically into synonyms. For example words like “dog” and “cat” can be found in the following knowledge structure:

WordNet-synset-graph
An example of a WordNet synset graph (image taken from here)

Each node in the hierarchy is called a “synonym set” or “synset”. This is a great way to categorise words because whatever noun you may have, you can easily extract its context (e.g. that a dog is a carnivore) – something very useful for artificial intelligence.

The idea with the ImageNet project, then, is to have 1000+ images for each and every synset in order to also have a visual hierarchy to accompany WordNet. Currently there are over 14 million images in ImageNet for nearly 22,000 synsets (WordNet has ~100,000 synsets). Over 1 million images also have hand-annotated bounding boxes around the dominant object in the image.

ImageNet-kit-fox
Example image of a kit fox from ImageNet showing hand-annotated bounding boxes

You can explore the ImageNet and WordNet dataset interactively here. I highly recommend you do this!

Note: by default only URLs to images in ImageNet are provided because ImageNet does not own the copyright to them. However, a download link can be obtained to the entire dataset if certain terms and conditions are accepted (e.g. that the images will be used for non-commercial research). 

Having said this, when the term “ImageNet” is used in CV literature, it usually refers to the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) which is an annual competition for object detection and image classification. This competition is very famous. In fact, the DL revolution of the 2010s is widely attributed to have originated from this challenge after a deep convolutional neural network blitzed the competition in 2012.

The motivation behind ILSVRC, as the website says, is:

… to allow researchers to compare progress in detection across a wider variety of objects — taking advantage of the quite expensive labeling effort. Another motivation is to measure the progress of computer vision for large scale image indexing for retrieval and annotation.

The ILSVRC competition has its own image dataset that is actually a subset of the ImageNet dataset. This meticulously hand-annotated dataset has 1,000 object categories (the full list of these synsets can be found here) spread over ~1.2 million images. Half of these images also have bounding boxes around the class category object.

The ILSCVRC dataset is most frequently used to train object classification neural network frameworks such as VGG16, InceptionV3, ResNet, etc., that are publicly available for use. If you ever download one of these pre-trained frameworks (e.g. Inception V3) and it says that it can detect 1000 different classes of objects, then it most certainly was trained on this dataset.

Google’s Open Images

Google is a new player in the field of datasets but you know that when Google does something it will do it with a bang. And it has not disappointed here either.

Open Images is a new dataset first released in 2016 that contains ~9 million images – which is fewer than ImageNet. What makes it stand out is that these images are mostly of complex scenes that span thousands of classes of objects. Moreover, ~2 million of these images are hand-annotated with bounding boxes making Open Images by far the largest existing dataset with object location annotations. In this subset of images, there are ~15.4 million bounding boxes of 600 classes of object. These objects are also part of a hierarchy (see here for a nice image of this hierarchy) but one that is nowhere near as complex as WordNet.

open-images-eg
Open Images example image with bounding box annotation

As of a few months’ ago, there is also a challenge associated with Open Images called the “Open Images Challenge. It is an object detection challenge and, what’s more interesting, there is also a visual relationship detection challenge (e.g. “woman playing a guitar” rather than just “guitar” and “woman”). The inaugural challenge will be held at this year’s European Conference on Computer Vision. It looks like this will be a super interesting event considering the complexity of the images in the dataset and, as a result, I foresee this challenge to be the de facto object detection challenge in the near future. I am certainly looking forward to seeing the results of the challenge to be posted around the time of the conference (September 2018).

Microsoft’s COCO Dataset

Microsoft is in this game also with their Common Objects in Context  (COCO) dataset. Containing ~200K images, it’s relatively small but what makes it stand out are its challenges that come associated with the additional features it provides for each image, for example:

  • object segmentation information rather than just bounding boxes of objects (see image below)
  • five textual captions per image such as “the a380 air bus ascends into the clouds” and “a plane flying through a cloudy blue sky”.

The first of these points is worth providing an example image of:

coco-eg

Notice how each object is segmented rather than outlined by a bounding box as is the case with ImageNet and Open Images examples? This object segmentation feature of the dataset makes for very interesting challenges because segmenting an object like this is many times more difficult than just drawing a rectangular box around it.

COCO challenges are also held annually. But each year’s challenge is slightly different. This year the challenge has four tracks:

  1. Object segmentation (as in the example image above)
  2. Panoptic segmentation task, which requires object and background scene segmentation, i.e. a task to segment the entire image rather than just the dominant objects in an image:
    panoptic-example
  3. Keypoint detection task, which involves simultaneously detecting people and localising their keypoints:
    keypoints-task-example
  4. DensePose task, which involves simultaneously detecting people and localising their dense keypoints (i.e. mapping all human pixels to a 3D surface of the human body):
    densepose-task-example

Very interesting, isn’t it? There is always something engaging taking place in the world of computer vision!

To be informed when new content like this is posted, subscribe to the mailing list:

old-disks-featured

The Early History of Computer Vision

As I’ve discussed in previous posts of mine, computer vision is growing faster than ever. In 2016, investments into US-based computer vision companies more than tripled since 2014 – from $100 million to $300 million. And it looks like this upward trend is going to continue worldwide, especially if you consider all the major acquisitions that were made in 2017 in the field, the biggest one being Intel’s purchase in March of Mobileye (a company specialising in computer vision-based collision prevention systems for autonomous cars) for a whopping $15 billion.

But today I want to take a look back rather than forward. I want to devote some time to present the early historical milestones that have led us to where we are now in computer vision.

This post, therefore, will focus on seminal developments in computer vision between the 60s and early 80s.

Larry Roberts – The Father of Computer Vision

Let’s start first with Lawrence Roberts. Ever heard of him? He calls himself the founder of the Internet. The case for giving him this title is strong considering that he was instrumental in the design and development of the ARPANET, which was the technical foundation of what you are surfing on now.

What is not well known is that he is also dubbed the father of computer vision in our community. In 1963 he published “Machine Perception Of Three-Dimensional Solids”, which started it all for us. There he discusses extracting 3D information about solid objects from 2D photographs of line drawings. He mentions things such as camera transformations, perspective effects, and “the rules and assumptions of depth perception” – things that we discuss to this very day.

Just take a look at this diagram from his original publication (which can be found here – and a more readable form can be found here):

Camera-transformation-roberts-PhD

Open up any book on image processing and you will see a similar diagram in there discussing the relationship between a camera and an object’s projection on a 2D plane.

Funnily enough, Lawrence’s Wikipedia page does not give a single utterance to his work in computer vision, which is all the more surprising considering that the publication I mentioned above was his PhD thesis! Crazy, isn’t it? If I find the time, I’ll go over and edit that article to give him the additional credit that he deserves.

The Summer Vision Project

Lawrence Roberts’ thesis was about analysing line drawings rather than images taken of the real world. Work in line drawings was to continue for a long time, especially after the following important incident of 1966.

You’ve probably all heard the stories from the 50s and 60s of scientists predicting a bright future within a generation for artificial intelligence. AI became an academic discipline and millions was pumped into research with the intention of developing a machine as intelligent as a human being within 25 years. But it didn’t take long before people realised just how hard creating a “thinking” machine was going to be.

Well, computer vision has its own place in this ambitious time of AI as well. People then thought that constructing a machine to mimic the human visual system was going to be an easy task on the road to finally building a robot with human-like intelligent behaviour.

In 1966, Seymour Papert organised “The Summer Vision Project” at the MIT. He assigned this project to Gerald Sussman who was to co-ordinate a small group of students to work on background/foreground segmentation of real-world images with a final goal of extracting non-overlapping objects from them.

Only in the past decade or so have we been able to obtain good results in a task such as this. So, those poor students really did now know what they were in for. (Note: I write about why image processing is such a hard task in another post of mine). I couldn’t find any information on exactly how much they were able to achieve over that summer but this will definitely be something I will try to find out if I ever get to visit the MIT in the United States.

Luckily enough, we also have access to the original memo of this project available to us – which is quite neat. It’s definitely a piece of history for us computer vision scientists. Take a look at the abstract (summary) of the project from the memo from 1966 (the full version can be found here):

summer-vision-project-abstract

Continued Work with Line Drawings

In the 1970s work continued with line drawings because real-world images were just too hard to handle at the time. To this extent, people were regularly looking into extracting 3D information about blocks from 2D images.

Line labelling was an interesting concept being looked at in this respect. The idea was to try to discern a shape in a line drawing by first attempting to annotate all the lines it was composed of accordingly. Line labels would include convex, concave, and occluded (boundary) lines. An example of a result from a line labelling algorithm can be seen below:

line-labelling-eg
Line labelling example with convex (+), concave (-), and occluded (<–) labels. (Image taken from here)

Two important people in the field of line labelling were David Huffman (“Impossible objects as nonsense sentences”. Machine Intelligence, 8:475-492, 1971) and Max Clowes (“On seeing things”. Artificial Intelligence, 2:79-116, 1971) who both published their line labelling algorithms independently in 1971.

In the genre of line labelling, interesting problems such as the one below were also looked at:

necker-illusion-marr

The image above was taken from a seminal book written by David Marr at the MIT entitled “Vision: A computational investigation into the human representation and processing of visual information”. It was finished around 1979 but posthumously published in 1982. In this book Marr proposes an important framework to image understanding that is used to this very day: the bottom-up approach. The bottom-up approach, as Marr suggests, uses low-level image processing algorithms as stepping-stones towards attaining high-level information.

(Now, for clarification, when we say “low-level” image processing, we mean tasks such as edge detection, corner detection, and motion detection (i.e. optical flow) that don’t directly give us any high-level information such as scene understanding and object detection/recognition.)

From that moment on, “low-level” image processing was given a prominent place in computer vision. It’s important to also note that Marr’s bottom-up framework is central to today’s deep learning systems (more on this in a future post).

Computer Vision Gathers Speed

So, with the bottom-up model approach to image understanding, important advances in low-level image processing began to be made. For example, the famous Lukas-Kanade optical flow algorithm, first published in 1981 (original paper available here), was developed. It is still so prominent today that it is a standard optical flow algorithm included in the OpenCV library. Likewise, the Canny edge detector, first published in 1986, is again widely used today and is also available in the OpenCV library.

The bottom line is, computer vision started to really gather speed in the late 80s. Mathematics and statistics began playing a more and more significant role and the increase in speed and memory capacity of machines helped things immensely also. Many more seminal algorithms followed this upward trend, including some famous face detection algorithms. But I won’t go into this here because I would like to talk about breakthrough CV algorithms in a future post.

Summary

In this post I looked at the early history of computer vision. I mentioned Lawrence Roberts’ PhD on things such as camera transformations, perspective effects, and “the rules and assumptions of depth perception”. He got the ball rolling for us all and is commonly regarded as the father of computer vision. The Summer Vision Project of 1966 was also an important event that taught us that computer vision, along with AI in general, is not an easy task at all. People, therefore, focused on line drawings until the 80s when Marr published his idea for a bottom-up framework for image understanding. Low-level image processing took off spurred on by advancements in the speed and memory capacity of machines and a stronger mathematical and statistical vigour. The late 80s and onwards saw tremendous developments in CV algorithm but I will talk more about this in a future post.

To be informed when new content like this is posted, subscribe to the mailing list:

feature-image-football-tabletop

Watch Football/Soccer on Your Tabletop using Virtual Reality

Well, it’s World Cup season now, isn’t it? Australia got eliminated this week so I’m feeling a bit depressed at the moment. However, seeing teams like Germany also not make it past the group stage makes me feel a little better (sorry German people reading this post :P).

But since the World Cup is on, it is only fitting that I write about something from the field of Computer Vision that is related to football. So, in this post I’m going to present to you quite an amazing paper I stumbled upon entitled “Soccer on Your Tabletop” (Rematas et al., CVPR 2018, pp. 4738-4747)

(Recall that CVPR is a world-class academic conference on computer vision. Anything published there is always worth reading.)

The goal of the paper is to present an algorithm that reconstructs a 3D representation of a football game from a single 2D video – just like you would find on YouTube. The 3D video could then be projected onto a tabletop (like a hologram) and viewed by everyone in the room from multiple angles. An interesting concept!

Usually something like this is obtained by having multiple cameras set up that can then work together to provide 3D information of the football pitch. But the idea here is to get all information from a single 2D video.

Here’s a clip of the entire project that has been released by the authors. Watch it!

Ah, you just have to love computer vision!

Let’s take a look at the (slightly simplified here) steps involved in the 2D -> 3D reconstruction process:

  1. Input frame: obtained from any 2D video of a football game (captured from stationary cameras).
  2. Camera calibration: this is performed using the football pitch line markings as guidance. The line markings provide excellent reference points to obtain the 2D plane of the football pitch from which players’ measurements can be deduced.
  3. Player detection, pose estimation, and tracking: this is done using already existing techniques. Specifically this paper is referenced from CVPR 2015 for detecting bounding boxes around players (top left image below), this paper from CVPR 2016 for estimation poses (top right image below), and a simple player tracking algorithm where you compare bounding boxes from adjacent frames and match them according to closest 2D Euclidean distance (bottom left image).

    football-on-tabletop-processing-steps
    (image taken from original publication)
  4. Player segmentation: the idea here is to highlight the entire contour of the player after performing the above steps (see bottom right image above). This is performed by taking each pixel and analysing its neighbouring pixels for similarities in colour and edge information until each player is extracted. (Several more steps are performed to fine-tune this process but I’ll skip over these).
  5. Player depth estimation and mesh generation. This is the tricky part. What the authors did is quite intuitive. To constrain the solution space to just football related poses, body shapes, and clothing, the authors created a training dataset from FIFA video games. Lol! What they found was that it was possible to intercept calls between the game engine and the GPU while playing the video game and then to extract depth maps from these intercepted calls. In doing so, they were able to train a deep neural network to extract depth maps from 2D videos. This trained network was then used on 2D YouTube videos. Absolutely brilliant!

    mesh-generation-football-tabletop
    (image obtained from project’s video)
  6. Scene reconstruction. Once player depth estimation and mesh information (which is 3D information) is obtained, the scene can then be reconstructed. What the authors ended up doing is to use Microsoft HoloLens (a mixed reality lens that enables you to see and interact with holograms in real life). So the football pitch on the tabletop you see in the image below isn’t real! Can you imagine watching a match like this around a table with your mates!? There is a catch with the project, however. It’s not good enough yet to reconstruct the ball, which means that at the moment all you can view in 3D are players running around chasing an invisible object 🙂 But that’s work in progress and the job and essence of research.

    hologram-football-tabletop
    (image obtained from project’s video)

Amazing, if you ask me! I can’t wait to see what the future holds for computer vision.

And believe it or not, code for this project is available online for you to play around with as much as you like. So, enjoy!

To be informed when new content like this is posted, subscribe to the mailing list:

controversies-in-computer-vision

Recent Controversies in Computer Vision – From Facebook to Uber

Computer vision is a fascinating area in which to work and perform research. And, as I’ve mentioned a few times already, it’s been a pleasure to witness its phenomenal growth, especially in the last few years. However, as with pretty much anything in the world, contention also plays a part in its existence.

In this post I would like to present 2 very recent events from the world of computer vision that have recently caused controversy:

  1. A judge’s ruling that Facebook must stand trial for its facial recognition software
  2. Uber’s autonomous car death of a pedestrian

Facebook and Facial Recognition

This is an event that has seemingly passed under the radar – at least for me it did. Probably because of the Facebook-Cambridge Analytica scandal that has been recently flooding the news and social discussions. But I think this is also an important event to mull over because it touches upon underlying issues associated with an important topic: facial recognition and privacy.

So, what has happened?

In 2015, Facebook was hit with a class action lawsuit (the original can be found here) by three residents from Chicago, Illinois. They are accusing Facebook of violating the state’s biometric privacy laws by the firm collecting and storing biometric data of each user’s face. This data is being stored without written notification. Moreover, it is not clear exactly what the data is to be used for, nor how long it will reside in storage, nor was there an opt-out option ever provided.

Facebook began to collect this data, as the lawsuit states, in a “purported attempt to make the process of tagging friends easier”.

cartoon-face-recognition

In other words, what Facebook is doing (yes, even now) is summarising the geometry of your face with certain parameters (e.g. distance between eyes, shape of chin, etc.). This data is then used to try to locate your face elsewhere to provide tag suggestions. But for this to be possible, the biometric data needs to be stored somewhere for it to be recalled when needed.

The Illinois residents are not happy that a firm is doing this without their knowledge or consent. Considering the Cambridge Analytica scandal, they kind of have a point, you would think? Who knows where this data could end up. They are suing for $75,000 and have requested a jury trial.

Anyway, Facebook protested over this lawsuit and asked that it be thrown out of court stating that the law in question does not cover its tagging suggestion feature. A year ago, a District Judge rejected Facebook’s appeal.

Facebook appealed again stating that proof of actual injury needs to be shown. Wow! As if violating privacy isn’t injurious enough!?

But on the 14th May, the same judge discarded (official ruling here) Facebook’s appeal:

[It’s up to a jury] to resolve the genuine factual disputes surrounding facial scanning and the recognition technology.

So, it looks like Facebook will be facing the jury on July 9th this year! Huge news, in my opinion. Even if any verdict will only pertain to the United States. There is still so much that needs to be done to protect our data but at least things seem to be finally moving in the right direction.

Uber’s Autonomous Car Death of Pedestrian

You probably heard on the news that on March 18th this year a woman was hit by an autonomous car owned by Uber in Arizona as she was crossing the road. She died in hospital shortly after the collision. This is believed to be the first ever fatality of a pedestrian in which an autonomous car was involved. There have been other deaths in the past (3 in total) but all of them have been of the driver.

uber-fatality-crash
(image taken from the US NTSB report)

3 weeks ago the US National Transportation Safety Board (NTSB) released its first report (short read) into this crash. It was only a preliminary report but it provides enough information to state that the self-driving system was at least partially at fault.

uber-fatality-car-image
(image taken from the US NTSB report)

The report gives the timeline of events: the pedestrian was detected about 6 seconds before impact but the system had trouble identifying it. It was first classified as an unknown object, then a vehicle, then a bicycle – but even then it couldn’t work out the object’s direction of travel. At 1.3 seconds before impact, the system realised that it needed to engage an emergency braking maneuver but this maneuver had been earlier disabled to prevent erratic vehicle behaviour on the roads. Moreover, the system was not designed to alert the driver in such situations. The driver began braking less than 1 second before impact but it was tragically too late.

Bottom line is, if the self-driving system had immediately recognised the object as a pedestrian walking directly into its path, it would have known that avoidance measures would have needed to be taken – well before the emergency braking maneuver was called to be engaged. This is a deficiency of the artificial intelligence implemented in the car’s system. 

No statement has been made with respect to who is legally at fault. I’m no expert but it seems like Uber will be given the all-clear: the pedestrian had hard drugs detected in her blood and was crossing in a non-crossing designated area of the road.

Nonetheless, this is a significant event for AI and computer vision (that plays a pivotal role in self-driving cars) because if these had performed better, the crash would have been avoided (as researchers have shown).

Big ethical questions are being taken seriously. For example, who will be held accountable if a fatal crash is deemed to be the fault of the autonomous car? The car manufacturer? The people behind the algorithms? One sole programmer who messed up a for-loop? Stanford scholars have been openly discussing the ethics behind autonomous cars for a long time (it’s an interesting read, if you have the time).

And what will be the future for autonomous cars in the aftermath of this event? Will their inevitable delivery into everyday use be pushed back?

Testing of autonomous cars has been halted by Uber in North America. Toyota has followed suit. And Chris Jones who leads the Autonomous Vehicle Analysis service at the technology analyst company Canalys, says that these events will set the industry back considerably:

It has put the industry back. It’s one step forward, two steps back when something like this happens… and it seriously undermines trust in the technology.

Furthermore, a former US Secretary of Transportation has deemed the crash a “wake up call to the entire [autonomous vehicle] industry and government to put a high priority on safety.”

But other news reports seem to indicate a different story.

Volvo, the make of car that Uber was driving in the fatal car crash, stated only last week that they expect a third of their cars sold to be autonomous by 2025. Other car manufacturers are making similar announcements. Two weeks ago General Motors and Fiat Chrysler unveiled self-driving deals with people like Google to push for a lead in the self-driving car market.

And Baidu (China’s Google, so to speak) is heavily invested in the game, too. Even Chris Jones is admitting that for them this is a race:

The Chinese companies involved in this are treating it as a race. And that’s worrying. Because a company like Baidu – the Google of China – has a very aggressive plan and will try to do things as fast as it can.

And when you have a race among large corporations, there isn’t much that is going to even slightly postpone anything. That’s been my experience in the industry anyway.

Summary

In this post I looked at 2 very recent events from the world of computer vision that have recently caused controversy.

The first was a judge’s ruling in the United States that Facebook must stand trial for its facial recognition software. Facebook is being accused of violating the Illinois’ biometric privacy laws by collecting and storing biometric data of each user’s face. This data is being stored without written notification. Moreover, it is not clear exactly what the data is being used for, nor how long it is going to reside in storage, nor was there an opt-out option ever provided.

The second event was the first recorded death of a pedestrian by an autonomous car in March of this year. A preliminary report was released by the US National Transportation Safety Board 3 weeks ago that states that AI is at least partially at fault for the crash. Debate over the ethical issues inherent to autonomous cars has heated up as a result but it seems as though the incident has not held up the race to bring self-driving cars onto our streets.

To be informed when new content like this is posted, subscribe to the mailing list:

fashion-computer-vision

Computer Vision in the Fashion Industry – Part 3

In my last two posts (part 1 can be found here, part 2 can be found here) I’ve looked at computer vision and the fashion industry. I introduced the lucrative fashion industry and showed what Microsoft recently did in this field with computer vision. I also presented two papers from last year’s International Conference on Computer Vision (ICCV).

In this post, the final of the series, I would like to present to you two papers from last year’s ICCV workshop that was entirely devoted to fashion:

  1. Dress like a Star: Retrieving Fashion Products from Videos” (N. Garcia and G. Vogiatzis, ICCV Workshop, 2017, 2293-2299) [source code]
  2. Multi-Modal Embedding for Main Product Detection in Fashion” (Rubio, et al., ICCV Workshop, 2017, pp. 2236-2242) [source code]

Once again, I’ve provided links to the source code so that you can play around with the algorithms as you wish. Also, as in previous posts, I am going to provide you with just an overview of these publications. Most papers published at this level require a (very) strong academic background to fully grasp, so I don’t want to go into that much detail here.

Dress Like a Star

This paper is impressive because it was written by a PhD student from Birmingham in the UK. By publishing at the ICCV Workshop (I discussed in my previous post how important this conference is), Noa Garcia has pretty much guaranteed her PhD and quite possibly any future research positions. Congratulations to her! However, I do think they cheated a bit to get into this ICCV workshop, as I explain further down.

The idea behind the paper is to provide a way to retrieve clothing and fashion products from video content. Sometimes you may be watching a TV show, film or YouTube clip and think to yourself: “Oh, that shirt looks good on him/her. I wish I knew where to buy it.”

The proposed algorithm works by providing it a photo of a screen that is playing the video content, querying a database, and then returning matching clothing content in the frame, as shown in this example image:

dress-like-star-example-image
(image source: original publication)

Quite a neat idea, wouldn’t you say?

The algorithm has three main modules: product indexing, training phase, and query phase.

The first two modules are performed offline (i.e. before the system is released for use). They require a database to be set up with video clips and another one with clothing articles. Then, the clothing items and video frames are matched to each other with some heavy computing (this is why it has be performed offline – there’s a lot of computation here that cannot be done in real time).

You may be thinking: but heck, how can you possibly store and analyse all video content with this algorithm!? Well, to save storage and computation space, each video is processed (offline) and divided into shots/scenes that are then summarised into a single vector containing features (features are small “interesting” or “stand-out” patches in images).

Hence, in the query phase, all you need to do is detect features in the provided photo, search for these features in the database (rather than the raw frames), locate the scene depicted in the photo in the video database, and then extract the clothing articles in the scene.

To evaluate this algorithm, the authors set up a system with 40 movies (80+ hours of video). They were able to retrieve the scene from a video depicted in a photo with an accuracy of 87%.

Unfortunately, in their experiments, they did not set up a fashion item database but left this part out as “future work”. That’s a little bit of a let down and I would call that “twisting the truth” in order to get into a fashion-dedicated workshop. But, as they state in the conclusion: “the encouraging experimental results shown here indicate that our method has the potential to index fashion products from thousands of movies with high accuracy”.

I’m still calling this cheating 🙂

Main Product Detection in Fashion

This paper discusses an algorithm to extract the main clothing product in an image according to any textual information associated with it – like in a fashion magazine, for example. The purpose of this algorithm is to extract these single articles of clothing to then be able to enhance other datasets that need to solely work with “clean” images. Such datasets would include ones used in fashion catalogue searches (e.g. as discussed in the first post in this series) or systems of “virtual fitting rooms” (e.g. as discussed in the second post in this series).

The algorithm works by utilising deep neural networks (DNNs). (Is that a surprise? There’s just no escaping deep learning nowadays, is there?) To cut a long story short, neural networks are trained to extract bounding boxes of fashion products that are then used to train other DNNs to match products with textual information.

Example results from the algorithm are shown below.

main-product-extraction
(image source: original publication)

You can see above how the algorithm nicely finds all the articles of clothing (sunglasses, shirt, necklace, shoes, handbag) but only highlights the pants as the main product in the image according to the textual information associated with the picture.

Summary

In this post, the final of the series, I presented two papers from last year’s ICCV workshop that was entirely devoted to fashion. The first paper describes a way to retrieve clothing and fashion products from video content by providing it with a photo of a computer/TV screen. The second paper discusses an algorithm to extract the main clothing product in an image according to any textual information associated with it.

I always say that it’s interesting to follow the academic world because every so often what you see happening there ends up being brought into our everyday lives. Some of the ideas from the academic world I’ve looked at in this series leave a lot to be desired but that’s the way research is: one small step at a time.

(Part 1 of this series of posts can be found here, part 2 can be found here.)

To be informed when new content like this is posted, subscribe to the mailing list:

fashion-computer-vision

Computer Vision in the Fashion Industry – Part 2

(Update: this post is part 2 of a 3 part series on CV and fashion. Part 1 can be found here, part 3 can be found here.)

In my last post I introduced the fashion industry and I gave an example of what Microsoft recently did in this field with computer vision. In today’s post, I would like to show you what the academic world has recently been doing in this respect.

It’s interesting to follow the academic world because every so often what you see happening there ends up being brought into our everyday lives. Artificial intelligence, with deep learning at the forefront, is a prime example of this. Hence why I’m always keen to keep up-to-date with the goings-on of computer vision in academia.

A good way to keep abreast of computer vision in the academic world is to follow two of the top conferences in the field: the International Conference on Computer Vision (ICCV) and the Conference on Computer Vision and Pattern Recognition (CVPR). These annual conferences are huge. This is where the best of the best come together; where the titans of computer vision pit their wits against each other. Believe me, publishing in either one of these conferences is a lifetime achievement. I have 10 publications in total there (that’s a lie… I have none :P).

Interestingly, the academic world has been eyeing the fashion industry for the last 5 years it seems. An analysis was performed by Fashwell recently that counted the number of fashion-related papers at these two conferences. There appears to be a steady increase in these since 2013, as the graph below depicts:

Notice, the huge spike at the end? The reason for it is that ICCV last year held an entire workshop specifically devoted to fashion

(Note: a workshop is a, let’s say, less-formal format of a conference usually held as a side event to a major conference. Despite this, publishing at an ICCV or CVPR workshop is still a major achievement.)

As a result, there is plenty of material for me to present to you on the topic of computer vision in the fashion industry. Let’s get cracking!

ICCV 2017

In this post I will present two closely-related papers to you from the 2017 ICCV conference (in my next post I’ll present a few from the workshop):

  1. “Be Your Own Prada: Fashion Synthesis With Structural Coherence” (Zhu, et al., ICCV, 2017, pp. 1680-1688) [source code]
  2. A Generative Model of People in Clothing” (Lassner, et al., ICCV, 2017, pp. 853-862) [source code]

Just like with all my posts, I will give to you an overview of these publications. Most papers published at this level require a (very) strong academic background to fully grasp, so I don’t want to go into that much detail here.

But I have provided links to the source code of these papers, so please feel free to download, install and play around with these beauties at home.

1. Be Your Own Prada

This is a paper that presents an algorithm that can generate new clothing on an existing photo of a person without changing the person’s shape or pose. The desired new outfit is provided as a sentence description, e.g.: “a white blouse with long sleeves but without a collar and blue jeans”.

This is an interesting idea! You can provide the algorithm with a photo of yourself and then virtually try on a seemingly endless combination of styles of shirts, dresses, etc.

“Do I look good in blue and red? I dunno, but let’s find out!”

Neat, hey?

The algorithm has a two-step process:

  1. Image segmentation. This step semantically breaks the image up into human body parts such as face, arms, legs, hips, torso, etc. The result basically captures the shape of the person’s body
    and parts, but not their appearance, as shown in the example image below. Also, along with image segmentation, other attributes are extracted such as skin colour, long/short hair, and gender to provide constraints and boundaries for how far the image rendering step can go (you don’t want to change the person’s skin or hair colour, for example). The segmentation step is performed using a trained generative adversarial network (GAN – see this post for a description of these).

    input-image-segmentation
    (image adapted from original publication)
  2. Image rendering. This is the part that places new outfits onto the person using the results (segmentation and constraints/boundaries) from the first step as a guide. GANs are used here again. Example clothing articles were taken from 80,000 annotated images selected from the DeepFashion dataset.

Let’s take a look at some results (taken from the original publication). Remember, all that is provided is one picture of a person and then a description of how that person’s outfit should look like:

prada-results

Pretty cool! You could really see yourself using this, couldn’t you? We might be using something like this on our phones soon, I would say. Take a look at the authors’ page for this paper for more example result images. Some amazing stuff there.

2. A Generative Model of People in Clothing

This paper is still a work in progress, meaning that more research is needed before anything from it gets rolled out for everyday use. The intended goal of the algorithm is similar to the one presented above but instead of being able to generate images of the same person wearing a different outfit, this algorithm can generate random images of different people wearing different attires. Usually, generating such images is achieved after following a complex 3D graphics rendering pipeline.

It is a very complex algorithm but, in a nutshell, it first creates a dataset containing human pose, shape, and face information along with clothing articles. This information is then used to learn the relationships between body parts and respective clothes and how these clothes fit nicely to its appropriate body part, depending on the person’s pose and shape.

The dataset is created using the SMPLify 3D pose and shape estimation algorithm on the Chictopia10K fashion dataset (that was collected from the Chictopia fashion website) as well as dlib‘s implementation of the fast facial shape matcher to enhance each image with facial information.

Let’s take a look at some results.

The image below shows a randomly generated person wearing different coloured clothes (provided manually). Notice that, for example, with the skirts, the model learnt to put different wrinkles on the skirt depending on its colour. Interesting, isn’t it? The face on the person seems out of place – one reason why the algorithm is still a work in progress.

clothnet-results

The authors of the paper also attempted to create a random fashion magazine photo dataset from their algorithm. The idea behind this was to show that fashion magazines could perhaps one day generate photos automatically without going through the costly process of setting up photo sessions with real people. Once again, the results leave a lot to be desired but it’s interesting to see where research is heading.

clothnet-results2

Summary

This post extended my last post on computer vision in the fashion industry. I first examined how fashion is increasingly being looked at in computer vision academic circles. I then presented two papers from ICCV 2017. The first paper describes an algorithm to generate a new attire on an existing photo of a person without changing the person’s shape or pose. The desired new outfit is provided as a sentence description. The second paper shows a work-in-progress algorithm to randomly generate people wearing different clothing attires.

It’s interesting to follow the academic world because every so often what you see happening there ends up being brought into our everyday lives.

(Update: this post is part 2 of a 3 part series on CV and fashion. Part 1 can be found here, part 3 can be found here.)

To be informed when new content like this is posted, subscribe to the mailing list:

fashion-computer-vision

Computer Vision in the Fashion Industry – Part 1

(image source)

(Update: this post is the first of a 3-part series. Part 2 can be found here, part 3 can be found here)

Computer vision has a plethora of applications in the industry: cashier-less stores, autonomous vehicles (including those loitering on Mars), security (e.g. face recognition) – the list goes on endlessly. I’ve already written about the incredible growth of this field in the industry and, in a separate post, the reasons behind it.

In today’s post I would like to discuss computer vision in a field that I haven’t touched upon yet: the fashion industry. In fact, I would like to devote my next few posts to this topic because of how ingeniously computer vision is being utilised in it.

In this post I will introduce the fashion industry and then present something that Microsoft recently did in the field with computer vision.

In my next few posts (part 2 here; part 3 here) I would like to present what the academic world (read: cutting-edge research) is doing in this respect. You will see quite amazing things there, so stay tuned for that!

The Fashion Industry

The fashion industry is huge. And that’s probably an understatement. At present it is estimated to be worth US$2.4 trillion. How big is that? If the fashion industry were a country, it would be ranked as the 7th largest economy in the world – above my beloved Australia and other countries like Russia and Spain. Utterly huge.

Moreover, it is reported to be growing at a steady rate of 5.5% each year.

On the e-commerce market, the clothing and fashion sectors dominate. In the EU, for example, the majority of the 530 billion euro e-commerce market is made up of this industry. Moreover, The Economic Times predicts that the online fashion market will grow three-fold in the next few years. The industry appears to be in agreement with this forecast considering some of the major takeovers being currently discussed. The largest one on the table at the moment is of Flipkart, India’s biggest online store that attributes 50% of its transactions to fashion. Walmart is expected to win the bidding war by purchasing 73% of the company that it has valued at US$22 billion. Google is expected to invest a “measly” US$3 billion also. Ridiculously large amounts of money!

So, if the industry is so huge, especially online, then it only makes sense to bring artificial intelligence into play. And since fashion is a visual thing, this is a perfect application for computer vision!

(I’ve always said it: now is a great time to get into computer vision)



Microsoft and the Fashion Industry

3 weeks ago, Microsoft published on their Developer Blog an interesting article detailing how they used deep learning to build an e-commerce catalogue visual search system for “a successful international online fashion retailer” (which one it was has not been disclosed). I would like to present a summary of this article here because I think it is a perfect introduction to what computer vision can do in the fashion industry. (In my next post you will see how what Microsoft did is just a drop in the ocean compared to what researches are currently able to do).

The motivation behind this search system was to save this retailer’s time in finding whether each new arriving item matches a merchandise item already in stock. Currently, employees have to manually look through catalogues and perform search and retrieval tasks themselves. For a large retailer, sifting through a sizable catalogue can be a time consuming and tedious process.

So, the idea was to be able to take a photo from a mobile phone of a piece of clothing or footwear and search for it in a database for matches.

You may know that Google already has image search functionalities. Microsoft realised, however, that for their application in fashion to work, it was necessary to construct their own algorithm that would include some initial pre-processing of images. The reason for this is that the images in the database had a clean background whereas if you take a photo on your phone in a warehouse setting, you will capture a noisy background. The images below (taken from the original blog post) show this well. The first column shows a query image (taken by a mobile phone), the second column the matching image in the database.

swater-bgd

shirt-bgd

Microsoft, hence, worked on a background subtraction algorithm that would remove the background of an image and only leave the foreground (i.e. salient fashion item) behind.

Background subtraction is a well-known technique in the computer vision field and it is by all means still an open area of research. OpenCV in fact has a few very interesting implementations available of background subtraction. See this OpenCV tutorial for more information on these.

GrabCut

Microsoft decided not to use these but instead to try out other methods for this task. It first tried GrabCut, a very popular background segmentation algorithm first introduced in 2004. In fact, this algorithm was developed by Microsoft researchers to which Microsoft still owns the patent rights (hence why you won’t find it in the main repository of OpenCV any more).

I won’t go into too much detail on how GrabCut works but basically, for each image, you first need to manually provide a bounding box of the salient object in the foreground. After that, GrabCut builds a model (i.e. a mathematical description) of the background (area outside of the bounding box) and foreground (area inside the bounding box) and using these models iteratively trims inside the rectangle until it deduces where the foreground object lies. This process can be repeated by then manually indicating where the algorithm went wrong inside the bounding box.

The image below (from the original publication of 2004) illustrates this process. Note that the red rectangle was manually provided as were also the white and red strokes in the bottom left image.

grabcut-example

The images below show some examples provided by Microsoft from their application. The first column shows raw images from a mobile phone taken inside a warehouse, the second column shows initial results using GrabCut, and the third column shows images using GrabCut after additional human interaction. These results are pretty good.

foreground-background-ms

Tiramisu

But Microsoft wasn’t happy with GrabCut for the important reason of it requiring human interaction. It wanted a solution that would work simply by only providing a photo of a product. So, it decided to move to a deep learning solution: Tiramisu (Yum, I love that cake…)

Tiramisu is a type of DenseNet, which in turn is a specific type of Convolutional Neural Network (CNN). Once again, I’m not going to go into detail on how this network works. For more information see this publication that introduced DenseNets and this paper that introduced Tiramisu. But basically DenseNets connect each layer to every other layer whereas CNN layers have connections with only their nearest layers.

DenseNets work (suprisingly?) well on relatively small datasets. For specific tasks using deep neural networks, you usually need a few thousand example images for each class you are trying to classify for. DenseNets can get remarkable results with around 600 images (which is still a lot but it’s at least a bit more manageable).

So, Microsoft trained a Tiramisu model from scratch with two classes: foreground and background. Only 249 images were provided for each class! The foreground and background training images were segmented using GrabCut with human interaction. The model achieved an accuracy rate of 93.7% at the training stage. The example image below shows an original image, the corresponding labelled training image (white is foreground and black is background), and the predicted Tiramisu result. Pretty good!

foreground-tiramisu

How did it fare in the real world? Apparently quite well. Here are some example images. The top row shows the automatically segmented image (i.e. with the background subtracted out) and the bottom row shows the original input images. Very neat 🙂

tiramisu-good-examples

The segmented images (e.g. top row in the above image) were then used to query a database. How this querying took place and what algorithm was used to detect potential matches is, however, not described in the blog post.

Microsoft has released all their code from this project so feel free to take a look yourselves.

Summary

In this post I introduced the topic of computer vision in the fashion industry. I described how the fashion industry is a huge business currently worth approximately US$2.4 trillion and how it is dominating on the online market. Since fashion is a visual trade, this is a perfect application for computer vision.

In the second part of this post I looked at what Microsoft did recently to develop a catalogue visual search system. They performed background subtraction on photos of fashion items using a DenseNet solution and these segmented images were used to query an already-existing catalogue.

Stay tuned for my next post which will look at what academia has been doing with respect to computer vision and the fashion industry.

(Update: this post is the first of a 3-part series. Part 2 can be found here, part 3 can be found here)

To be informed when new content like this is posted, subscribe to the mailing list:

Eye-tracking-heatmap

Generating Heatmaps from Coordinates with Kernel Density Estimation

In last week’s post I talked about plotting tracked customers or staff from video footage onto a 2D floor plan. This is an example of video analytics and data mining that can be performed on standard CCTV footage that can give you insightful information such as common movement patterns or common places of congestion at particular times of the day.

There is, however, another thing that can be done with these extracted 2D coordinates of tracked people: generation of heatmaps.

A heatmap is a visual representation or summary of data that uses colour to represent data values. Generally speaking, the more congested data is at a particular location, the hotter will be the colour used to represent this data.

The diagram at the top of this post shows an example heatmap for eye-tracking data (I did my PhD in eye-tracking, so this brings back memories :P) on a Wikipedia page. There, the hotter regions denote where more time was spent gazing by viewers.

There are many ways to create these heatmaps. In this post I will present you one way with some supporting code at the end.

I’m going to assume that you have a list of coordinates in a file denoting the location of people on a 2D floor plan (see my previous post for how to obtain such a file from CCTV footage). Each line in the file is a coordinate at a specific point in time. For example, you might have something like this in a file called “coords.txt”:
x_coords,y_coords
200,301
205,300
208,300
210,300
210,300
210,300

Update: Note the ‘301’ in the first row of coordinates in the y-axis column. After an update to one of the libraries I use below, the variance in a column cannot be 0.

In this example we have somebody moving horizontally 5 pixels for two time intervals and then standing still for 3 time intervals. If we were to generate a heatmap here, you would expect there to be hot colours around (210, 300) and cooler colours at (200, 300) through to (210, 300).

But how do we get these hot and cold colours around our points and make the heatmap look smooth and beautiful? Well, some of you may have heard of a thing called a Gaussian kernel. That’s just a fancy name for a particular type of curve. Let me show you a 2D image of one:

sm-gaussian

That curve can also be drawn in 3D, like so (notice the hot and cold colours here!):

gaussian-kernel-3D

Now, I’m not going to go into too much detail on Gaussian kernels because it would involve venturing into university mathematics. If you would like to read up on them, this pdf goes into a lot of explanatory detail and this page explains nicely why it is so commonly used in the trade. For this post, all you need to know is that it’s a specific type of curve.

With respect to our heatmaps, then, the idea is to place one of these Gaussian kernels at each coordinate location that we have in our “coords.txt” file. The trick here is to notice that when Gaussian kernels overlap, their values are added together (where the overlapping occurs). So, for example, with the 2D kernel image above, if we were to put another kernel at the exact same location, the peak of the kernel would reach 0.8 (0.4 + 0.4 = 0.8).

If you have clusters of points at a similar location, the Gaussian kernels at these locations would all push each other up.

The following image shows this idea well. There are 6 coordinates (the black marks on the x-axis) and a kernel placed at each of these (red dashed lines). The resulting curve is depicted in blue. The three congested kernels on the left push (by addition) the resulting curve up the highest.

pde
Gaussian kernels stacked on top of each other (image source)

This final plot of Gaussian kernels is actually called a kernel density estimation (KDE). It’s just a fancy name for a concept that really, in it’s core, isn’t too hard to understand!

A kernel density estimation can be performed in 3D as well and this is exactly what can be done with the coordinates in your “coords.txt” file. Take a look at the 3D picture of a single Gaussian kernel above and picture looking down at that curve from above. You would be looking at a heatmap!

Here’s a top-down view example but with more kernels (at the locations of the white points). Notice the hot colours at the more congested locations. This is where the kernels have pushed the resulting KDE up the highest:

density-heat-map

And that, ladies and gentlemen is how you create a heatmap from a file containing coordinate locations.

And what about some accompanying code? For the project that I worked on, I used the seaborn Python visualisation library. That library has a kernel density estimator function called kdeplot:

# import the required packages
import pandas as pd
import seaborn as sns
import numpy as np
from matplotlib import pyplot as plt


# Library versions used in this code:
# Python: 3.7.3
# Pandas: 1.3.5
# Numpy: 1.21.6
# Seaborn: 0.12.1
# Matplotlib: 3.5.3


# load the coordinates file into a dataframe
coords = pd.read_csv('coords2.txt')
# call the kernel density estimator function
ax = sns.kdeplot(data = coords, x="x_coords", y="y_coords", fill=True, thresh=0, levels=100, cmap="mako")
# the function has additional paramter, e.g. to change the colour palette,
# so if you need things customised, there are plenty of options

# plot your KDE
# once again, there are plenty of customisations available to you in pyplot
plt.show()


# save your KDE to disk
fig = ax.get_figure()
fig.savefig('kde.png', transparent=True, bbox_inches='tight', pad_inches=0)

It’s amazing what you can do with basic CCTV footage, computer vision, and a little bit of mathematical knowledge, isn’t it?

To be informed when new content like this is posted, subscribe to the mailing list:

data-mining-image

Mapping Camera Coordinates to a 2D Floor Plan

Data mining is a big business. Everyone is analysing mouse clicks, mouse movements, customer purchase patterns. Such analysis has proven to give profitable insights that are driving businesses further than ever before.

But not many people have considered data mining videos. What about all that security footage that has stacked up over the years? Can we mine those for profitable insights also? Of course!

In this blog post I’m going to present a task that video analytics can do for you: the plotting of tracked customers or staff from video footage onto a 2D floor plan. 

Why would you want to do this? Well, plotting on a 2D plane will allow you to more easily data mine for things such as common movement patterns or common places of congestion at particular times of the day. This is powerful information to possess. For example, if you can deduce what products customers reached for in what order you can make important decisions with respect to the layout of your shelves and placement of advertising. 

Another benefit of this technique is that it is also much easier to visualise movement patterns presented on 2D plane rather than when shown on distorted CCTV footage. (In fact, in my next post I extend what I present here by showing you how to generate heatmaps from your tracking data – check it out).

However, if you have ever tried to undertake this task you may have come to the understanding that it is not as straightforward as you initially thought. A major dilemma is that your security camera images are distorted. For example, a one pixel movement at the top of your image corresponds to a much larger movement in the real world than a one pixel movement at the bottom of your image.

Where to begin? In tackling this problem, the first thing to realise is that we are dealing with two planes in the Euclidean space. One plane (the floor in your camera footage) is “stretched out”, while the other is “laid flat”. We, therefore, need a transformation function to map points from one plane to the other.

The following image shows what we are trying to achieve (assume the chessboard is the floor in your shop/business):

chessboard-person-transformation
The task: map the plane from your camera to a perspective view

The next step, then, is to deduce what kind of transformation is necessary. Once we know this, we can start to look at the mathematics behind it and use this maths accordingly in our application. Here are some possible transformations:

different-transformations
Different types of transformations (image source)

Translations (the first transformation in the image above) are shifts in the x and y plane that preserve orientation. Euclidean transformations change the orientation of the plane but preserve the distances between points – definitely not our case, as mentioned earlier. Affine transformations are a combination of translation, rotation, scale, and shear. They can change the distances between points but parallel lines remain parallel after transformation – also not our case. Lastly, we have homographic transformations that can change a square into any form of a quadrilateral. This is what we are after.

Mathematically, homographic transformations are represented as such:

homographic-transformation

where (x,y) represent pixel coordinates in one plane, (x’, y’) represent pixel coordinates in another plane and H is the homography matrix represented as this 3×3 matrix:

homography-matrix

Basically, the equation states this: given a point in one plane (x’,y’), if I multiply it by the homography matrix H I will get the corresponding point (x,y) from the other plane. So, if we calculate H, we can get the coordinates of any pixel from our camera image to the flat image.

But how do you calculate this magic matrix H? To gloss over some intricate mathematics, what we need is at least 4 point pairs (4 corresponding points) from the two images to get a minimal solution (a “close enough” solution) of H. But the more point pairs we provide, the better the estimate of H will be.

Getting the corresponding point pairs from our images is easy, too. You can use an image editing application like GIMP. If you move your mouse over an image, the pixel coordinates of your mouse positions are given at the bottom of the window. Jot down the pixel coordinates from one image and the corresponding pixel coordinates in the matching image. Get at least four such points pairs and you can then get an estimate of H and use it to calculate any other corresponding point pairs.

chessboard-point-pairs
Example of 3 corresponding points in two images

Now you can take the tracking information from your security camera and plot the position of people on your perspective 2D floor plan. You can now analyse their walking paths, where they spent most of their time, where congestion frequently occurs, etc. Nice! But what’s even nicer is the simple code needed to do everything discussed here. The OpenCV library (the best image/video processing library around) provides all necessary methods that you’ll need:

import cv2 # import the OpenCV library
import numpy as np # import the numpy library

# provide points from image 1
pts_src = np.array([[154, 174], [702, 349], [702, 572],[1, 572], [1, 191]])
# corresponding points from image 2 (i.e. (154, 174) matches (212, 80))
pts_dst = np.array([[212, 80],[489, 80],[505, 180],[367, 235], [144,153]])

# calculate matrix H
h, status = cv2.findHomography(pts_src, pts_dst)

# provide a point you wish to map from image 1 to image 2
a = np.array([[154, 174]], dtype='float32')
a = np.array([a])

# finally, get the mapping
pointsOut = cv2.perspectiveTransform(a, h)

Piece of cake. Here is a short animation showing what you can do:

animation-after-transformation

Be sure to check out my next post where I show you how to generate heatmaps from the tracking data you just obtained from your security cameras.

To be informed when new content like this is posted, subscribe to the mailing list:

NASA-Mars-Rover

Computer Vision on Mars from the 2004 Exploration Rovers

I was doing my daily trawl of the internet a few days ago looking at the latest news in artificial intelligence (especially computer vision) and an image caught my eye. The image was of one of the Mars Exploration Rovers (MER) that landed on the Red Planet in 2004. Upon seeing the image I thought to myself: “Heck, those rovers must surely have used computer vision up there!?” So, I spent the day looking into this and, sure as can be, not only was computer vision used by these rovers, it in fact played an integral part in their missions.

In this post, then, I’m going to present to you how, where, and when computer vision was used by those MERs. It’s been a fascinating few days for me researching into this and I’m certain you’ll find this an interesting read also. I won’t go into too much detail here but I’ll give you enough to come to appreciate just how neat and important computer vision can be.

If you would like to read more about this topic, a good place to start is “Computer Vision on Mars” (Matthies, Larry, et al. International Journal of Computer Vision 75.1, 2007: 67-92.), which is an academic paper published by NASA in 2007. You can also follow any additional referenced publications there. All images in this post, unless otherwise stated, were taken from this paper.

Background Information

In 2003, NASA launched two rovers into space with the intention of landing them on Mars to study rocks and soils for traces of past water activity. MER followed upon three other rover-based missions: the two Viking missions of 1975 and 1976 and the Mars Pathfinder mission of 1997.

Due to constraints in processing power and memory capacity no image processing was performed by the Viking rovers. They only took pictures with their on-board cameras to be sent back to Earth.

The Sojourner (the name of the Mars Pathfinder rover), on the other hand, performed computer vision in one way only. It used stereoscopic vision to provide scientists detailed maps of the terrain around the rover for operators on Earth to use in planning movement trajectories. Stereoscopic vision provides visual information from two viewing angles a short distance apart just like our eyes do. This kind of vision is important because two views of the same scene allows for the extraction of 3D data (i.e. depth data). See this OpenCV tutorial on extracting depth maps from stereo images for more information on this.

The MER Rovers

The MER rovers, Spirit and Opportunity as they were named, were identical. Both had a 20 MHz processor, 128 MB of RAM, and 256 MB of flash memory. Not much to work with there, as you can see! Phones nowadays are about 1000 times more powerful.

The rovers also had a monocular descent camera facing directly down and three sets of stereo camera pairs: one pair each at the front and back of the rovers (called hazard cameras, or “hazcams” for short) and a pair of cameras (called navigation cameras, or “navcams” for short) on a mast 1.3m (4.3 feet) above the ground. All these cameras took 1024 x 1024 greyscale photos.

But wait, those colour photos we’ve seen so many times from these missions were fake, then? Nope! Cleverly, each of the stereoscopic camera lenses also had a wheel of 8 filters that could be rotated. Consecutive images could be taken with a different filter (e.g. infrared, ultra-violet, etc.) and colour extracted from a combination of these. Colour extraction was only done on Earth, however. All computer vision processing on Mars was therefore performed in greyscale. Fascinating, isn’t it?

MER-rover-components
Components of an MER rover (image source)

The Importance of Computer Vision in Space

If you’ve been around computer vision for a while you’ll know that for things such as autonomous vehicles, vision solutions are not necessarily the most efficient. For example, lidar (Light Detection And Ranging – a technique similar to sonar for constructing 3D representations of scenes by emitting pulsating laser light and then measuring reflections of it) can give you 3D obstacle avoidance/detection information much more easily and quickly. So, why did NASA choose to use computer vision (and so much of it, as I’ll be presenting to you below) instead of other solutions? Because laser equipment is fragile and it may not have withstood the harsh conditions of Mars. So, digital cameras were chosen instead.

Computer Vision on Mars

We now have information on the background of the mission and the technical hardware relevant to us so let’s move to the business side of things: computer vision.

The first thing I will talk about is the importance of autonomy in space exploration. Due to communication latency and bandwidth limitations, it is advantageous to minimise human intervention by allowing vehicles or spacecraft to make decisions on their own. The Sojourner had minimal autonomy and only ended up travelling approximately 100 metres (328 feet) during it’s entire mission (which lasted a good few months). NASA wanted the MER rovers to travel on average that much every day, so they put a lot of time and research into autonomy to help them reach this target.

In this respect, the result was that they used computer vision for autonomy on Mars in 3 ways:

  1. Descent motion estimation
  2. Obstacle detection for navigation
  3. Visual odometry

I will talk about each of these below. As mentioned in the introduction, I won’t go into great detail here but I’ll give you enough to satisfy that inner nerd in you 😛

1. Descent Image Motion Estimation System

Two years before the launch of the rocket that was to take the rovers to Mars, scientists realised that their estimates of near-surface wind velocities of the planet were too low. This could have proven catastrophic because severe horizontal winds could have caused irreparable damage upon an ill-judged landing of the rover. Spirit and Opportunity had horizontal impulse rockets that could be used to reduce horizontal velocity upon descent but no system to detect actual horizontal speed of the rovers.

Since a regular horizontal velocity sensor could not be installed due to cost and time constraints, it was decided to turn to computer vision for assistance! A monocular camera was attached to the base of the rover that would take pictures of the surface of the planet as the rovers were descending onto it. These pictures would be analysed in-flight to provide estimates of horizontal speeds in order to trigger the impulse rockets, if necessary.

The computer vision system for motion estimation worked by tracking a single feature (features are small “interesting” or “stand-out” patches in images). The feature was located in photos taken by the rovers and then the position of these patches was tracked between consecutive images.

Coupled with this feature tracking information and measurements from the angular velocity and vertical velocity sensors (that were already installed for the purpose of on-surface navigation), the entire velocity vector (i.e. information about the magnitude and direction of the rover’s speed) was able to be calculated.

The feature tracking algorithm, called the Descent Image Motion Estimation System (DIMES) consisted of 7 steps as summarised by the following image:

DIMES-algorithm

The first step reduces the image size to 256 x 256 resolution. The smaller the resolution, the faster that image processing calculations can be performed – but at the possible expense of accuracy. The second step was responsible for estimating the maximum possible area of overlap in consecutive images to minimise the search area for features (there’s no point in detecting features in regions of an image that you know are not going to be present in the second). This was done by taking into consideration knowledge from sensors of things such as the rover’s altitude and orientation. The third step picked out two features from an image using the Harris corner detector (discussed here in this OpenCV tutorial). Only one feature is needed for the algorithm to work but two were detected in case one feature could not be located in the following image. A few noise “clean-up” operations on images were performed in step 4 to reduce effects of things such as blurring.

Step 5 is interesting. The feature patches (aka feature templates) and search windows in consecutive images were rectified (rotated, twisted, etc.) to remove orientation and scale differences in order to make searching for features easier. In other words, the images were rotated, twisted and enlarged/diminished to be placed on the same plane. An example of this from the actual mission (from the Spirit rover’s descent) is shown in the image below. The red squares in the first image are the detected feature patches that are shown in green in the second image with the search windows shown in blue. You can see how the first and second images have been twisted and rotated such that the feature size, for example, is the same in both images.

DIMES-rectification-example

Step 6 was responsible for locating in the second image the two features found in the first image. Moravec’s correlator (an algorithm developed by Hans Moravec and published in his PhD thesis way back in 1980) was used for this. The general idea in this algorithm is to minimise the search area first instead of searching over every possible location in an image for a match. This is done by first selecting potential regions in an image for matches and only there is a more exhaustive search performed.

The final step is combining all this information to calculate the velocity vector. In total, the DIMES algorithm took 14 seconds to run up there in the atmosphere of Mars. It was run by both rovers during their descent. The Spirit rover was the only one that fired its impulse rockets as a result of calculations from DIMES. Its horizontal velocity was at one stage reduced from 23.5 m/s (deemed to be slightly over a safe limit) to 11 m/s, which ensured a safe landing. Computer vision to the rescue! Opportunity’s horizontal speed was never calculated to be too fast so firing its stabilising rockets was considered to be unnecessary. It also had a successful landing.

All the above steps were performed autonomously on Mars without any human intervention. 

2. Stereo Vision for Navigation

To give the MER rovers as much autonomy as possible, NASA scientists developed a stereo-vision-based obstacle detection and navigation system. The idea behind it was to give the scientists the ability to simply provide the rovers each day with a destination and for the vehicles to work things out on their own with respect to navigation to this target (e.g. to avoid large rocks).

And their system performed beautifully.

The algorithm worked by extracting disparity (depth) maps from stereo images – as I’ve already mentioned, see this OpenCV tutorial for more information on this technique. What was done, however, by the rovers was slightly different to that tutorial (for example a simpler feature matching algorithm was employed), but the gist of it was the same: feature point detection and matching was performed to find the relationship between images and knowledge of camera properties such as focal lengths and baseline distances allowed for the derivation of depth for all pixels in an image. An example of depth maps calculated in this way by the Spirit rover is shown below:

depth-maps-spirit-rover
The middle picture was taken by Spirit and shows a rock on Mars approximately 0.5 m (1.6 feet) in height. The left image shows corresponding range information (red is closest, blue furthest). The right image shows corresponding height information.

Interestingly, the Opportunity rover, because it landed on a smoothly-surfaced plain, was forced to use its navcams (that were mounted on a mast) for its navigation. Looking down from a higher angle meant that detailed texture from the sand could be used for feature detection and matching. Its hazcams returned only the smooth surface of the sand. Smooth surfaces are not agreeable to feature detection (because, for example, they don’t have corners or edges). The Spirit rover, on the other hand, because it landed in a crater full of rocks, could use its hazcams for stereoscopic navigation.

3. Visual Odometry

Finally, computer vision on Mars was used at certain times to estimate the rovers’ position and travelling distance. No GPS is available on Mars (yet) and standard means of estimating distance travelled such as counting the number of wheel rotations was deemed during desert testing on Earth to be vulnerable to significant error due to one thing: wheel slippage. So, NASA scientists decided to employ motion estimation via computer vision instead.

Motion estimation was performed using feature tracking in 3D across successive shots taken by the navcams. To obtain 3D information, once again depth maps were extracted from stereoscopic images. Distances to features could easily be calculated from these and then the rovers’ poses were estimated. On average, 80 features were tracked per frame and a photo was taken for visual odometry calculations every 75 cm (30 inches) of travel.

Using computer vision to assist in motion estimation proved to be a wise decision because wheel slippage was quite severe on Mars. In fact, at one time the rover got stuck in sand and the wheels rotated in place for the equivalent of 50m (164 feet) of driving distance. Without computer vision the rovers’ estimated positions would have been severely inaccurate. 

There was another instance where this was strikingly the case. At one time the Opportunity rover was operating on a 17-20 degree slope in a crater and was attempting to maneuver around a large rock. It had been trying to escape the rock for several days and had slid down the crater many times in the process. The image below shows the rover’s estimated trajectory (from a top-down view) using just wheel odometry (left), and the rover’s corrected trajectory (right) as assisted by computer vision calculations. The large rock is represented by the black ellipse. The corrected trajectory proved to be the more accurate estimation.

visual-odometry-example

Summary

In this post I presented the three ways computer vision was used by the Spirit and Opportunity rovers during their MER missions on Mars. These three ways were:

  1. Estimating horizontal speeds during their descent onto the Red Planet to ensure the rovers had a smooth landing.
  2. Extracting 3D information of its surroundings using stereoscopic imagery to assist in navigation and obstacle detection.
  3. Using stereoscopic imagery once again but this time to provide motion and pose estimation on difficult terrain.

In this way, computer vision gave the rovers a significant amount of autonomy (much, much more autonomy than its predecessor, the Sojourner rover) that ultimately gave the rovers a safe landing and allowed the robots to traverse up to 370 m (1213 feet) per day. In fact, the Opportunity rover is still active on Mars now. This means that the computer vision techniques described in this post are churning away as we speak. If that isn’t neat, I don’t know what is!

To be informed when new content like this is posted, subscribe to the mailing list: