fashion-computer-vision

Computer Vision in the Fashion Industry – Part 3

In my last two posts (part 1 can be found here, part 2 can be found here) I’ve looked at computer vision and the fashion industry. I introduced the lucrative fashion industry and showed what Microsoft recently did in this field with computer vision. I also presented two papers from last year’s International Conference on Computer Vision (ICCV).

In this post, the final of the series, I would like to present to you two papers from last year’s ICCV workshop that was entirely devoted to fashion:

  1. Dress like a Star: Retrieving Fashion Products from Videos” (N. Garcia and G. Vogiatzis, ICCV Workshop, 2017, 2293-2299) [source code]
  2. Multi-Modal Embedding for Main Product Detection in Fashion” (Rubio, et al., ICCV Workshop, 2017, pp. 2236-2242) [source code]

Once again, I’ve provided links to the source code so that you can play around with the algorithms as you wish. Also, as in previous posts, I am going to provide you with just an overview of these publications. Most papers published at this level require a (very) strong academic background to fully grasp, so I don’t want to go into that much detail here.

Dress Like a Star

This paper is impressive because it was written by a PhD student from Birmingham in the UK. By publishing at the ICCV Workshop (I discussed in my previous post how important this conference is), Noa Garcia has pretty much guaranteed her PhD and quite possibly any future research positions. Congratulations to her! However, I do think they cheated a bit to get into this ICCV workshop, as I explain further down.

The idea behind the paper is to provide a way to retrieve clothing and fashion products from video content. Sometimes you may be watching a TV show, film or YouTube clip and think to yourself: “Oh, that shirt looks good on him/her. I wish I knew where to buy it.”

The proposed algorithm works by providing it a photo of a screen that is playing the video content, querying a database, and then returning matching clothing content in the frame, as shown in this example image:

dress-like-star-example-image
(image source: original publication)

Quite a neat idea, wouldn’t you say?

The algorithm has three main modules: product indexing, training phase, and query phase.

The first two modules are performed offline (i.e. before the system is released for use). They require a database to be set up with video clips and another one with clothing articles. Then, the clothing items and video frames are matched to each other with some heavy computing (this is why it has be performed offline – there’s a lot of computation here that cannot be done in real time).

You may be thinking: but heck, how can you possibly store and analyse all video content with this algorithm!? Well, to save storage and computation space, each video is processed (offline) and divided into shots/scenes that are then summarised into a single vector containing features (features are small “interesting” or “stand-out” patches in images).

Hence, in the query phase, all you need to do is detect features in the provided photo, search for these features in the database (rather than the raw frames), locate the scene depicted in the photo in the video database, and then extract the clothing articles in the scene.

To evaluate this algorithm, the authors set up a system with 40 movies (80+ hours of video). They were able to retrieve the scene from a video depicted in a photo with an accuracy of 87%.

Unfortunately, in their experiments, they did not set up a fashion item database but left this part out as “future work”. That’s a little bit of a let down and I would call that “twisting the truth” in order to get into a fashion-dedicated workshop. But, as they state in the conclusion: “the encouraging experimental results shown here indicate that our method has the potential to index fashion products from thousands of movies with high accuracy”.

I’m still calling this cheating 🙂

Main Product Detection in Fashion

This paper discusses an algorithm to extract the main clothing product in an image according to any textual information associated with it – like in a fashion magazine, for example. The purpose of this algorithm is to extract these single articles of clothing to then be able to enhance other datasets that need to solely work with “clean” images. Such datasets would include ones used in fashion catalogue searches (e.g. as discussed in the first post in this series) or systems of “virtual fitting rooms” (e.g. as discussed in the second post in this series).

The algorithm works by utilising deep neural networks (DNNs). (Is that a surprise? There’s just no escaping deep learning nowadays, is there?) To cut a long story short, neural networks are trained to extract bounding boxes of fashion products that are then used to train other DNNs to match products with textual information.

Example results from the algorithm are shown below.

main-product-extraction
(image source: original publication)

You can see above how the algorithm nicely finds all the articles of clothing (sunglasses, shirt, necklace, shoes, handbag) but only highlights the pants as the main product in the image according to the textual information associated with the picture.

Summary

In this post, the final of the series, I presented two papers from last year’s ICCV workshop that was entirely devoted to fashion. The first paper describes a way to retrieve clothing and fashion products from video content by providing it with a photo of a computer/TV screen. The second paper discusses an algorithm to extract the main clothing product in an image according to any textual information associated with it.

I always say that it’s interesting to follow the academic world because every so often what you see happening there ends up being brought into our everyday lives. Some of the ideas from the academic world I’ve looked at in this series leave a lot to be desired but that’s the way research is: one small step at a time.

(Part 1 of this series of posts can be found here, part 2 can be found here.)

To be informed when new content like this is posted, subscribe to the mailing list (or subscribe to my YouTube channel!):

fashion-computer-vision

Computer Vision in the Fashion Industry – Part 2

(Update: this post is part 2 of a 3 part series on CV and fashion. Part 1 can be found here, part 3 can be found here.)

In my last post I introduced the fashion industry and I gave an example of what Microsoft recently did in this field with computer vision. In today’s post, I would like to show you what the academic world has recently been doing in this respect.

It’s interesting to follow the academic world because every so often what you see happening there ends up being brought into our everyday lives. Artificial intelligence, with deep learning at the forefront, is a prime example of this. Hence why I’m always keen to keep up-to-date with the goings-on of computer vision in academia.

A good way to keep abreast of computer vision in the academic world is to follow two of the top conferences in the field: the International Conference on Computer Vision (ICCV) and the Conference on Computer Vision and Pattern Recognition (CVPR). These annual conferences are huge. This is where the best of the best come together; where the titans of computer vision pit their wits against each other. Believe me, publishing in either one of these conferences is a lifetime achievement. I have 10 publications in total there (that’s a lie… I have none :P).

Interestingly, the academic world has been eyeing the fashion industry for the last 5 years it seems. An analysis was performed by Fashwell recently that counted the number of fashion-related papers at these two conferences. There appears to be a steady increase in these since 2013, as the graph below depicts:

Notice, the huge spike at the end? The reason for it is that ICCV last year held an entire workshop specifically devoted to fashion

(Note: a workshop is a, let’s say, less-formal format of a conference usually held as a side event to a major conference. Despite this, publishing at an ICCV or CVPR workshop is still a major achievement.)

As a result, there is plenty of material for me to present to you on the topic of computer vision in the fashion industry. Let’s get cracking!

ICCV 2017

In this post I will present two closely-related papers to you from the 2017 ICCV conference (in my next post I’ll present a few from the workshop):

  1. “Be Your Own Prada: Fashion Synthesis With Structural Coherence” (Zhu, et al., ICCV, 2017, pp. 1680-1688) [source code]
  2. A Generative Model of People in Clothing” (Lassner, et al., ICCV, 2017, pp. 853-862) [source code]

Just like with all my posts, I will give to you an overview of these publications. Most papers published at this level require a (very) strong academic background to fully grasp, so I don’t want to go into that much detail here.

But I have provided links to the source code of these papers, so please feel free to download, install and play around with these beauties at home.

1. Be Your Own Prada

This is a paper that presents an algorithm that can generate new clothing on an existing photo of a person without changing the person’s shape or pose. The desired new outfit is provided as a sentence description, e.g.: “a white blouse with long sleeves but without a collar and blue jeans”.

This is an interesting idea! You can provide the algorithm with a photo of yourself and then virtually try on a seemingly endless combination of styles of shirts, dresses, etc.

“Do I look good in blue and red? I dunno, but let’s find out!”

Neat, hey?

The algorithm has a two-step process:

  1. Image segmentation. This step semantically breaks the image up into human body parts such as face, arms, legs, hips, torso, etc. The result basically captures the shape of the person’s body
    and parts, but not their appearance, as shown in the example image below. Also, along with image segmentation, other attributes are extracted such as skin colour, long/short hair, and gender to provide constraints and boundaries for how far the image rendering step can go (you don’t want to change the person’s skin or hair colour, for example). The segmentation step is performed using a trained generative adversarial network (GAN – see this post for a description of these).

    input-image-segmentation
    (image adapted from original publication)
  2. Image rendering. This is the part that places new outfits onto the person using the results (segmentation and constraints/boundaries) from the first step as a guide. GANs are used here again. Example clothing articles were taken from 80,000 annotated images selected from the DeepFashion dataset.

Let’s take a look at some results (taken from the original publication). Remember, all that is provided is one picture of a person and then a description of how that person’s outfit should look like:

prada-results

Pretty cool! You could really see yourself using this, couldn’t you? We might be using something like this on our phones soon, I would say. Take a look at the authors’ page for this paper for more example result images. Some amazing stuff there.

2. A Generative Model of People in Clothing

This paper is still a work in progress, meaning that more research is needed before anything from it gets rolled out for everyday use. The intended goal of the algorithm is similar to the one presented above but instead of being able to generate images of the same person wearing a different outfit, this algorithm can generate random images of different people wearing different attires. Usually, generating such images is achieved after following a complex 3D graphics rendering pipeline.

It is a very complex algorithm but, in a nutshell, it first creates a dataset containing human pose, shape, and face information along with clothing articles. This information is then used to learn the relationships between body parts and respective clothes and how these clothes fit nicely to its appropriate body part, depending on the person’s pose and shape.

The dataset is created using the SMPLify 3D pose and shape estimation algorithm on the Chictopia10K fashion dataset (that was collected from the Chictopia fashion website) as well as dlib‘s implementation of the fast facial shape matcher to enhance each image with facial information.

Let’s take a look at some results.

The image below shows a randomly generated person wearing different coloured clothes (provided manually). Notice that, for example, with the skirts, the model learnt to put different wrinkles on the skirt depending on its colour. Interesting, isn’t it? The face on the person seems out of place – one reason why the algorithm is still a work in progress.

clothnet-results

The authors of the paper also attempted to create a random fashion magazine photo dataset from their algorithm. The idea behind this was to show that fashion magazines could perhaps one day generate photos automatically without going through the costly process of setting up photo sessions with real people. Once again, the results leave a lot to be desired but it’s interesting to see where research is heading.

clothnet-results2

Summary

This post extended my last post on computer vision in the fashion industry. I first examined how fashion is increasingly being looked at in computer vision academic circles. I then presented two papers from ICCV 2017. The first paper describes an algorithm to generate a new attire on an existing photo of a person without changing the person’s shape or pose. The desired new outfit is provided as a sentence description. The second paper shows a work-in-progress algorithm to randomly generate people wearing different clothing attires.

It’s interesting to follow the academic world because every so often what you see happening there ends up being brought into our everyday lives.

(Update: this post is part 2 of a 3 part series on CV and fashion. Part 1 can be found here, part 3 can be found here.)

To be informed when new content like this is posted, subscribe to the mailing list (or subscribe to my YouTube channel!):

fashion-computer-vision

Computer Vision in the Fashion Industry – Part 1

(image source)

(Update: this post is the first of a 3-part series. Part 2 can be found here, part 3 can be found here)

Computer vision has a plethora of applications in the industry: cashier-less stores, autonomous vehicles (including those loitering on Mars), security (e.g. face recognition) – the list goes on endlessly. I’ve already written about the incredible growth of this field in the industry and, in a separate post, the reasons behind it.

In today’s post I would like to discuss computer vision in a field that I haven’t touched upon yet: the fashion industry. In fact, I would like to devote my next few posts to this topic because of how ingeniously computer vision is being utilised in it.

In this post I will introduce the fashion industry and then present something that Microsoft recently did in the field with computer vision.

In my next few posts (part 2 here; part 3 here) I would like to present what the academic world (read: cutting-edge research) is doing in this respect. You will see quite amazing things there, so stay tuned for that!

The Fashion Industry

The fashion industry is huge. And that’s probably an understatement. At present it is estimated to be worth US$2.4 trillion. How big is that? If the fashion industry were a country, it would be ranked as the 7th largest economy in the world – above my beloved Australia and other countries like Russia and Spain. Utterly huge.

Moreover, it is reported to be growing at a steady rate of 5.5% each year.

On the e-commerce market, the clothing and fashion sectors dominate. In the EU, for example, the majority of the 530 billion euro e-commerce market is made up of this industry. Moreover, The Economic Times predicts that the online fashion market will grow three-fold in the next few years. The industry appears to be in agreement with this forecast considering some of the major takeovers being currently discussed. The largest one on the table at the moment is of Flipkart, India’s biggest online store that attributes 50% of its transactions to fashion. Walmart is expected to win the bidding war by purchasing 73% of the company that it has valued at US$22 billion. Google is expected to invest a “measly” US$3 billion also. Ridiculously large amounts of money!

So, if the industry is so huge, especially online, then it only makes sense to bring artificial intelligence into play. And since fashion is a visual thing, this is a perfect application for computer vision!

(I’ve always said it: now is a great time to get into computer vision)



Microsoft and the Fashion Industry

3 weeks ago, Microsoft published on their Developer Blog an interesting article detailing how they used deep learning to build an e-commerce catalogue visual search system for “a successful international online fashion retailer” (which one it was has not been disclosed). I would like to present a summary of this article here because I think it is a perfect introduction to what computer vision can do in the fashion industry. (In my next post you will see how what Microsoft did is just a drop in the ocean compared to what researches are currently able to do).

The motivation behind this search system was to save this retailer’s time in finding whether each new arriving item matches a merchandise item already in stock. Currently, employees have to manually look through catalogues and perform search and retrieval tasks themselves. For a large retailer, sifting through a sizable catalogue can be a time consuming and tedious process.

So, the idea was to be able to take a photo from a mobile phone of a piece of clothing or footwear and search for it in a database for matches.

You may know that Google already has image search functionalities. Microsoft realised, however, that for their application in fashion to work, it was necessary to construct their own algorithm that would include some initial pre-processing of images. The reason for this is that the images in the database had a clean background whereas if you take a photo on your phone in a warehouse setting, you will capture a noisy background. The images below (taken from the original blog post) show this well. The first column shows a query image (taken by a mobile phone), the second column the matching image in the database.

swater-bgd

shirt-bgd

Microsoft, hence, worked on a background subtraction algorithm that would remove the background of an image and only leave the foreground (i.e. salient fashion item) behind.

Background subtraction is a well-known technique in the computer vision field and it is by all means still an open area of research. OpenCV in fact has a few very interesting implementations available of background subtraction. See this OpenCV tutorial for more information on these.

GrabCut

Microsoft decided not to use these but instead to try out other methods for this task. It first tried GrabCut, a very popular background segmentation algorithm first introduced in 2004. In fact, this algorithm was developed by Microsoft researchers to which Microsoft still owns the patent rights (hence why you won’t find it in the main repository of OpenCV any more).

I won’t go into too much detail on how GrabCut works but basically, for each image, you first need to manually provide a bounding box of the salient object in the foreground. After that, GrabCut builds a model (i.e. a mathematical description) of the background (area outside of the bounding box) and foreground (area inside the bounding box) and using these models iteratively trims inside the rectangle until it deduces where the foreground object lies. This process can be repeated by then manually indicating where the algorithm went wrong inside the bounding box.

The image below (from the original publication of 2004) illustrates this process. Note that the red rectangle was manually provided as were also the white and red strokes in the bottom left image.

grabcut-example

The images below show some examples provided by Microsoft from their application. The first column shows raw images from a mobile phone taken inside a warehouse, the second column shows initial results using GrabCut, and the third column shows images using GrabCut after additional human interaction. These results are pretty good.

foreground-background-ms

Tiramisu

But Microsoft wasn’t happy with GrabCut for the important reason of it requiring human interaction. It wanted a solution that would work simply by only providing a photo of a product. So, it decided to move to a deep learning solution: Tiramisu (Yum, I love that cake…)

Tiramisu is a type of DenseNet, which in turn is a specific type of Convolutional Neural Network (CNN). Once again, I’m not going to go into detail on how this network works. For more information see this publication that introduced DenseNets and this paper that introduced Tiramisu. But basically DenseNets connect each layer to every other layer whereas CNN layers have connections with only their nearest layers.

DenseNets work (suprisingly?) well on relatively small datasets. For specific tasks using deep neural networks, you usually need a few thousand example images for each class you are trying to classify for. DenseNets can get remarkable results with around 600 images (which is still a lot but it’s at least a bit more manageable).

So, Microsoft trained a Tiramisu model from scratch with two classes: foreground and background. Only 249 images were provided for each class! The foreground and background training images were segmented using GrabCut with human interaction. The model achieved an accuracy rate of 93.7% at the training stage. The example image below shows an original image, the corresponding labelled training image (white is foreground and black is background), and the predicted Tiramisu result. Pretty good!

foreground-tiramisu

How did it fare in the real world? Apparently quite well. Here are some example images. The top row shows the automatically segmented image (i.e. with the background subtracted out) and the bottom row shows the original input images. Very neat 🙂

tiramisu-good-examples

The segmented images (e.g. top row in the above image) were then used to query a database. How this querying took place and what algorithm was used to detect potential matches is, however, not described in the blog post.

Microsoft has released all their code from this project so feel free to take a look yourselves.

Summary

In this post I introduced the topic of computer vision in the fashion industry. I described how the fashion industry is a huge business currently worth approximately US$2.4 trillion and how it is dominating on the online market. Since fashion is a visual trade, this is a perfect application for computer vision.

In the second part of this post I looked at what Microsoft did recently to develop a catalogue visual search system. They performed background subtraction on photos of fashion items using a DenseNet solution and these segmented images were used to query an already-existing catalogue.

Stay tuned for my next post which will look at what academia has been doing with respect to computer vision and the fashion industry.

(Update: this post is the first of a 3-part series. Part 2 can be found here, part 3 can be found here)

To be informed when new content like this is posted, subscribe to the mailing list (or subscribe to my YouTube channel!):

Eye-tracking-heatmap

Generating Heatmaps from Coordinates with Kernel Density Estimation

In last week’s post I talked about plotting tracked customers or staff from video footage onto a 2D floor plan. This is an example of video analytics and data mining that can be performed on standard CCTV footage that can give you insightful information such as common movement patterns or common places of congestion at particular times of the day.

There is, however, another thing that can be done with these extracted 2D coordinates of tracked people: generation of heatmaps.

A heatmap is a visual representation or summary of data that uses colour to represent data values. Generally speaking, the more congested data is at a particular location, the hotter will be the colour used to represent this data.

The diagram at the top of this post shows an example heatmap for eye-tracking data (I did my PhD in eye-tracking, so this brings back memories :P) on a Wikipedia page. There, the hotter regions denote where more time was spent gazing by viewers.

There are many ways to create these heatmaps. In this post I will present you one way with some supporting code at the end.

I’m going to assume that you have a list of coordinates in a file denoting the location of people on a 2D floor plan (see my previous post for how to obtain such a file from CCTV footage). Each line in the file is a coordinate at a specific point in time. For example, you might have something like this in a file called “coords.txt”:
x_coords,y_coords
200,301
205,300
208,300
210,300
210,300
210,300

Update: Note the ‘301’ in the first row of coordinates in the y-axis column. After an update to one of the libraries I use below, the variance in a column cannot be 0.

In this example we have somebody moving horizontally 5 pixels for two time intervals and then standing still for 3 time intervals. If we were to generate a heatmap here, you would expect there to be hot colours around (210, 300) and cooler colours at (200, 300) through to (210, 300).

But how do we get these hot and cold colours around our points and make the heatmap look smooth and beautiful? Well, some of you may have heard of a thing called a Gaussian kernel. That’s just a fancy name for a particular type of curve. Let me show you a 2D image of one:

sm-gaussian

That curve can also be drawn in 3D, like so (notice the hot and cold colours here!):

gaussian-kernel-3D

Now, I’m not going to go into too much detail on Gaussian kernels because it would involve venturing into university mathematics. If you would like to read up on them, this pdf goes into a lot of explanatory detail and this page explains nicely why it is so commonly used in the trade. For this post, all you need to know is that it’s a specific type of curve.

With respect to our heatmaps, then, the idea is to place one of these Gaussian kernels at each coordinate location that we have in our “coords.txt” file. The trick here is to notice that when Gaussian kernels overlap, their values are added together (where the overlapping occurs). So, for example, with the 2D kernel image above, if we were to put another kernel at the exact same location, the peak of the kernel would reach 0.8 (0.4 + 0.4 = 0.8).

If you have clusters of points at a similar location, the Gaussian kernels at these locations would all push each other up.

The following image shows this idea well. There are 6 coordinates (the black marks on the x-axis) and a kernel placed at each of these (red dashed lines). The resulting curve is depicted in blue. The three congested kernels on the left push (by addition) the resulting curve up the highest.

pde
Gaussian kernels stacked on top of each other (image source)

This final plot of Gaussian kernels is actually called a kernel density estimation (KDE). It’s just a fancy name for a concept that really, in it’s core, isn’t too hard to understand!

A kernel density estimation can be performed in 3D as well and this is exactly what can be done with the coordinates in your “coords.txt” file. Take a look at the 3D picture of a single Gaussian kernel above and picture looking down at that curve from above. You would be looking at a heatmap!

Here’s a top-down view example but with more kernels (at the locations of the white points). Notice the hot colours at the more congested locations. This is where the kernels have pushed the resulting KDE up the highest:

density-heat-map

And that, ladies and gentlemen is how you create a heatmap from a file containing coordinate locations.

And what about some accompanying code? For the project that I worked on, I used the seaborn Python visualisation library. That library has a kernel density estimator function called kdeplot:

# import the required packages
import pandas as pd
import seaborn as sns
import numpy as np
from matplotlib import pyplot as plt


# Library versions used in this code:
# Python: 3.7.3
# Pandas: 1.3.5
# Numpy: 1.21.6
# Seaborn: 0.12.1
# Matplotlib: 3.5.3


# load the coordinates file into a dataframe
coords = pd.read_csv('coords2.txt')
# call the kernel density estimator function
ax = sns.kdeplot(data = coords, x="x_coords", y="y_coords", fill=True, thresh=0, levels=100, cmap="mako")
# the function has additional paramter, e.g. to change the colour palette,
# so if you need things customised, there are plenty of options

# plot your KDE
# once again, there are plenty of customisations available to you in pyplot
plt.show()


# save your KDE to disk
fig = ax.get_figure()
fig.savefig('kde.png', transparent=True, bbox_inches='tight', pad_inches=0)

It’s amazing what you can do with basic CCTV footage, computer vision, and a little bit of mathematical knowledge, isn’t it?

To be informed when new content like this is posted, subscribe to the mailing list (or subscribe to my YouTube channel!):

data-mining-image

Mapping Camera Coordinates to a 2D Floor Plan

Data mining is a big business. Everyone is analysing mouse clicks, mouse movements, customer purchase patterns. Such analysis has proven to give profitable insights that are driving businesses further than ever before.

But not many people have considered data mining videos. What about all that security footage that has stacked up over the years? Can we mine those for profitable insights also? Of course!

In this blog post I’m going to present a task that video analytics can do for you: the plotting of tracked customers or staff from video footage onto a 2D floor plan. 

Why would you want to do this? Well, plotting on a 2D plane will allow you to more easily data mine for things such as common movement patterns or common places of congestion at particular times of the day. This is powerful information to possess. For example, if you can deduce what products customers reached for in what order you can make important decisions with respect to the layout of your shelves and placement of advertising. 

Another benefit of this technique is that it is also much easier to visualise movement patterns presented on 2D plane rather than when shown on distorted CCTV footage. (In fact, in my next post I extend what I present here by showing you how to generate heatmaps from your tracking data – check it out).

However, if you have ever tried to undertake this task you may have come to the understanding that it is not as straightforward as you initially thought. A major dilemma is that your security camera images are distorted. For example, a one pixel movement at the top of your image corresponds to a much larger movement in the real world than a one pixel movement at the bottom of your image.

Where to begin? In tackling this problem, the first thing to realise is that we are dealing with two planes in the Euclidean space. One plane (the floor in your camera footage) is “stretched out”, while the other is “laid flat”. We, therefore, need a transformation function to map points from one plane to the other.

The following image shows what we are trying to achieve (assume the chessboard is the floor in your shop/business):

chessboard-person-transformation
The task: map the plane from your camera to a perspective view

The next step, then, is to deduce what kind of transformation is necessary. Once we know this, we can start to look at the mathematics behind it and use this maths accordingly in our application. Here are some possible transformations:

different-transformations
Different types of transformations (image source)

Translations (the first transformation in the image above) are shifts in the x and y plane that preserve orientation. Euclidean transformations change the orientation of the plane but preserve the distances between points – definitely not our case, as mentioned earlier. Affine transformations are a combination of translation, rotation, scale, and shear. They can change the distances between points but parallel lines remain parallel after transformation – also not our case. Lastly, we have homographic transformations that can change a square into any form of a quadrilateral. This is what we are after.

Mathematically, homographic transformations are represented as such:

homographic-transformation

where (x,y) represent pixel coordinates in one plane, (x’, y’) represent pixel coordinates in another plane and H is the homography matrix represented as this 3×3 matrix:

homography-matrix

Basically, the equation states this: given a point in one plane (x’,y’), if I multiply it by the homography matrix H I will get the corresponding point (x,y) from the other plane. So, if we calculate H, we can get the coordinates of any pixel from our camera image to the flat image.

But how do you calculate this magic matrix H? To gloss over some intricate mathematics, what we need is at least 4 point pairs (4 corresponding points) from the two images to get a minimal solution (a “close enough” solution) of H. But the more point pairs we provide, the better the estimate of H will be.

Getting the corresponding point pairs from our images is easy, too. You can use an image editing application like GIMP. If you move your mouse over an image, the pixel coordinates of your mouse positions are given at the bottom of the window. Jot down the pixel coordinates from one image and the corresponding pixel coordinates in the matching image. Get at least four such points pairs and you can then get an estimate of H and use it to calculate any other corresponding point pairs.

chessboard-point-pairs
Example of 3 corresponding points in two images

Now you can take the tracking information from your security camera and plot the position of people on your perspective 2D floor plan. You can now analyse their walking paths, where they spent most of their time, where congestion frequently occurs, etc. Nice! But what’s even nicer is the simple code needed to do everything discussed here. The OpenCV library (the best image/video processing library around) provides all necessary methods that you’ll need:

import cv2 # import the OpenCV library
import numpy as np # import the numpy library

# provide points from image 1
pts_src = np.array([[154, 174], [702, 349], [702, 572],[1, 572], [1, 191]])
# corresponding points from image 2 (i.e. (154, 174) matches (212, 80))
pts_dst = np.array([[212, 80],[489, 80],[505, 180],[367, 235], [144,153]])

# calculate matrix H
h, status = cv2.findHomography(pts_src, pts_dst)

# provide a point you wish to map from image 1 to image 2
a = np.array([[154, 174]], dtype='float32')
a = np.array([a])

# finally, get the mapping
pointsOut = cv2.perspectiveTransform(a, h)

Piece of cake. Here is a short animation showing what you can do:

animation-after-transformation

Be sure to check out my next post where I show you how to generate heatmaps from the tracking data you just obtained from your security cameras.

To be informed when new content like this is posted, subscribe to the mailing list:

NASA-Mars-Rover

Computer Vision on Mars from the 2004 Exploration Rovers

I was doing my daily trawl of the internet a few days ago looking at the latest news in artificial intelligence (especially computer vision) and an image caught my eye. The image was of one of the Mars Exploration Rovers (MER) that landed on the Red Planet in 2004. Upon seeing the image I thought to myself: “Heck, those rovers must surely have used computer vision up there!?” So, I spent the day looking into this and, sure as can be, not only was computer vision used by these rovers, it in fact played an integral part in their missions.

In this post, then, I’m going to present to you how, where, and when computer vision was used by those MERs. It’s been a fascinating few days for me researching into this and I’m certain you’ll find this an interesting read also. I won’t go into too much detail here but I’ll give you enough to come to appreciate just how neat and important computer vision can be.

If you would like to read more about this topic, a good place to start is “Computer Vision on Mars” (Matthies, Larry, et al. International Journal of Computer Vision 75.1, 2007: 67-92.), which is an academic paper published by NASA in 2007. You can also follow any additional referenced publications there. All images in this post, unless otherwise stated, were taken from this paper.

Background Information

In 2003, NASA launched two rovers into space with the intention of landing them on Mars to study rocks and soils for traces of past water activity. MER followed upon three other rover-based missions: the two Viking missions of 1975 and 1976 and the Mars Pathfinder mission of 1997.

Due to constraints in processing power and memory capacity no image processing was performed by the Viking rovers. They only took pictures with their on-board cameras to be sent back to Earth.

The Sojourner (the name of the Mars Pathfinder rover), on the other hand, performed computer vision in one way only. It used stereoscopic vision to provide scientists detailed maps of the terrain around the rover for operators on Earth to use in planning movement trajectories. Stereoscopic vision provides visual information from two viewing angles a short distance apart just like our eyes do. This kind of vision is important because two views of the same scene allows for the extraction of 3D data (i.e. depth data). See this OpenCV tutorial on extracting depth maps from stereo images for more information on this.

The MER Rovers

The MER rovers, Spirit and Opportunity as they were named, were identical. Both had a 20 MHz processor, 128 MB of RAM, and 256 MB of flash memory. Not much to work with there, as you can see! Phones nowadays are about 1000 times more powerful.

The rovers also had a monocular descent camera facing directly down and three sets of stereo camera pairs: one pair each at the front and back of the rovers (called hazard cameras, or “hazcams” for short) and a pair of cameras (called navigation cameras, or “navcams” for short) on a mast 1.3m (4.3 feet) above the ground. All these cameras took 1024 x 1024 greyscale photos.

But wait, those colour photos we’ve seen so many times from these missions were fake, then? Nope! Cleverly, each of the stereoscopic camera lenses also had a wheel of 8 filters that could be rotated. Consecutive images could be taken with a different filter (e.g. infrared, ultra-violet, etc.) and colour extracted from a combination of these. Colour extraction was only done on Earth, however. All computer vision processing on Mars was therefore performed in greyscale. Fascinating, isn’t it?

MER-rover-components
Components of an MER rover (image source)

The Importance of Computer Vision in Space

If you’ve been around computer vision for a while you’ll know that for things such as autonomous vehicles, vision solutions are not necessarily the most efficient. For example, lidar (Light Detection And Ranging – a technique similar to sonar for constructing 3D representations of scenes by emitting pulsating laser light and then measuring reflections of it) can give you 3D obstacle avoidance/detection information much more easily and quickly. So, why did NASA choose to use computer vision (and so much of it, as I’ll be presenting to you below) instead of other solutions? Because laser equipment is fragile and it may not have withstood the harsh conditions of Mars. So, digital cameras were chosen instead.

Computer Vision on Mars

We now have information on the background of the mission and the technical hardware relevant to us so let’s move to the business side of things: computer vision.

The first thing I will talk about is the importance of autonomy in space exploration. Due to communication latency and bandwidth limitations, it is advantageous to minimise human intervention by allowing vehicles or spacecraft to make decisions on their own. The Sojourner had minimal autonomy and only ended up travelling approximately 100 metres (328 feet) during it’s entire mission (which lasted a good few months). NASA wanted the MER rovers to travel on average that much every day, so they put a lot of time and research into autonomy to help them reach this target.

In this respect, the result was that they used computer vision for autonomy on Mars in 3 ways:

  1. Descent motion estimation
  2. Obstacle detection for navigation
  3. Visual odometry

I will talk about each of these below. As mentioned in the introduction, I won’t go into great detail here but I’ll give you enough to satisfy that inner nerd in you 😛

1. Descent Image Motion Estimation System

Two years before the launch of the rocket that was to take the rovers to Mars, scientists realised that their estimates of near-surface wind velocities of the planet were too low. This could have proven catastrophic because severe horizontal winds could have caused irreparable damage upon an ill-judged landing of the rover. Spirit and Opportunity had horizontal impulse rockets that could be used to reduce horizontal velocity upon descent but no system to detect actual horizontal speed of the rovers.

Since a regular horizontal velocity sensor could not be installed due to cost and time constraints, it was decided to turn to computer vision for assistance! A monocular camera was attached to the base of the rover that would take pictures of the surface of the planet as the rovers were descending onto it. These pictures would be analysed in-flight to provide estimates of horizontal speeds in order to trigger the impulse rockets, if necessary.

The computer vision system for motion estimation worked by tracking a single feature (features are small “interesting” or “stand-out” patches in images). The feature was located in photos taken by the rovers and then the position of these patches was tracked between consecutive images.

Coupled with this feature tracking information and measurements from the angular velocity and vertical velocity sensors (that were already installed for the purpose of on-surface navigation), the entire velocity vector (i.e. information about the magnitude and direction of the rover’s speed) was able to be calculated.

The feature tracking algorithm, called the Descent Image Motion Estimation System (DIMES) consisted of 7 steps as summarised by the following image:

DIMES-algorithm

The first step reduces the image size to 256 x 256 resolution. The smaller the resolution, the faster that image processing calculations can be performed – but at the possible expense of accuracy. The second step was responsible for estimating the maximum possible area of overlap in consecutive images to minimise the search area for features (there’s no point in detecting features in regions of an image that you know are not going to be present in the second). This was done by taking into consideration knowledge from sensors of things such as the rover’s altitude and orientation. The third step picked out two features from an image using the Harris corner detector (discussed here in this OpenCV tutorial). Only one feature is needed for the algorithm to work but two were detected in case one feature could not be located in the following image. A few noise “clean-up” operations on images were performed in step 4 to reduce effects of things such as blurring.

Step 5 is interesting. The feature patches (aka feature templates) and search windows in consecutive images were rectified (rotated, twisted, etc.) to remove orientation and scale differences in order to make searching for features easier. In other words, the images were rotated, twisted and enlarged/diminished to be placed on the same plane. An example of this from the actual mission (from the Spirit rover’s descent) is shown in the image below. The red squares in the first image are the detected feature patches that are shown in green in the second image with the search windows shown in blue. You can see how the first and second images have been twisted and rotated such that the feature size, for example, is the same in both images.

DIMES-rectification-example

Step 6 was responsible for locating in the second image the two features found in the first image. Moravec’s correlator (an algorithm developed by Hans Moravec and published in his PhD thesis way back in 1980) was used for this. The general idea in this algorithm is to minimise the search area first instead of searching over every possible location in an image for a match. This is done by first selecting potential regions in an image for matches and only there is a more exhaustive search performed.

The final step is combining all this information to calculate the velocity vector. In total, the DIMES algorithm took 14 seconds to run up there in the atmosphere of Mars. It was run by both rovers during their descent. The Spirit rover was the only one that fired its impulse rockets as a result of calculations from DIMES. Its horizontal velocity was at one stage reduced from 23.5 m/s (deemed to be slightly over a safe limit) to 11 m/s, which ensured a safe landing. Computer vision to the rescue! Opportunity’s horizontal speed was never calculated to be too fast so firing its stabilising rockets was considered to be unnecessary. It also had a successful landing.

All the above steps were performed autonomously on Mars without any human intervention. 

2. Stereo Vision for Navigation

To give the MER rovers as much autonomy as possible, NASA scientists developed a stereo-vision-based obstacle detection and navigation system. The idea behind it was to give the scientists the ability to simply provide the rovers each day with a destination and for the vehicles to work things out on their own with respect to navigation to this target (e.g. to avoid large rocks).

And their system performed beautifully.

The algorithm worked by extracting disparity (depth) maps from stereo images – as I’ve already mentioned, see this OpenCV tutorial for more information on this technique. What was done, however, by the rovers was slightly different to that tutorial (for example a simpler feature matching algorithm was employed), but the gist of it was the same: feature point detection and matching was performed to find the relationship between images and knowledge of camera properties such as focal lengths and baseline distances allowed for the derivation of depth for all pixels in an image. An example of depth maps calculated in this way by the Spirit rover is shown below:

depth-maps-spirit-rover
The middle picture was taken by Spirit and shows a rock on Mars approximately 0.5 m (1.6 feet) in height. The left image shows corresponding range information (red is closest, blue furthest). The right image shows corresponding height information.

Interestingly, the Opportunity rover, because it landed on a smoothly-surfaced plain, was forced to use its navcams (that were mounted on a mast) for its navigation. Looking down from a higher angle meant that detailed texture from the sand could be used for feature detection and matching. Its hazcams returned only the smooth surface of the sand. Smooth surfaces are not agreeable to feature detection (because, for example, they don’t have corners or edges). The Spirit rover, on the other hand, because it landed in a crater full of rocks, could use its hazcams for stereoscopic navigation.

3. Visual Odometry

Finally, computer vision on Mars was used at certain times to estimate the rovers’ position and travelling distance. No GPS is available on Mars (yet) and standard means of estimating distance travelled such as counting the number of wheel rotations was deemed during desert testing on Earth to be vulnerable to significant error due to one thing: wheel slippage. So, NASA scientists decided to employ motion estimation via computer vision instead.

Motion estimation was performed using feature tracking in 3D across successive shots taken by the navcams. To obtain 3D information, once again depth maps were extracted from stereoscopic images. Distances to features could easily be calculated from these and then the rovers’ poses were estimated. On average, 80 features were tracked per frame and a photo was taken for visual odometry calculations every 75 cm (30 inches) of travel.

Using computer vision to assist in motion estimation proved to be a wise decision because wheel slippage was quite severe on Mars. In fact, at one time the rover got stuck in sand and the wheels rotated in place for the equivalent of 50m (164 feet) of driving distance. Without computer vision the rovers’ estimated positions would have been severely inaccurate. 

There was another instance where this was strikingly the case. At one time the Opportunity rover was operating on a 17-20 degree slope in a crater and was attempting to maneuver around a large rock. It had been trying to escape the rock for several days and had slid down the crater many times in the process. The image below shows the rover’s estimated trajectory (from a top-down view) using just wheel odometry (left), and the rover’s corrected trajectory (right) as assisted by computer vision calculations. The large rock is represented by the black ellipse. The corrected trajectory proved to be the more accurate estimation.

visual-odometry-example

Summary

In this post I presented the three ways computer vision was used by the Spirit and Opportunity rovers during their MER missions on Mars. These three ways were:

  1. Estimating horizontal speeds during their descent onto the Red Planet to ensure the rovers had a smooth landing.
  2. Extracting 3D information of its surroundings using stereoscopic imagery to assist in navigation and obstacle detection.
  3. Using stereoscopic imagery once again but this time to provide motion and pose estimation on difficult terrain.

In this way, computer vision gave the rovers a significant amount of autonomy (much, much more autonomy than its predecessor, the Sojourner rover) that ultimately gave the rovers a safe landing and allowed the robots to traverse up to 370 m (1213 feet) per day. In fact, the Opportunity rover is still active on Mars now. This means that the computer vision techniques described in this post are churning away as we speak. If that isn’t neat, I don’t know what is!

To be informed when new content like this is posted, subscribe to the mailing list (or subscribe to my YouTube channel!):

scale-cv-dl

Why Deep Learning Has Not Superseded Traditional Computer Vision

This is another post that’s been inspired by a question that has been regularly popping up in forums:

Has deep learning superseded traditional computer vision?

Or in a similar vein:

Is there still a need to study traditional computer vision techniques when deep learning seems to be so effective? 

These are good questions. Deep learning (DL) has certainly revolutionised computer vision (CV) and artificial intelligence in general. So many problems that once seemed improbable to be solved are solved to a point where machines are obtaining better results than humans. Image classification is probably the prime example of this. Indeed, deep learning is responsible for placing CV on the map in the industry, as I’ve discussed in previous posts of mine.

But deep learning is still only a tool of computer vision. And it certainly is not the panacea for all problems. So, in this post I would like to elaborate on this. That is, I would like to lay down my arguments for why traditional computer vision techniques are still very much useful and therefore should be learnt and taught.

I will break the post up into the following sections/arguments:

  • Deep learning needs big data
  • Deep learning is sometimes overkill
  • Traditional CV will help you with deep learning

But before I jump into these arguments, I think it’s necessary to first explain in detail what I mean by “traditional computer vision”, what deep learning is, and also why it has been so revolutionary.

Background Knowledge

Before the emergence of deep learning if you had a task such as image classification, you would perform a step called feature extraction. Features are small “interesting”, descriptive or informative patches in images. You would look for these by employing a combination of what I am calling in this post traditional computer vision techniques, which include things like edge detection, corner detection, object detection, and the like.

In using these techniques – for example, with respect to feature extraction and image classification – the idea is to extract as many features from images of one class of object (e.g. chairs, horses, etc.) and treat these features as a sort of “definition” (known as a bag-of-words) of the object. You would then search for these “definitions” in other images. If a significant number of features from one bag-of-words are located in another image, the image is classified as containing that specific object (i.e. chair, horse, etc.).

The difficulty with this approach of feature extraction in image classification is that you have to choose which features to look for in each given image. This becomes cumbersome and pretty much impossible when the number of classes you are trying to classify for starts to grow past, say, 10 or 20. Do you look for corners? edges? texture information? Different classes of objects are better described with different types of features. If you choose to use many features, you have to deal with a plethora of parameters, all of which have to be fine-tuned by you.

Well, deep learning introduced the concept of end-to-end learning where (in a nutshell) the machine is told to learn what to look for with respect to each specific class of object. It works out the most descriptive and salient features for each object. In other words, neural networks are told to discover the underlying patterns in classes of images.

So, with end-to-end learning you no longer have to manually decide which traditional computer vision techniques to use to describe your features. The machine works this all out for you. Wired magazine puts it this way:

If you want to teach a [deep] neural network to recognize a cat, for instance, you don’t tell it to look for whiskers, ears, fur, and eyes. You simply show it thousands and thousands of photos of cats, and eventually it works things out. If it keeps misclassifying foxes as cats, you don’t rewrite the code. You just keep coaching it.

The image below portrays this difference between feature extraction (using traditional CV) and end-to-end learning:

traditional-cv-and-dl

So, that’s the background. Let’s jump into the arguments as to why traditional computer vision is still necessary and beneficial to learn.

Deep Learning Needs Big Data

First of all, deep learning needs data. Lots and lots of data. Those famous image classification models mentioned above are trained on huge datasets. The top three of these datasets used for training are:

Easier tasks than general image classification will not require this much data but you will still need a lot of it. What happens if you can’t get that much data? You’ll have to train on what you have (yes, some techniques exist to boost your training data but these are artificial methods).

But chances are a poorly trained model will perform badly outside of your training data because a machine doesn’t have insight into a problem – it can’t generalise for a task without seeing data.

And it’s too difficult for you to look inside the trained model and tweak things around manually because a deep learning model has millions of parameters inside of it – each of which is tuned during training. In a way, a deep learning model is a black box.

Traditional computer vision gives you full transparency and allows you to better gauge and judge whether your solution will work outside of a training environment. You have insight into a problem that you can transfer into your algorithm. And if anything fails, you can much more easily work out what needs to be tweaked and where.

Deep Learning is Sometimes Overkill

This is probably my favourite reason for supporting the study of traditional computer vision techniques.

Training a deep neural network takes a very long time. You need dedicated hardware (high-powered GPUs, for example) to train the latest state-of-the-art image classification models in under a day. Want to train it on your standard laptop? Go on a holiday for a week and chances are the training won’t even be done when you return.

Moreover, what happens if your trained model isn’t performing well? You have to go back and redo the whole thing again with different training parameters. And this process can be repeated sometimes hundreds of times.

But there are times when all this is totally unnecessary. Because sometimes traditional CV techniques can solve a problem much more efficiently and in fewer lines of code than DL. For example, I once worked on a project to detect if each tin passing through on a conveyor belt had a red spoon in it. Now, you can train a deep neural network to detect spoons and go through the time-consuming process outlined above, or you can write a simple colour thresholding algorithm on the colour red (any pixel within a certain range of red is coloured white, every other pixel is coloured black) and then count how many white pixels you have. Simple. You’re done in an hour!

Knowing traditional computer vision can potentially save you a lot of time and unnecessary headaches.

Traditional Computer Vision will Improve your Deep Learning Skills

Understanding traditional computer vision can actually help you be better at deep learning.

For example, the most common neural network used in computer vision is the Convolutional Neural Network. But what is a convolution? It’s in fact a widely used image processing technique (e.g. see Sobel edge detection). Knowing this can help you understand what your neural network is doing under the hood and hence design and fine-tune it better to the task you’re trying to solve.

Then there is also a thing called pre-processing. This is something frequently done on the data that you’re feeding into your model to prepare it for training. These pre-processing steps are predominantly performed with traditional computer vision techniques. For example, if you don’t have enough training data, you can do a task called data augmentation. Data augmentation can involve performing random rotations, shifts, shears, etc. on the images in your training set to create “new” images. By performing these computer vision operations you can greatly increase the amount of training data that you have.

Conclusion

In this post I explained why deep learning has not superseded traditional computer vision techniques and hence why the latter should still be studied and taught. Firstly, I looked at the problem of DL frequently requiring lots of data to perform well. Sometimes this is not a possibility and traditional computer vision can be considered as an alternative in these situations. Secondly, occasionally deep learning can be overkill for a specific task. In such tasks, standard computer vision can solve a problem much more efficiently and in fewer lines of code than DL. Thirdly, knowing traditional computer vision can actually make you better at deep learning. This is because you can better understand what is happening under the hood of DL and you can perform certain pre-processing steps that will improve DL results.

In a nutshell,  deep learning is just a tool of computer vision that is certainly not a panacea. Don’t only use it because it’s trendy now. Traditional computer vision techniques are still very much useful and knowing them can save you time and many headaches.

To be informed when new content like this is posted, subscribe to the mailing list (or subscribe to my YouTube channel!):

brain-reading

Machines That Can Read Our Minds from MRI Scans

Who here watches Black Mirror? Oh, I love that show (maybe except for Season 4 – only one episode there was any good, IMHO). For those that haven’t seen it yet, Black Mirror basically tries to extrapolate what society will look like if technological advances were to continue on their current trajectory.

A few of the technologies presented in that show are science fiction on steroids. But I think that many are not and that it’s only a matter of time before we reach that (dark?) high-tech reality.

A reader of this blog approached me recently with a (preprint) academic publication that sounded pretty much like it had been taken out of that series. It was a paper purporting that scientists can now reconstruct our visual thoughts by analysing our brain scans.

“No way. Too good to be true”, was my first reaction. There has to be a catch. So, I went off to investigate further.

And boy was my mind blown. Allow me in this post to present to you that publication out of Kyoto, Japan. Try telling me at the end that you’re not impressed either!

(Note: Although this post is technically not directly related to computer vision, I’m still including it in my blog because it deals with trying to understand how our brains work when it comes to visual stimuli. This is important to do because many times in the history of computer vision and AI in fact, it has been shown that biologically inspired solutions (i.e. solutions that mimic the way we behave) can obtain really good if not superior results to other solutions. The prime example of this, of course, is neural networks)

Machines that can read our minds

The paper in question is entitled “Deep image reconstruction from human brain activity” (Shen et al., bioRxiv, 2017). Although it has been published in preprint, meaning that it hasn’t been peer-reviewed and technically doesn’t hold too much academic weight, the site that it was published on is of repute (www.biorxiv.org) as are the institutes that co-authored the work (ATR Computational Neuroscience Laboratories and Kyoto University, the latter being one of Asia’s highest ranked universities). Hence, I’m confident in the credibility of their research.

In a nutshell, the paper is claiming to have developed a way to reconstruct images that people are looking at or thinking about by solely analysing functional magnetic resonance imaging (fMRI). fMRI machines measure brain activity by detecting changes in blood flow through regions of the brain – generally speaking, the more blood flow, the more brain activity. Here’s an example fMRI image showing increased brain activity in orange/yellow:

(image source)

Let’s look at some details of their paper.

The first step was to utilise machine learning to examine what the relationship is between brain activity as detected by fMRI scans and the images the were being shown to test subjects. What’s interesting is that there was a focus on mapping hierarchical features between the fMRI images and real images.

Hierarchical features is a way to deconstruct images into an ascending level of detail. For example, to describe a car you could start by describing edges, then working up to curves, and then particular shapes like circles, and then objects like wheels; and then finally you’d get four wheels, a chassis and some windows.

Deconstructing images by looking at hierarchical features (edges, curves, shapes, and so on) is also the way neural networks and our brains work with respect to visual stimuli. So, this plan of action to focus on hierarchical features in fMRI images seems intuitive.

The hierarchical features extracted from fMRI images were then used to replace the features of a deep neural network (DNN). Next, an iterative process was started where base images (produced by a deep generator network) were fed into the DNN and at each iteration individual pixels of the image were transformed until the pixel values matched the features of the DNN (which were extracted from fMRI images) to a certain threshold (error) level.

It’s a complicated process, I know. I’ve tried to summarise it as best as I could here! To understand it more, you’ll need to go back to the original publication and then follow a trail of referenced articles. The image below shows this process – not that it helps much 🙂 But I think the most important part of this study are the results – that’s coming up in the next section.

deep-image-reconstruction
Deep image reconstruction (source: original publication)

Test subjects were shown various images from three different classes: natural colour images (sampled from ImageNet), artificial geometrical shapes (in 8 colours), and black and white alphabetical letters. Scans of brain activity were conducted during the viewing sessions.

Interestingly, test subjects were also asked to recollect images they had been previously shown and brain scans during this task were also taken. The idea was to try to see if a machine could truly discern what a person was actually thinking rather than just viewing.

Results from the Study

This is where the fun begins!

But before we dive into the results, I need to mention that this is not the first time fMRI imaging has been used to attempt to reconstruct images. Similar projects (referenced in the publication we’re discussing) go back to 2011 and 2013. The novelty of this study, however, is to work with hierarchical features rather than working with fMRI images directly.

Take a look at this video showing past results from previous studies. The images are monochromatic and of minimal resolution:

Now let’s take a look at some results from this particular study. (All images are taken from the original publication.)

How about a reconstructed image of a swan? Remember, this reconstructed image (right) is generated from someone’s brain scan (taken while the test subject was viewing the original image):

Example-Result-Swan
Example result #1: the test subject was shown the image on the left; the image on the right is the reconstructed image from brain scans.

You must be impressed with that!

Here’s a reconstructed picture of a duck:

experiment-result-duck
Example result #2: the test subject was shown the image on the left; the image on the right is the reconstructed image from brain scans.

For a presentation of results, take a look at this video released by the authors of the paper:

What about those letters I mentioned that were also shown to test subjects? Check this out:

results-from-text-experiment
Example results #3: the test subject was shown the letters in the top row; the bottom row shows the reconstructed images from brain scans.

Scary how good the reconstructions are, isn’t it? Imagine being able to tell what a person is reading at a given time!? I’ve already written a post about non-invasive lie detection from thermal imaging. If we can just work out how to get fMRI images like that (i.e. without engaging the subject), we’ll be able to tell what a person is looking at without them even knowing about it. Black Mirror stuff!

Results from the artificial geometric shapes images can be viewed here – also impressive, of course.

And how about reconstructed images from scans performed when people were thinking about the images they had been previously shown? Well, results here aren’t as impressive (see below). The authors concede this also. But hey! One step at a time, right?

results-imagined-images
Example results #4: the test subject was told to think about the images in the rop row; the bottom row shows the attempted reconstructions of these images from brain scans.

Conclusion

In this post I presented a preprint publication in which it is purported that scientists were able to reconstruct visual thoughts by analysing brain scans. fMRI images were analysed for hierarchical features that were then used in a deep neural network. Base images were fed through this network and iteratively transformed until the features of each image matched the features in the neural network. Results from this study were presented in this post and I think we can all agree that they were quite impressive. This is the future, folks! Black Mirror stuff perhaps.

The uses for good of this technology are numerous – communicating with people in comas is one such use case. Of course, the abuse that could come from this is frightening, too. Issues with privacy (ah, that pesky perennial question in computer science!) spring to mind immediately.

But as you may have gathered from all my posts here, I look forward to what the future holds for us, especially with respect to artificial intelligence. And likewise with this technology. I can’t wait to see it progress.

To be informed when new content like this is posted, subscribe to the mailing list (or subscribe to my YouTube channel!):

Amazon Go – Computer Vision at the Forefront of Innovation

Where would a computer vision blog be without a post about the new cashier-less store recently opened to the public by Amazon? Absolutely nowhere.

But I don’t need additional motivation to write about Amazon Go (as the store is called) because I am, to put it simply, thrilled and excited about this new venture. This is innovation at its finest where computer vision is playing a central role.

How can you not get enthusiastic about it, then? I always love it when computer vision makes the news and this is no exception.

In this post, I wish to talk about Amazon Go under four headings:

  1. How it all works from a technical (as much as is possible) and non-technical perspective,
  2. Some of the reported issues prior to public opening,
  3. Some reported in-store issues post public opening, and
  4. Some potential unfavourable implications cashier-less stores may have in the future (just to dampen the mood a little)

So, without further ado…

How it Works – Non-Technically & Technically

The store has a capacity of around 90 people – so it’s fairly small in size, like a convenience store. To enter it you first need to download the official Amazon app and connect it to your Amazon Prime account. You then walk up to a gate like you would at a metro/subway and scan a QR code from the app. The gate opens and your shopping experience begins.

Inside the store, if you wish to purchase something, you simply pick it up off the shelf and put it in your bag or pocket. Once you’re done, you walk out of the shop and a few minutes later you get emailed a receipt listing all your purchases. No cashiers. No digging around for money or cards. Easy as pie!

What happens on the technical side of things behind the scenes? Unfortunately, Amazon hasn’t disclosed much at all, which is a bit of a shame for nerds like me. But I shouldn’t complain too much, I guess.

What we do know is that sensor fusion is employed (sensor fusion is when data is combined from multiple sensors/sources to provide a higher degree of accuracy) along with deep learning.

Hundreds of cameras and depth sensors are attached to the ceiling around the store:

amazon-go-cameras
The cameras and depth sensors located on the ceiling of the store (image source)

These track you and your movements (using computer vision!) throughout your expedition. Weight sensors can also be found on the shelves to assist the cameras in discerning which products you have chosen to put into your shopping basket.

(Note: sensor fusion is also being employed in autonomous cars. Hopefully I’ll be writing about this as well soon.)

In 2015, Amazon filed a patent application for its cashier-less store in which it stated the use of RGB cameras (i.e. colour cameras) along with facial recognition. TechCrunch, however, has reported that the Vice President of Technology at Amazon Go told them that no facial recognition algorithms are currently being used.

In-Store Issues Prior to Public Opening

Although the store opened its doors to the public a few weeks ago, it has been open to employees since December 2016. Initially, Amazon expected the store to be ready for public use a few months after that but public opening was delayed by nearly a year due to “technical problems“.

We know what some of the dilemmas behind these “technical problems” were.

Firstly, Amazon had problems tracking more than 20 people in the store. If you’ve ever worked on person-tracking software, you’ll know how hard it is to track a crowd of people with similar body types and wearing similar clothes. But it looks like this has been resolved (to at least a satisfactory level for them). It’s a shame for us, though, to not be given more information on how Amazon managed to get this to work.

Funnily enough, some employees of Amazon knew about this problem and in November last year tried to see if a solution had been developed. Three employees dressed up in Pikachu costumes (as reported by Bloomberg here) while doing their round of shopping to attempt to fool the system. Amazon Go passed this thorough, systematic, and very scientific test. Too bad I couldn’t find any images or videos of this escapade!

We also know that initially engineers were assisting the computer vision system behind the scenes. The system would let these people know when it was losing confidence with its tracking results and would ask them to intervene, at least minimally. Nobody is supposedly doing this task any more.

Lastly, I also found information stating that the system would run into trouble when products were taken off the shelf and placed back on a different shelf. This was reported to have occurred when employees brought their children into the store and they ran wild a little (as children do).

This also appears to have been taken care of because someone from the public attempted to do this on purpose last week (see this video) but to no adverse effects, it would seem.

It’s interesting to see the growing pains that Amazon Go had to go through, isn’t it? How they needed an extra year to try to iron out all these creases. This is such a huge innovation. Makes you wonder what “creases” autonomous cars will have when they become more prominent!

In-Store Issues Post Public Opening

But, alas. It appears as though not all creases were ironed out to perfection. Since Amazon Go’s opening a few weeks ago, two issues have been written about.

The first is of Deirdre Bosa of CNBC not being charged for a small tub of yoghurt:


The Vice President of Amazon Go responded in the following way:

First and foremost, enjoy the yogurt on us. It happens so rarely that we didn’t even bother building in a feature for customers to tell us it happened. So thanks for being honest and telling us. I’ve been doing this a year and I have yet to get an error.

The yoghurt manufacturer replied to that tweet also:

To which Dierdre responded: “Thanks Siggi’s! But I think it’s on Amazon :)”

LOL! 🙂

But as Amazon Go stated, it’s a rarity for these mistakes to happen. Or is that only the case until someone works out a flaw in the system?

Well, it seems as though someone has!

In this video, Tim Pool states that he managed to walk out of the Amazon Go store with a bag full of products and was only charged for one item. According to him it is “absurdly easy to take a bag full of things and not get charged”. That’s a little disconcerting. It’s one thing when the system makes a mistake every now and then. It’s another thing when someone has worked out how to break it entirely.

Tim Pool says he has contacted Amazon Go to let them know of the major flaw. Amazon confirmed with him that he did not commit a crime but “if used in practice what we did would in fact be shoplifting”.

Ouch. I bet engineers are working on this frantically as we speak.

One more issue worth mentioning that isn’t really a flaw but could also be abused is that at the moment you can request a refund on any item without returns. No questions asked. Linus Tech Tips shows in this video how easily this can be done. Of course, since your Amazon Go account needs to be linked to your Amazon Prime account, if you do this too many times, Amazon will catch on and will probably take some form of preventative action against you or will even verify everything by looking back at past footage of you.

Cons of Amazon Go

Like I said earlier, I am really excited about Amazon Go. I always love it when computer vision spearheads innovation. But I also think it’s important to in this post also talk about potential unfavourable implications of a cashier-less store.

Potential Job Losses

The first most obvious potential con of Amazon Go is the job losses that might ensue if this innovation catches on. Considering that 3.5 million people in the US are employed as cashiers (it’s the second-most common job in that country), this issue needs to be raised and discussed. Heck, there have already been protests in this respect outside of Amazon Go:

amazon-go-protest
Protests in front of the Amazon Go store (image source)

Bill Ingram, the organiser of the protest shown above asks: “What will all the cashiers do once their jobs are automated?”

Amazon, not surprisingly, has issued statements on this topic. It has said that although some jobs may be taken by automation, people can be relocated to improve other areas of the store by, for example:

Working in the kitchen and the store, prepping ingredients, making breakfast, lunch and dinner items, greeting customers at the door, stocking shelves and helping customers

Let’s also not forget that new jobs have also been created. For example, additional people need to be hired to manage the technological infrastructure behind this huge endeavour.

Personally, I’m not a pessimist about automation either. The industrial revolution that brought automation to so many walks of life was hard at first but society found ways to re-educate into other areas. The same will happen, I believe, if cashier-less stores become a prominent thing (and autonomous cars also, for that matter).

An Increase in Unhealthy Impulse Purchases

Manoj Thomas, a professor of marketing at Cornell University, has stated that our shopping behaviour will change around cashier-less stores:

[W]e know that when people use any abstract form of payment, they spend more. And the type of products they choose changes too.

What he’s saying is that psychological research has shown that the more distance we put between us and the “pain of paying” the more discipline we need to avoid those pesky impulse purchases. Having cash physically in your hand means you can what you’re doing with your money more easily. And that extra bit of time waiting in line at the cashier could be time enough to reconsider purchasing that chocolate and vanilla tub of ice cream :/

Even More Surveillance

And then we have the perennial question of surveillance. When is too much, too much? How much more data about us can be collected?

With such sophisticated surveillance in-store, companies are going to have access to even more behavioural data about us: which products I looked at for a long time; which products I picked up but put back on the shelf; my usual path around a store; which advertisements made me smile – the list goes on. Targeted advertising will become even more effective.

Indeed, Bill Ingram’s protest pictured above was also about this (hence why masks were worn to it). According to him, we’re heading in the wrong direction:

If people like that future, I guess they can jump into it. But to me, it seems pretty bleak.

Harsh, but there might be something to it.

Less Human Interaction

Albert Borgmann, a great philosopher on technology, coined the term device paradigm in his book “Technology and the Character of Contemporary Life” (1984). In a nutshell, the term is used to explain the hidden, detrimental nature and power of technology in our world (for a more in-depth explanation of the device paradigm, I highly recommend you read his philosophical works).

One of the things he laments is how we are increasingly losing daily human interactions due to the proliferation of technology. The sense of a community with the people around us is diminishing. Cashier-less stores are pushing this agenda further, it would seem. And considering, according to Aristotle anyway, that we are social creatures, the more we move away from human interaction, the more we act against our nature.

The Chicago Tribune wrote a little about this at the bottom of this article.

Is this something worth considering? Yes, definitely. But only in the bigger picture of things, I would say. At the moment, I don’t think accusing Amazon Go of trying to damage our human nature is the way to go.

Personally, I think this initiative is something to celebrate – albeit, perhaps, with just the faintest touch of reservation. 

Summary

In this post I discussed the cashier-less store “Amazon Go” recently opened to the public. I looked at how the store works from a technical and non-technical point of view. Unfortunately, I couldn’t say much from a technical angle because of the little amount of information that has been disclosed to us by Amazon. I also discussed some of the issues that the store has dealt with and is dealing with now. I mentioned, for example, that initially there were problems in trying to track more than 20 people in the store. But this appears to have been solved to a satisfactory level (for Amazon, at least). Finally, I dampened the mood a little by holding a discussion on the potential unfavourable implications that a proliferation of cashier-less stores may have on our societies. Some of the issues raised here are important but ultimately, in my humble opinion, this endeavour is something to celebrate – especially since computer vision is playing such a prominent role in it.

To be informed when new content like this is posted, subscribe to the mailing list (or subscribe to my YouTube channel!):

graph-upward-trend-why

The Reasons Behind the Recent Growth of Computer Vision

In my previous post I looked at the unprecedented growth of computer vision in the industry. 10 years ago computer vision was nowhere to be seen outside of academia. But things have since changed significantly. A telling sign of this is the consistent tripling each year of venture capital funding in computer vision. And Intel’s acquisition of Mobileye in March 2017 for a whopping US$15.3 billion just sums up the field’s achievements.

In that post, however, I only briefly touched upon the reasons behind this incredible growth. The purpose of this article, then, is to fill that gap.

In this respect, I will discuss the top 4 reasons behind the growth of computer vision in the industry. I’ll do so in the following order:

  1. Advancements in hardware
  2. The emergence of deep learning
  3. The advent of large datasets
  4. The increase in computer vision applications

Better and More Dedicated Hardware

I mentioned in another post of mine that one of the main reasons why image processing is such a difficult problem is that it deals with an immense amount of data. To process this data you need memory and processing power. These have been increasing in size and power regularly for over 50 years (cf. Moore’s Law).

Such increases have allowed for algorithms to run faster to the point where more and more things are now capable of being run in real-time (e.g. face recognition).

We have also seen an emergence and proliferation of dedicated pieces of hardware for graphics and image processing calculations. GPUs are the prime example of this. A GPU clock speed may generally be slower than a regular CPU’s, but it can still outperform one for these specific tasks. 

Dedicated pieces of hardware are becoming so highly prized nowadays in computer vision that numerous companies have started designing and producing them. Just two weeks ago, for example, Ambarella announced two new chips designed for computer vision processing with chiefly autonomous cars, drones, and security cameras in mind. And last year, Mythic, a startup based in California, raised over US$10 million to commercialise their own deep learning focused pieces of hardware.

The Emergence of Deep Learning

Deep learning, a subfield of machine learning, has been revolutionary in computer vision. Because of it machines are now getting better results than humans in important tasks such as image classification (i.e. detecting what object is in an image).

Previously, if you had a task such as image classification, you would perform a step called feature extraction. Features are small “interesting”, descriptive or informative patches in images. The idea is to extract as many of these from images of one class of object (e.g. chairs, horses, etc.) and treat these features as a sort of “definition” (known as a bag-of-words) of the object. You would then search for these “definitions” in other images. If a significant number of features from one bag-of-words are located in another image, the image is classified as containing that specific object (i.e. chair, horse, etc.).

The difficulty with this approach is that you have to choose which features to look for in each given image. This becomes cumbersome and pretty much impossible when the number of classes you are trying to classify for starts to grow past, say, 10 or 20. Do you look for corners? edges? texture information? Different classes of objects are better described with different types of features. If you choose to use many features, you have to deal with a plethora of parameters, all of which have to be fine-tuned.

Well, deep learning introduced the concept of end-to-end learning where (in a nutshell) the machine is told to learn what to look for with respect to each specific class of object. It works out the most descriptive and salient features for each object. In other words, neural networks are told to discover the underlying patterns in classes of images.

The image below portrays this difference between feature extraction and end-to-end learning:

traditional-cv-and-dl

Deep learning has proven to be extremely successful for computer vision. If you look below at the graph I used in my previous post showing capital investments into US-based computer vision companies since 2011, you can see that when deep learning became mainstream in around 2014/2015, investments suddenly doubled and have been growing at a regular rate since.

cv-growth-graph
(image source)

You can safely say that deep learning put computer vision on the map in the industry. Without it, chances are we would all still be stuck with it in academia (not that there’s anything wrong with academia, of course).

Large Datasets

To allow a machine to learn the underlying patterns of classes of objects it needs A LOT of data. That is, it needs large datasets. More and more of these have been emerging and have been instrumental in the success of deep learning and therefore computer vision.

Before around 2012, a dataset was considered relatively large if it contained 100+ images or videos. Now, datasets exist with numbers ranging in the millions.

Here are some of the most known image classification databases currently being used to test and train the latest state-of-the-art object classification/recognition models. They have all been meticulously hand annotated by the open source community.

  • ImageNet – 15 million images, 22,000 object categories. It’s HUGE! (I hope to write more about this dataset in the near future, so stay tuned for that).
  • Open Images – 9 million images, 5,000 object categories.
  • Microsoft Common Objects in Context (COCO) – 330K images, 80 object categories.
  • PASCAL VOC Dataset – a few versions exist, 20 object categories.
  • CALTECH-101 – 9,000 images with 101 object categories.

I need to also mention Kaggle, the University of California, Irvine Machine Learning Repository. Kaggle hosts 351 image datasets ranging from flowers, wines, forest fires, etc. It is also the home of the famous Facial Expression Recognition Challenge (FER). The aim of this competition is to correctly detect the emotion of people from seven different categories from nearly 35,000 images of faces.

All these datasets and many more have raised computer vision to its current position in the industry. Certainly, deep learning would not be where it is now without them.

More Applications

Faster machines, larger memories, and other advances in technology have increased the number of useful things machines have been able to do for us in our lives. We now have autonomous cars (well, we’re close to having them), drones, factory robots, cleaning robots – the list goes on. With an increase in such vehicles, devices, tools, appliances, etc. has come an increase in the need for computer vision.

Let’s take a look at some examples of recent new ways computer vision is being used today.

Walmart, for example, a few months ago released shelf-scanning robots into 50 of its warehouses. The purpose of these robots is to detect out-of-stock and missing items and other things such as incorrect labelling. Here’s a picture of one of these robots at work:

(image source)

A British online supermarket is using computer vision to determine the best ways to grasp goods for its packaging robots. This video shows their robot in action:

Agriculture as well is capitalising on the growth of computer vision. iUNU, for example, is developing a network of cameras on rails to assist greenhouse owners to keep track of how their plants are growing.

The famous Roomba autonomous vacuum cleaner got an upgrade a few years ago with a new computer vision system to more smartly manoeuvre around your home.

And our phones? Plenty of computer vision being used in them! Ever noticed your phone camera tracking and focusing on your face when you’re trying to take a picture? That’s computer vision. And how about the face recognition services to unlock your phones? Well, Samsung’s version of it can be classified as computer vision (I write about it in this post).

There’s no need to mention autonomous cars here. We are constantly hearing about them on the news. It’s only a matter of time before we’ll be jumping into one.

Computer vision is definitely here to stay. In fact, it’s only going to get bigger with time.

Summary

In this post I looked at the four main reasons behind the recent growth of computer vision: 1) the advancements in hardware such as faster CPUs and availability of GPUs; 2) the emergence of deep learning, which has changed our way of performing tasks such as image classification; 3) the advent of large datasets that have allowed us to more meticulously study the underlying patterns in images; and 4) the increase in computer vision applications.

All these factors (not always mutually exclusive) have contributed to the unprecedented position of computer vision in the industry.

As I mentioned in my previous post, it’s been an absolute pleasure to have witnessed this growth and to have seen these factors in action. I truly look forward to what the future holds for computer vision.

To be informed when new content like this is posted, subscribe to the mailing list (or subscribe to my YouTube channel!):