Zbigatron | Zbigatron

Generating Heatmaps from Coordinates with Kernel Density Estimation

Posted on April 21, 2018January 16, 2025 by Zbigatron

In last week’s post I talked about plotting tracked customers or staff from video footage onto a 2D floor plan. This is an example of video analytics and data mining that can be performed on standard CCTV footage that can give you insightful information such as common movement patterns or common places of congestion at particular times of the day.

There is, however, another thing that can be done with these extracted 2D coordinates of tracked people: generation of heatmaps.

A heatmap is a visual representation or summary of data that uses colour to represent data values. Generally speaking, the more congested data is at a particular location, the hotter will be the colour used to represent this data.

The diagram at the top of this post shows an example heatmap for eye-tracking data (I did my PhD in eye-tracking, so this brings back memories :P) on a Wikipedia page. There, the hotter regions denote where more time was spent gazing by viewers.

There are many ways to create these heatmaps. In this post I will present you one way with some supporting code at the end.

I’m going to assume that you have a list of coordinates in a file denoting the location of people on a 2D floor plan (see my previous post for how to obtain such a file from CCTV footage). Each line in the file is a coordinate at a specific point in time. For example, you might have something like this in a file called “coords.txt”:
x_coords,y_coords 200,301 205,300 208,300 210,300 210,300 210,300

Update: Note the ‘301’ in the first row of coordinates in the y-axis column. After an update to one of the libraries I use below, the variance in a column cannot be 0.

In this example we have somebody moving horizontally 5 pixels for two time intervals and then standing still for 3 time intervals. If we were to generate a heatmap here, you would expect there to be hot colours around (210, 300) and cooler colours at (200, 300) through to (210, 300).

But how do we get these hot and cold colours around our points and make the heatmap look smooth and beautiful? Well, some of you may have heard of a thing called a Gaussian kernel. That’s just a fancy name for a particular type of curve. Let me show you a 2D image of one:

That curve can also be drawn in 3D, like so (notice the hot and cold colours here!):

Now, I’m not going to go into too much detail on Gaussian kernels because it would involve venturing into university mathematics. If you would like to read up on them, this pdf goes into a lot of explanatory detail and this page explains nicely why it is so commonly used in the trade. For this post, all you need to know is that it’s a specific type of curve.

With respect to our heatmaps, then, the idea is to place one of these Gaussian kernels at each coordinate location that we have in our “coords.txt” file. The trick here is to notice that when Gaussian kernels overlap, their values are added together (where the overlapping occurs). So, for example, with the 2D kernel image above, if we were to put another kernel at the exact same location, the peak of the kernel would reach 0.8 (0.4 + 0.4 = 0.8).

If you have clusters of points at a similar location, the Gaussian kernels at these locations would all push each other up.

The following image shows this idea well. There are 6 coordinates (the black marks on the x-axis) and a kernel placed at each of these (red dashed lines). The resulting curve is depicted in blue. The three congested kernels on the left push (by addition) the resulting curve up the highest.

pde — *Gaussian kernels stacked on top of each other (image source)*

This final plot of Gaussian kernels is actually called a kernel density estimation (KDE). It’s just a fancy name for a concept that really, in it’s core, isn’t too hard to understand!

A kernel density estimation can be performed in 3D as well and this is exactly what can be done with the coordinates in your “coords.txt” file. Take a look at the 3D picture of a single Gaussian kernel above and picture looking down at that curve from above. You would be looking at a heatmap!

Here’s a top-down view example but with more kernels (at the locations of the white points). Notice the hot colours at the more congested locations. This is where the kernels have pushed the resulting KDE up the highest:

And that, ladies and gentlemen is how you create a heatmap from a file containing coordinate locations.

And what about some accompanying code? For the project that I worked on, I used the seaborn Python visualisation library. That library has a kernel density estimator function called kdeplot:

# import the required packages import pandas as pd import seaborn as sns import numpy as np from matplotlib import pyplot as plt

# Library versions used in this code: # Python: 3.7.3 # Pandas: 1.3.5 # Numpy: 1.21.6 # Seaborn: 0.12.1 # Matplotlib: 3.5.3

# load the coordinates file into a dataframe coords = pd.read_csv('coords2.txt') # call the kernel density estimator function ax = sns.kdeplot(data = coords, x="x_coords", y="y_coords", fill=True, thresh=0, levels=100, cmap="mako") # the function has additional paramter, e.g. to change the colour palette, # so if you need things customised, there are plenty of options

# plot your KDE # once again, there are plenty of customisations available to you in pyplot plt.show()

# save your KDE to disk fig = ax.get_figure() fig.savefig('kde.png', transparent=True, bbox_inches='tight', pad_inches=0)

It’s amazing what you can do with basic CCTV footage, computer vision, and a little bit of mathematical knowledge, isn’t it?

To be informed when new content like this is posted, subscribe to the mailing list (or subscribe to my YouTube channel!):

Mapping Camera Coordinates to a 2D Floor Plan

Posted on April 6, 2018January 16, 2025 by Zbigatron

Data mining is a big business. Everyone is analysing mouse clicks, mouse movements, customer purchase patterns. Such analysis has proven to give profitable insights that are driving businesses further than ever before.

But not many people have considered data mining videos. What about all that security footage that has stacked up over the years? Can we mine those for profitable insights also? Of course!

In this blog post I’m going to present a task that video analytics can do for you: the plotting of tracked customers or staff from video footage onto a 2D floor plan.

Why would you want to do this? Well, plotting on a 2D plane will allow you to more easily data mine for things such as common movement patterns or common places of congestion at particular times of the day. This is powerful information to possess. For example, if you can deduce what products customers reached for in what order you can make important decisions with respect to the layout of your shelves and placement of advertising.

Another benefit of this technique is that it is also much easier to visualise movement patterns presented on 2D plane rather than when shown on distorted CCTV footage. (In fact, in my next post I extend what I present here by showing you how to generate heatmaps from your tracking data – check it out).

However, if you have ever tried to undertake this task you may have come to the understanding that it is not as straightforward as you initially thought. A major dilemma is that your security camera images are distorted. For example, a one pixel movement at the top of your image corresponds to a much larger movement in the real world than a one pixel movement at the bottom of your image.

Where to begin? In tackling this problem, the first thing to realise is that we are dealing with two planes in the Euclidean space. One plane (the floor in your camera footage) is “stretched out”, while the other is “laid flat”. We, therefore, need a transformation function to map points from one plane to the other.

The following image shows what we are trying to achieve (assume the chessboard is the floor in your shop/business):

chessboard-person-transformation — **The task**: map the plane from your camera to a perspective view

The next step, then, is to deduce what kind of transformation is necessary. Once we know this, we can start to look at the mathematics behind it and use this maths accordingly in our application. Here are some possible transformations:

different-transformations — *Different types of transformations (image source)*

Translations (the first transformation in the image above) are shifts in the x and y plane that preserve orientation. Euclidean transformations change the orientation of the plane but preserve the distances between points – definitely not our case, as mentioned earlier. Affine transformations are a combination of translation, rotation, scale, and shear. They can change the distances between points but parallel lines remain parallel after transformation – also not our case. Lastly, we have homographic transformations that can change a square into any form of a quadrilateral. This is what we are after.

Mathematically, homographic transformations are represented as such:

where (x,y) represent pixel coordinates in one plane, (x’, y’) represent pixel coordinates in another plane and H is the homography matrix represented as this 3×3 matrix:

Basically, the equation states this: given a point in one plane (x’,y’), if I multiply it by the homography matrix H I will get the corresponding point (x,y) from the other plane. So, if we calculate H, we can get the coordinates of any pixel from our camera image to the flat image.

But how do you calculate this magic matrix H? To gloss over some intricate mathematics, what we need is at least 4 point pairs (4 corresponding points) from the two images to get a minimal solution (a “close enough” solution) of H. But the more point pairs we provide, the better the estimate of H will be.

Getting the corresponding point pairs from our images is easy, too. You can use an image editing application like GIMP. If you move your mouse over an image, the pixel coordinates of your mouse positions are given at the bottom of the window. Jot down the pixel coordinates from one image and the corresponding pixel coordinates in the matching image. Get at least four such points pairs and you can then get an estimate of H and use it to calculate any other corresponding point pairs.

chessboard-point-pairs — *Example of 3 corresponding points in two images*

Now you can take the tracking information from your security camera and plot the position of people on your perspective 2D floor plan. You can now analyse their walking paths, where they spent most of their time, where congestion frequently occurs, etc. Nice! But what’s even nicer is the simple code needed to do everything discussed here. The OpenCV library (the best image/video processing library around) provides all necessary methods that you’ll need:

import cv2 # import the OpenCV library import numpy as np # import the numpy library

# provide points from image 1 pts_src = np.array([[154, 174], [702, 349], [702, 572],[1, 572], [1, 191]]) # corresponding points from image 2 (i.e. (154, 174) matches (212, 80)) pts_dst = np.array([[212, 80],[489, 80],[505, 180],[367, 235], [144,153]])

# calculate matrix H h, status = cv2.findHomography(pts_src, pts_dst)

# provide a point you wish to map from image 1 to image 2 a = np.array([[154, 174]], dtype='float32') a = np.array([a])

# finally, get the mapping pointsOut = cv2.perspectiveTransform(a, h)

Piece of cake. Here is a short animation showing what you can do:

Be sure to check out my next post where I show you how to generate heatmaps from the tracking data you just obtained from your security cameras.

To be informed when new content like this is posted, subscribe to the mailing list:

Computer Vision on Mars from the 2004 Exploration Rovers

Posted on March 23, 2018January 16, 2025 by Zbigatron

I was doing my daily trawl of the internet a few days ago looking at the latest news in artificial intelligence (especially computer vision) and an image caught my eye. The image was of one of the Mars Exploration Rovers (MER) that landed on the Red Planet in 2004. Upon seeing the image I thought to myself: “Heck, those rovers must surely have used computer vision up there!?” So, I spent the day looking into this and, sure as can be, not only was computer vision used by these rovers, it in fact played an integral part in their missions.

In this post, then, I’m going to present to you how, where, and when computer vision was used by those MERs. It’s been a fascinating few days for me researching into this and I’m certain you’ll find this an interesting read also. I won’t go into too much detail here but I’ll give you enough to come to appreciate just how neat and important computer vision can be.

If you would like to read more about this topic, a good place to start is “Computer Vision on Mars” (Matthies, Larry, et al. International Journal of Computer Vision 75.1, 2007: 67-92.), which is an academic paper published by NASA in 2007. You can also follow any additional referenced publications there. All images in this post, unless otherwise stated, were taken from this paper.

Background Information

In 2003, NASA launched two rovers into space with the intention of landing them on Mars to study rocks and soils for traces of past water activity. MER followed upon three other rover-based missions: the two Viking missions of 1975 and 1976 and the Mars Pathfinder mission of 1997.

Due to constraints in processing power and memory capacity no image processing was performed by the Viking rovers. They only took pictures with their on-board cameras to be sent back to Earth.

The Sojourner (the name of the Mars Pathfinder rover), on the other hand, performed computer vision in one way only. It used stereoscopic vision to provide scientists detailed maps of the terrain around the rover for operators on Earth to use in planning movement trajectories. Stereoscopic vision provides visual information from two viewing angles a short distance apart just like our eyes do. This kind of vision is important because two views of the same scene allows for the extraction of 3D data (i.e. depth data). See this OpenCV tutorial on extracting depth maps from stereo images for more information on this.

The MER Rovers

The MER rovers, Spirit and Opportunity as they were named, were identical. Both had a 20 MHz processor, 128 MB of RAM, and 256 MB of flash memory. Not much to work with there, as you can see! Phones nowadays are about 1000 times more powerful.

The rovers also had a monocular descent camera facing directly down and three sets of stereo camera pairs: one pair each at the front and back of the rovers (called hazard cameras, or “hazcams” for short) and a pair of cameras (called navigation cameras, or “navcams” for short) on a mast 1.3m (4.3 feet) above the ground. All these cameras took 1024 x 1024 greyscale photos.

But wait, those colour photos we’ve seen so many times from these missions were fake, then? Nope! Cleverly, each of the stereoscopic camera lenses also had a wheel of 8 filters that could be rotated. Consecutive images could be taken with a different filter (e.g. infrared, ultra-violet, etc.) and colour extracted from a combination of these. Colour extraction was only done on Earth, however. All computer vision processing on Mars was therefore performed in greyscale. Fascinating, isn’t it?

MER-rover-components — *Components of an MER rover (image source)*

The Importance of Computer Vision in Space

If you’ve been around computer vision for a while you’ll know that for things such as autonomous vehicles, vision solutions are not necessarily the most efficient. For example, lidar (Light Detection And Ranging – a technique similar to sonar for constructing 3D representations of scenes by emitting pulsating laser light and then measuring reflections of it) can give you 3D obstacle avoidance/detection information much more easily and quickly. So, why did NASA choose to use computer vision (and so much of it, as I’ll be presenting to you below) instead of other solutions? Because laser equipment is fragile and it may not have withstood the harsh conditions of Mars. So, digital cameras were chosen instead.

Computer Vision on Mars

We now have information on the background of the mission and the technical hardware relevant to us so let’s move to the business side of things: computer vision.

The first thing I will talk about is the importance of autonomy in space exploration. Due to communication latency and bandwidth limitations, it is advantageous to minimise human intervention by allowing vehicles or spacecraft to make decisions on their own. The Sojourner had minimal autonomy and only ended up travelling approximately 100 metres (328 feet) during it’s entire mission (which lasted a good few months). NASA wanted the MER rovers to travel on average that much every day, so they put a lot of time and research into autonomy to help them reach this target.

In this respect, the result was that they used computer vision for autonomy on Mars in 3 ways:

Descent motion estimation
Obstacle detection for navigation
Visual odometry

I will talk about each of these below. As mentioned in the introduction, I won’t go into great detail here but I’ll give you enough to satisfy that inner nerd in you 😛

1. Descent Image Motion Estimation System

Two years before the launch of the rocket that was to take the rovers to Mars, scientists realised that their estimates of near-surface wind velocities of the planet were too low. This could have proven catastrophic because severe horizontal winds could have caused irreparable damage upon an ill-judged landing of the rover. Spirit and Opportunity had horizontal impulse rockets that could be used to reduce horizontal velocity upon descent but no system to detect actual horizontal speed of the rovers.

Since a regular horizontal velocity sensor could not be installed due to cost and time constraints, it was decided to turn to computer vision for assistance! A monocular camera was attached to the base of the rover that would take pictures of the surface of the planet as the rovers were descending onto it. These pictures would be analysed in-flight to provide estimates of horizontal speeds in order to trigger the impulse rockets, if necessary.

The computer vision system for motion estimation worked by tracking a single feature (features are small “interesting” or “stand-out” patches in images). The feature was located in photos taken by the rovers and then the position of these patches was tracked between consecutive images.

Coupled with this feature tracking information and measurements from the angular velocity and vertical velocity sensors (that were already installed for the purpose of on-surface navigation), the entire velocity vector (i.e. information about the magnitude and direction of the rover’s speed) was able to be calculated.

The feature tracking algorithm, called the Descent Image Motion Estimation System (DIMES) consisted of 7 steps as summarised by the following image:

The first step reduces the image size to 256 x 256 resolution. The smaller the resolution, the faster that image processing calculations can be performed – but at the possible expense of accuracy. The second step was responsible for estimating the maximum possible area of overlap in consecutive images to minimise the search area for features (there’s no point in detecting features in regions of an image that you know are not going to be present in the second). This was done by taking into consideration knowledge from sensors of things such as the rover’s altitude and orientation. The third step picked out two features from an image using the Harris corner detector (discussed here in this OpenCV tutorial). Only one feature is needed for the algorithm to work but two were detected in case one feature could not be located in the following image. A few noise “clean-up” operations on images were performed in step 4 to reduce effects of things such as blurring.

Step 5 is interesting. The feature patches (aka feature templates) and search windows in consecutive images were rectified (rotated, twisted, etc.) to remove orientation and scale differences in order to make searching for features easier. In other words, the images were rotated, twisted and enlarged/diminished to be placed on the same plane. An example of this from the actual mission (from the Spirit rover’s descent) is shown in the image below. The red squares in the first image are the detected feature patches that are shown in green in the second image with the search windows shown in blue. You can see how the first and second images have been twisted and rotated such that the feature size, for example, is the same in both images.

Step 6 was responsible for locating in the second image the two features found in the first image. Moravec’s correlator (an algorithm developed by Hans Moravec and published in his PhD thesis way back in 1980) was used for this. The general idea in this algorithm is to minimise the search area first instead of searching over every possible location in an image for a match. This is done by first selecting potential regions in an image for matches and only there is a more exhaustive search performed.

The final step is combining all this information to calculate the velocity vector. In total, the DIMES algorithm took 14 seconds to run up there in the atmosphere of Mars. It was run by both rovers during their descent. The Spirit rover was the only one that fired its impulse rockets as a result of calculations from DIMES. Its horizontal velocity was at one stage reduced from 23.5 m/s (deemed to be slightly over a safe limit) to 11 m/s, which ensured a safe landing. Computer vision to the rescue! Opportunity’s horizontal speed was never calculated to be too fast so firing its stabilising rockets was considered to be unnecessary. It also had a successful landing.

All the above steps were performed autonomously on Mars without any human intervention.

2. Stereo Vision for Navigation

To give the MER rovers as much autonomy as possible, NASA scientists developed a stereo-vision-based obstacle detection and navigation system. The idea behind it was to give the scientists the ability to simply provide the rovers each day with a destination and for the vehicles to work things out on their own with respect to navigation to this target (e.g. to avoid large rocks).

And their system performed beautifully.

The algorithm worked by extracting disparity (depth) maps from stereo images – as I’ve already mentioned, see this OpenCV tutorial for more information on this technique. What was done, however, by the rovers was slightly different to that tutorial (for example a simpler feature matching algorithm was employed), but the gist of it was the same: feature point detection and matching was performed to find the relationship between images and knowledge of camera properties such as focal lengths and baseline distances allowed for the derivation of depth for all pixels in an image. An example of depth maps calculated in this way by the Spirit rover is shown below:

depth-maps-spirit-rover — The middle picture was taken by Spirit and shows a rock on Mars approximately 0.5 m (1.6 feet) in height. The left image shows corresponding range information (red is closest, blue furthest). The right image shows corresponding height information.

Interestingly, the Opportunity rover, because it landed on a smoothly-surfaced plain, was forced to use its navcams (that were mounted on a mast) for its navigation. Looking down from a higher angle meant that detailed texture from the sand could be used for feature detection and matching. Its hazcams returned only the smooth surface of the sand. Smooth surfaces are not agreeable to feature detection (because, for example, they don’t have corners or edges). The Spirit rover, on the other hand, because it landed in a crater full of rocks, could use its hazcams for stereoscopic navigation.

3. Visual Odometry

Finally, computer vision on Mars was used at certain times to estimate the rovers’ position and travelling distance. No GPS is available on Mars (yet) and standard means of estimating distance travelled such as counting the number of wheel rotations was deemed during desert testing on Earth to be vulnerable to significant error due to one thing: wheel slippage. So, NASA scientists decided to employ motion estimation via computer vision instead.

Motion estimation was performed using feature tracking in 3D across successive shots taken by the navcams. To obtain 3D information, once again depth maps were extracted from stereoscopic images. Distances to features could easily be calculated from these and then the rovers’ poses were estimated. On average, 80 features were tracked per frame and a photo was taken for visual odometry calculations every 75 cm (30 inches) of travel.

Using computer vision to assist in motion estimation proved to be a wise decision because wheel slippage was quite severe on Mars. In fact, at one time the rover got stuck in sand and the wheels rotated in place for the equivalent of 50m (164 feet) of driving distance. Without computer vision the rovers’ estimated positions would have been severely inaccurate.

There was another instance where this was strikingly the case. At one time the Opportunity rover was operating on a 17-20 degree slope in a crater and was attempting to maneuver around a large rock. It had been trying to escape the rock for several days and had slid down the crater many times in the process. The image below shows the rover’s estimated trajectory (from a top-down view) using just wheel odometry (left), and the rover’s corrected trajectory (right) as assisted by computer vision calculations. The large rock is represented by the black ellipse. The corrected trajectory proved to be the more accurate estimation.

Summary

In this post I presented the three ways computer vision was used by the Spirit and Opportunity rovers during their MER missions on Mars. These three ways were:

Estimating horizontal speeds during their descent onto the Red Planet to ensure the rovers had a smooth landing.
Extracting 3D information of its surroundings using stereoscopic imagery to assist in navigation and obstacle detection.
Using stereoscopic imagery once again but this time to provide motion and pose estimation on difficult terrain.

In this way, computer vision gave the rovers a significant amount of autonomy (much, much more autonomy than its predecessor, the Sojourner rover) that ultimately gave the rovers a safe landing and allowed the robots to traverse up to 370 m (1213 feet) per day. In fact, the Opportunity rover is still active on Mars now. This means that the computer vision techniques described in this post are churning away as we speak. If that isn’t neat, I don’t know what is!

To be informed when new content like this is posted, subscribe to the mailing list (or subscribe to my YouTube channel!):

Why Deep Learning Has Not Superseded Traditional Computer Vision

Posted on March 9, 2018January 16, 2025 by Zbigatron

This is another post that’s been inspired by a question that has been regularly popping up in forums:

Has deep learning superseded traditional computer vision?

Or in a similar vein:

Is there still a need to study traditional computer vision techniques when deep learning seems to be so effective?

These are good questions. Deep learning (DL) has certainly revolutionised computer vision (CV) and artificial intelligence in general. So many problems that once seemed improbable to be solved are solved to a point where machines are obtaining better results than humans. Image classification is probably the prime example of this. Indeed, deep learning is responsible for placing CV on the map in the industry, as I’ve discussed in previous posts of mine.

But deep learning is still only a tool of computer vision. And it certainly is not the panacea for all problems. So, in this post I would like to elaborate on this. That is, I would like to lay down my arguments for why traditional computer vision techniques are still very much useful and therefore should be learnt and taught.

I will break the post up into the following sections/arguments:

Deep learning needs big data
Deep learning is sometimes overkill
Traditional CV will help you with deep learning

But before I jump into these arguments, I think it’s necessary to first explain in detail what I mean by “traditional computer vision”, what deep learning is, and also why it has been so revolutionary.

Background Knowledge

Before the emergence of deep learning if you had a task such as image classification, you would perform a step called feature extraction. Features are small “interesting”, descriptive or informative patches in images. You would look for these by employing a combination of what I am calling in this post traditional computer vision techniques, which include things like edge detection, corner detection, object detection, and the like.

In using these techniques – for example, with respect to feature extraction and image classification – the idea is to extract as many features from images of one class of object (e.g. chairs, horses, etc.) and treat these features as a sort of “definition” (known as a bag-of-words) of the object. You would then search for these “definitions” in other images. If a significant number of features from one bag-of-words are located in another image, the image is classified as containing that specific object (i.e. chair, horse, etc.).

The difficulty with this approach of feature extraction in image classification is that you have to choose which features to look for in each given image. This becomes cumbersome and pretty much impossible when the number of classes you are trying to classify for starts to grow past, say, 10 or 20. Do you look for corners? edges? texture information? Different classes of objects are better described with different types of features. If you choose to use many features, you have to deal with a plethora of parameters, all of which have to be fine-tuned by you.

Well, deep learning introduced the concept of end-to-end learning where (in a nutshell) the machine is told to learn what to look for with respect to each specific class of object. It works out the most descriptive and salient features for each object. In other words, neural networks are told to discover the underlying patterns in classes of images.

So, with end-to-end learning you no longer have to manually decide which traditional computer vision techniques to use to describe your features. The machine works this all out for you. Wired magazine puts it this way:

If you want to teach a [deep] neural network to recognize a cat, for instance, you don’t tell it to look for whiskers, ears, fur, and eyes. You simply show it thousands and thousands of photos of cats, and eventually it works things out. If it keeps misclassifying foxes as cats, you don’t rewrite the code. You just keep coaching it.

The image below portrays this difference between feature extraction (using traditional CV) and end-to-end learning:

So, that’s the background. Let’s jump into the arguments as to why traditional computer vision is still necessary and beneficial to learn.

Deep Learning Needs Big Data

First of all, deep learning needs data. Lots and lots of data. Those famous image classification models mentioned above are trained on huge datasets. The top three of these datasets used for training are:

ImageNet – 1.5 million images with 1000 object categories/classes,
Microsoft Common Objects in Context (COCO) – 2.5 million images, 91 object categories,
PASCAL VOC Dataset – 500K images, 20 object categories.

Easier tasks than general image classification will not require this much data but you will still need a lot of it. What happens if you can’t get that much data? You’ll have to train on what you have (yes, some techniques exist to boost your training data but these are artificial methods).

But chances are a poorly trained model will perform badly outside of your training data because a machine doesn’t have insight into a problem – it can’t generalise for a task without seeing data.

And it’s too difficult for you to look inside the trained model and tweak things around manually because a deep learning model has millions of parameters inside of it – each of which is tuned during training. In a way, a deep learning model is a black box.

Traditional computer vision gives you full transparency and allows you to better gauge and judge whether your solution will work outside of a training environment. You have insight into a problem that you can transfer into your algorithm. And if anything fails, you can much more easily work out what needs to be tweaked and where.

Deep Learning is Sometimes Overkill

This is probably my favourite reason for supporting the study of traditional computer vision techniques.

Training a deep neural network takes a very long time. You need dedicated hardware (high-powered GPUs, for example) to train the latest state-of-the-art image classification models in under a day. Want to train it on your standard laptop? Go on a holiday for a week and chances are the training won’t even be done when you return.

Moreover, what happens if your trained model isn’t performing well? You have to go back and redo the whole thing again with different training parameters. And this process can be repeated sometimes hundreds of times.

But there are times when all this is totally unnecessary. Because sometimes traditional CV techniques can solve a problem much more efficiently and in fewer lines of code than DL. For example, I once worked on a project to detect if each tin passing through on a conveyor belt had a red spoon in it. Now, you can train a deep neural network to detect spoons and go through the time-consuming process outlined above, or you can write a simple colour thresholding algorithm on the colour red (any pixel within a certain range of red is coloured white, every other pixel is coloured black) and then count how many white pixels you have. Simple. You’re done in an hour!

Knowing traditional computer vision can potentially save you a lot of time and unnecessary headaches.

Traditional Computer Vision will Improve your Deep Learning Skills

Understanding traditional computer vision can actually help you be better at deep learning.

For example, the most common neural network used in computer vision is the Convolutional Neural Network. But what is a convolution? It’s in fact a widely used image processing technique (e.g. see Sobel edge detection). Knowing this can help you understand what your neural network is doing under the hood and hence design and fine-tune it better to the task you’re trying to solve.

Then there is also a thing called pre-processing. This is something frequently done on the data that you’re feeding into your model to prepare it for training. These pre-processing steps are predominantly performed with traditional computer vision techniques. For example, if you don’t have enough training data, you can do a task called data augmentation. Data augmentation can involve performing random rotations, shifts, shears, etc. on the images in your training set to create “new” images. By performing these computer vision operations you can greatly increase the amount of training data that you have.

Conclusion

In this post I explained why deep learning has not superseded traditional computer vision techniques and hence why the latter should still be studied and taught. Firstly, I looked at the problem of DL frequently requiring lots of data to perform well. Sometimes this is not a possibility and traditional computer vision can be considered as an alternative in these situations. Secondly, occasionally deep learning can be overkill for a specific task. In such tasks, standard computer vision can solve a problem much more efficiently and in fewer lines of code than DL. Thirdly, knowing traditional computer vision can actually make you better at deep learning. This is because you can better understand what is happening under the hood of DL and you can perform certain pre-processing steps that will improve DL results.

In a nutshell, deep learning is just a tool of computer vision that is certainly not a panacea. Don’t only use it because it’s trendy now. Traditional computer vision techniques are still very much useful and knowing them can save you time and many headaches.

To be informed when new content like this is posted, subscribe to the mailing list (or subscribe to my YouTube channel!):

Machines That Can Read Our Minds from MRI Scans

Posted on February 23, 2018January 16, 2025 by Zbigatron

Who here watches Black Mirror? Oh, I love that show (maybe except for Season 4 – only one episode there was any good, IMHO). For those that haven’t seen it yet, Black Mirror basically tries to extrapolate what society will look like if technological advances were to continue on their current trajectory.

A few of the technologies presented in that show are science fiction on steroids. But I think that many are not and that it’s only a matter of time before we reach that (dark?) high-tech reality.

A reader of this blog approached me recently with a (preprint) academic publication that sounded pretty much like it had been taken out of that series. It was a paper purporting that scientists can now reconstruct our visual thoughts by analysing our brain scans.

“No way. Too good to be true”, was my first reaction. There has to be a catch. So, I went off to investigate further.

And boy was my mind blown. Allow me in this post to present to you that publication out of Kyoto, Japan. Try telling me at the end that you’re not impressed either!

(Note: Although this post is technically not directly related to computer vision, I’m still including it in my blog because it deals with trying to understand how our brains work when it comes to visual stimuli. This is important to do because many times in the history of computer vision and AI in fact, it has been shown that biologically inspired solutions (i.e. solutions that mimic the way we behave) can obtain really good if not superior results to other solutions. The prime example of this, of course, is neural networks)

Machines that can read our minds

The paper in question is entitled “Deep image reconstruction from human brain activity” (Shen et al., bioRxiv, 2017). Although it has been published in preprint, meaning that it hasn’t been peer-reviewed and technically doesn’t hold too much academic weight, the site that it was published on is of repute (www.biorxiv.org) as are the institutes that co-authored the work (ATR Computational Neuroscience Laboratories and Kyoto University, the latter being one of Asia’s highest ranked universities). Hence, I’m confident in the credibility of their research.

In a nutshell, the paper is claiming to have developed a way to reconstruct images that people are looking at or thinking about by solely analysing functional magnetic resonance imaging (fMRI). fMRI machines measure brain activity by detecting changes in blood flow through regions of the brain – generally speaking, the more blood flow, the more brain activity. Here’s an example fMRI image showing increased brain activity in orange/yellow:

Let’s look at some details of their paper.

The first step was to utilise machine learning to examine what the relationship is between brain activity as detected by fMRI scans and the images the were being shown to test subjects. What’s interesting is that there was a focus on mapping hierarchical features between the fMRI images and real images.

Hierarchical features is a way to deconstruct images into an ascending level of detail. For example, to describe a car you could start by describing edges, then working up to curves, and then particular shapes like circles, and then objects like wheels; and then finally you’d get four wheels, a chassis and some windows.

Deconstructing images by looking at hierarchical features (edges, curves, shapes, and so on) is also the way neural networks and our brains work with respect to visual stimuli. So, this plan of action to focus on hierarchical features in fMRI images seems intuitive.

The hierarchical features extracted from fMRI images were then used to replace the features of a deep neural network (DNN). Next, an iterative process was started where base images (produced by a deep generator network) were fed into the DNN and at each iteration individual pixels of the image were transformed until the pixel values matched the features of the DNN (which were extracted from fMRI images) to a certain threshold (error) level.

It’s a complicated process, I know. I’ve tried to summarise it as best as I could here! To understand it more, you’ll need to go back to the original publication and then follow a trail of referenced articles. The image below shows this process – not that it helps much 🙂 But I think the most important part of this study are the results – that’s coming up in the next section.

deep-image-reconstruction — *Deep image reconstruction (source: original publication)*

Test subjects were shown various images from three different classes: natural colour images (sampled from ImageNet), artificial geometrical shapes (in 8 colours), and black and white alphabetical letters. Scans of brain activity were conducted during the viewing sessions.

Interestingly, test subjects were also asked to recollect images they had been previously shown and brain scans during this task were also taken. The idea was to try to see if a machine could truly discern what a person was actually thinking rather than just viewing.

Results from the Study

This is where the fun begins!

But before we dive into the results, I need to mention that this is not the first time fMRI imaging has been used to attempt to reconstruct images. Similar projects (referenced in the publication we’re discussing) go back to 2011 and 2013. The novelty of this study, however, is to work with hierarchical features rather than working with fMRI images directly.

Take a look at this video showing past results from previous studies. The images are monochromatic and of minimal resolution:

Now let’s take a look at some results from this particular study. (All images are taken from the original publication.)

How about a reconstructed image of a swan? Remember, this reconstructed image (right) is generated from someone’s brain scan (taken while the test subject was viewing the original image):

Example-Result-Swan — *Example result #1: the test subject was shown the image on the left; the image on the right is the reconstructed image from brain scans.*

You must be impressed with that!

Here’s a reconstructed picture of a duck:

experiment-result-duck — *Example result #2: the test subject was shown the image on the left; the image on the right is the reconstructed image from brain scans.*

For a presentation of results, take a look at this video released by the authors of the paper:

What about those letters I mentioned that were also shown to test subjects? Check this out:

results-from-text-experiment — *Example results #3: the test subject was shown the letters in the top row; the bottom row shows the reconstructed images from brain scans.*

Scary how good the reconstructions are, isn’t it? Imagine being able to tell what a person is reading at a given time!? I’ve already written a post about non-invasive lie detection from thermal imaging. If we can just work out how to get fMRI images like that (i.e. without engaging the subject), we’ll be able to tell what a person is looking at without them even knowing about it. Black Mirror stuff!

Results from the artificial geometric shapes images can be viewed here – also impressive, of course.

And how about reconstructed images from scans performed when people were thinking about the images they had been previously shown? Well, results here aren’t as impressive (see below). The authors concede this also. But hey! One step at a time, right?

results-imagined-images — *Example results #4: the test subject was told to think about the images in the rop row; the bottom row shows the attempted reconstructions of these images from brain scans.*

Conclusion

In this post I presented a preprint publication in which it is purported that scientists were able to reconstruct visual thoughts by analysing brain scans. fMRI images were analysed for hierarchical features that were then used in a deep neural network. Base images were fed through this network and iteratively transformed until the features of each image matched the features in the neural network. Results from this study were presented in this post and I think we can all agree that they were quite impressive. This is the future, folks! Black Mirror stuff perhaps.

The uses for good of this technology are numerous – communicating with people in comas is one such use case. Of course, the abuse that could come from this is frightening, too. Issues with privacy (ah, that pesky perennial question in computer science!) spring to mind immediately.

But as you may have gathered from all my posts here, I look forward to what the future holds for us, especially with respect to artificial intelligence. And likewise with this technology. I can’t wait to see it progress.

To be informed when new content like this is posted, subscribe to the mailing list (or subscribe to my YouTube channel!):

Amazon Go – Computer Vision at the Forefront of Innovation

Posted on February 9, 2018January 16, 2025 by Zbigatron

Where would a computer vision blog be without a post about the new cashier-less store recently opened to the public by Amazon? Absolutely nowhere.

But I don’t need additional motivation to write about Amazon Go (as the store is called) because I am, to put it simply, thrilled and excited about this new venture. This is innovation at its finest where computer vision is playing a central role.

How can you not get enthusiastic about it, then? I always love it when computer vision makes the news and this is no exception.

In this post, I wish to talk about Amazon Go under four headings:

How it all works from a technical (as much as is possible) and non-technical perspective,
Some of the reported issues prior to public opening,
Some reported in-store issues post public opening, and
Some potential unfavourable implications cashier-less stores may have in the future (just to dampen the mood a little)

So, without further ado…

How it Works – Non-Technically & Technically

The store has a capacity of around 90 people – so it’s fairly small in size, like a convenience store. To enter it you first need to download the official Amazon app and connect it to your Amazon Prime account. You then walk up to a gate like you would at a metro/subway and scan a QR code from the app. The gate opens and your shopping experience begins.

Inside the store, if you wish to purchase something, you simply pick it up off the shelf and put it in your bag or pocket. Once you’re done, you walk out of the shop and a few minutes later you get emailed a receipt listing all your purchases. No cashiers. No digging around for money or cards. Easy as pie!

What happens on the technical side of things behind the scenes? Unfortunately, Amazon hasn’t disclosed much at all, which is a bit of a shame for nerds like me. But I shouldn’t complain too much, I guess.

What we do know is that sensor fusion is employed (sensor fusion is when data is combined from multiple sensors/sources to provide a higher degree of accuracy) along with deep learning.

Hundreds of cameras and depth sensors are attached to the ceiling around the store:

amazon-go-cameras — *The cameras and depth sensors located on the ceiling of the store (image source)*

These track you and your movements (using computer vision!) throughout your expedition. Weight sensors can also be found on the shelves to assist the cameras in discerning which products you have chosen to put into your shopping basket.

(Note: sensor fusion is also being employed in autonomous cars. Hopefully I’ll be writing about this as well soon.)

In 2015, Amazon filed a patent application for its cashier-less store in which it stated the use of RGB cameras (i.e. colour cameras) along with facial recognition. TechCrunch, however, has reported that the Vice President of Technology at Amazon Go told them that no facial recognition algorithms are currently being used.

In-Store Issues Prior to Public Opening

Although the store opened its doors to the public a few weeks ago, it has been open to employees since December 2016. Initially, Amazon expected the store to be ready for public use a few months after that but public opening was delayed by nearly a year due to “technical problems“.

We know what some of the dilemmas behind these “technical problems” were.

Firstly, Amazon had problems tracking more than 20 people in the store. If you’ve ever worked on person-tracking software, you’ll know how hard it is to track a crowd of people with similar body types and wearing similar clothes. But it looks like this has been resolved (to at least a satisfactory level for them). It’s a shame for us, though, to not be given more information on how Amazon managed to get this to work.

Funnily enough, some employees of Amazon knew about this problem and in November last year tried to see if a solution had been developed. Three employees dressed up in Pikachu costumes (as reported by Bloomberg here) while doing their round of shopping to attempt to fool the system. Amazon Go passed this thorough, systematic, and very scientific test. Too bad I couldn’t find any images or videos of this escapade!

We also know that initially engineers were assisting the computer vision system behind the scenes. The system would let these people know when it was losing confidence with its tracking results and would ask them to intervene, at least minimally. Nobody is supposedly doing this task any more.

Lastly, I also found information stating that the system would run into trouble when products were taken off the shelf and placed back on a different shelf. This was reported to have occurred when employees brought their children into the store and they ran wild a little (as children do).

This also appears to have been taken care of because someone from the public attempted to do this on purpose last week (see this video) but to no adverse effects, it would seem.

It’s interesting to see the growing pains that Amazon Go had to go through, isn’t it? How they needed an extra year to try to iron out all these creases. This is such a huge innovation. Makes you wonder what “creases” autonomous cars will have when they become more prominent!

In-Store Issues Post Public Opening

But, alas. It appears as though not all creases were ironed out to perfection. Since Amazon Go’s opening a few weeks ago, two issues have been written about.

The first is of Deirdre Bosa of CNBC not being charged for a small tub of yoghurt:

I think I just shoplifted?? #AmazonGo didn’t charge me for my Siggi’s yogurt 😬 #nolinesnocheckout #freestuff pic.twitter.com/RDPhC5ryXD

— Deirdre Bosa (@dee_bosa) January 22, 2018

The Vice President of Amazon Go responded in the following way:

First and foremost, enjoy the yogurt on us. It happens so rarely that we didn’t even bother building in a feature for customers to tell us it happened. So thanks for being honest and telling us. I’ve been doing this a year and I have yet to get an error.

The yoghurt manufacturer replied to that tweet also:

oh no! It’s ok, that one’s on us 😉

— siggi’s (@siggisdairy) January 22, 2018

To which Dierdre responded: “Thanks Siggi’s! But I think it’s on Amazon :)”

LOL! 🙂

But as Amazon Go stated, it’s a rarity for these mistakes to happen. Or is that only the case until someone works out a flaw in the system?

Well, it seems as though someone has!

In this video, Tim Pool states that he managed to walk out of the Amazon Go store with a bag full of products and was only charged for one item. According to him it is “absurdly easy to take a bag full of things and not get charged”. That’s a little disconcerting. It’s one thing when the system makes a mistake every now and then. It’s another thing when someone has worked out how to break it entirely.

Tim Pool says he has contacted Amazon Go to let them know of the major flaw. Amazon confirmed with him that he did not commit a crime but “if used in practice what we did would in fact be shoplifting”.

Ouch. I bet engineers are working on this frantically as we speak.

One more issue worth mentioning that isn’t really a flaw but could also be abused is that at the moment you can request a refund on any item without returns. No questions asked. Linus Tech Tips shows in this video how easily this can be done. Of course, since your Amazon Go account needs to be linked to your Amazon Prime account, if you do this too many times, Amazon will catch on and will probably take some form of preventative action against you or will even verify everything by looking back at past footage of you.

Cons of Amazon Go

Like I said earlier, I am really excited about Amazon Go. I always love it when computer vision spearheads innovation. But I also think it’s important to in this post also talk about potential unfavourable implications of a cashier-less store.

Potential Job Losses

The first most obvious potential con of Amazon Go is the job losses that might ensue if this innovation catches on. Considering that 3.5 million people in the US are employed as cashiers (it’s the second-most common job in that country), this issue needs to be raised and discussed. Heck, there have already been protests in this respect outside of Amazon Go:

amazon-go-protest — *Protests in front of the Amazon Go store (image source)*

Bill Ingram, the organiser of the protest shown above asks: “What will all the cashiers do once their jobs are automated?”

Amazon, not surprisingly, has issued statements on this topic. It has said that although some jobs may be taken by automation, people can be relocated to improve other areas of the store by, for example:

Working in the kitchen and the store, prepping ingredients, making breakfast, lunch and dinner items, greeting customers at the door, stocking shelves and helping customers

Let’s also not forget that new jobs have also been created. For example, additional people need to be hired to manage the technological infrastructure behind this huge endeavour.

Personally, I’m not a pessimist about automation either. The industrial revolution that brought automation to so many walks of life was hard at first but society found ways to re-educate into other areas. The same will happen, I believe, if cashier-less stores become a prominent thing (and autonomous cars also, for that matter).

An Increase in Unhealthy Impulse Purchases

Manoj Thomas, a professor of marketing at Cornell University, has stated that our shopping behaviour will change around cashier-less stores:

[W]e know that when people use any abstract form of payment, they spend more. And the type of products they choose changes too.

What he’s saying is that psychological research has shown that the more distance we put between us and the “pain of paying” the more discipline we need to avoid those pesky impulse purchases. Having cash physically in your hand means you can what you’re doing with your money more easily. And that extra bit of time waiting in line at the cashier could be time enough to reconsider purchasing that chocolate and vanilla tub of ice cream :/

Even More Surveillance

And then we have the perennial question of surveillance. When is too much, too much? How much more data about us can be collected?

With such sophisticated surveillance in-store, companies are going to have access to even more behavioural data about us: which products I looked at for a long time; which products I picked up but put back on the shelf; my usual path around a store; which advertisements made me smile – the list goes on. Targeted advertising will become even more effective.

Indeed, Bill Ingram’s protest pictured above was also about this (hence why masks were worn to it). According to him, we’re heading in the wrong direction:

If people like that future, I guess they can jump into it. But to me, it seems pretty bleak.

Harsh, but there might be something to it.

Less Human Interaction

Albert Borgmann, a great philosopher on technology, coined the term device paradigm in his book “Technology and the Character of Contemporary Life” (1984). In a nutshell, the term is used to explain the hidden, detrimental nature and power of technology in our world (for a more in-depth explanation of the device paradigm, I highly recommend you read his philosophical works).

One of the things he laments is how we are increasingly losing daily human interactions due to the proliferation of technology. The sense of a community with the people around us is diminishing. Cashier-less stores are pushing this agenda further, it would seem. And considering, according to Aristotle anyway, that we are social creatures, the more we move away from human interaction, the more we act against our nature.

The Chicago Tribune wrote a little about this at the bottom of this article.

Is this something worth considering? Yes, definitely. But only in the bigger picture of things, I would say. At the moment, I don’t think accusing Amazon Go of trying to damage our human nature is the way to go.

Personally, I think this initiative is something to celebrate – albeit, perhaps, with just the faintest touch of reservation.

Summary

In this post I discussed the cashier-less store “Amazon Go” recently opened to the public. I looked at how the store works from a technical and non-technical point of view. Unfortunately, I couldn’t say much from a technical angle because of the little amount of information that has been disclosed to us by Amazon. I also discussed some of the issues that the store has dealt with and is dealing with now. I mentioned, for example, that initially there were problems in trying to track more than 20 people in the store. But this appears to have been solved to a satisfactory level (for Amazon, at least). Finally, I dampened the mood a little by holding a discussion on the potential unfavourable implications that a proliferation of cashier-less stores may have on our societies. Some of the issues raised here are important but ultimately, in my humble opinion, this endeavour is something to celebrate – especially since computer vision is playing such a prominent role in it.

To be informed when new content like this is posted, subscribe to the mailing list (or subscribe to my YouTube channel!):

The Reasons Behind the Recent Growth of Computer Vision

Posted on January 26, 2018January 16, 2025 by Zbigatron

In my previous post I looked at the unprecedented growth of computer vision in the industry. 10 years ago computer vision was nowhere to be seen outside of academia. But things have since changed significantly. A telling sign of this is the consistent tripling each year of venture capital funding in computer vision. And Intel’s acquisition of Mobileye in March 2017 for a whopping US$15.3 billion just sums up the field’s achievements.

In that post, however, I only briefly touched upon the reasons behind this incredible growth. The purpose of this article, then, is to fill that gap.

In this respect, I will discuss the top 4 reasons behind the growth of computer vision in the industry. I’ll do so in the following order:

Advancements in hardware
The emergence of deep learning
The advent of large datasets
The increase in computer vision applications

Better and More Dedicated Hardware

I mentioned in another post of mine that one of the main reasons why image processing is such a difficult problem is that it deals with an immense amount of data. To process this data you need memory and processing power. These have been increasing in size and power regularly for over 50 years (cf. Moore’s Law).

Such increases have allowed for algorithms to run faster to the point where more and more things are now capable of being run in real-time (e.g. face recognition).

We have also seen an emergence and proliferation of dedicated pieces of hardware for graphics and image processing calculations. GPUs are the prime example of this. A GPU clock speed may generally be slower than a regular CPU’s, but it can still outperform one for these specific tasks.

Dedicated pieces of hardware are becoming so highly prized nowadays in computer vision that numerous companies have started designing and producing them. Just two weeks ago, for example, Ambarella announced two new chips designed for computer vision processing with chiefly autonomous cars, drones, and security cameras in mind. And last year, Mythic, a startup based in California, raised over US$10 million to commercialise their own deep learning focused pieces of hardware.

The Emergence of Deep Learning

Deep learning, a subfield of machine learning, has been revolutionary in computer vision. Because of it machines are now getting better results than humans in important tasks such as image classification (i.e. detecting what object is in an image).

Previously, if you had a task such as image classification, you would perform a step called feature extraction. Features are small “interesting”, descriptive or informative patches in images. The idea is to extract as many of these from images of one class of object (e.g. chairs, horses, etc.) and treat these features as a sort of “definition” (known as a bag-of-words) of the object. You would then search for these “definitions” in other images. If a significant number of features from one bag-of-words are located in another image, the image is classified as containing that specific object (i.e. chair, horse, etc.).

The difficulty with this approach is that you have to choose which features to look for in each given image. This becomes cumbersome and pretty much impossible when the number of classes you are trying to classify for starts to grow past, say, 10 or 20. Do you look for corners? edges? texture information? Different classes of objects are better described with different types of features. If you choose to use many features, you have to deal with a plethora of parameters, all of which have to be fine-tuned.

Well, deep learning introduced the concept of end-to-end learning where (in a nutshell) the machine is told to learn what to look for with respect to each specific class of object. It works out the most descriptive and salient features for each object. In other words, neural networks are told to discover the underlying patterns in classes of images.

The image below portrays this difference between feature extraction and end-to-end learning:

Deep learning has proven to be extremely successful for computer vision. If you look below at the graph I used in my previous post showing capital investments into US-based computer vision companies since 2011, you can see that when deep learning became mainstream in around 2014/2015, investments suddenly doubled and have been growing at a regular rate since.

You can safely say that deep learning put computer vision on the map in the industry. Without it, chances are we would all still be stuck with it in academia (not that there’s anything wrong with academia, of course).

Large Datasets

To allow a machine to learn the underlying patterns of classes of objects it needs A LOT of data. That is, it needs large datasets. More and more of these have been emerging and have been instrumental in the success of deep learning and therefore computer vision.

Before around 2012, a dataset was considered relatively large if it contained 100+ images or videos. Now, datasets exist with numbers ranging in the millions.

Here are some of the most known image classification databases currently being used to test and train the latest state-of-the-art object classification/recognition models. They have all been meticulously hand annotated by the open source community.

ImageNet – 15 million images, 22,000 object categories. It’s HUGE! (I hope to write more about this dataset in the near future, so stay tuned for that).
Open Images – 9 million images, 5,000 object categories.
Microsoft Common Objects in Context (COCO) – 330K images, 80 object categories.
PASCAL VOC Dataset – a few versions exist, 20 object categories.
CALTECH-101 – 9,000 images with 101 object categories.

I need to also mention Kaggle, the University of California, Irvine Machine Learning Repository. Kaggle hosts 351 image datasets ranging from flowers, wines, forest fires, etc. It is also the home of the famous Facial Expression Recognition Challenge (FER). The aim of this competition is to correctly detect the emotion of people from seven different categories from nearly 35,000 images of faces.

All these datasets and many more have raised computer vision to its current position in the industry. Certainly, deep learning would not be where it is now without them.

More Applications

Faster machines, larger memories, and other advances in technology have increased the number of useful things machines have been able to do for us in our lives. We now have autonomous cars (well, we’re close to having them), drones, factory robots, cleaning robots – the list goes on. With an increase in such vehicles, devices, tools, appliances, etc. has come an increase in the need for computer vision.

Let’s take a look at some examples of recent new ways computer vision is being used today.

Walmart, for example, a few months ago released shelf-scanning robots into 50 of its warehouses. The purpose of these robots is to detect out-of-stock and missing items and other things such as incorrect labelling. Here’s a picture of one of these robots at work:

A British online supermarket is using computer vision to determine the best ways to grasp goods for its packaging robots. This video shows their robot in action:

Agriculture as well is capitalising on the growth of computer vision. iUNU, for example, is developing a network of cameras on rails to assist greenhouse owners to keep track of how their plants are growing.

The famous Roomba autonomous vacuum cleaner got an upgrade a few years ago with a new computer vision system to more smartly manoeuvre around your home.

And our phones? Plenty of computer vision being used in them! Ever noticed your phone camera tracking and focusing on your face when you’re trying to take a picture? That’s computer vision. And how about the face recognition services to unlock your phones? Well, Samsung’s version of it can be classified as computer vision (I write about it in this post).

There’s no need to mention autonomous cars here. We are constantly hearing about them on the news. It’s only a matter of time before we’ll be jumping into one.

Computer vision is definitely here to stay. In fact, it’s only going to get bigger with time.

Summary

In this post I looked at the four main reasons behind the recent growth of computer vision: 1) the advancements in hardware such as faster CPUs and availability of GPUs; 2) the emergence of deep learning, which has changed our way of performing tasks such as image classification; 3) the advent of large datasets that have allowed us to more meticulously study the underlying patterns in images; and 4) the increase in computer vision applications.

All these factors (not always mutually exclusive) have contributed to the unprecedented position of computer vision in the industry.

As I mentioned in my previous post, it’s been an absolute pleasure to have witnessed this growth and to have seen these factors in action. I truly look forward to what the future holds for computer vision.

To be informed when new content like this is posted, subscribe to the mailing list (or subscribe to my YouTube channel!):

The Growth of Computer Vision in the Industry

Posted on January 12, 2018January 16, 2025 by Zbigatron

I started out in computer vision in 2004. I was walking along the corridors of the computer science department at the University of Adelaide (in South Australia) looking at notices put up by lecturers advertising potential undergraduate thesis topics. There wasn’t much there for me until one particular topic caught my eye: developing a vision system for a soccer-playing robot.

Well, the nerd in my awoke! I knocked on the lecturer’s door and 5 minutes later I walked out with a thesis topic and one of those cheeky smiles that said “we’re in for a lot of fun here”. Little did I know that my topic choice was to be the beginning of my adventures in computer vision that would lead me to a PhD and working for companies in Europe and Australia in this field.

So, I’ve been around the world of computer vision for nearly 15 years. When I started out, computer vision was predominantly a research-based field that rarely ventured outside those university corridors and lecture theatres that I used to saunter around. You just couldn’t do anything practical with it – mainly because machines were too slow and memory sizes were too small.

(Note: see this earlier post of mine that discusses why image processing is such a computation and memory demanding activity)

But things have changed since those days. Computer vision has grown immensely and most importantly it’s shaping up to be a viable source of income in the industry. It’s truly been a pleasure to witness this transformation. And it seems as though things are only going to get better.

In this post, then, I would like to present to you how much computer vision in the industry has grown over the last few years and whether this growth will continue in the future. I would also like to briefly talk about what this means for us computer vision enthusiasts in terms of jobs and opportunities in the workforce (where I am based now).

(Note: in my next post I write in more detail about the reasons behind this growth)

Computer vision in the last few years

Recently, growth in computer vision in the industry has surged. To understand just how much, all one has to do is analyse the speed at which top tech corporations are moving into the field now.

Apple, for example, in this respect made at least two significant takeovers last year, both for an undisclosed amount: one of an Israeli-based startup in February 2017 called Realface that works on facial recognition technology for the authentication of users (could it be behind FaceID?); and another in September of 2017 when it acquired Regaind, a startup from Paris that focuses on AI-driven photo and facial analysis.

Facebook has joined the game also. Two months ago (Nov, 2017) it bought out a German computer vision startup called Fayteq. Fayteq develops plugins for various applications that allow you to add or remove objects from existing videos. Its acquisition follows the purchase of Source3, a company that develops video piracy detection algorithms, which also took place in 2017.

While on the topic of social media, let’s take a look at the recent moves made by Twitter and Snapchat. In 2016 Twitter bought Magic Pony Technology for $150 million. Magic Pony Technology employs machine learning to improve low-quality videos on-the-fly by detecting patterns and textures over time. Whereas Snapchat, also in 2016, acquired Seene, which allows you to, among other things, take 3D shots of objects (e.g. 3D selfies) and insert them into videos. Take a look at what Seene can do in this neat little demo video. Apologies for the digression but it’s just too good not to share here:

Amazon has noted the growth of computer vision in the industry so much so that it recently (Oct 2017) created an AI research hub in Germany that focuses on computer vision. This follows shortly upon its acquisition of a 3D body model startup for around $70 million.

The clear stand-out takeover, however, was made by Intel. Last year in March it bought out Mobileye for a WHOPPING $15.3 billion. Mobileye, an Israeli-based company, explores vision-based technologies for autonomous cars. In fact, Intel and Mobileye unveiled their first autonomous car just three days ago!

Other notable recent acquisitions were made by Baidu (info here) and Ebay (info here).

Such corporate activity is unprecedented for computer vision.

Let’s try to visualise this growth by looking at the following graph (from 2016), showing investments into solely US-based computer vision companies since 2011:

A clear upward trend can be seen beginning with 2011 when investments were barely above zero. In 2004, when I joined the computer vision club, it would have been even less than that. Amazing isn’t it? Like I said, back then the field very rarely ventured outside of academia. Just compare that with the money being pumped into it now.

Here are some more numbers, this time from venture capital funding:

In 2015, global venture capital funding in computer vision reached US$186 million.
In 2016, that jumped three-fold to $555 million (source).
Last year, according to index.co, investments jumped three-fold again to reach a super cool US$1.7 billion.

The stand-out from venture capital funding from 2017 was the raising of $460 million from multiple investors for Megvii, a Chinese start-up that develops facial recognition technology (it is behind Face++, a product I hope to write about soon).

Serious, serious money, we’re talking about here.

What the Future Holds for Businesses and Us

Undoubtedly, further growth can be easily predicted for computer vision in the industry. Autonomous cars, commercialisation of drones, emotion detection, face recognition, security and surveillance – these are all areas that will be driving the demand for computer vision solutions for businesses.

Tractica predicts that the market for these solutions will grow to $48.6 billion by 2022. Autonomous cars and robotics will be the major players in these future markets:

cv-forecast-2022 — *(image taken from here)*

What does this mean for us computer vision enthusiasts? Will it be any easier to find those elusive jobs?

In my opinion the state of affairs at our level will not change for a while. A high-level of technical knowledge and understanding backed up by a PhD degree is still going to be the norm for some time to come. Like I mentioned in a previous post, you will need to branch out into other areas of AI to have a decent chance of working on computer vision projects.

Having said that, the situation will slowly start to change once businesses and governments come to realise just some of the things that can be done with the data being acquired by their cameras (more on this in a future post). It’s only a matter of time before this happens, in my opinion, so it’s worth sticking with CV and getting ahead of the crowd now. Investing your time and effort into CV will certainly pay off dividends in the future.

Summary

In this post I presented how much computer vision has grown over the last few years. I looked at some of the recent acquisitions into CV made by big companies such as Apple, Intel and Facebook. I then reviewed the current investments being made into CV and showed that this area is experiencing unprecedented growth. Before 2010, computer vision rarely ventured outside of academia. Now, it is starting to be a viable source of income for businesses around the world. Having said that, the situation for us computer vision enthusiasts will not change for a while. CV jobs will still be elusive. More businesses and governments need to realise that the data being acquired by their cameras can also be mined for information before the effects of this unprecedented growth in CV start to significantly affect us. Thankfully, in my opinion, it’s only a matter of time until this happens so it’s worth sticking with CV and getting ahead of the crowd.

To be informed when new content like this is posted, subscribe to the mailing list (or subscribe to my YouTube channel!):

Deep Learning for Computer Vision with Python Review

Posted on December 26, 2017January 16, 2025 by Zbigatron

In this post I will be reviewing a book called “Deep Learning for Computer Vision with Python“ (DL4CV) that was recently published by Dr Adrian Rosebrock, author of “Practical Python and OpenCV” and most notably the computer vision blog PyImageSearch.

I have already (highly) talked about Dr Rosebrock before on my blog in my post on starting a career in computer vision and I mentioned the fact that PyImageSearch is one of my favourite blogs on the internet. So, it is with great pleasure that I sit down here to write this review.

(Fun fact: I’m writing this review in the wild wilderness of Tasmania. What a beautiful place this is! But, alas, we digress…)

In the first part of this post I will focus on presenting a summary of the book(s) and then in the second part I will give you my thoughts on the work itself. But for those that want the TL;DR version of the review, I’ll give that to you now:

The book is phenomenal. The concepts on deep learning are so well explained that I will be recommending it to anybody not just involved in computer vision but AI in general. If you’re thinking of getting into deep learning for computer vision or wish to fine-tune what you already know, forget about the rest – this is the place to start and finish.

Summary of DL4CV

Due to the huge amount of content that it covers, DL4CV is divided into three volumes: Starter Bundle, Practitioner Bundle, and ImageNet Bundle. Each volume builds on top of the previous one and goes further into the world of deep learning for computer vision. The reason why the volumes are called bundles is that they are each accompanied by additional components such as a downloadable pre-configured Ubuntu virtual machine, source code listings, and access to a companion website. Video tutorials and walkthroughs for each chapter are also advertised to be coming soon.

The Starter Bundle is all about the basics of machine learning, neural networks, convolutional neural networks, and working with datasets. And it truly is a starter bundle because half the book is spent laying down a solid foundation for beginners to deep learning. No knowledge is presupposed (although some experience in computer science or even computer vision I would consider to be advantageous here). Deep learning is also presented on the fundamental level with topics covered such as convolutional neural networks (CNNs), their famous implementations, and the Keras framework. Throughout the book, interesting real-world problems (e.g. breaking captchas) are solved with source code provided and explained at each and every step of the way.

The next volume is the Practitioner Bundle that immerses the reader even further in the world of deep learning. More advanced topics and algorithms are covered such as data augmentation, optimisation methods, and the HDF5 data format. Famous implementations of CNNs are also revisited but in a more in-depth manner. This volume was written for those that want to take computer vision and deep learning (whether it be in academia or the industry) seriously. Once again, practical examples with source code are provided every step of the way.

The final volume is the ImageNet Bundle. The first part of the volume is focused on the ImageNet dataset and the training on it of state-of-the-art CNNs. The second part focuses on even more real-world applications of deep learning and computer vision. Transfer learning and other training techniques are discussed in great detail to the point where readers will be able to reproduce the results seen in seminal deep learning papers and publications. This volume was written for those who want to reach a research level of deep learning in computer vision.

My Thoughts

As I said in the TL;DR section above, this book is phenomenal. And I don’t say things like that lightly – and likewise my words aren’t hot air, either. Prior to returning to the industry this year, my main source of employment was education. I taught and lectured in high schools, primary schools, universities, and privately for eight years. During that time, I acquired a good eye for textbooks that truly give the most to their students in each and every class. A lot of good books exist like this in the fields that I taught in (mathematics, English, philosophy, and computer science). But then sometimes you stumble upon the amazing textbooks – the ones that are just so well-written and structured that they make your job of explaining and helping to assimilate things incredibly easy.

And this is one such book.

Let me tell you, I know a good educator when I see one – Dr Adrian Rosebrock is one such person. This guy has talent. If I were working at a school or university, I’d hire him without even conducting an interview (well, maybe a quick one over the phone just to make sure he’s not a talented nutcase :P)

I really believe that his talent has produced something unique in the field of deep learning, especially because of the following two characteristics:

He understands that when it comes to learning you need to get your hands dirty and do something practical with any newly-acquired theory. That’s the best way to assimilate knowledge. In this respect, all his chapters follow this principle and provide hands-on examples with code to help cement the concepts raised and discussed.
His explanations are so ridiculously lucid that he is able to make state-of-the-art academic publications reachable to non-academic people. This is rare. Believe me.

I now work in artificial intelligence in the industry and I am being pushed into a training position in my company. When it’ll come to teaching deep learning, this is the book I will be telling my fellow employees to work through and read with me. And there’s a strong chance that his book will be made into the go-to textbook at universities because of how good it is.

But for that to happen, there is one thing that will need to be touched up. And this thing is my sole criticism of the book.

This criticism is that, in my opinion, there are too many typos (spelling mistakes, missing words, etc.) and grammatical mistakes scattered throughout the book. I understand that such things happen to every writer, but I think that there is an overabundance of them here. I’d say that on average there is one such mistake every few pages. At that rate it can get a little bit frustrating and distracting when you’re trying to focus on the content. When it comes to grammatical mistakes, I’m talking about things like mixing up words such as “affect” and “effect” and “awhile” and “a while”. These creases will need to be ironed out if the book is to be put on shelves in a prominent place in universities and colleges.

However, Dr Rosebrock has provided an easy means to submit mistakes like this to him via the companion website. So, let’s hope that the open-source community will help him out in this respect.

Having mentioned this criticism, I must again underline one thing: this is a unique book and no matter the number of typos and grammatical errors (especially if they will be undoubtedly fixed over time), I hope DL4CV will one day become a classic of deep learning and computer vision. In fact, I’m sure it will.

Summary

In this post I reviewed the book “Deep Learning for Computer Vision with Python” written by Dr Adrian Rosebrock of the PyImageSearch blog. I gave a brief summary of the three volumes and then presented my thoughts on the work as a whole. I mentioned that I think Dr Rosebrock is a talented educator who has written a very good book that explains very difficult concepts exceptionally well. His focus on both theory and implementation is unique and shows that he (perhaps intuitively) understands best-practices in pedagogy. I will be recommending DL4CV to anybody not just involved in computer vision but AI in general. And I hope DL4CV will become a classic textbook at universities.

To purchase “Deep Learning for Computer Vision with Python” or to get more information on it, see the book’s official page.

To be informed when new content like this is posted, subscribe to the mailing list (or subscribe to my YouTube channel!):

Why Image Processing and Computer Vision is so Difficult

Posted on December 15, 2017January 16, 2025 by Zbigatron

This is another post that has been inspired by a question posed in a forum: “What are the open research areas in image processing?”.

My answer? Everything is still an open research area in image processing/computer vision!

But why is this the case? You’d think that after decades of research we’d feel comfortable in saying “this problem here is solved, let’s focus on something else”. In a way we can say this but only for narrow and simple use cases (e.g. locating a red spoon on an empty white plate) but not for computer vision in general (e.g. locating a red spoon in all possible scenarios, like a big box full of colourful toys). I’m going to spend the rest of this post explaining the main reasons behind this.

So, why is computer vision so hard?

Before we dig into what I consider to be the dominant reasons why computer vision is so damn hard, I first need to explain how machines “see” images. When us humans view an image, we perceive objects, people or a landscape. When machines “view” images, all they see are numbers that represent individual pixels.

An example will explain this best. Let’s say that you have a greyscale image. Each pixel, then, is represented by a number usually between 0 and 255 (I’m abstracting here over things like compression, colour spaces, etc.), where 0 is for black (no colour) and 255 is for white (full intensity). Anything between 0 and 255 is a shade of grey, like in the picture below.

grid-intensity-to-numbers — Machines “see” pixels as numbers (pixel boundaries added for clarity).

So, for a machine to garner anything about an image, it has to process these numbers in one way or another. This is exactly what image/video processing and computer vision is all about – dealing with numbers!

Now that we have the necessary background information about computer vision we can move on to the meat of the post: the main reasons behind why computer vision is an immensely hard problem to solve. I’m going to list four such reasons:

Swathes of data
Inherent loss of information
Dealing with noise
Requirements for interpretation

We’ll look at these one at a time.

1. We’re dealing with a heck of a lot of data

As I said above, when it comes to images, all computers see are numbers… lots of numbers! And lots of numbers means a lot of data that needs to be processed to be made sense of.

How much data are we talking about here? Let’s take a look at another example (that once again abstracts over many things such as compression and colour spaces). If you have a greyscale (black & white) image with 1920 x 1080 resolution, this means that your image is described by 2 million numbers (1920 * 1080 = 2,073,600 pixels). Now, if you switch to a colour image, you need three times as many numbers because, typically, when you represent a coloured pixel you specify how much read, blue, and green it is composed of. And then further, if you’re trying to analyse images coming in from a video/camera stream with, say, a 30 frames/sec frame rate (which is a standard frame rate nowadays), you’re suddenly dealing with 180 million numbers per second (3 *2,073,600 * 30 ~= 180 million pixels/sec). That is a lot of data that needs processing! Even with today’s powerful processors and relatively large memory sizes, machines struggle to do anything meaningful with 180 million numbers coming in per second.

2. Loss of Information

Loss of information in the digitising process (going from real life to an image on a machine) is another major player contributing to the difficulty involved in computer vision. The nature of image processing is such that you’re taking information from a 3D world (or 4D if we’re dealing with time in a video stream) and projecting it onto a 2D plane (i.e. a flat image). This means that you’re also losing a lot of information in this process – even though we still have a lot of data to deal with as is, as discussed above.

Now, our brains are fantastic at inferring what that lost data is. Machines are not. Take a look at the image below showing a messy room (not mine, promise!)

We can easily tell that the large green gym ball is bigger and further away than the black pan on the table. But how is a machine supposed to infer this if the black pan takes up more pixels than the green ball!? Not an easy task.

Of course, you can attempt to simulate the way we see with two eyes by taking two pictures simultaneously and extracting 3D information from these. This is called stereoscopic vision. However, stitching images together is also not a trivial task and is, hence, likewise an open area of research. Further, it too suffers from the other 3 major reasons I discuss in this post.

3. Noise

The digitising process is frequently accompanied by noise. For example, no camera is going to give you a perfect picture of reality, especially when it comes to the cameras located on our phones (even though phone cameras are getting phenomenally good with each new release). Intensity levels, colour saturation, etc. – these will always be just an attempt at capturing our beautiful world.

Other examples of noise are phenomena known as artefacts. These are distortions of images that can be caused by a number of things. E.g. Lens flare – an example of which is shown in the image below. How is a computer supposed to interpret this and work out what is situated behind it? Algorithms have been developed to attempt to remove lens flare from images but, once again, it’s an open area of research.

The biggest source of artefacts undoubtedly comes from compression. Now, compression is necessary as I discussed in this post. Images would otherwise be too large to store, process, and transfer over networks. But if compression levels are too high, image quality decreases. And then you have compression artefacts appearing, as depicted in the image below.

compression-example — The right image has clear compression artefacts visible

Humans can deal with artefacts, even if they dominate a scene, as seen above. But this is not the case for computers. Artefacts don’t exist in reality and are frequently arbitrary. They truly add another level of difficulty that machines have to cope with.

4. Interpretation is needed

Lastly and most importantly is interpretation. This is definitely the hardest thing for a machine to deal with in the context of computer vision (and not only!). When we view an image we analyse it with years and years of accumulated learning and memory (called a priori knowledge). We know, for example, that we can sit on gym balls and that pans are generally used in the kitchen – we have learnt about these things in the past. So, if there’s something that looks like a pan in the sky, chances are it isn’t and we can scrutinise further to work out what the object may be (e.g. a frisbee!). Or if there are people kicking around a green ball, chances are it’s not a gym ball but a small children’s ball.

But machines don’t have this kind of knowledge. They don’t understand our world, the intricacies inherent in it, and the numerous tools, commodities, devices, etc. that we have created over the thousands of years of our existence. Maybe one day machines will be able to ingest Wikipedia and extract contextual information about objects from there but at the moment we are very far from such a scenario. And some will argue that we will never reach a phase where machines will be able to completely understand our reality – because consciousness is something that will always be out of reach for them. But more on that in a future post.

Discussion

I hope I have shown you, at least in a nutshell, why computer vision is such a difficult problem. It is an open area of research and will be for a very, very long time. Ever heard of the Turing test? It’s a test for intelligence devised by the famous computer scientist, Alan Turing in the 1950s. He basically said, that if you’re not able to distinguish between a machine and a human within a specified amount of time by having a natural conversation with both parties, then the machine can be dubbed intelligent.

Well, there is an annual competition called the Loebner Prize that gives away prize money to computer programs deemed most intelligent. The format of the competition is exactly the scenario proposed by Alan Turing: in each round, human judges simultaneously hold textual conversations with a computer program and a human being via a computer. Points are then awarded to how much the machine manages to fool the judges. The top prize awarded each year is about US$3,000. If a machine is able to entirely fool a judge, the prize is $25,000. Nobody has won this award, yet.

However, there is a prize worth $100,000 that nobody has picked up either. It will be awarded to the first program that judges cannot distinguish from a real human in a Turing test that includes deciphering and understanding text, visual, and auditory input. Once this is achieved, the organisers say that the annual competition will end. See how far away we are from strong intelligence? Nobody has won the $25,000 prize yet, let alone the big one.

I also mentioned above that some simple use cases can be considered solved. I must also mention here that even when use cases appear to be solved, chances are that the speed of the algorithms leave much to be desired. Neural networks are now supposedly performing better than humans in image classification tasks (I hope to write about this in a future post, also). But the state-of-the-art algorithms are barely able to squeeze out ~1 frame/sec on a standard machine. No chance of getting that to work in real-time (remember how I said above that standard frame rates are now at about 30 frames/sec?). These algorithms need to be optimised. So, although the results obtained are excellent, speed is a major issue.

Summary

In this post I discuss why computer vision is so hard and why it is still very much an open area of research. I discussed four major reasons for this:

Images are represented by a heck of a lot of data that machines need to process before extracting information from them;
When dealing with images we are dealing with a 2D reality that has been shrunk from 3D meaning that A LOT of information has been lost;
Devices that present the world to us frequently also deliver noise such as compression artefacts and lens flare;
And the most important hurdle for machines is interpretation: the inability to fully comprehend the world around us and its intricacies that we learn to deal with from the very beginnings of our lives.

I then mentioned the Loebner Prize, which is an AI competition inspired by the Turing test. Nobody has yet won the $25,000, let alone the big one that involves analysing images. I also discussed the need to optimise the current state-of-the-art algorithms in computer vision. A lot of them do a good job but the amount of processing that takes place behind the scenes makes them unusable in real-time scenarios.

Computer vision is definitely still an open area of research.

To be informed when new content like this is posted, subscribe to the mailing list (or subscribe to my YouTube channel!):

Be an Optimist Prime in the world of Computer Vision and AI

Author: Zbigatron

Background Information

The MER Rovers

The Importance of Computer Vision in Space

Computer Vision on Mars

1. Descent Image Motion Estimation System

2. Stereo Vision for Navigation

3. Visual Odometry

Summary

Background Knowledge

Deep Learning Needs Big Data

Deep Learning is Sometimes Overkill

Traditional Computer Vision will Improve your Deep Learning Skills

Conclusion

Machines that can read our minds

Results from the Study

Conclusion

How it Works – Non-Technically & Technically

In-Store Issues Prior to Public Opening

In-Store Issues Post Public Opening

Cons of Amazon Go

Potential Job Losses

An Increase in Unhealthy Impulse Purchases

Even More Surveillance

Less Human Interaction

Summary

Better and More Dedicated Hardware

The Emergence of Deep Learning

Large Datasets

More Applications

Summary

Computer vision in the last few years

What the Future Holds for Businesses and Us

Summary

Summary of DL4CV

My Thoughts

Summary

So, why is computer vision so hard?

1. We’re dealing with a heck of a lot of data

2. Loss of Information

3. Noise

4. Interpretation is needed

Discussion

Summary