Football/Soccer on Your Tabletop

Well, it’s World Cup season now, isn’t it? Australia got eliminated this week so I’m feeling a bit depressed at the moment. However, seeing teams like Germany also not make it past the group stage makes me feel a little better (sorry German people reading this post :P).

But since the World Cup is on, it is only fitting that I write about something from the field of Computer Vision that is related to football. So, in this post I’m going to present to you quite an amazing paper I stumbled upon entitled “Soccer on Your Tabletop” (Rematas et al., CVPR 2018, pp. 4738-4747)

(Recall that CVPR is a world-class academic conference on computer vision. Anything published there is always worth reading.)

The goal of the paper is to present an algorithm that reconstructs a 3D representation of a football game from a single 2D video – just like you would find on YouTube. The 3D video could then be projected onto a tabletop (like a hologram) and viewed by everyone in the room from multiple angles. An interesting concept!

Usually something like this is obtained by having multiple cameras set up that can then work together to provide 3D information of the football pitch. But the idea here is to get all information from a single 2D video.

Here’s a clip of the entire project that has been released by the authors. Watch it!

Ah, you just have to love computer vision!

Let’s take a look at the (slightly simplified here) steps involved in the 2D -> 3D reconstruction process:

  1. Input frame: obtained from any 2D video of a football game (captured from stationary cameras).
  2. Camera calibration: this is performed using the football pitch line markings as guidance. The line markings provide excellent reference points to obtain the 2D plane of the football pitch from which players’ measurements can be deduced.
  3. Player detection, pose estimation, and tracking: this is done using already existing techniques. Specifically this paper is referenced from CVPR 2015 for detecting bounding boxes around players (top left image below), this paper from CVPR 2016 for estimation poses (top right image below), and a simple player tracking algorithm where you compare bounding boxes from adjacent frames and match them according to closest 2D Euclidean distance (bottom left image).

    (image taken from original publication)
  4. Player segmentation: the idea here is to highlight the entire contour of the player after performing the above steps (see bottom right image above). This is performed by taking each pixel and analysing its neighbouring pixels for similarities in colour and edge information until each player is extracted. (Several more steps are performed to fine-tune this process but I’ll skip over these).
  5. Player depth estimation and mesh generation. This is the tricky part. What the authors did is quite intuitive. To constrain the solution space to just football related poses, body shapes, and clothing, the authors created a training dataset from FIFA video games. Lol! What they found was that it was possible to intercept calls between the game engine and the GPU while playing the video game and then to extract depth maps from these intercepted calls. In doing so, they were able to train a deep neural network to extract depth maps from 2D videos. This trained network was then used on 2D YouTube videos. Absolutely brilliant!

    (image obtained from project’s video)
  6. Scene reconstruction. Once player depth estimation and mesh information (which is 3D information) is obtained, the scene can then be reconstructed. What the authors ended up doing is to use Microsoft HoloLens (a mixed reality lens that enables you to see and interact with holograms in real life). So the football pitch on the tabletop you see in the image below isn’t real! Can you imagine watching a match like this around a table with your mates!? There is a catch with the project, however. It’s not good enough yet to reconstruct the ball, which means that at the moment all you can view in 3D are players running around chasing an invisible object 🙂 But that’s work in progress and the job and essence of research.

    (image obtained from project’s video)

Amazing, if you ask me! I can’t wait to see what the future holds for computer vision.

And believe it or not, code for this project is available online for you to play around with as much as you like. So, enjoy!


To be informed when new content like this is posted, subscribe to the mailing list:

Please share what you just read:

Generating Heatmaps from Coordinates

In last week’s post I talked about plotting tracked customers or staff from video footage onto a 2D floor plan. This is an example of video analytics and data mining that can be performed on standard CCTV footage that can give you insightful information such as common movement patterns or common places of congestion at particular times of the day.

There is, however, another thing that can be done with these extracted 2D coordinates of tracked people: generation of heatmaps.

A heatmap is a visual representation or summary of data that uses colour to represent data values. Generally speaking, the more congested data is at a particular location, the hotter will be the colour used to represent this data.

The diagram at the top of this post shows an example heatmap for eye-tracking data (I did my PhD in eye-tracking, so this brings back memories :P) on a Wikipedia page. There, the hotter regions denote where more time was spent gazing by viewers.

There are many ways to create these heatmaps. In this post I will present you one way with some supporting code at the end.

I’m going to assume that you have a list of coordinates in a file denoting the location of people on a 2D floor plan (see my previous post for how to obtain such a file from CCTV footage). Each line in the file is a coordinate at a specific point in time. For example, you might have something like this in a file called “coords.txt”:

In this example we have somebody moving horizontally 5 pixels for two time intervals and then standing still for 3 time intervals. If we were to generate a heatmap here, you would expect there to be hot colours around (210, 300) and cooler colours at (200, 300) through to (210, 300).

But how do we get these hot and cold colours around our points and make the heatmap look smooth and beautiful? Well, some of you may have heard of a thing called a Gaussian kernel. That’s just a fancy name for a particular type of curve. Let me show you a 2D image of one:

That curve can also be drawn in 3D, like so (notice the hot and cold colours here!):


Now, I’m not going to go into too much detail on Gaussian kernels because it would involve venturing into university mathematics. If you would like to read up on them, this pdf goes into a lot of explanatory detail and this page explains nicely why it is so commonly used in the trade. For this post, all you need to know is that it’s a specific type of curve.

With respect to our heatmaps, then, the idea is to place one of these Gaussian kernels at each coordinate location that we have in our “coords.txt” file. The trick here is to notice that when Gaussian kernels overlap, their values are added together (where the overlapping occurs). So, for example, with the 2D kernel image above, if we were to put another kernel at the exact same location, the peak of the kernel would reach 0.8 (0.4 + 0.4 = 0.8).

If you have clusters of points at a similar location, the Gaussian kernels at these locations would all push each other up.

The following image shows this idea well. There are 6 coordinates (the black marks on the x-axis) and a kernel placed at each of these (red dashed lines). The resulting curve is depicted in blue. The three congested kernels on the left push (by addition) the resulting curve up the highest.

Demonstration picture for Gaussian kernels in constructing a kernel density estimator
Gaussian kernels stacked on top of each other (image source)

This final plot of Gaussian kernels is actually called a kernel density estimation (KDE). It’s just a fancy name for a concept that really, in it’s core, isn’t too hard to understand!

A kernel density estimation can be performed in 3D as well and this is exactly what can be done with the coordinates in your “coords.txt” file. Take a look at the 3D picture of a single Gaussian kernel above and picture looking down at that curve from above. You would be looking at a heatmap!

Here’s a top-down view example but with more kernels (at the locations of the white points). Notice the hot colours at the more congested locations. This is where the kernels have pushed the resulting KDE up the highest:

enter image description here

And that, ladies and gentlemen is how you create a heatmap from a file containing coordinate locations.

And what about some accompanying code? For the project that I worked on, I used the seaborn Python visualisation library. That library has a kernel density estimator function called kdeplot:

It’s amazing what you can do with basic CCTV footage, computer vision, and a little bit of mathematical knowledge, isn’t it?


To be informed when new content like this is posted, subscribe to the mailing list:

Please share what you just read:

Mapping Camera Coordinates to a 2D Floor Plan

Data mining is a big business. Everyone is analysing mouse clicks, mouse movements, customer purchase patterns. Such analysis has proven to give profitable insights that are driving businesses further than ever before.

But not many people have considered data mining videos. What about all that security footage that has stacked up over the years? Can we mine those for profitable insights also? Of course!

In this blog post I’m going to present a task that video analytics can do for you: the plotting of tracked customers or staff from video footage onto a 2D floor plan. 

Why would you want to do this? Well, plotting on a 2D plane will allow you to more easily data mine for things such as common movement patterns or common places of congestion at particular times of the day. This is powerful information to possess. For example, if you can deduce what products customers reached for in what order you can make important decisions with respect to the layout of your shelves and placement of advertising. 

Another benefit of this technique is that it is also much easier to visualise movement patterns presented on 2D plane rather than when shown on distorted CCTV footage. (In fact, in my next post I extend what I present here by showing you how to generate heatmaps from your tracking data – check it out).

However, if you have ever tried to undertake this task you may have come to the understanding that it is not as straightforward as you initially thought. A major dilemma is that your security camera images are distorted. For example, a one pixel movement at the top of your image corresponds to a much larger movement in the real world than a one pixel movement at the bottom of your image.

Where to begin? In tackling this problem, the first thing to realise is that we are dealing with two planes in the Euclidean space. One plane (the floor in your camera footage) is “stretched out”, while the other is “laid flat”. We, therefore, need a transformation function to map points from one plane to the other.

The following image shows what we are trying to achieve (assume the chessboard is the floor in your shop/business):

The task: map the plane from your camera to a perspective view

The next step, then, is to deduce what kind of transformation is necessary. Once we know this, we can start to look at the mathematics behind it and use this maths accordingly in our application. Here are some possible transformations:

Different types of transformations (image source)

Translations (the first transformation in the image above) are shifts in the x and y plane that preserve orientation. Euclidean transformations change the orientation of the plane but preserve the distances between points – definitely not our case, as mentioned earlier. Affine transformations are a combination of translation, rotation, scale, and shear. They can change the distances between points but parallel lines remain parallel after transformation – also not our case. Lastly, we have homographic transformations that can change a square into any form of a quadrilateral. This is what we are after.

Mathematically, homographic transformations are represented as such:


where (x,y) represent pixel coordinates in one plane, (x’, y’) represent pixel coordinates in another plane and H is the homography matrix represented as this 3×3 matrix:


Basically, the equation states this: given a point in one plane (x’,y’), if I multiply it by the homography matrix H I will get the corresponding point (x,y) from the other plane. So, if we calculate H, we can get the coordinates of any pixel from our camera image to the flat image.

But how do you calculate this magic matrix H? To gloss over some intricate mathematics, what we need is at least 4 point pairs (4 corresponding points) from the two images to get a minimal solution (a “close enough” solution) of H. But the more point pairs we provide, the better the estimate of H will be.

Getting the corresponding point pairs from our images is easy, too. You can use an image editing application like GIMP. If you move your mouse over an image, the pixel coordinates of your mouse positions are given at the bottom of the window. Jot down the pixel coordinates from one image and the corresponding pixel coordinates in the matching image. Get at least four such points pairs and you can then get an estimate of H and use it to calculate any other corresponding point pairs.

Example of 3 corresponding points in two images

Now you can take the tracking information from your security camera and plot the position of people on your perspective 2D floor plan. You can now analyse their walking paths, where they spent most of their time, where congestion frequently occurs, etc. Nice! But what’s even nicer is the simple code needed to do everything discussed here. The OpenCV library (the best image/video processing library around) provides all necessary methods that you’ll need:

Piece of cake. Here is a short animation showing what you can do:


Be sure to check out my next post where I show you how to generate heatmaps from the tracking data you just obtained from your security cameras.


To be informed when new content like this is posted, subscribe to the mailing list:

Please share what you just read:

Computer Vision on Mars

I was doing my daily trawl of the internet a few days ago looking at the latest news in artificial intelligence (especially computer vision) and an image caught my eye. The image was of one of the Mars Exploration Rovers (MER) that landed on the Red Planet in 2004. Upon seeing the image I thought to myself: “Heck, those rovers must surely have used computer vision up there!?” So, I spent the day looking into this and, sure as can be, not only was computer vision used by these rovers, it in fact played an integral part in their missions.

In this post, then, I’m going to present to you how, where, and when computer vision was used by those MERs. It’s been a fascinating few days for me researching into this and I’m certain you’ll find this an interesting read also. I won’t go into too much detail here but I’ll give you enough to come to appreciate just how neat and important computer vision can be.

If you would like to read more about this topic, a good place to start is “Computer Vision on Mars” (Matthies, Larry, et al. International Journal of Computer Vision 75.1, 2007: 67-92.), which is an academic paper published by NASA in 2007. You can also follow any additional referenced publications there. All images in this post, unless otherwise stated, were taken from this paper.

Background Information

In 2003, NASA launched two rovers into space with the intention of landing them on Mars to study rocks and soils for traces of past water activity. MER followed upon three other rover-based missions: the two Viking missions of 1975 and 1976 and the Mars Pathfinder mission of 1997.

Due to constraints in processing power and memory capacity no image processing was performed by the Viking rovers. They only took pictures with their on-board cameras to be sent back to Earth.

The Sojourner (the name of the Mars Pathfinder rover), on the other hand, performed computer vision in one way only. It used stereoscopic vision to provide scientists detailed maps of the terrain around the rover for operators on Earth to use in planning movement trajectories. Stereoscopic vision provides visual information from two viewing angles a short distance apart just like our eyes do. This kind of vision is important because two views of the same scene allows for the extraction of 3D data (i.e. depth data). See this OpenCV tutorial on extracting depth maps from stereo images for more information on this.

The MER Rovers

The MER rovers, Spirit and Opportunity as they were named, were identical. Both had a 20 MHz processor, 128 MB of RAM, and 256 MB of flash memory. Not much to work with there, as you can see! Phones nowadays are about 1000 times more powerful.

The rovers also had a monocular descent camera facing directly down and three sets of stereo camera pairs: one pair each at the front and back of the rovers (called hazard cameras, or “hazcams” for short) and a pair of cameras (called navigation cameras, or “navcams” for short) on a mast 1.3m (4.3 feet) above the ground. All these cameras took 1024 x 1024 greyscale photos.

But wait, those colour photos we’ve seen so many times from these missions were fake, then? Nope! Cleverly, each of the stereoscopic camera lenses also had a wheel of 8 filters that could be rotated. Consecutive images could be taken with a different filter (e.g. infrared, ultra-violet, etc.) and colour extracted from a combination of these. Colour extraction was only done on Earth, however. All computer vision processing on Mars was therefore performed in greyscale. Fascinating, isn’t it?

Components of an MER rover (image source)

The Importance of Computer Vision in Space

If you’ve been around computer vision for a while you’ll know that for things such as autonomous vehicles, vision solutions are not necessarily the most efficient. For example, lidar (Light Detection And Ranging – a technique similar to sonar for constructing 3D representations of scenes by emitting pulsating laser light and then measuring reflections of it) can give you 3D obstacle avoidance/detection information much more easily and quickly. So, why did NASA choose to use computer vision (and so much of it, as I’ll be presenting to you below) instead of other solutions? Because laser equipment is fragile and it may not have withstood the harsh conditions of Mars. So, digital cameras were chosen instead.

Computer Vision on Mars

We now have information on the background of the mission and the technical hardware relevant to us so let’s move to the business side of things: computer vision.

The first thing I will talk about is the importance of autonomy in space exploration. Due to communication latency and bandwidth limitations, it is advantageous to minimise human intervention by allowing vehicles or spacecraft to make decisions on their own. The Sojourner had minimal autonomy and only ended up travelling approximately 100 metres (328 feet) during it’s entire mission (which lasted a good few months). NASA wanted the MER rovers to travel on average that much every day, so they put a lot of time and research into autonomy to help them reach this target.

In this respect, the result was that they used computer vision for autonomy on Mars in 3 ways:

  1. Descent motion estimation
  2. Obstacle detection for navigation
  3. Visual odometry

I will talk about each of these below. As mentioned in the introduction, I won’t go into great detail here but I’ll give you enough to satisfy that inner nerd in you 😛

1. Descent Image Motion Estimation System

Two years before the launch of the rocket that was to take the rovers to Mars, scientists realised that their estimates of near-surface wind velocities of the planet were too low. This could have proven catastrophic because severe horizontal winds could have caused irreparable damage upon an ill-judged landing of the rover. Spirit and Opportunity had horizontal impulse rockets that could be used to reduce horizontal velocity upon descent but no system to detect actual horizontal speed of the rovers.

Since a regular horizontal velocity sensor could not be installed due to cost and time constraints, it was decided to turn to computer vision for assistance! A monocular camera was attached to the base of the rover that would take pictures of the surface of the planet as the rovers were descending onto it. These pictures would be analysed in-flight to provide estimates of horizontal speeds in order to trigger the impulse rockets, if necessary.

The computer vision system for motion estimation worked by tracking a single feature (features are small “interesting” or “stand-out” patches in images). The feature was located in photos taken by the rovers and then the position of these patches was tracked between consecutive images.

Coupled with this feature tracking information and measurements from the angular velocity and vertical velocity sensors (that were already installed for the purpose of on-surface navigation), the entire velocity vector (i.e. information about the magnitude and direction of the rover’s speed) was able to be calculated.

The feature tracking algorithm, called the Descent Image Motion Estimation System (DIMES) consisted of 7 steps as summarised by the following image:


The first step reduces the image size to 256 x 256 resolution. The smaller the resolution, the faster that image processing calculations can be performed – but at the possible expense of accuracy. The second step was responsible for estimating the maximum possible area of overlap in consecutive images to minimise the search area for features (there’s no point in detecting features in regions of an image that you know are not going to be present in the second). This was done by taking into consideration knowledge from sensors of things such as the rover’s altitude and orientation. The third step picked out two features from an image using the Harris corner detector (discussed here in this OpenCV tutorial). Only one feature is needed for the algorithm to work but two were detected in case one feature could not be located in the following image. A few noise “clean-up” operations on images were performed in step 4 to reduce effects of things such as blurring.

Step 5 is interesting. The feature patches (aka feature templates) and search windows in consecutive images were rectified (rotated, twisted, etc.) to remove orientation and scale differences in order to make searching for features easier. In other words, the images were rotated, twisted and enlarged/diminished to be placed on the same plane. An example of this from the actual mission (from the Spirit rover’s descent) is shown in the image below. The red squares in the first image are the detected feature patches that are shown in green in the second image with the search windows shown in blue. You can see how the first and second images have been twisted and rotated such that the feature size, for example, is the same in both images.


Step 6 was responsible for locating in the second image the two features found in the first image. Moravec’s correlator (an algorithm developed by Hans Moravec and published in his PhD thesis way back in 1980) was used for this. The general idea in this algorithm is to minimise the search area first instead of searching over every possible location in an image for a match. This is done by first selecting potential regions in an image for matches and only there is a more exhaustive search performed.

The final step is combining all this information to calculate the velocity vector. In total, the DIMES algorithm took 14 seconds to run up there in the atmosphere of Mars. It was run by both rovers during their descent. The Spirit rover was the only one that fired its impulse rockets as a result of calculations from DIMES. Its horizontal velocity was at one stage reduced from 23.5 m/s (deemed to be slightly over a safe limit) to 11 m/s, which ensured a safe landing. Computer vision to the rescue! Opportunity’s horizontal speed was never calculated to be too fast so firing its stabilising rockets was considered to be unnecessary. It also had a successful landing.

All the above steps were performed autonomously on Mars without any human intervention. 

2. Stereo Vision for Navigation

To give the MER rovers as much autonomy as possible, NASA scientists developed a stereo-vision-based obstacle detection and navigation system. The idea behind it was to give the scientists the ability to simply provide the rovers each day with a destination and for the vehicles to work things out on their own with respect to navigation to this target (e.g. to avoid large rocks).

And their system performed beautifully.

The algorithm worked by extracting disparity (depth) maps from stereo images – as I’ve already mentioned, see this OpenCV tutorial for more information on this technique. What was done, however, by the rovers was slightly different to that tutorial (for example a simpler feature matching algorithm was employed), but the gist of it was the same: feature point detection and matching was performed to find the relationship between images and knowledge of camera properties such as focal lengths and baseline distances allowed for the derivation of depth for all pixels in an image. An example of depth maps calculated in this way by the Spirit rover is shown below:

The middle picture was taken by Spirit and shows a rock on Mars approximately 0.5 m (1.6 feet) in height. The left image shows corresponding range information (red is closest, blue furthest). The right image shows corresponding height information.

Interestingly, the Opportunity rover, because it landed on a smoothly-surfaced plain, was forced to use its navcams (that were mounted on a mast) for its navigation. Looking down from a higher angle meant that detailed texture from the sand could be used for feature detection and matching. Its hazcams returned only the smooth surface of the sand. Smooth surfaces are not agreeable to feature detection (because, for example, they don’t have corners or edges). The Spirit rover, on the other hand, because it landed in a crater full of rocks, could use its hazcams for stereoscopic navigation.

3. Visual Odometry

Finally, computer vision on Mars was used at certain times to estimate the rovers’ position and travelling distance. No GPS is available on Mars (yet) and standard means of estimating distance travelled such as counting the number of wheel rotations was deemed during desert testing on Earth to be vulnerable to significant error due to one thing: wheel slippage. So, NASA scientists decided to employ motion estimation via computer vision instead.

Motion estimation was performed using feature tracking in 3D across successive shots taken by the navcams. To obtain 3D information, once again depth maps were extracted from stereoscopic images. Distances to features could easily be calculated from these and then the rovers’ poses were estimated. On average, 80 features were tracked per frame and a photo was taken for visual odometry calculations every 75 cm (30 inches) of travel.

Using computer vision to assist in motion estimation proved to be a wise decision because wheel slippage was quite severe on Mars. In fact, at one time the rover got stuck in sand and the wheels rotated in place for the equivalent of 50m (164 feet) of driving distance. Without computer vision the rovers’ estimated positions would have been severely inaccurate. 

There was another instance where this was strikingly the case. At one time the Opportunity rover was operating on a 17-20 degree slope in a crater and was attempting to maneuver around a large rock. It had been trying to escape the rock for several days and had slid down the crater many times in the process. The image below shows the rover’s estimated trajectory (from a top-down view) using just wheel odometry (left), and the rover’s corrected trajectory (right) as assisted by computer vision calculations. The large rock is represented by the black ellipse. The corrected trajectory proved to be the more accurate estimation.



In this post I presented the three ways computer vision was used by the Spirit and Opportunity rovers during their MER missions on Mars. These three ways were:

  1. Estimating horizontal speeds during their descent onto the Red Planet to ensure the rovers had a smooth landing.
  2. Extracting 3D information of its surroundings using stereoscopic imagery to assist in navigation and obstacle detection.
  3. Using stereoscopic imagery once again but this time to provide motion and pose estimation on difficult terrain.

In this way, computer vision gave the rovers a significant amount of autonomy (much, much more autonomy than its predecessor, the Sojourner rover) that ultimately gave the rovers a safe landing and allowed the robots to traverse up to 370 m (1213 feet) per day. In fact, the Opportunity rover is still active on Mars now. This means that the computer vision techniques described in this post are churning away as we speak. If that isn’t neat, I don’t know what is!


To be informed when new content like this is posted, subscribe to the mailing list:

Please share what you just read:

Amazon Go – Computer Vision at the Forefront of Innovation

Where would a computer vision blog be without a post about the new cashier-less store recently opened to the public by Amazon? Absolutely nowhere.

But I don’t need additional motivation to write about Amazon Go (as the store is called) because I am, to put it simply, thrilled and excited about this new venture. This is innovation at its finest where computer vision is playing a central role.

How can you not get enthusiastic about it, then? I always love it when computer vision makes the news and this is no exception.

In this post, I wish to talk about Amazon Go under four headings:

  1. How it all works from a technical (as much as is possible) and non-technical perspective,
  2. Some of the reported issues prior to public opening,
  3. Some reported in-store issues post public opening, and
  4. Some potential unfavourable implications cashier-less stores may have in the future (just to dampen the mood a little)

So, without further ado…

How it Works – Non-Technically & Technically

The store has a capacity of around 90 people – so it’s fairly small in size, like a convenience store. To enter it you first need to download the official Amazon app and connect it to your Amazon Prime account. You then walk up to a gate like you would at a metro/subway and scan a QR code from the app. The gate opens and your shopping experience begins.

Inside the store, if you wish to purchase something, you simply pick it up off the shelf and put it in your bag or pocket. Once you’re done, you walk out of the shop and a few minutes later you get emailed a receipt listing all your purchases. No cashiers. No digging around for money or cards. Easy as pie!

What happens on the technical side of things behind the scenes? Unfortunately, Amazon hasn’t disclosed much at all, which is a bit of a shame for nerds like me. But I shouldn’t complain too much, I guess.

What we do know is that sensor fusion is employed (sensor fusion is when data is combined from multiple sensors/sources to provide a higher degree of accuracy) along with deep learning.

Hundreds of cameras and depth sensors are attached to the ceiling around the store:

The cameras and depth sensors located on the ceiling of the store (image source)

These track you and your movements (using computer vision!) throughout your expedition. Weight sensors can also be found on the shelves to assist the cameras in discerning which products you have chosen to put into your shopping basket.

(Note: sensor fusion is also being employed in autonomous cars. Hopefully I’ll be writing about this as well soon.)

In 2015, Amazon filed a patent application for its cashier-less store in which it stated the use of RGB cameras (i.e. colour cameras) along with facial recognition. TechCrunch, however, has reported that the Vice President of Technology at Amazon Go told them that no facial recognition algorithms are currently being used.

In-Store Issues Prior to Public Opening

Although the store opened its doors to the public a few weeks ago, it has been open to employees since December 2016. Initially, Amazon expected the store to be ready for public use a few months after that but public opening was delayed by nearly a year due to “technical problems“.

We know what some of the dilemmas behind these “technical problems” were.

Firstly, Amazon had problems tracking more than 20 people in the store. If you’ve ever worked on person-tracking software, you’ll know how hard it is to track a crowd of people with similar body types and wearing similar clothes. But it looks like this has been resolved (to at least a satisfactory level for them). It’s a shame for us, though, to not be given more information on how Amazon managed to get this to work.

Funnily enough, some employees of Amazon knew about this problem and in November last year tried to see if a solution had been developed. Three employees dressed up in Pikachu costumes (as reported by Bloomberg here) while doing their round of shopping to attempt to fool the system. Amazon Go passed this thorough, systematic, and very scientific test. Too bad I couldn’t find any images or videos of this escapade!

We also know that initially engineers were assisting the computer vision system behind the scenes. The system would let these people know when it was losing confidence with its tracking results and would ask them to intervene, at least minimally. Nobody is supposedly doing this task any more.

Lastly, I also found information stating that the system would run into trouble when products were taken off the shelf and placed back on a different shelf. This was reported to have occurred when employees brought their children into the store and they ran wild a little (as children do).

This also appears to have been taken care of because someone from the public attempted to do this on purpose last week (see this video) but to no adverse effects, it would seem.

It’s interesting to see the growing pains that Amazon Go had to go through, isn’t it? How they needed an extra year to try to iron out all these creases. This is such a huge innovation. Makes you wonder what “creases” autonomous cars will have when they become more prominent!

In-Store Issues Post Public Opening

But, alas. It appears as though not all creases were ironed out to perfection. Since Amazon Go’s opening a few weeks ago, two issues have been written about.

The first is of Deirdre Bosa of CNBC not being charged for a small tub of yoghurt:

The Vice President of Amazon Go responded in the following way:

First and foremost, enjoy the yogurt on us. It happens so rarely that we didn’t even bother building in a feature for customers to tell us it happened. So thanks for being honest and telling us. I’ve been doing this a year and I have yet to get an error.

The yoghurt manufacturer replied to that tweet also:

To which Dierdre responded: “Thanks Siggi’s! But I think it’s on Amazon :)”

LOL! 🙂

But as Amazon Go stated, it’s a rarity for these mistakes to happen. Or is that only the case until someone works out a flaw in the system?

Well, it seems as though someone has!

In this video, Tim Pool states that he managed to walk out of the Amazon Go store with a bag full of products and was only charged for one item. According to him it is “absurdly easy to take a bag full of things and not get charged”. That’s a little disconcerting. It’s one thing when the system makes a mistake every now and then. It’s another thing when someone has worked out how to break it entirely.

Tim Pool says he has contacted Amazon Go to let them know of the major flaw. Amazon confirmed with him that he did not commit a crime but “if used in practice what we did would in fact be shoplifting”.

Ouch. I bet engineers are working on this frantically as we speak.

One more issue worth mentioning that isn’t really a flaw but could also be abused is that at the moment you can request a refund on any item without returns. No questions asked. Linus Tech Tips shows in this video how easily this can be done. Of course, since your Amazon Go account needs to be linked to your Amazon Prime account, if you do this too many times, Amazon will catch on and will probably take some form of preventative action against you or will even verify everything by looking back at past footage of you.


Cons of Amazon Go

Like I said earlier, I am really excited about Amazon Go. I always love it when computer vision spearheads innovation. But I also think it’s important to in this post also talk about potential unfavourable implications of a cashier-less store.

Potential Job Losses

The first most obvious potential con of Amazon Go is the job losses that might ensue if this innovation catches on. Considering that 3.5 million people in the US are employed as cashiers (it’s the second-most common job in that country), this issue needs to be raised and discussed. Heck, there have already been protests in this respect outside of Amazon Go:

Protests in front of the Amazon Go store (image source)

Bill Ingram, the organiser of the protest shown above asks: “What will all the cashiers do once their jobs are automated?”

Amazon, not surprisingly, has issued statements on this topic. It has said that although some jobs may be taken by automation, people can be relocated to improve other areas of the store by, for example:

Working in the kitchen and the store, prepping ingredients, making breakfast, lunch and dinner items, greeting customers at the door, stocking shelves and helping customers

Let’s also not forget that new jobs have also been created. For example, additional people need to be hired to manage the technological infrastructure behind this huge endeavour.

Personally, I’m not a pessimist about automation either. The industrial revolution that brought automation to so many walks of life was hard at first but society found ways to re-educate into other areas. The same will happen, I believe, if cashier-less stores become a prominent thing (and autonomous cars also, for that matter).

An Increase in Unhealthy Impulse Purchases

Manoj Thomas, a professor of marketing at Cornell University, has stated that our shopping behaviour will change around cashier-less stores:

[W]e know that when people use any abstract form of payment, they spend more. And the type of products they choose changes too.

What he’s saying is that psychological research has shown that the more distance we put between us and the “pain of paying” the more discipline we need to avoid those pesky impulse purchases. Having cash physically in your hand means you can what you’re doing with your money more easily. And that extra bit of time waiting in line at the cashier could be time enough to reconsider purchasing that chocolate and vanilla tub of ice cream :/

Even More Surveillance

And then we have the perennial question of surveillance. When is too much, too much? How much more data about us can be collected?

With such sophisticated surveillance in-store, companies are going to have access to even more behavioural data about us: which products I looked at for a long time; which products I picked up but put back on the shelf; my usual path around a store; which advertisements made me smile – the list goes on. Targeted advertising will become even more effective.

Indeed, Bill Ingram’s protest pictured above was also about this (hence why masks were worn to it). According to him, we’re heading in the wrong direction:

If people like that future, I guess they can jump into it. But to me, it seems pretty bleak.

Harsh, but there might be something to it.

Less Human Interaction

Albert Borgmann, a great philosopher on technology, coined the term device paradigm in his book “Technology and the Character of Contemporary Life” (1984). In a nutshell, the term is used to explain the hidden, detrimental nature and power of technology in our world (for a more in-depth explanation of the device paradigm, I highly recommend you read his philosophical works).

One of the things he laments is how we are increasingly losing daily human interactions due to the proliferation of technology. The sense of a community with the people around us is diminishing. Cashier-less stores are pushing this agenda further, it would seem. And considering, according to Aristotle anyway, that we are social creatures, the more we move away from human interaction, the more we act against our nature.

The Chicago Tribune wrote a little about this at the bottom of this article.

Is this something worth considering? Yes, definitely. But only in the bigger picture of things, I would say. At the moment, I don’t think accusing Amazon Go of trying to damage our human nature is the way to go.

Personally, I think this initiative is something to celebrate – albeit, perhaps, with just the faintest touch of reservation. 


In this post I discussed the cashier-less store “Amazon Go” recently opened to the public. I looked at how the store works from a technical and non-technical point of view. Unfortunately, I couldn’t say much from a technical angle because of the little amount of information that has been disclosed to us by Amazon. I also discussed some of the issues that the store has dealt with and is dealing with now. I mentioned, for example, that initially there were problems in trying to track more than 20 people in the store. But this appears to have been solved to a satisfactory level (for Amazon, at least). Finally, I dampened the mood a little by holding a discussion on the potential unfavourable implications that a proliferation of cashier-less stores may have on our societies. Some of the issues raised here are important but ultimately, in my humble opinion, this endeavour is something to celebrate – especially since computer vision is playing such a prominent role in it.


To be informed when new content like this is posted, subscribe to the mailing list:

Please share what you just read: