My Top 5 Posts So Far

It’s been nearly 18 months since I started this blog. I did it to share my journey in computer vision with you. I love this field and I’m always stumbling across such fascinating things that I feel as though more people should know about them.

I’ve seen this blog grow in popularity – much, much more than I had anticipated when I first started it. In this little “bonus” post, I thought I’d list my top posts thus far with additional comments about them.

I also thought I’d compile a second list with my personal favourite posts. These have not been as popular but I sure as hell had fun writing them!

Enjoy! And thanks for the support over the last 18 months.

My top 5 posts thus far:

  1. Why Deep Learning Has Not Superseded Traditional Computer Vision – I wrote this post on a Friday evening directly after work with a beer bottle in one hand and people playing pool or foosball around me. I wrote it up in an hour or so and didn’t think much of it, to be honest. The next day I woke up and saw, to my extreme surprise, that it had gone slightly viral! It was featured in Deep Learning Weekly (Issue #76), was being reposted by people such as Dr Adrian Rosebrock from PyImageSearch, and was getting about 1000 hits/day. Not bad, hey!?
  2. The Top Image Datasets and Their Challenges
  3. Finding a Good Thesis Topic in Computer Vision – I wrote this post after constantly seeing people asking this question on forums. Considering it’s consistently in my top 3 posts every week, I guess people are still searching for inspiration.
  4. Mapping Camera Coordinates to a 2D Floor Plan – This post came about after I had to work on security footage from a bank for a project at work. The boss was very pleased with what I had done and writing about my experiences in a post was a no-brainer after that.
  5. The Early History of Computer Vision – History is something that really interests me so it was only a matter of time before I was going to read up on the history of computer vision. Once I did and saw how fascinating it was, I just had to write a post about it.

My favourite posts thus far. Like I said, these are not popular (some barely get a single hit in a week) but I really enjoyed researching for and writing them:

  1. Soccer on Your Tabletop – The coolest thing going around in computer vision.
  2. Amazon Go – Computer Vision at the Forefront of Innovation – This to me is something amazing.
  3. The Baidu and ImageNet Controversy – Nothing like a good controversy!
  4. Computer Vision on Mars – Computer vision in space. Imagine working on that project!
  5. The Growth of Computer Vision in the IndustryThe Reasons Behind the Recent Growth of Computer Vision – I’m proud of how far computer vision has come over the years. It’s been a pleasure to be a part of the adventure.

Enjoy looking back over my posts. Thanks once again for your support over the last 18 months.


To be informed when new content like this is posted, subscribe to the mailing list:

Please share what you just read:

Smartphone Camera Technology from Google and Nokia

A few days ago Nokia unveiled its new smartphone: the Nokia 9 PureView. It looks kind of weird (or maybe funky?) with its 5 cameras at its rear (see image above). But what’s interesting is how Nokia uses these 5 cameras to give you better quality photos with a technique called High Dynamic Range (HDR) imaging.

HDR has been around in smartphones for a while, though. In fact, Google has had this imaging technique available in some of its phones since at least 2014. And in my opinion it does a much better job than Nokia. 

In this post I would like to discuss what HDR is and then present what Nokia and Google are doing with it to provide some truly amazing results. I will break the post up into the following sections:

  • High Dynamic Range Imaging (what it is)
  • The Nokie 9 PureView
  • Google’s HDR+ (some amazing results here)

High Dynamic Range Imaging

I’m sure you’ve attempted to take photos of high luminosity range scenarios such as dimly lit scenes or ones where the backdrop is brightly radiant. Frequently such photos come out either overexposed, underexposed and/or blurred. The foreground, for example, might be completely in shadow or details will be blurred out because it’s hard to keep the camera still when you have the shutter speed set to low to let in extra light.

HDR attempts to alleviate these high range scenario problems by capturing additional shots of the same scene (at different exposure levels, for instance) and then taking what’s best out of each photo and merging this into one picture.

Photo by Gustave Le Gray (image taken from Wikipedia)

Interestingly, the idea of taking multiple shots of a scene to provide a better single photo goes back to the 1850s. Gustave Le Gray, a highly noted French photographer, rendered seascapes showing both the sky and the sea by using one negative for the sky, and another one with a longer exposure for the sea. He then combined the two into one picture in the positive. Quite innovative for the period. The picture on the right was captured by him using the HDR technique.


The Nokia 9 PureView

As you’ve probably already guessed, Nokia uses the five cameras on the Nokia 9 PureView to take photos of the same scene. However, each camera is different. Two cameras are standard RGB sensors to capture colour. The remaining three are monochrome that capture nearly three times more light as the RGB cameras. These 5 cameras are each 12 megapixels in resolution. There is also an infrared sensor for depth readings.

Depending on the scene and lighting conditions each camera can be triggered up to four times in quick succession (commonly referred to as burst photography).

One colour photo is then selected to act as the primary shot and the other photos are used to improve it with details.

The final result is a photo of up to 240 megapixel in quality. Interestingly, you also have control over how much photo merging takes place and where this merging occurs. For example, you can choose to add additional detail to the foreground and ignore the background. The depth map from the depth sensor undoubtedly assists in this. And yes, you have access to all the RAW files taken by the cameras.

Not bad, but in my opinion Google does a much better job… and with only one camera. Read on!

Google’s HDR+

Google’s HDR technology is dubbed HDR+. It has been around for a while, first appearing in the Nexus 5 and 6 phones. It is now a standard on the Pixel range of phones. It is standard because HDR+ uses the regular single camera on Google’s phones. 

It gets away with just using one camera by taking up to 10 photos in quick succession – much more than Nokia does. Although the megapixel quality of the resulting photos may not match Nokia’s, the results are nonetheless impressive. Just take a look at this:

(image taken from here)

That is a dimly lit indoor scene. The final result is truly astonishing, isn’t it?

Here’s another example:

Both pictures were taken with the same camera. The picture on the left was captured with HDR+ turned off while the picture on the right had it turned on. (image taken from here)

What makes HDR+ standout from the crowd is its academic background. This isn’t some black-box technology that we know nothing about – it’s a technology that has been peer-reviewed by world-class academics and published in a world-class conference (SIGGRAPH Asia 2016).

Moreover, only last month, Google released to the public a dataset of archives of image bursts to help improve this technology.

When Google does something, it (usually) does it with a bang. You have to love this. This is HDR imaging done right.


To be informed when new content like this is posted, subscribe to the mailing list:

Please share what you just read:

Google’s Dataset Search Engine

Did you know that there is now a search engine for datasets that is powered by Google? Well, there is! And it’s something that the research community and the industry have been needing (whether they knew it or not) for years now.

This new search engine is called Dataset Search and can be found at this link.

This is a big deal. Datasets have become crucial since the prominent arrival of deep learning onto the scene a few years ago. Deep learning needs data. Lots and lots of data. This is because in deep learning, neural networks are told to (more or less) autonomously discover the underlying patterns in data. In computer vision, for example, you would want a machine to learn that bicycles are composed of two wheels, a handlebar, and a seat. But you need to provide enough examples for a machine to be able to learn these patterns.

Creating such large datasets is not an easy task. Some of the top image datasets (as I have documented here), contain millions of hand annotated images. These are famous datasets that most people in the computer vision world know about. But what about datasets that are more niche and hence less known? Some of these can be very difficult to find – and you certainly would not want to spend months or years creating them only to find that someone had already gone to all the trouble before you.

Up until now, then, there was no central location to search for these datasets. You had to manually traverse the web in the hope of finding what you were looking for. But that was until Dataset Search came along! Thank the heavens for that. Although Dataset Search is still in its beta stage, this is definitely something the research and industry communities have been needing.

For datasets to be listed in a coherent and informative manner on Dataset Search, Google has developed guidelines for dataset providers. These guidelines are based on, which is an open standard for describing such information (in metadata tags). As Google states:

We encourage dataset providers, large and small, to adopt this common standard so that all datasets are part of this robust ecosystem.

It would be a good idea to start adhering to these guidelines when creating datasets because a central place of reference for datasets is something we all need.

As a side note, Dataset Search has been in development for at least three years (interestingly, Dataset Search’s previous name was actually Goods – Google Dataset Search). Google released two academic papers on this in 2016 – see here and here. It’s nice to see that their work has finally culminated into what they have offered us now.

Dataset Search is definitely a step in in the right direction.


To be informed when new content like this is posted, subscribe to the mailing list:

Please share what you just read:

Seeing Around Corners with a Laser

In this post I would like to show you some results of another interesting paper I came across recently that was published last year in the prestigious Nature journal. It’s on the topic of non-line-of-sight (NLOS) imaging or, in other words, it’s about research that helps you see around corners. NLOS could be something particularly useful for use cases such as autonomous cars in the future.

I’ll break this post up into the following sections:

  • The LIDAR laser-mapping technology
  • LIDAR and NLOS
  • Current Research into NLOS

Let’s get cracking, then.


You may have heard of LIDAR (a term which combines “light” and “radar”). It is used very frequently as a tool to scan surroundings in 3D. It works similarly to radar but instead of emitting sound waves, it sends out pulses of infrared light and then calculates the time it takes for this light to return to the emitter. Closer objects will reflect this laser light quicker than distant objects. In this way, a 3D representation of the scene can be acquired, like this one which shows a home damaged by the 2011 Christchurch Earthquake:

(image obtained from here)

LIDAR has been around for decades and I came across it very frequently in my past research work in computer vision, especially in the field of robotics. More recently, LIDAR has been experimented with in autonomous vehicles for obstacle detection and avoidance. It really is a great tool to acquire depth information of the scene.

NLOS Imaging

But what if where you want to see is obscured by an object? What if you want to see what’s behind a wall or what’s in front of the car in front of you? LIDAR does not, by default, allow you to do this:

The rabbit object is not reachable by the LIDAR system (image adapted from this video)

This is were the field of NLOS comes in.

The idea behind NLOS is to use sensors like LIDAR to bounce laser light off walls and then read back any reflected light.

The laser is bounced off the wall to reach the object hidden behind the occluder (image adapted from this video)

This process is repeated around a particular point (p in the image above) to obtain as much reflected light as possible. The reflected light is then analysed and any objects on the other side of the occlusion are attempted to be reconstructed.

This is still an open area of research with many assumptions (e.g. that light is not reflected multiple times by the occluded object but bounces straight back to the wall and then the sensors) but the work on this done so far is quite intriguing.

Current Research into NLOS

The paper that I came across is entitled “Confocal non-line-of-sight imaging based on the light-cone transform“. It was published in March of last year in the Nature journal (555, no. 7696, p. 338). Nature is one of the world’s top and most famous academic journals, so anything published there is more than just world-class – it’s unique and exceptional.

The experiment setup from this paper was as shown here:

The setup of the experiment for NLOS. The laser light is bounced off the white wall to hit and reflect off the hidden object (image taken from original publication)

The idea, then, was to try and reconstruct anything placed behind the occluder by bouncing laser light off the white wall. In the paper, two objects were scrutinised: an “S” (as shown in the image above) and a road sign. With a novel method of reconstruction, the authors were able to obtain the following reconstructed 3D images of the two objects:


(image adapted from original publication)

Remember, these results are obtained by bouncing light off a wall. Very interesting, isn’t it? What’s even more interesting is that the text on the street sign has been detected as well. Talk about precision! You can clearly see how one day, this could come in handy with autonomous cars who could use information such as this to increase safety on the roads.

A computer simulation was also created to ascertain with dexterity the error rates involved with the reconstruction process. The simulated setup was as shown in the above images with the bunny rabbit. The results of the simulation were as follows:

(image adapted from original publication)

The green in the image is the reconstructed parts of the bunny superimposed on the original object. You can clearly see how the 3D shape and structure of the object is extremely well-preserved. Obviously, the parts of the bunny not visible to the laser could not be reconstructed.


This post introduced the field of non-line-of-sight imaging, which is, in a nutshell, research that helps you see around corners. The idea behind NLOS is to use sensors like LIDAR to bounce laser light off walls and then read back any reflected light. The scene behind an occlusion is then attempted to be reconstructed.

Recent results from state-of-the-art research in NLOS published in the Nature journal were also presented in this post. Although much more work is needed in this field, the results are quite impressive and show that NLOS could one day be very useful with, for example, autonomous cars who could use information such as this to increase safety on the roads.


To be informed when new content like this is posted, subscribe to the mailing list:

Please share what you just read:

Heart Rate Estimation Using Computer Vision

This post is another that has been inspired by a forum question: “What are some lesser known use cases for computer vision?” I jumped at the opportunity to answer this question because if there’s one thing I’m proud of with respect to this blog, it is the weird and wacky use cases that I have documented here.

Some things I’ve talked about include:

In this post I would like to add to the list above and discuss another lesser known use case for computer vision: heart rate estimation from colour cameras.

Vital Signal Estimation

Heart rate estimation belongs in the field called “Vital Signal Estimation” (VSE). In computer vision, VSE has been around for a while. One of the more famous attempts at this comes from 2012 from a paper entitled “Eulerian Video Magnification for Revealing Subtle Changes in the World” that was published at SIGGRAPH by MIT.

(Note: as I’ve mentioned in the past, SIGGRAPH, which stands for “Special Interest Group on Computer GRAPHics and Interactive Techniques”, is a world-renowned annual conference held for computer graphics researchers. But you do sometimes get papers from the world of computer vision being published there as is the case with this one.)

Basically, the way that VSE was implemented by MIT was that they analysed images captured from a camera for small illumination changes on a person’s face produced by varying amounts of blood flow to it. These changes were then magnified to make it easier to scrutinise. See, for example, this image from their paper:

(image source)

Amazing that these illumination changes can be extracted like this, isn’t it?

This video describes the Eulerian Video Magnification technique developed by these researchers (colour amplification begins at 1:25):

Interestingly, most research in VSE has focused around this idea of magnifying minute changes to estimate heart rates.

Uses of Heart Rate Estimation

What could heart rate estimation by computer vision be used for? Well, medical scenarios automatically come to mind because of the non-invasive (i.e. non-contact) feature of this technique. The video above (from 3:30) suggests using this technology for SIDS detection. Stress detection is also another use case. And what follows from this is lie detection. I’ve already written about lie detection using thermal imaging – here is one more way for us to be monitored unknowingly.

On the topic of being monitored, this paper suggests (using a slightly different technique of magnifying minute changes in blood flow to the face) detecting emotional reactions to TV programs and advertisements.

Ugh! It’s just what we need, isn’t it? More ways of being watched.

Making it Proprietary

From what I can see, VSE in computer vision is still mostly in the research domain. However, this research from Utah State University has recently been patented and a company called Photorithm Inc. formed around it. The company produces baby monitoring systems to detect abnormal breathing in sleeping infants. In fact, Forbes wrote an interesting article about this company this year. Among other things, the article talks about the reasons behind the push by the authors for this research and how the technology behind the application works. It’s a good read.

Here’s a video demonstrating how Photorithm’s product works:


This post talked about another lesser known use case of computer vision: heart rate estimation. MIT’s famous research from 2012 was briefly presented. This research used a technique of magnifying small changes in an input video for subsequent analysis. Magnifying small changes like this is how most VSE technologies in computer vision work today.

After this, a discussion of what heart rate estimation by computer vision could be used for followed. Finally, it was mentioned that VSE is still predominantly something in the research domain, although one company has recently appeared on the scene that sells baby monitoring systems to detect abnormal breathing in sleeping infants. The product being sold by this company uses computer vision techniques presented in this post.


To be informed when new content like this is posted, subscribe to the mailing list:

Please share what you just read:

Image Steganography – Simple Examples

In my last post I introduced the field of image steganography, which is the practice of concealing secret messages in digital images. I looked at the history of steganography and presented some recently reported real-life cases (including one from the FBI) where digital steganography was used for malicious purposes.

In this post I would like to present to you the following two very simple ways messages can be hidden in digital images:

  • JPEG Concealing
  • Least Significant Bit Technique

These techniques, although trivial and easy to detect, will give you an idea of how simple (and therefore potentially dangerous) digital image steganography can be.

JPEG Concealing

Image files in general are composed of two sections: header data + image data. The header data section can contain metadata information pertaining to the image such as date of creation, author, image resolution, and compression algorithm used if the image is compressed. This is the standard for JPEGs, BMPs, TIFFs, GIFs, etc.

Knowing this, one can work around these file structures to conceal messages. 

Let’s take JPEGs as an example. The file structure for this format is as follows:


Notice that every single JPEG file starts and ends with the SOI and EOI markers, respectively.

What this means is that any image interpreting application (e.g. Photoshop or GIMP, any internet browser, the standard photo viewing software that comes with your operating system, etc.) looks for these markers inside the file and knows that it should interpret and display whatever comes between them. Everything else is automatically ignored. 

Hence, you can insert absolutely anything after the EOI marker like this:


And even though the hidden message will be part of the JPEG file and travel with this file wherever it goes, no standard application will see anything out of the ordinary. It will just read whatever comes before EOI.

Of course, if you put a lot of data after EOI, your file size will increase significantly and might, therefore, arouse suspicion – so you have to be wary of that. In this case, it might be an idea to use a high resolution JPEG file (that naturally has a large file size) to turn attention away from your hidden message.

If you would like to try this steganography technique out yourself, download a hex editor for your machine (if you use Windows, WinHex is a good program), search for FF D9 (which is the hex version of EOI), paste anything you want after this section marker, and save your changes. You will notice that the file is opened like any other JPEG file. The hidden message simply piggy backs on top of the image file. Quite neat!

(Note: hexadecimal is a number system made up of 16 symbols. Our decimal system uses 10 digits: 0-9. The hex system uses the 10 digits from the decimal system plus the first 6 letters of the alphabet. To cut a long story short, hexadecimal is a shorthand and therefore much easier way to read/write binary digits, i.e. 1s and 0s. Most file formats will not save data in human readable form and we therefore need help if we want to view the raw data of these files – this is why hex is used sometimes used)

The Least Significant Bit Technique

Although easy to detect (if you know what you’re looking for), the Least Significant Bit (LSB) technique is a very sly way of hiding data in images. The way that it works is by taking advantage of the fact that small changes in pixel colour are invisible to the naked eye.

Let’s say we’re encoding images in the RGB colour space – i.e. each pixel’s colour is represented by a combination of a certain amount of red (R), a certain amount of green (G), and a certain amount of blue (B). The amount of red, green, and blue is given in the the range of 0 to 255. So, pure red would be represented as (255, 0, 0) in this colour space – i.e. the maximum amount of red, no green, and no blue.

Now, in this scenario (and abstracting over a few things), a machine would represent each pixel in 3 bytes – one byte for each of red, green and blue. Since a byte is 8 bits (i.e. 8 ones and zeros) each colour would be stored as something like this:


That would be the colour red (11111111 in binary is 255 in our number system).

What about if we were to change the 255 into 254 – i.e. change 11111111 into 11111110? Would we notice the difference in the colour red? Absolutely not. How about  changing 11111111 to 11111100 (255 to 252)? We still would not notice the difference – especially if this change is happening to single pixels!

The idea behind the LSB technique, then, is to use this fact that slightly changing the colour of each pixel would be imperceptible to the naked eye.

Since the last few digits in a byte are insignificant this is where LSB gets its name: the Least Significant Bit technique.

We know, then, that the last few bits in each byte can be manipulated. So, we can use this knowledge to set aside these bits of each pixel to store a hidden message.

Let’s look at an example. Suppose we want to hide a message like “SOS“. We choose to use the ASCII format to encode our letters. In this format each character has its own binary representation. The binary for our message would be:


What we do now is split each character into two-bit pairs (e.g. S has the following four pairs: 01, 01, 00, 11) and spread these pairs successively along multiple pixels. So, if our image had four pixels, our message would be encoded like this:


Notice that each letter is spread across two pixels: one pixel encodes the first 3 pairs and the next pixel takes the last pair. Very neat, isn’t it? You can choose to use more than 2 bits per pixel to store your message but remember that by using more bits you risk changes to each pixel becoming perceptible.

Also, the larger the image, the more you can encode. And, since images can be represented in binary, you can store an image inside an image using the exact same technique. In this respect, I would highly recommend that you take a look at this little website that will allow you to do just that using the method described here.

I would also recommend going back to the first post of this series and looking at the image-inside-image steganography examples there provided by the FBI. It shows brilliantly how sneaky image steganography can be.


In this post I looked at two simple techniques of image steganography. The first technique takes advantage of the fact that image files have an end-of-file (EOF) marker in their metadata. This means that any program opening these images will read everything up to and including this marker. If you were to put anything after the EOF, it would be hidden from view. The second technique takes advantage of the fact that slightly changing the colour of a pixel is imperceptible to the naked eye. In this sense, the least significant bits of each pixel can be used to spread a message (e.g. text or image) across the pixels in an image. A program would then be used to extract this message.


To be informed when new content like this is posted, subscribe to the mailing list:

Please share what you just read:

Image Steganography – An Introduction

In this post (part 1 of 2) I would like to introduce the topic of image steganography, which is the practice of concealing secret messages in digital images. I’ve always been fascinated by this subject so I have taken the excuse to research for this post as a way to delve into the topic. Turns out that image steganography is a fascinating field that should be garnering much more attention than it is.

I will divide the post into two sections:

  1. Steganography: what it is and its early history
  2. Digital image steganography and some recently reported real-life cases – including one from an FBI report on Russian spying in the US (like something out of the cold war)

In my next post I will detail some simple techniques of hiding messages in images, so stay tuned for that.

What is Steganography

Usually today if we want to send sensitive data (e.g. credit card information), we encrypt this data before sending it across the internet. Sending messages like this, however, can arouse suspicion: there is obviously sensitive/secret data in your encrypted message that you are trying to conceal. Attackers know exactly where to look to try to obtain this information.

But steganography works differently: you hide the message in plain sight in order for your message to not attract any attention at all.

The first recorded case of steganography goes back to 499 BC when the Greek tyrant Histiaeus shaved the head of his slave and “marked” (probably tattooed) a secret message onto it. The message was intended for Aristagoras and it was telling him to start a revolt against the Persians. Histiaeus waited for the slave’s hair to grow back before sending him on his way. When the slave reached Aristagoras, his head was shaved again to reveal the hidden message.

Image result for steganography tattoo
(image source)

Who would have thought to stop the slave and look for a hidden message tattooed on his head? Ingenious, wasn’t it? (Well, maybe not for the slave who was probably left with that message permanently on his head…).

That’s the way steganography works: through deception.

It is an important topic because of how seemingly common it is becoming. A report from 2017 by the global computer security software company McAfee says that steganography is being used in more ways today than ever before. However, Simon Wiseman, the chief technology officer of the network security firm Deep Secure, argues that it’s not so much that steganography is becoming more popular, just that we are discovering it more often by learning how it is being done: “now that people are waking up to the fact that it’s out there, the discovery rate is going up.”

Either way, as McAfee claims: “Steganography will continue to become more popular.”

Digital Image Steganography

As mentioned earlier, digital image steganography is the hiding of secret messages inside images. Take a look at these two images distributed by the FBI:

Figure 8 shows a JPEG carrier file containing the airport map.  Figure 6 shows a GIF carrier file containing the airport map.

You wouldn’t think that both of them contain the following map of an airport, would you?

Figure 5 shows this map is hidden in the various carriers in this article.

Well, they do. The FBI doesn’t lie 🙂

It’s a scary thing when you consider the huge number of images being sent across the internet every day. You would really have to know precisely where to scan for this stuff and what to look for otherwise you’re searching for a needle in a haystack.

Now, the first recorded case of image steganography in a cyberattack dates back to 2011. It was called the Duqu malware attack and it worked by encrypting and embedding data into small JPEG image files. These files were then sent to servers to obtain sensitive information (rather than doing destructive work directly like deleting files). McAfee says that it was used to, for example, steal digital certificates from its victims. How Duqu worked exactly, however, remains unknown. Researchers are still trying to work this out (although all sources I could find on this are fairly outdated). Quite amazing.

I found earlier reported cases, however, of image steganography being used for malicious purposes, not necessarily in cyberattacks. My favourite one is from the FBI.

Here’s an official report from them from 2010 accusing the Russian foreign intelligence agency of embedding encrypted text messages inside image files for communications with agents stationed abroad. This reportedly all took place in the 90s in the US. Turns out that the 10 spies mentioned in the report later pleaded guilty to being Russian agents and were used as part of a spy swap between the U.S. and Russian governments. The FBI and the Russians… and spy swapping! Like something out of a movie. Shows you how serious the topic of digital image steganography is.

Image result for fbi logo wikipedia

You can see how this way of embedding communication in images is a much more sophisticated version of the “tattooing a message on a shaved head” example from Ancient Greece described above.

Is image steganography being used like this by ISIS to communicate secretly amongst each other? Chances are it is.

Early this year a communication tool was discovered called MuslimCrypt (poor choice of name, in my opinion). As Wired reports, the tool was found in a private, pro-ISIS Telegram channel on January 20.  It is dead simple to use (take a look at the video on the Wired page to see this for yourselves): you select an image, write a message in text form, select a password, and click one button to hide this message inside the image. This image can then be sent across the internet after which the recipient puts it into MuslimCrypt and with one click of a button retrieves the hidden message. Sneaky, dangerous stuff.

What would make detection even more difficult is a hidden message distributed over multiple images. Well, models for this already exist, as the academic paper “Distributed Steganography”  (Liao et al., International Conference on Intelligent Information Hiding and Multimedia Signal Processing, 2011) presents.

Moreover, a patent for distributed steganography was filed by a certain Charles Easttom II William in 2010. This image from the patent summarises distributed steganography nicely:

(Image adapted from the original patent)

Fascinating stuff, isn’t it?

Stay tuned for my next post where I will look in detail at some simple examples of digital image steganography.


To be informed when new content like this is posted, subscribe to the mailing list:

Please share what you just read:

Image Colourisation – Converting B&W Photos to Colour

I got another great academic publication to present to you – and this publication also comes with an online interactive website for you to use to your heart’s content. The paper is from the field of image colourisation. 

Image colourisation (or ‘colorization’ for our US readers :P) is the act of taking a black and white photo and converting it to colour. Currently, this is a tedious, manual process usually performed in Photoshop, that can typically take up to a month for a single black and white photo. But the results can be astounding. Just take a look at the following video illustrating this process to give you an idea of how laborious but amazing image colourisation can be:

Up to a month to do that for each image!? That’s a long time, right?

But then came along some researchers from the University of California in Berkeley who decided to throw some deep learning and computer vision at the task. Their work, published at the European Conference on Computer Vision in 2016, has produced a fully automatic image colourisation algorithm that creates vibrant and realistic colourisations in seconds. 

Their results truly are astounding. Here are some examples:



Not bad, hey? Remember, this is a fully automatic solution that is only given a black and white photo as input.

How about really old monochrome photographs? Here is one from 1936:


And here’s an old one of Marilyn Monroe:



Quite remarkable. For more example images, see the official project page (where you can also download the code).

How did the authors manage to get such good results? It’s obvious that deep learning (DL) was used as part of the solution. Why is it obvious? Because DL is ubiquitous nowadays – and considering the difficulty of the task, no other solution is going to come near. Indeed, the authors report that their results are significantly better than previous solutions.

What is intuitive is how they implemented their solution. One might choose to go down the standard route of designing a neural network that maps a black and white image directly to a colour image (see my previous post for an example of this). But this idea will not work here. The reason for it is that similar objects can have very different colours.

Let’s take apples as an example to explain this. Consider an image dataset that has four pictures of an apple: 2 pictures showing a yellow apple and 2 showing a red one. A standard neural network solution that just maps black and white apples to colour apples will calculate the average colour of apples in the dataset and colour the black and white photo this way. So, 2 yellow + 2 red apples will give you an average colour of orange. Hence, all apples will be coloured orange because this is the way the dataset is being interpreted. The authors report that going down this path will produce very desaturated (bland) results.

So, their idea was to instead calculate what the probability is of each pixel being a particular colour. In other words, each pixel in a black and white image has a list of percentages calculated that represent the probability of that particular pixel being each specific colour. That’s a long list of colour percentages for every pixel! The final colour of the pixel is then chosen from the top candidates on this list.

Going back to our apples example, the neural network would tell us that pixels belonging to the apple in the image would have a 50% probability of being yellow and 50% probability of being red (because our dataset consists of only red and yellow apples). We would then choose either of these two colours – orange would never make an appearance.

As is usually the case, ImageNet with its 1.3 million images (cf. this previous blog post that describes ImageNet) is used to train the neural network. Because of the large array of objects in ImageNet, the neural network can hence learn to colour many, many scenes in the amazing way that it does.

What is quite neat is that the authors have also set up a website where you can upload your own black and white photos to be converted by their algorithm into colour. Try it out yourself – especially if you have old photos that you have always wanted to colourise.

Ah, computer vision wins again. What a great area in which to be working and researching.


To be informed when new content like this is posted, subscribe to the mailing list:

Please share what you just read:

Image Completion from SIGGRAPH 2017

Oh, I love stumbling upon fascinating publications from the academic world! This post will present to you yet another one of those little gems that has recently fallen into my lap. It’s on the topic of image completion and comes from a paper published in SIGGRAPH 2017 entitled “Globally and Locally Consistent Image Completion” (project page can be found here).

(Note: SIGGRAPH, which stands for “Special Interest Group on Computer GRAPHics and Interactive Techniques”, is a world renowned annual conference held for computer graphics researchers. But you do sometimes get papers from the world of computer vision being published there as is the case with the one I’m presenting here.)

This post will be divided into the following sections:

  1. What is image completion and some of its prior weaknesses
  2. An outline of the solution proposed by the above mentioned SIGGRAPH publication
  3. A presentation of results

If anything, please scroll down to the results section and take a look at the video published by the authors of the paper. There’s some amazing stuff to be seen there!

1. What is image completion?

Image completion is a technique for filling-in target regions with alternative content. A major use for image completion is in the task of object removal where an object from a photo is erased and the remaining hole is automatically substituted with content that hopefully maintains the contextual integrity of the image.

Image completion has been around for a while. Perhaps the most famous algorithm in this area is called PatchMatch which is used by Photoshop in its Content Aware Fill feature. Take a look at this example image generated by PatchMatch after the flowers in the bottom right corner were removed from the left image:

An image completion example on a natural scene generated by PatchMatch

Not bad, hey? But the problem with existing solutions such as PatchMatch is that images can only be completed with textures that solely come from the input image. That is, calculations for what should be plugged into the hole are done using information obtained just from the input image. So, for images like the flower picture above, PatchMatch works great because it can work out that green leaves is the dominant texture and make do with that.

But what about more complex images… and faces as well? You can’t work out what should go into a gap in an image of a face just from its input image. This is an image completion example done on a face by PatchMatch:

An image completion example on a face generated by PatchMatch

Yeah, not so good now, is it? You can see how trying to work out what should go into a gap from other areas of the input image is not going to work for a lot of cases like this.

2. Proposed solution

This is where the paper “Globally and Locally Consistent Image Completion” comes in. The idea behind it, in a nutshell, is to use a massive database of images of natural scenes to train a single deep learning network for image completion. The Places2 dataset is used for this, which contains over 8 million images of diverse natural scenes – a massive database from which the network basically learns the consistency inherent in natural scenes. This means that information to fill in missing gaps in images is obtained from these 8 million images rather than just one single image!

Once this deep neural network is trained for image completion, a GAN (Generative Adversarial Network) approach is utilised to further improve this network.

GAN is an unsupervised neural network training technique where one or more neural networks are used to mutually improve each other in the training phase. One neural network tries to fool another and all neural networks are updated according to results obtained from this step. You can leave these neural networks running for a long time and watch them improving each other.

The GAN technique is very common in computer vision nowadays in scenarios where one needs to artificially produce images that appear realistic. 

Two additional networks are used in order to improve the image completion network: a global and a local context discriminator network. The former discriminator looks at the entire image to assess if it is coherent as a whole. The latter looks only at the small area centered at the completed region to ensure local consistency of the generated patch. In other words, you get two additional networks assisting in the training: one for global consistency and one local consistency. These two auxiliary networks return a result stating whether the generated image is realistic-looking or artificial. The image completion network then tries to generate completed images to fool the auxiliary networks into thinking that their real.

In total, it took 2 months for the entire training stage to complete on a machine with four high-end GPUs. Crazy!

The following image shows the solution’s training architecture:

Overview of architecture for training for image completion (image taken from original publication)

Typically, to complete an image of 1024 x 1024 resolution that has one gap takes about 8 seconds on a machine with a single CPU or 0.5 seconds on one with a decent GPU. That’s not bad at all considering how good the generated results are – see the next section for this.

3. Results

The first thing you need to do is view the results video released by the authors of the publication. Visit their project page for this and scroll down a little. I can provide a shorter version of this video from YouTube here:

As for concrete examples, let’s take a look at some faces first. One of these faces is the same from the PatchMatch example above.

Examples of image completion on faces (image adapted from original publication)

How’s impressive is this?

My favourite examples are of object removal. Check this out:

Examples of image completion (image taken from original publication)

Look how the consistency of the image is maintained with the new patch in the image. It’s quite incredible!

My all-time favourite example is this one:

Another example of image completion (taken from original publication)

Absolutely amazing. More results can be viewed in supplementary material released by the authors of the paper. It’s well-worth a look!


In this post I presented a paper on image completion from SIGGRAPH 2017 entitled “Globally and Locally Consistent Image Completion”. I first introduced the topic of image completion, which is a technique for filling-in target regions with alternative content, and described some weaknesses of previous solutions – mainly that calculations for what should be generated for a target region are done using information obtained just from the input image. I then presented the more technical aspect of the proposed solution as presented in the paper. I showed that the image completion deep learning network learnt about global and local consistency of natural scenes from a database of over 8 million images. Then, a GAN approach was used to further train this network. In the final section of the post I showed some examples of image completion as generated by the presented solution.


To be informed when new content like this is posted, subscribe to the mailing list:

Please share what you just read:

The Baidu and ImageNet Controversy

Two months ago I wrote a post about some recent controversies in the industry in computer vision. In this post I turn to the world of academia/research and write about something controversial that occurred there.

But since the world of research isn’t as aggressive as that of the industry, I had to go back three years to find anything worth presenting. However, this event really is interesting, despite its age, and people in research circles talk about it to this day.

The controversy in question pertains to the ImageNet challenge and the Baidu research group. Baidu is one of the largest AI and internet companies in the world. Based in Beijing, it has the 2nd largest search engine in the world and is hence commonly referred to as China’s Google. So, when it is involved in a controversy, you know it’s no small matter!

I will divide the post into the following sections:

  1. ImageNet and the Deep Learning Arms Race
  2. What Baidu did and ImageNet’s response
  3. Ren Wu’s (Ex-Baidu Researcher’s) later response (here is where things get really interesting!)

Let’s get into it.

ImageNet and the Deep Learning Arms Race

(Note: I wrote about what ImageNet is in my last post, so please read that post for a more detailed explanation.) 

ImageNet is the most famous image dataset by a country mile. Currently there are over 14 million images in ImageNet for nearly 22,000 synsets (WordNet has ~100,000 synsets). Over 1 million images also have hand-annotated bounding boxes around the dominant object in the image.

However, when the term “ImageNet” is used in CV literature, it usually refers to the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) which is an annual competition for object detection and image classification organised by computer scientists at Stanford University, the University of North Carolina at Chapel Hill and the University of Michigan.

This competition is very famous. In fact, the deep learning revolution of the 2010s is widely attributed to have originated from this challenge after a deep convolutional neural network blitzed the competition in 2012. Since then, deep learning has revolutionised our world and the industry has been forming research groups like crazy to push the boundary of artificial intelligence. Facebook, Amazon, Google, IBM, Microsoft – all the major players in IT are now in the research game, which is phenomenal to think about for people like me who remember the days of the 2000s when research was laughed at by people in the industry.

With such large names in the deep learning world, a certain “computing arms race” has ensued. Big bucks are being pumped into these research groups to obtain (and trumpet far and wide) results better than other rivals. Who can prove to be the master of the AI world? Who is the smartest company going around? Well, competitions such as ImageNet are a perfect benchmark for questions like this, which makes the ImageNet scandal quite significant.

Baidu and ImageNet

To have your object classification algorithm scored on the ImageNet Challenge, you first get it trained on 1.5 million images from the ImageNet dataset. Then, you submit your code to the ImageNet server where this code is tested against a collection of 100,000 images that are not known to anybody. What is key, though, is that to avoid people fine-tuning the parameters in their algorithms to this specific testing set of 100,000 images, ImageNet only allows 2 evaluations/submissions on the test set per week (otherwise you could keep resubmitting until you’ve hit that “sweet spot” specific to this test set).

Before the deep learning revolution, a good ILSVRC classification error rate was 25% (that’s 1 out of 4 images being classified incorrectly). After 2014, error rates have dropped to below 5%!

In 2015, Baidu announced that with its new supercomputer called Minwa it had obtained a record low error rate of 4.58%, which was an improvement on Google’s error rate of 4.82% as well as Microsoft’s of 4.9%. Massive news in the computing arms race, even though the error rate differences appear to be minimal (and some would argue, therefore, that they’re insignificant – but that’s another story).

However, a few days after this declaration, an initial announcement was made by ImageNet:

It was recently brought to our attention that one group has circumvented our policy of allowing only 2 evaluations on the test set per week.

Three weeks later, a follow up announcement was made stating that the perpetrator of this act was Baidu. ImageNet had conducted an analysis and found that 30 accounts connected to Baidu had been used in the period of November 28th, 2014 to May 13th, 2015 to make on average four times the permitted amount of submissions. 

As a result, ImageNet disqualified Baidu from that year’s competition and banned them from re-entering for a further 12 months.

Ren Wu, a distinguished AI scientist and head of the research group at the time, apologised for this mistake. A week later he was dismissed from the company. But that’s not the end of the saga.

Ren Wu’s Response

Here is where things get really interesting. 

A few days after being fired from Baidu, Ren Wu sent an email to Enterprise Technology in which he denied any wrongdoing:

We didn’t break any rules, and the allegation of cheating is completely baseless

Whoa! Talk about opening a can of worms!

Ren stated that there is “no official rule specify [sic] how many times one can submit results to ImageNet servers for evaluation” and that this regulation only appears once a submission is made from one account. From this he came to understand that 2 submissions per week can be made from each account/individual rather than a whole team. Since Baidu had 5 authors working on the project, he argues that he was allowed to make 10 submission per week.

I’m not convinced though because he still used 30 accounts (purportedly to be owned by junior students assisting in the research) to make these submissions. Moreover, he still admits that on two occasions the 10 submission threshold was breached – so, he definitely did break the rules.

Things get even more interesting, however, when he states that he officially apologised just for those two occasions as requested by his management:

A mistake in our part, and it was the reason I made a public apology, requested by my management. Of course, this was my biggest mistake. And things have been gone crazy since. [emphasis mine]

Whoa! Another can of worms. He apologised as a result of a request by his management and he states that this was a mistake. It looks like he’s accusing Baidu of using him as a scapegoat in this whole affair. Two months later he confirms this to the EE Times, by stating that

I think I was set up

Well, if that isn’t big news, I don’t know what is! I personally am not convinced by Ren’s arguments. But it at least shows that the academic/research world can be exciting at times, too 🙂


To be informed when new content like this is posted, subscribe to the mailing list:

Please share what you just read: