imagenet

Artificial Intelligence is Slowing Down – Part 2

Posted on February 25, 2022January 16, 2025 by Zbigatron

(Update: part 3 of this post was posted recently here.)

In July of last year I wrote an opinion piece entitled “Artificial Intelligence is Slowing Down” in which I shared my judgement that as AI and Deep Learning (DL) currently stand, their growth is slowly becoming unsustainable. The main reason for this is that training costs are starting to go through the roof the more DL models are scaled up in size to accommodate more and more complex tasks. (See my original post for a discussion on this).

In this post, part 2 of “AI Slowing Down”, I wanted to present findings from an article written a few months after mine for IEEE Spectrum. The article, entitled “Deep Learning’s Diminishing Returns – The cost of improvement is becoming unsustainable“, came to the same conclusions as I did (and more) regarding AI but it presented much harder facts to back its claims.

I would like to share some of these claims on my blog because they’re very good and backed up by solid empirical data.

The first thing that should be noted is that the claims presented by the authors are based on an analysis of 1,058 research papers (plus additional benchmark sources). That’s a decent dataset from which significant conclusions can be gathered (assuming the analyses were done correctly, of course, but considering the four authors who are of repute, I think it is safe to assume the veracity of their findings).

One thing the authors found was that with the increase in performance of a DL model, the computational cost increases exponentially by a factor of four (i.e. to improve performance by a factor of k, the computational cost scales by k^4). I stated in my post that the larger the model the more complex tasks it can perform, but also the more training time is required. We now have a number to estimate just how much computation power is required per improvement in performance. A factor of four is staggering.

Another thing I liked about the analysis performed was that it took into consideration the environmental impact of growing and training more complex DL models.

The following graph speaks volumes. It shows the error rate (y-axis and dots on the graph) on the famous ImageNet dataset/challenge (I’ve written about it here) decreasing over the years once DL entered the scene in 2012 and smashed previous records. The line shows the corresponding carbon-dioxide emissions accompanying training processes for these larger and larger models. A projection is then shown (dashed line) of where carbon emissions will be in the years to come assuming AI grows at its current rate (and no new steps are taken to alleviate this issue – more on this later).

imagenet-carbon-emissions — *As DL models get better (y-axis), the computations required to train them (bottom x-axis) increase and hence do carbon emissions (top x-axis).*

Just look at the comments in red in the graph. Very interesting.

And the costs of these future models? To achieve an error rate of 5%, the authors extrapolated a cost of US$100 billion. That’s just ridiculous and definitely untenable.

We won’t, of course, get to a 5% error rate the way we are going (nobody has this much money) so scientists will find other ways to get there or DL results will start to plateau:

We must either adapt how we do deep learning or face a future of much slower progress

At the end of the article, then, the authors provide an insight into what is happening in this respect as science begins to realise its limitations and look for solutions. Meta-learning is one such solution that is presented and discussed (meta-learning is the training of models that are designed for broader tasks and then using them for a multitude of more specific cases. In this scenario, only one training needs to take place for multiple tasks).

However, all the current research so far indicates that the gains from these innovations are minimal. We need a much bigger breakthrough for significant results to appear.

And like I said in my previous article, big breakthroughs like this don’t come willy-nilly. It’s highly likely that one will come along but when that will be is anybody’s guess. It could be next year, it could be at the end of the decade, or it could be at the end of the century.

We really could be reaching the max speed of AI – which obviously would be a shame.

Note: the authors of the aforementioned article have published a scientific paper as an arXiv preprint (available here) that digs into all these issues in even more detail.

To be informed when new content like this is posted, subscribe to the mailing list (or subscribe to my YouTube channel!):

The Largest Cat Video Dataset in the World

Posted on September 9, 2019January 16, 2025 by Zbigatron

This is a bit of a fun post about a “dataset” I stumbled upon a few days ago… a dataset of cat videos.

As I’ve mentioned numerous times in various posts of mine, the deep learning revolution that is driving the recent advancements in AI around the world needs data. Lots and lots of data. For example, the famous image classification models that are able to tell you, with better precision than humans, what objects are in an image, are trained on datasets containing sometimes millions of images (e.g. ImageNet). Large datasets have basically become essential fuel for the AI boom of recent years.

So, I had a really good laugh when a few days ago I found this innocuous and unexposed YouTube channel owned by a Japanese man who has been posting a few videos every day of him feeding stray cats. Since he has been doing this for the past 9 years, he has managed to accumulate over 19,000 cat videos on his channel. And in doing so he has most probably and inadvertently created the largest cat video dataset in the world. My goodness!

Technically speaking, unless you’re a die-hard cat lover (like me!), these videos aren’t all that interesting. They’re simply of stray cats having a decent feed or drink with their good-hearted caretaker on occasion uttering a few sentences here and there. Here’s one, for example, of two cats eating out of a bowl:

Or here’s one of two cats enjoying a good ol’ scratch behind the ears:

On average these videos are about 30-60 seconds in length. And they’re all titled by the default name given by his cameras (e.g. MVI 3985, etc.). Hence, nothing about these clips is designed in order for them be found by anybody out there.

However, despite all this mundanity, to the computer vision community (that needs datasets to survive like humans need oxygen), these videos could come in handy… one day. I’m not sure how just yet, but I’m sure somebody out there could find a use for them. I mean, there’s over 19,000 cat videos just sitting there. This is just too good to pass up.

So, if there are any academics out there: please, please, please use this “dataset” in your publishable studies. It would make my year, for sure! The cats would be proud, too.

Oh, and one more thing. I found this guy’s twitter account (@niiyan1216). And, you guessed it: it is full of pictures of cats.

To be informed when new content like this is posted, subscribe to the mailing list (or subscribe to my YouTube channel!):

Smart Glasses for the Blind – Why has it Taken This Long?

Posted on May 18, 2019January 16, 2025 by Zbigatron

Remember Google Glass? Those smart glasses that were released by Google to the public in May of 2014 (see image below). Less than a year later production was halted because, well, not many people wanted to walk around with a goofy looking pair of specs on their noses. They really did look wacky. I’m not surprised the gadget never caught on.

Well, despite the (predictable?) flop of Google Glass, it turns out that there has proven to be a fantastic use case for such smart glasses: for people with visual impairments.

There is a company out there called Aira that provides an AI-guided service used in conjunction with smart glasses and an app on a smartphone. When images are captured by the glasses’ forward-facing camera, image and text recognition are used and an AI assistant, dubbed “Chloe”, describes in speech what is present in these videos: whether it be everyday objects such as products on a shelf in your pantry, words on medication bottles or even words in a book.

Quite amazing, isn’t it?

Simple tasks like object and text recognition are performed locally on the smartphone. However, more complex tasks can be sent to Aira’s cloud services (powered by Amazon’s AWS).

Furthermore, the user has the option to, at the tap of a button on the glasses or app, connect to a live agent who is then able to access a live video stream from the smart glasses and other data from the smartphone like GPS location. With these the live agent is able to provide real-time assistance by speaking directly to the visually impaired person. A fantastic idea.

According to NVIDIA, Aira trains its object recognition deep learning neural networks not on image datasets like ImageNet but from 3 million minutes worth of data captured by their users, which has been annotated by Aira’s agents. An interesting idea considering how time consuming such a task must have been. But this has given the service an edge as training from real-world scenarios has provided, as reported, better results.

The uses for Aira’s product and service are pretty much endless! As suggested on their site, you can use Aira for things like: reading to a child, locating a stadium seat, reading a whiteboard, navigating premises, sorting and reading mail and the paper, enjoying the park or the zoo, roaming historical sites. Gosh, the list can be endless!

And thankfully, the glasses don’t look goofy at all! That’s definitely a win right there.

Finally, I would encourage you to take a look at this official video demonstrating the uses of Aira. This is computer vision serving society in the right way.

(Unfortunately, the video that was once here has been taken down by the pubisher)

To be informed when new content like this is posted, subscribe to the mailing list (or subscribe to my YouTube channel!):

My Top 5 Posts So Far

Posted on March 20, 2019January 16, 2025 by Zbigatron

It’s been nearly 18 months since I started this blog. I did it to share my journey in computer vision with you. I love this field and I’m always stumbling across such fascinating things that I feel as though more people should know about them.

I’ve seen this blog grow in popularity – much, much more than I had anticipated when I first started it. In this little “bonus” post, I thought I’d list my top posts thus far with additional comments about them.

I also thought I’d compile a second list with my personal favourite posts. These have not been as popular but I sure as hell had fun writing them!

Enjoy! And thanks for the support over the last 18 months.

My top 5 posts thus far:

Why Deep Learning Has Not Superseded Traditional Computer Vision – I wrote this post on a Friday evening directly after work with a beer bottle in one hand and people playing pool or foosball around me. I wrote it up in an hour or so and didn’t think much of it, to be honest. The next day I woke up and saw, to my extreme surprise, that it had gone slightly viral! It was featured in Deep Learning Weekly (Issue #76), was being reposted by people such as Dr Adrian Rosebrock from PyImageSearch, and was getting about 1000 hits/day. Not bad, hey!?
The Top Image Datasets and Their Challenges
Finding a Good Thesis Topic in Computer Vision – I wrote this post after constantly seeing people asking this question on forums. Considering it’s consistently in my top 3 posts every week, I guess people are still searching for inspiration.
Mapping Camera Coordinates to a 2D Floor Plan – This post came about after I had to work on security footage from a bank for a project at work. The boss was very pleased with what I had done and writing about my experiences in a post was a no-brainer after that.
The Early History of Computer Vision – History is something that really interests me so it was only a matter of time before I was going to read up on the history of computer vision. Once I did and saw how fascinating it was, I just had to write a post about it.

My favourite posts thus far:

Like I said, these are not popular (some barely get a single hit in a week) but I really enjoyed researching for and writing them.

Soccer on Your Tabletop – The coolest thing going around in computer vision.
Amazon Go – Computer Vision at the Forefront of Innovation – This to me is something amazing.
The Baidu and ImageNet Controversy – Nothing like a good controversy!
Computer Vision on Mars – Computer vision in space. Imagine working on that project!
The Growth of Computer Vision in the Industry / The Reasons Behind the Recent Growth of Computer Vision – I’m proud of how far computer vision has come over the years. It’s been a pleasure to be a part of the adventure.

Enjoy looking back over my posts. Thanks once again for your support over the last 18 months.

To be informed when new content like this is posted, subscribe to the mailing list (or subscribe to my YouTube channel!):

Image Colourisation – Converting B&W Photos to Colour

Posted on September 21, 2018January 16, 2025 by Zbigatron

I got another great academic publication to present to you – and this publication also comes with an online interactive website for you to use to your heart’s content. The paper is from the field of image colourisation.

Image colourisation (or ‘colorization’ for our US readers :P) is the act of taking a black and white photo and converting it to colour. Currently, this is a tedious, manual process usually performed in Photoshop, that can typically take up to a month for a single black and white photo. But the results can be astounding. Just take a look at the following video illustrating this process to give you an idea of how laborious but amazing image colourisation can be:

Up to a month to do that for each image!? That’s a long time, right?

But then came along some researchers from the University of California in Berkeley who decided to throw some deep learning and computer vision at the task. Their work, published at the European Conference on Computer Vision in 2016, has produced a fully automatic image colourisation algorithm that creates vibrant and realistic colourisations in seconds.

Their results truly are astounding. Here are some examples:

Not bad, hey? Remember, this is a fully automatic solution that is only given a black and white photo as input.

How about really old monochrome photographs? Here is one from 1936:

And here’s an old one of Marilyn Monroe:

Quite remarkable. For more example images, see the official project page (where you can also download the code).

How did the authors manage to get such good results? It’s obvious that deep learning (DL) was used as part of the solution. Why is it obvious? Because DL is ubiquitous nowadays – and considering the difficulty of the task, no other solution is going to come near. Indeed, the authors report that their results are significantly better than previous solutions.

What is intuitive is how they implemented their solution. One might choose to go down the standard route of designing a neural network that maps a black and white image directly to a colour image (see my previous post for an example of this). But this idea will not work here. The reason for it is that similar objects can have very different colours.

Let’s take apples as an example to explain this. Consider an image dataset that has four pictures of an apple: 2 pictures showing a yellow apple and 2 showing a red one. A standard neural network solution that just maps black and white apples to colour apples will calculate the average colour of apples in the dataset and colour the black and white photo this way. So, 2 yellow + 2 red apples will give you an average colour of orange. Hence, all apples will be coloured orange because this is the way the dataset is being interpreted. The authors report that going down this path will produce very desaturated (bland) results.

So, their idea was to instead calculate what the probability is of each pixel being a particular colour. In other words, each pixel in a black and white image has a list of percentages calculated that represent the probability of that particular pixel being each specific colour. That’s a long list of colour percentages for every pixel! The final colour of the pixel is then chosen from the top candidates on this list.

Going back to our apples example, the neural network would tell us that pixels belonging to the apple in the image would have a 50% probability of being yellow and 50% probability of being red (because our dataset consists of only red and yellow apples). We would then choose either of these two colours – orange would never make an appearance.

As is usually the case, ImageNet with its 1.3 million images (cf. this previous blog post that describes ImageNet) is used to train the neural network. Because of the large array of objects in ImageNet, the neural network can hence learn to colour many, many scenes in the amazing way that it does.

What is quite neat is that the authors have also set up a website where you can upload your own black and white photos to be converted by their algorithm into colour. Try it out yourself – especially if you have old photos that you have always wanted to colourise.

Ah, computer vision wins again. What a great area in which to be working and researching.

To be informed when new content like this is posted, subscribe to the mailing list (or subscribe to my YouTube channel!):

Image Completion from SIGGRAPH 2017

Posted on August 24, 2018January 16, 2025 by Zbigatron

Oh, I love stumbling upon fascinating publications from the academic world! This post will present to you yet another one of those little gems that has recently fallen into my lap. It’s on the topic of image completion and comes from a paper published in SIGGRAPH 2017 entitled “Globally and Locally Consistent Image Completion” (project page can be found here).

(Note: SIGGRAPH, which stands for “Special Interest Group on Computer GRAPHics and Interactive Techniques”, is a world renowned annual conference held for computer graphics researchers. But you do sometimes get papers from the world of computer vision being published there as is the case with the one I’m presenting here.)

This post will be divided into the following sections:

What is image completion and some of its prior weaknesses
An outline of the solution proposed by the above mentioned SIGGRAPH publication
A presentation of results

If anything, please scroll down to the results section and take a look at the video published by the authors of the paper. There’s some amazing stuff to be seen there!

1. What is image completion?

Image completion is a technique for filling-in target regions with alternative content. A major use for image completion is in the task of object removal where an object from a photo is erased and the remaining hole is automatically substituted with content that hopefully maintains the contextual integrity of the image.

Image completion has been around for a while. Perhaps the most famous algorithm in this area is called PatchMatch which is used by Photoshop in its Content Aware Fill feature. Take a look at this example image generated by PatchMatch after the flowers in the bottom right corner were removed from the left image:

patchmatch-example — *An image completion example on a natural scene generated by PatchMatch*

Not bad, hey? But the problem with existing solutions such as PatchMatch is that images can only be completed with textures that solely come from the input image. That is, calculations for what should be plugged into the hole are done using information obtained just from the input image. So, for images like the flower picture above, PatchMatch works great because it can work out that green leaves is the dominant texture and make do with that.

But what about more complex images… and faces as well? You can’t work out what should go into a gap in an image of a face just from its input image. This is an image completion example done on a face by PatchMatch:

patchmatch-example2 — *An image completion example on a face generated by PatchMatch*

Yeah, not so good now, is it? You can see how trying to work out what should go into a gap from other areas of the input image is not going to work for a lot of cases like this.

2. Proposed solution

This is where the paper “Globally and Locally Consistent Image Completion” comes in. The idea behind it, in a nutshell, is to use a massive database of images of natural scenes to train a single deep learning network for image completion. The Places2 dataset is used for this, which contains over 8 million images of diverse natural scenes – a massive database from which the network basically learns the consistency inherent in natural scenes. This means that information to fill in missing gaps in images is obtained from these 8 million images rather than just one single image!

Once this deep neural network is trained for image completion, a GAN (Generative Adversarial Network) approach is utilised to further improve this network.

GAN is an unsupervised neural network training technique where one or more neural networks are used to mutually improve each other in the training phase. One neural network tries to fool another and all neural networks are updated according to results obtained from this step. You can leave these neural networks running for a long time and watch them improving each other.

The GAN technique is very common in computer vision nowadays in scenarios where one needs to artificially produce images that appear realistic.

Two additional networks are used in order to improve the image completion network: a global and a local context discriminator network. The former discriminator looks at the entire image to assess if it is coherent as a whole. The latter looks only at the small area centered at the completed region to ensure local consistency of the generated patch. In other words, you get two additional networks assisting in the training: one for global consistency and one local consistency.

These two auxiliary networks return a result stating whether the generated image is realistic-looking or artificial. The image completion network then tries to generate completed images to fool the auxiliary networks into thinking that their real.

In total, it took 2 months for the entire training stage to complete on a machine with four high-end GPUs. Crazy!

The following image shows the solution’s training architecture:

image-completion-solution-architecture — *Overview of architecture for training for image completion (image taken from original publication)*

Typically, to complete an image of 1024 x 1024 resolution that has one gap takes about 8 seconds on a machine with a single CPU or 0.5 seconds on one with a decent GPU. That’s not bad at all considering how good the generated results are – see the next section for this.

3. Results

The first thing you need to do is view the results video released by the authors of the publication. Visit their project page for this and scroll down a little. I can provide a shorter version of this video from YouTube here:

As for concrete examples, let’s take a look at some faces first. One of these faces is the same from the PatchMatch example above.

image-completion-on-faces-examples — *Examples of image completion on faces (image adapted from original publication)*

How’s impressive is this?

My favourite examples are of object removal. Check this out:

image-completion-examples — *Examples of image completion (image taken from original publication)*

Look how the consistency of the image is maintained with the new patch in the image. It’s quite incredible!

My all-time favourite example is this one:

Absolutely amazing. More results can be viewed in supplementary material released by the authors of the paper. It’s well-worth a look!

Summary

In this post I presented a paper on image completion from SIGGRAPH 2017 entitled “Globally and Locally Consistent Image Completion”. I first introduced the topic of image completion, which is a technique for filling-in target regions with alternative content, and described some weaknesses of previous solutions – mainly that calculations for what should be generated for a target region are done using information obtained just from the input image. I then presented the more technical aspect of the proposed solution as presented in the paper. I showed that the image completion deep learning network learnt about global and local consistency of natural scenes from a database of over 8 million images. Then, a GAN approach was used to further train this network. In the final section of the post I showed some examples of image completion as generated by the presented solution.

To be informed when new content like this is posted, subscribe to the mailing list (or subscribe to my YouTube channel!):

The Baidu and ImageNet Controversy

Posted on August 10, 2018January 16, 2025 by Zbigatron

Two months ago I wrote a post about some recent controversies in the industry in computer vision. In this post I turn to the world of academia/research and write about something controversial that occurred there.

But since the world of research isn’t as aggressive as that of the industry, I had to go back three years to find anything worth presenting. However, this event really is interesting, despite its age, and people in research circles talk about it to this day.

The controversy in question pertains to the ImageNet challenge and the Baidu research group. Baidu is one of the largest AI and internet companies in the world. Based in Beijing, it has the 2nd largest search engine in the world and is hence commonly referred to as China’s Google. So, when it is involved in a controversy, you know it’s no small matter!

I will divide the post into the following sections:

ImageNet and the Deep Learning Arms Race
What Baidu did and ImageNet’s response
Ren Wu’s (Ex-Baidu Researcher’s) later response (here is where things get really interesting!)

Let’s get into it.

ImageNet and the Deep Learning Arms Race

(Note: I wrote about what ImageNet is in my last post, so please read that post for a more detailed explanation.)

ImageNet is the most famous image dataset by a country mile. Currently there are over 14 million images in ImageNet for nearly 22,000 synsets (WordNet has ~100,000 synsets). Over 1 million images also have hand-annotated bounding boxes around the dominant object in the image.

However, when the term “ImageNet” is used in CV literature, it usually refers to the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) which is an annual competition for object detection and image classification organised by computer scientists at Stanford University, the University of North Carolina at Chapel Hill and the University of Michigan.

This competition is very famous. In fact, the deep learning revolution of the 2010s is widely attributed to have originated from this challenge after a deep convolutional neural network blitzed the competition in 2012. Since then, deep learning has revolutionised our world and the industry has been forming research groups like crazy to push the boundary of artificial intelligence. Facebook, Amazon, Google, IBM, Microsoft – all the major players in IT are now in the research game, which is phenomenal to think about for people like me who remember the days of the 2000s when research was laughed at by people in the industry.

With such large names in the deep learning world, a certain “computing arms race” has ensued. Big bucks are being pumped into these research groups to obtain (and trumpet far and wide) results better than other rivals. Who can prove to be the master of the AI world? Who is the smartest company going around? Well, competitions such as ImageNet are a perfect benchmark for questions like this, which makes the ImageNet scandal quite significant.

Baidu and ImageNet

To have your object classification algorithm scored on the ImageNet Challenge, you first get it trained on 1.5 million images from the ImageNet dataset. Then, you submit your code to the ImageNet server where this code is tested against a collection of 100,000 images that are not known to anybody. What is key, though, is that to avoid people fine-tuning the parameters in their algorithms to this specific testing set of 100,000 images, ImageNet only allows 2 evaluations/submissions on the test set per week (otherwise you could keep resubmitting until you’ve hit that “sweet spot” specific to this test set).

Before the deep learning revolution, a good ILSVRC classification error rate was 25% (that’s 1 out of 4 images being classified incorrectly). After 2014, error rates have dropped to below 5%!

In 2015, Baidu announced that with its new supercomputer called Minwa it had obtained a record low error rate of 4.58%, which was an improvement on Google’s error rate of 4.82% as well as Microsoft’s of 4.9%. Massive news in the computing arms race, even though the error rate differences appear to be minimal (and some would argue, therefore, that they’re insignificant – but that’s another story).

However, a few days after this declaration, an initial announcement was made by ImageNet:

It was recently brought to our attention that one group has circumvented our policy of allowing only 2 evaluations on the test set per week.

Three weeks later, a follow up announcement was made stating that the perpetrator of this act was Baidu. ImageNet had conducted an analysis and found that 30 accounts connected to Baidu had been used in the period of November 28th, 2014 to May 13th, 2015 to make on average four times the permitted amount of submissions.

As a result, ImageNet disqualified Baidu from that year’s competition and banned them from re-entering for a further 12 months.

Ren Wu, a distinguished AI scientist and head of the research group at the time, apologised for this mistake. A week later he was dismissed from the company. But that’s not the end of the saga.

Ren Wu’s Response

Here is where things get really interesting.

A few days after being fired from Baidu, Ren Wu sent an email to Enterprise Technology in which he denied any wrongdoing:

We didn’t break any rules, and the allegation of cheating is completely baseless

Whoa! Talk about opening a can of worms!

Ren stated that there is “no official rule specify [sic] how many times one can submit results to ImageNet servers for evaluation” and that this regulation only appears once a submission is made from one account. From this he came to understand that 2 submissions per week can be made from each account/individual rather than a whole team. Since Baidu had 5 authors working on the project, he argues that he was allowed to make 10 submission per week.

I’m not convinced though because he still used 30 accounts (purportedly to be owned by junior students assisting in the research) to make these submissions. Moreover, he still admits that on two occasions the 10 submission threshold was breached – so, he definitely did break the rules.

Things get even more interesting, however, when he states that he officially apologised just for those two occasions as requested by his management:

A mistake in our part, and it was the reason I made a public apology, requested by my management. Of course, this was my biggest mistake. And things have been gone crazy since. [emphasis mine]

Whoa! Another can of worms. He apologised as a result of a request by his management and he states that this was a mistake. It looks like he’s accusing Baidu of using him as a scapegoat in this whole affair. Two months later he confirms this to the EE Times, by stating that

I think I was set up

Well, if that isn’t big news, I don’t know what is! I personally am not convinced by Ren’s arguments. But it at least shows that the academic/research world can be exciting at times, too 🙂

To be informed when new content like this is posted, subscribe to the mailing list (or subscribe to my YouTube channel!):

The Top Image Datasets and Their Challenges

Posted on July 28, 2018January 16, 2025 by Zbigatron

In previous posts of mine I have discussed how image datasets have become crucial in the deep learning (DL) boom of computer vision of the past few years. In deep learning, neural networks are told to (more or less) autonomously discover the underlying patterns in classes of images (e.g. that bicycles are composed of two wheels, a handlebar, and a seat). Since images are visual representations of our reality, they contain the inherent complex intricacies of our world. Hence, to train good DL models that are capable of extracting the underlying patterns in classes of images, deep learning needs lots of data, i.e. big data. And it’s crucial that this big data that feeds the deep learning machine be of top quality.

In lieu of Google’s recent announcement of an update to its image dataset as well as its new challenge, in this post I would like to present to you the top 3 image datasets that are currently being used by the computer vision community as well as their associated challenges:

ImageNet and ILSVRC
Open Images and the Open Images Challenge
COCO Dataset and the four COCO challenges of 2018

I wish to talk about the challenges associated with these datasets because challenges are a great way for researchers to compete against each other and in the process to push the boundary of computer vision further each year!

This is the most famous image dataset by a country mile. But confusion often accompanies what ImageNet actually is because the name is frequently used to describe two things: the ImageNet project itself and its visual recognition challenge.

The former is a project whose aim is to label and categorise images according to the WordNet hierarchy. WordNet is an open-source database for words that are organised hierarchically into synonyms. For example words like “dog” and “cat” can be found in the following knowledge structure:

WordNet-synset-graph — *An example of a WordNet synset graph (image taken from here)*

Each node in the hierarchy is called a “synonym set” or “synset”. This is a great way to categorise words because whatever noun you may have, you can easily extract its context (e.g. that a dog is a carnivore) – something very useful for artificial intelligence.

The idea with the ImageNet project, then, is to have 1000+ images for each and every synset in order to also have a visual hierarchy to accompany WordNet. Currently there are over 14 million images in ImageNet for nearly 22,000 synsets (WordNet has ~100,000 synsets). Over 1 million images also have hand-annotated bounding boxes around the dominant object in the image.

ImageNet-kit-fox — *Example image of a kit fox from ImageNet showing hand-annotated bounding boxes*

You can explore the ImageNet and WordNet dataset interactively here. I highly recommend you do this!

Note: by default only URLs to images in ImageNet are provided because ImageNet does not own the copyright to them. However, a download link can be obtained to the entire dataset if certain terms and conditions are accepted (e.g. that the images will be used for non-commercial research).

Having said this, when the term “ImageNet” is used in CV literature, it usually refers to the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) which is an annual competition for object detection and image classification. This competition is very famous. In fact, the DL revolution of the 2010s is widely attributed to have originated from this challenge after a deep convolutional neural network blitzed the competition in 2012.

The motivation behind ILSVRC, as the website says, is:

… to allow researchers to compare progress in detection across a wider variety of objects — taking advantage of the quite expensive labeling effort. Another motivation is to measure the progress of computer vision for large scale image indexing for retrieval and annotation.

The ILSVRC competition has its own image dataset that is actually a subset of the ImageNet dataset. This meticulously hand-annotated dataset has 1,000 object categories (the full list of these synsets can be found here) spread over ~1.2 million images. Half of these images also have bounding boxes around the class category object.

The ILSCVRC dataset is most frequently used to train object classification neural network frameworks such as VGG16, InceptionV3, ResNet, etc., that are publicly available for use. If you ever download one of these pre-trained frameworks (e.g. Inception V3) and it says that it can detect 1000 different classes of objects, then it most certainly was trained on this dataset.

Google’s Open Images

Google is a new player in the field of datasets but you know that when Google does something it will do it with a bang. And it has not disappointed here either.

Open Images is a new dataset first released in 2016 that contains ~9 million images – which is fewer than ImageNet. What makes it stand out is that these images are mostly of complex scenes that span thousands of classes of objects. Moreover, ~2 million of these images are hand-annotated with bounding boxes making Open Images by far the largest existing dataset with object location annotations. In this subset of images, there are ~15.4 million bounding boxes of 600 classes of object. These objects are also part of a hierarchy (see here for a nice image of this hierarchy) but one that is nowhere near as complex as WordNet.

open-images-eg — *Open Images example image with bounding box annotation*

As of a few months’ ago, there is also a challenge associated with Open Images called the “Open Images Challenge“. It is an object detection challenge and, what’s more interesting, there is also a visual relationship detection challenge (e.g. “woman playing a guitar” rather than just “guitar” and “woman”). The inaugural challenge will be held at this year’s European Conference on Computer Vision. It looks like this will be a super interesting event considering the complexity of the images in the dataset and, as a result, I foresee this challenge to be the de facto object detection challenge in the near future. I am certainly looking forward to seeing the results of the challenge to be posted around the time of the conference (September 2018).

Microsoft’s COCO Dataset

Microsoft is in this game also with their Common Objects in Context (COCO) dataset. Containing ~200K images, it’s relatively small but what makes it stand out are its challenges that come associated with the additional features it provides for each image, for example:

object segmentation information rather than just bounding boxes of objects (see image below)
five textual captions per image such as “the a380 air bus ascends into the clouds” and “a plane flying through a cloudy blue sky”.

The first of these points is worth providing an example image of:

Notice how each object is segmented rather than outlined by a bounding box as is the case with ImageNet and Open Images examples? This object segmentation feature of the dataset makes for very interesting challenges because segmenting an object like this is many times more difficult than just drawing a rectangular box around it.

COCO challenges are also held annually. But each year’s challenge is slightly different. This year the challenge has four tracks:

Object segmentation (as in the example image above)
Panoptic segmentation task, which requires object and background scene segmentation, i.e. a task to segment the entire image rather than just the dominant objects in an image:
Keypoint detection task, which involves simultaneously detecting people and localising their keypoints:
DensePose task, which involves simultaneously detecting people and localising their dense keypoints (i.e. mapping all human pixels to a 3D surface of the human body):

Very interesting, isn’t it? There is always something engaging taking place in the world of computer vision!

To be informed when new content like this is posted, subscribe to the mailing list (or subscribe to my YouTube channel!):

Recent Controversies in Computer Vision – From Facebook to Uber

Posted on June 15, 2018January 16, 2025 by Zbigatron

Computer vision is a fascinating area in which to work and perform research. And, as I’ve mentioned a few times already, it’s been a pleasure to witness its phenomenal growth, especially in the last few years. However, as with pretty much anything in the world, contention also plays a part in its existence.

In this post I would like to present 2 very recent events from the world of computer vision that have recently caused controversy:

A judge’s ruling that Facebook must stand trial for its facial recognition software
Uber’s autonomous car death of a pedestrian

Facebook and Facial Recognition

This is an event that has seemingly passed under the radar – at least for me it did. Probably because of the Facebook-Cambridge Analytica scandal that has been recently flooding the news and social discussions. But I think this is also an important event to mull over because it touches upon underlying issues associated with an important topic: facial recognition and privacy.

So, what has happened?

In 2015, Facebook was hit with a class action lawsuit (the original can be found here) by three residents from Chicago, Illinois. They are accusing Facebook of violating the state’s biometric privacy laws by the firm collecting and storing biometric data of each user’s face. This data is being stored without written notification. Moreover, it is not clear exactly what the data is to be used for, nor how long it will reside in storage, nor was there an opt-out option ever provided.

Facebook began to collect this data, as the lawsuit states, in a “purported attempt to make the process of tagging friends easier”.

In other words, what Facebook is doing (yes, even now) is summarising the geometry of your face with certain parameters (e.g. distance between eyes, shape of chin, etc.). This data is then used to try to locate your face elsewhere to provide tag suggestions. But for this to be possible, the biometric data needs to be stored somewhere for it to be recalled when needed.

The Illinois residents are not happy that a firm is doing this without their knowledge or consent. Considering the Cambridge Analytica scandal, they kind of have a point, you would think? Who knows where this data could end up. They are suing for $75,000 and have requested a jury trial.

Anyway, Facebook protested over this lawsuit and asked that it be thrown out of court stating that the law in question does not cover its tagging suggestion feature. A year ago, a District Judge rejected Facebook’s appeal.

Facebook appealed again stating that proof of actual injury needs to be shown. Wow! As if violating privacy isn’t injurious enough!?

But on the 14th May, the same judge discarded (official ruling here) Facebook’s appeal:

[It’s up to a jury] to resolve the genuine factual disputes surrounding facial scanning and the recognition technology.

So, it looks like Facebook will be facing the jury on July 9th this year! Huge news, in my opinion. Even if any verdict will only pertain to the United States. There is still so much that needs to be done to protect our data but at least things seem to be finally moving in the right direction.

Uber’s Autonomous Car Death of Pedestrian

You probably heard on the news that on March 18th this year a woman was hit by an autonomous car owned by Uber in Arizona as she was crossing the road. She died in hospital shortly after the collision. This is believed to be the first ever fatality of a pedestrian in which an autonomous car was involved. There have been other deaths in the past (3 in total) but all of them have been of the driver.

uber-fatality-crash — *(image taken from the US NTSB report)*

3 weeks ago the US National Transportation Safety Board (NTSB) released its first report (short read) into this crash. It was only a preliminary report but it provides enough information to state that the self-driving system was at least partially at fault.

uber-fatality-car-image — *(image taken from the US NTSB report)*

The report gives the timeline of events: the pedestrian was detected about 6 seconds before impact but the system had trouble identifying it. It was first classified as an unknown object, then a vehicle, then a bicycle – but even then it couldn’t work out the object’s direction of travel. At 1.3 seconds before impact, the system realised that it needed to engage an emergency braking maneuver but this maneuver had been earlier disabled to prevent erratic vehicle behaviour on the roads. Moreover, the system was not designed to alert the driver in such situations. The driver began braking less than 1 second before impact but it was tragically too late.

Bottom line is, if the self-driving system had immediately recognised the object as a pedestrian walking directly into its path, it would have known that avoidance measures would have needed to be taken – well before the emergency braking maneuver was called to be engaged. This is a deficiency of the artificial intelligence implemented in the car’s system.

No statement has been made with respect to who is legally at fault. I’m no expert but it seems like Uber will be given the all-clear: the pedestrian had hard drugs detected in her blood and was crossing in a non-crossing designated area of the road.

Nonetheless, this is a significant event for AI and computer vision (that plays a pivotal role in self-driving cars) because if these had performed better, the crash would have been avoided (as researchers have shown).

Big ethical questions are being taken seriously. For example, who will be held accountable if a fatal crash is deemed to be the fault of the autonomous car? The car manufacturer? The people behind the algorithms? One sole programmer who messed up a for-loop? Stanford scholars have been openly discussing the ethics behind autonomous cars for a long time (it’s an interesting read, if you have the time).

And what will be the future for autonomous cars in the aftermath of this event? Will their inevitable delivery into everyday use be pushed back?

Testing of autonomous cars has been halted by Uber in North America. Toyota has followed suit. And Chris Jones who leads the Autonomous Vehicle Analysis service at the technology analyst company Canalys, says that these events will set the industry back considerably:

It has put the industry back. It’s one step forward, two steps back when something like this happens… and it seriously undermines trust in the technology.

Furthermore, a former US Secretary of Transportation has deemed the crash a “wake up call to the entire [autonomous vehicle] industry and government to put a high priority on safety.”

But other news reports seem to indicate a different story.

Volvo, the make of car that Uber was driving in the fatal car crash, stated only last week that they expect a third of their cars sold to be autonomous by 2025. Other car manufacturers are making similar announcements. Two weeks ago General Motors and Fiat Chrysler unveiled self-driving deals with people like Google to push for a lead in the self-driving car market.

And Baidu (China’s Google, so to speak) is heavily invested in the game, too. Even Chris Jones is admitting that for them this is a race:

The Chinese companies involved in this are treating it as a race. And that’s worrying. Because a company like Baidu – the Google of China – has a very aggressive plan and will try to do things as fast as it can.

And when you have a race among large corporations, there isn’t much that is going to even slightly postpone anything. That’s been my experience in the industry anyway.

Summary

In this post I looked at 2 very recent events from the world of computer vision that have recently caused controversy.

The first was a judge’s ruling in the United States that Facebook must stand trial for its facial recognition software. Facebook is being accused of violating the Illinois’ biometric privacy laws by collecting and storing biometric data of each user’s face. This data is being stored without written notification. Moreover, it is not clear exactly what the data is being used for, nor how long it is going to reside in storage, nor was there an opt-out option ever provided.

The second event was the first recorded death of a pedestrian by an autonomous car in March of this year. A preliminary report was released by the US National Transportation Safety Board 3 weeks ago that states that AI is at least partially at fault for the crash. Debate over the ethical issues inherent to autonomous cars has heated up as a result but it seems as though the incident has not held up the race to bring self-driving cars onto our streets.

To be informed when new content like this is posted, subscribe to the mailing list (or subscribe to my YouTube channel!):

Be an Optimist Prime in the world of Computer Vision and AI

Tag: imagenet

Artificial Intelligence is Slowing Down – Part 2

The Largest Cat Video Dataset in the World

Smart Glasses for the Blind – Why has it Taken This Long?

My Top 5 Posts So Far

My top 5 posts thus far:

My favourite posts thus far:

Image Colourisation – Converting B&W Photos to Colour

Image Completion from SIGGRAPH 2017

1. What is image completion?

2. Proposed solution

3. Results

Summary

The Baidu and ImageNet Controversy

ImageNet and the Deep Learning Arms Race

Baidu and ImageNet

Ren Wu’s Response

The Top Image Datasets and Their Challenges

ImageNet

Google’s Open Images

Microsoft’s COCO Dataset

Recent Controversies in Computer Vision – From Facebook to Uber

Facebook and Facial Recognition

Uber’s Autonomous Car Death of Pedestrian

Summary