Zbigatron | Zbigatron

How Facial Recognition Works – Part 1

Posted on February 11, 2020January 25, 2024 by Zbigatron

(Update: part 2 of this post can be found here)

Facial recognition. What a hot topic this is today! Hardly a week goes by without it hitting the news in one way or another, and usually for the wrong reasons – i.e. privacy and its growing ubiquity. But have you ever wondered how this technology works? How is it that only now it has become such a hot topic of interest, whereas 10 years ago not many people cared about it at all?

In this blog post I hope to demystify facial recognition and present its general workings in a lucid way. I especially want to explain why it is that facial recognition has seen tremendous performance improvements over the last 5 or so years.

I will break this post up according to the steps taken in most facial recognition technologies and discuss each step one by one:

Face Detection and Aligning
Face Representation
Model Training
Recognition

(Note: in my next blog post I will describe Google’s FaceNet technology: the facial recognition algorithm behind much of the hype we are witnessing today.)

1. Face Detection and Aligning

The first step in any (reputable) facial recognition technology is to detect the location of faces in an image/video. This step is called face detection and should not be confused with actual face recognition. Generally speaking, face detection is a much simpler task to facial recognition and in many ways it is considered a solved problem.

There are numerous face detection algorithms out there. A popular one, especially in the days when CPUs and memories weren’t what they are today, used to be the Viola-Jones algorithm because of its impressive speed. However, much more accurate algorithms have since been developed. This paper benchmarked a few of these under various conditions and the Tiny Faces Detector (published in the world-class Conference on Computer Vision and Pattern Recognition in 2017 – code can be downloaded here) came out on top:

teaser — Face detection being performed by Tiny Face Detector. *(Image taken from paper’s website)*

If you wish to implement a face detection algorithm, here is a good article that shows how you can do this using Python and OpenCV. How face detection algorithms work, however, is beyond the scope of this article. Here, I would like to focus on actual face recognition and assume that faces have already been located on a given image.

Once faces have been detected, the next (usual) step is to rotate and scale the face in order for its main features to be located in the same place, more or less, as the features of other detected faces. Ideally, you want the face to be looking directly at you with the lips and the position of the eyes parallel to the ground. The aligning of faces is an important step much akin to the cleaning of data (for those that work in data analysis). It makes further processing a lot easier to perform.

2. Face Representation

When we have a clean picture of a face, the next step is to extract a representation of it. A representation (often also called a signature) for a machine is a description/summary of a thing in a form that can be processed and analysed by it. For example, when dealing with faces, it is common to represent a face as a vector of numbers. The best way to explain this is with a simple example.

Suppose we choose to represent a face with a 2-dimensional vector: the first dimension represents the distance between the eyes, the second dimension the width of the nose. We have two people, Alice and Bob, and a photo of each of their faces. We detect and align the two faces in these photos and work out that the distance between the eyes of Alice is 12 pixels and that of Bob is 15 pixels. Similarly, the width of Alice’s nose is 4 pixels, Bob’s is 7 pixels. We therefore have the following representations of the two faces:

So, for example, Alice’s face is represented by the vector (12, 4) – the first dimension stores the distance between the eyes, the second dimension stores the width of the nose. Bob’s face is represented by the vector (15, 7).

Of course, one photo of a person is not enough for a machine to get a robust representation/understanding of a person’s face. We need more examples. So, let’s say we grab 3 photos of each person and extract the representation from each of them. We might come up with the following list of vectors (remember this list as I’ll use these numbers later in the post to explain further concepts):

Having numbers like these to represent faces is much easier to deal with for a machine than raw pictures – machines are unlike us!

Now, in the above example, we used two dimensions to describe a face. We can easily increase the dimensionality of our representations to include even more features. For example, the third dimension could represent the colour of the eyes, the fourth dimension the colour of the skin – and so on. The more dimensions we choose to use to describe each face, generally speaking, the more precise our descriptions will be. And the more precise our descriptions are, the easier it will be for our machines to perform facial recognition. In today’s facial recognition algorithms, it is not uncommon to see vectors with 128+ dimensions.

Let’s talk about the facial features used in representations. The above example, where I chose the distance between the eyes and the width of the nose as features, is a VERY crude example. In reality, reputable facial recognition algorithms use “lower-level” features. For instance, in the late 1990s facial recognition algorithms were being published that considered the gradient direction (using Local Binary Patterns (LBP)) around each pixel on a face. That is, each pixel was analysed to see how much brighter or darker it was compared to its neighbouring pixels. Hence, brightness/darkness of pixels were the features chosen to create representations of faces. (You can see how this would work: each pixel location would have a relative different brightness/darkness score depending on a person’s facial structure).

Other lower-level features have been proposed for facial recognition algorithms, too. Some solutions, in fact used hybrid approaches – i.e. they used more than one low-level feature. There is an interesting paper from 2018 (M. Wang and W. Deng, “Deep Face Recognition: A Survey“, arXiv preprint) that summarises the history of facial recognition algorithms. This picture from the paper shows the evolution of features chosen for facial recognition:

Notice how the accuracy of facial recognition algorithms (benchmarked on the LFW dataset) have increased over time depending on the type of representation used? Notice also the LBP algorithm, described above, making an appearance in the late 1990s? It gets an accuracy score of around 70%.

And what do you see at the very top of the graph? Deep Learning, of course! Deep Learning changed the face of Computer Vision and AI in general (as I discussed in this earlier post of mine). But how did it revolutionise facial recognition to the point that it is now achieving human-level precision? The key difference is that instead of choosing the features manually (aka “hand-crafting features” – i.e. when you say: “let’s use pixel brightness as a feature”), you let the machine decide which features should be used. In other words, the machine generates the representation itself. You build a neural network and train it to deliver vectors for you that describe faces. You can put any face through this network and you will get a vector out at the end.

But what does each dimension in the vector represent? Well, we don’t really know! We would have to break down the neural network used to see what is going on inside of it step by step. DeepFace, the algorithm developed by Facebook (you can see it mentioned in the graph above), churns out vectors with 128 dimensions. But considering that the neural network used has 120 million parameters, it’s probably infeasible to break it apart to see what each dimension in the vector represents exactly. The important thing, however, is that this strategy works! And it works exceptionally well.

3. Model Training

It’s time to move on to the next step: model training. This is the step where we train a classifier (an algorithm to classify things) on a list of face representations of people in order to be able to recognise more examples (representations) of these people’s faces.

Let’s go back to our example with our friends Alice and Bob to explain this step. Recall the table of six vector representations of their faces we came up with earlier? Well, those vectors can be plotted on a 2-dimensional graph like so:

facial-recognition-graph — *A scatter-plot of face representations (unit of measurement is pixels; plot generated here)*

Notice how Alice and Bob’s representations cluster around each other (shown in red)? The bottom left data points belong to Alice, the top right belong to Bob. The job of a machine now will be to learn to differentiate between these two clusters. Typically, a well-known classification algorithm such as SVM is used for this.

If more than two clusters are present in the data (i.e. we’re dealing with more than two people in our data), the classifier will need to be able to deal with this. And then if we’re working in higher dimensions (e.g. 128+), the algorithm will need to operate in these dimensions, too.

4. Recognition

Recognition is the final step in the facial recognition process. Given a new image of a face, a representation will be generated for it, and the classification algorithm will provide a score of how close it is to its nearest cluster. If the score is high enough (according to some threshold) the face will be marked as recognised/identified.

So, in our example case, let’s say a new photo has emerged with a face on it. We would first generate a representation of it, e.g.: (13, 4). The classification algorithm will take this vector and see which cluster it lies closest to – in this case it will be Alice’s. Since this data point is very close to the cluster, a high recognition score will also be generated. The picture below illustrates this example:

face-representation-graph — *A scatter-plot of face representations. The green point represents the new face that will be classified as Alice’s*

This recognition step is usually extremely fast. And the accuracy of it is highly dependent on the preceding steps – the most important of which is the quality of the second step (the one that generates representations of faces).

Conclusion

In this post I described the major steps taken in a robust facial recognition algorithm. Each step was then described and an example use case was utilised to illustrate the concepts behind these steps. The major breakthrough in facial recognition came in 2014 when deep learning was used to generate representations of faces rather than the technique of hand-crafting features. As a result, facial recognition algorithms can now achieve near human-level precision.

(Update: part 2 of this post can be found here)

—

To be informed when new content like this is posted, subscribe to the mailing list:

De-Identification of Faces in Live Videos – ICCV 2019

Posted on December 13, 2019January 25, 2024 by Zbigatron

Facial recognition. What a hot topic this is today. A week hardly goes by without it making the news in one way or another. This technology seems to be infiltrating more and more of our everyday lives: from ID verification on our phones (to unlock them) to automated border processing systems at airports. In some ways this is a good thing but in some this is not. The most controversial aspect of the growing ubiquity of facial recognition technology (FRT) is arguably the erosion of privacy. The more that FRT is used in our lives, the more it seems that we are turning into a highly monitored society.

This erosion of privacy is such a foremost issue to some that three cities in the USA have banned the use of FRT: San Francisco and Oakland in California and Somerville in Massachusetts. These bans, however, only affect city agencies such as police departments. Portland, Oregan, on the other hand, may soon be introducing a bill that could also cover private retailers and airlines. Moreover, according the Mutale Nkonde, a Harvard fellow and AI policy advisor, a federal ban could be around the corner.

FRT is undoubtedly controversial with respect to the debate on privacy.

In this post I would like to introduce to you a paper from the International Conference on Computer Vision 2019 (ICCV) that attempts to provide that little bit of additional privacy in our lives by proposing a fast and impressive method to de-identify videos in real-time. The de-identification process is purported to be effective against machines rather than humans such that we are still able to perceive the original identity of the speaker in the resulting video.

(TL;DR: jump to the end to see the results generated by the researchers. It’s quite impressive.)

The paper in question here is entitled “Live Face De-Identification in Videos” by Gafni et al. published by the Facebook Research Group (isn’t it ironic that Facebook is writing academic papers on privacy?).

The de-identification algorithm itself is a bit tricky to explain but I’ll do my best. First, an adversarial autoencoder network is paired with a trained facial classifier. An autoencoder network (which is a special case of the encoder-decoder architecture) works by imposing a bottleneck in the neural network, which forces the network to learn a compressed representation of the original input image. (Here’s a fantastic video explaining what this means exactly). So, what happens is that a compressed version of your face is generated – but the important aspects such as your pose, lip positioning, expression, illumination conditions, any occlusions, etc. are all retained. What is discarded are the identifying elements. This retaining/discarding is controlled by having a trained facial classifier nearby that the autoencoder tries to fool. During training, the autoencoder gets better and better at fooling the facial classifier by learning to more effectively discard identifying elements from faces, while retaining the important aspects of them.

What results is a face that is still easily recognisable to us, an algorithm that works in real-time (meaning that you can “turn it on” for skype sessions, for example) and one that doesn’t need to be retrained for each particular face – it just works out of the box for all faces.

Here is a video released by the authors of the paper showing some of their results:

Very impressive, if you ask me! Remember, the generated faces in the video have been de-identified, meaning that a facial recognition algorithm (FaceNet or ArcFace, for example) will find it extremely difficult to deal with. In fact, experiments were performed to see how well the researchers’ algorithm performs against popular FRTs. For one experiment, FaceNet was tested on images before and after de-identification. The true positive rate for one dataset dropped from almost 0.99, to less than 0.04. Very nice, indeed.

Moreover, the paper also goes into detail on a number of key steps of their algorithm. One of them is the de-identification distance from the original image. That is, they play around a bit with how much a person’s face is de-identified. The image below shows Nicholas Cage being gradually anonymised by increasing a variable in the algorithm. This is also something quite interesting.

cage-de-identified — *Image showing Nicholas Cage being gradually de-identified (image taken from original publication)*

Summary

In this post I presented a paper from the Facebook Research Group on the de-identification of faces in real-time. In the context of FRTs and the current hot debate on privacy, this is an important piece of work, especially considering the fact that this algorithm is getting impressive results and works in real-time. Whether we will see this technology in use in the near future is hard to say, but I wouldn’t be surprised if a de-identification app that works much like the face swap filter on instagram becomes available to the general public. There is certainly a demand for it.

To be informed when new content like this is posted, subscribe to the mailing list:

The Largest Cat Video Dataset in the World

Posted on September 9, 2019January 27, 2024 by Zbigatron

This is a bit of a fun post about a “dataset” I stumbled upon a few days ago… a dataset of cat videos.

As I’ve mentioned numerous times in various posts of mine, the deep learning revolution that is driving the recent advancements in AI around the world needs data. Lots and lots of data. For example, the famous image classification models that are able to tell you, with better precision than humans, what objects are in an image, are trained on datasets containing sometimes millions of images (e.g. ImageNet). Large datasets have basically become essential fuel for the AI boom of recent years.

So, I had a really good laugh when a few days ago I found this innocuous and unexposed YouTube channel owned by a Japanese man who has been posting a few videos every day of him feeding stray cats. Since he has been doing this for the past 9 years, he has managed to accumulate over 19,000 cat videos on his channel. And in doing so he has most probably and inadvertently created the largest cat video dataset in the world. My goodness!

Technically speaking, unless you’re a die-hard cat lover (like me!), these videos aren’t all that interesting. They’re simply of stray cats having a decent feed or drink with their good-hearted caretaker on occasion uttering a few sentences here and there. Here’s one, for example, of two cats eating out of a bowl:

Or here’s one of two cats enjoying a good ol’ scratch behind the ears:

On average these videos are about 30-60 seconds in length. And they’re all titled by the default name given by his cameras (e.g. MVI 3985, etc.). Hence, nothing about these clips is designed in order for them be found by anybody out there.

However, despite all this mundanity, to the computer vision community (that needs datasets to survive like humans need oxygen), these videos could come in handy… one day. I’m not sure how just yet, but I’m sure somebody out there could find a use for them. I mean, there’s over 19,000 cat videos just sitting there. This is just too good to pass up.

So, if there are any academics out there: please, please, please use this “dataset” in your publishable studies. It would make my year, for sure! The cats would be proud, too.

Oh, and one more thing. I found this guy’s twitter account (@niiyan1216). And, you guessed it: it is full of pictures of cats.

To be informed when new content like this is posted, subscribe to the mailing list:

Capturing the Moment in Photography – SIGGRAPH 2019 Award

Posted on July 31, 2019January 27, 2024 by Zbigatron

SIGGRAPH 2019 is coming to end today. SIGGRAPH, which stands for “Special Interest Group on Computer GRAPHics and Interactive Techniques”, is a world-renowned annual conference held predominantly for computer graphics researchers – but you do sometimes get papers from the world of computer vision being published there. In fact, I’ve presented a few such papers on this blog in the past (e.g. see here).

I’m not going to present any papers from this conference today. What I would like to do is mention a person who is being recognised at this year’s conference with a special award. Michael F. Cohen, the current Director of Facebook’s Computational Photography Research team, a few days ago received the 2019 Steven A. Coons Award for Outstanding Creative Contributions to Computer Graphics. This is an award given every two years to honour outstanding lifetime contributions to computer graphics and interactive techniques.

For the full, very impressive list of Michael’s achievements, see the SIGGRAPH award’s page. But there are a few that stand out. In particular his significant contributions to Facebook’s 3D photos feature and most interestingly (for me) his work on The Moment Camera.

You may recall that in March of this year I wrote about Smartphone Camera Technology from Google and Nokia. At the time, I didn’t realise that the foundations for the technologies I discussed there were laid down by Michael nearly 15 years ago.

In that post I talked about High Dynamic Range (HDR) Imaging, which is a technique employed by some cameras to give you better quality photos. The basic idea behind HDR is to capture additional shots of the same scene (at different exposure levels, for instance) and then take what’s best out of each photo to create a single picture. For example, the image on the right below was created by a Google phone using a single camera and HDR technology. A quick succession of 10 photos (called an image burst) was taken of a dimly lit indoor scene. The final merged picture gives a vivid representation of the scene. Quite astonishing, really.

(image taken from here)

Well, Michael F. Cohen, laid out the basic ideas behind HDR for combining images/photos to create better pictures at the beginning of this century. For example, he along with Richard Szeliski published this fantastic paper in 2006. In it he talks about the idea of capturing a moment rather than an image. Capturing a moment is a much better description of what HDR is all about!

The abstract to the paper says it best:

Future cameras will let us “capture the moment,” not just the instant when the shutter opens. The moment camera will gather significantly more data than is needed for a single image. This data, coupled with automated and user-assisted algorithms, will provide powerful new paradigms for image making.

Ah, the moment camera. What a good name for HDR-capable phones!

It’s interesting to note that it has taken a long time for the moment camera to become available to the general public. I would guess that we just had to wait for faster CPUs on our phones for Michael’s work to become a reality. However, some features of the “moment camera” described in the 2006 paper are yet to be implemented in our HDR-enabled phones. For example, this idea of a group shot being improved by image segmentation:

capturing-moment-group-shot — The original caption to the image reads: “Working with stored images, the user indicates when each person photographed looks best. The system automatically finds the best regions around each selection to compose into a final group shot.” (image taken from original publication)

Anyway, a well-deserved lifetime achievement award, Michael. And thank you for the “moment camera”.

To be informed when new content like this is posted, subscribe to the mailing list:

Smart Glasses for the Blind – Why has it Taken This Long?

Posted on May 18, 2019January 27, 2024 by Zbigatron

Remember Google Glass? Those smart glasses that were released by Google to the public in May of 2014 (see image below). Less than a year later production was halted because, well, not many people wanted to walk around with a goofy looking pair of specs on their noses. They really did look wacky. I’m not surprised the gadget never caught on.

Well, despite the (predictable?) flop of Google Glass, it turns out that there has proven to be a fantastic use case for such smart glasses: for people with visual impairments.

There is a company out there called Aira that provides an AI-guided service used in conjunction with smart glasses and an app on a smartphone. When images are captured by the glasses’ forward-facing camera, image and text recognition are used and an AI assistant, dubbed “Chloe”, describes in speech what is present in these videos: whether it be everyday objects such as products on a shelf in your pantry, words on medication bottles or even words in a book.

Quite amazing, isn’t it?

Simple tasks like object and text recognition are performed locally on the smartphone. However, more complex tasks can be sent to Aira’s cloud services (powered by Amazon’s AWS).

Furthermore, the user has the option to, at the tap of a button on the glasses or app, connect to a live agent who is then able to access a live video stream from the smart glasses and other data from the smartphone like GPS location. With these the live agent is able to provide real-time assistance by speaking directly to the visually impaired person. A fantastic idea.

According to NVIDIA, Aira trains its object recognition deep learning neural networks not on image datasets like ImageNet but from 3 million minutes worth of data captured by their users, which has been annotated by Aira’s agents. An interesting idea considering how time consuming such a task must have been. But this has given the service an edge as training from real-world scenarios has provided, as reported, better results.

The uses for Aira’s product and service are pretty much endless! As suggested on their site, you can use Aira for things like: reading to a child, locating a stadium seat, reading a whiteboard, navigating premises, sorting and reading mail and the paper, enjoying the park or the zoo, roaming historical sites. Gosh, the list can be endless!

And thankfully, the glasses don’t look goofy at all! That’s definitely a win right there.

Finally, I would encourage you to take a look at this official video demonstrating the uses of Aira. This is computer vision serving society in the right way.

(Unfortunately, the video that was once here has been taken down by the pubisher)

To be informed when new content like this is posted, subscribe to the mailing list:

Delivery Drones and the Google Wing Project

Posted on April 15, 2019January 27, 2024 by Zbigatron

I gave a guest lecture last Thursday at Carnegie Mellon University at their Adelaide campus in South Australia. (A special shout-out to the fantastic students that I met there!). The talk was on the recent growth of computer vision (CV) in the industry. At the end of the presentation I showed the students some really interesting projects that are being worked on today in the CV domain such as Amazon Go, Soccer/Football on Your Tabletop, autonomous cars (which I am yet to write about), CV in the fashion industry, and the like.

I missed one project, however, that has been making news in the past few days in Australia: delivery drones. Three days ago, Google announced that it is officially launching the first home delivery drone service in Australia in our capital city, Canberra, to deliver takeaway food, coffee, and medicines. Google Wing is the name of the project behind all this.

Big, big news, especially for computer vision.

In this post I am going to look at the story behind this. I will present:

the benefits of delivery drones,
the potential drawbacks of them,
and then I’ll take a look at (as much as is possible) the technology behind Google’s drones.

The Benefits of Delivery Drones

There was an official report prepared two months ago by AlphaBeta Advisors on behalf of Google Wing for the Standing Committee on Economic Development at the Parliament of the Australian Capital Territory (Canberra). The report, entitled “Inquiry into drone delivery systems in the ACT“, analysed the benefits of delivery drones in order to sway the government to give permission for drones to be utilised in this city for the purposes described above. The report was successful since, as I’ve mentioned, the requested permission was granted a few days ago.

Let’s take a look (in summary) at the benefits discussed by the article. Note that numbers presented here are specific to Canberra.

Benefits for local businesses:

More households can be brought into range by delivery drones. More households means more consumers.
Reduction of delivery costs. It is estimated that delivery costs could fall by up to 80-90% in the long term.
Lower costs will generate more sales.
More businesses delivering means a more competitive market.

Benefits for consumers:

Drones will be able to reach the more underserved members of the public such as the elderly, disabled, and homebound.
Since delivery times are faster by 60-70%, it is estimated that 3 million hours will be saved per year. This includes scenarios where customer pick-up journeys are replaced by drones.
As a result of lower delivery costs, drones could save households $5 million in fees per year.
Product variety will be expanded for the consumer as up to 4 times more merchants could be brought into range for them.

Benefits for society:

35 million km per year will be removed as a result of more delivery vehicles being taken off the road. This will reduce traffic congestion.
The above benefit will also result in a reduction of emissions by 8,000 tonnes, which is equivalent to the carbon storage of 250,000 trees (huge!).
Fewer cars on the road means fewer road accidents.

Some convincing arguments here. The benefits to society are my personal favourites. I hate traffic congestion!

The Potential Drawbacks of Delivery Drones

Drawbacks are not discussed in the aforementioned report. But some have been raised by the public living in Canberra. These are definitely worth mulling over:

Noise pollution. Ever since 2014 when Google started testing these delivery drones people have complained about how noisy they are. Some have even mentioned that wildlife seems to have disappeared from delivery areas as a result of this noise pollution. In fact, residents from this area have created an action group, called Bonython Against Drones, “to raise awareness of the negative impact of the drone delivery trial on people, pets and wildlife in Bonython [a suburb in Canberra] and to ensure governance and appropriate legislative orders are in place to protect the community“. Below is a video of a delivery in progress. Bonython Against Drones appears to have a strong case. This noise really is irritating.
Invasion of privacy. Could flying low over people’s properties be deemed as an invasion of privacy? A fair question to ask. Also, could Google use these drones to collect private information from the households they fly over? Of course, the company says that they comply with privacy laws and regulations but, well, their track record on privacy isn’t stellar. Heck, there’s even an entire Wikipedia article on the Privacy Concerns Regarding Google.
Bad weather conditions such as strong winds would render drones unusable. Can we rely on weather conditions so heavily?

The first point is definitely a drawback worth considering.

Google Wing Drones

Let’s take a look at the drones in operation in Canberra.

It seems as if this drone is a hybrid between a plane and a helicopter. The drone has wings with 2 large propellers but also 9 smaller hover propellers. Google says that the hover propellers are designed specifically to reduce noise. From the video above, though, a little bit more is probably needed to curtail that obnoxious buzzing sound.

There’s not much information out there on the technical side of things. For example, no white papers have been released by Google as of yet. But I dug around a bit and managed to come up with some interesting things. I stumbled upon this job description for the position of Perception Software Engineer at Google Wing HQ in California. What a find 🙂

(If you’re reading this post some time after April 2019, chances are the job description has been taken down… sorry about that)

The job description gives us hints as to what is going on in the background of this project. For example, we know that Google has developed “an unmanned traffic management platform–a kind of air traffic control for unmanned aircraft–to safely route drones through the sky”. Very cool.

More importantly for us, we also know that computer vision plays a prominent role in the guidance of these drones:

“Our perception solutions run on real-time embedded systems, and familiarity with computer vision, optical sensors, flight control, and simulation is a plus.”

And the job requirements specifically request 2 years of experience working with camera sensors for computer vision applications.

One interesting task that these drones perform is visual odometry, which is the process of determining the position and orientation of a device/vehicle by analysing camera images. As I’ve documented earlier, visual odometry was a CV technique used on Mars by the MER rovers from way back in the early 2000s.

It’s interesting to note that the CV techniques listed by the job description are performed on embedded systems and are coded in C++. A lot of people (including me) are predicting that embedded systems (e.g. IoTs, edge computing) are the next big thing for CV, so it’s worth taking note of this. Oh, and notice also that C++ is being used here. This language is not dead yet, despite it not being taught at universities any more. C++ is just damn fast – something that is a must in embedded CV solutions.

Summary

This post looked at some background information pertaining to the Google Wing project that, as of a few days ago, officially launched the first home delivery drone service in Australia’s capital city, Canberra. The first section of the post discussed the benefits and drawbacks of delivery drones. The last part of the post presented the Google Wing project from the technical side. Not much technical information is available on this project but a job description for the position of Perception Software Engineer gives us a sneak peek at the inner workings of Google Wing, especially from the perspective of computer vision.

It will be interesting to see whether delivery drones will be deemed a success by Google and also, most importantly, by the public of Canberra.

To be informed when new content like this is posted, subscribe to the mailing list:

My Top 5 Posts So Far

Posted on March 20, 2019January 27, 2024 by Zbigatron

It’s been nearly 18 months since I started this blog. I did it to share my journey in computer vision with you. I love this field and I’m always stumbling across such fascinating things that I feel as though more people should know about them.

I’ve seen this blog grow in popularity – much, much more than I had anticipated when I first started it. In this little “bonus” post, I thought I’d list my top posts thus far with additional comments about them.

I also thought I’d compile a second list with my personal favourite posts. These have not been as popular but I sure as hell had fun writing them!

Enjoy! And thanks for the support over the last 18 months.

My top 5 posts thus far:

Why Deep Learning Has Not Superseded Traditional Computer Vision – I wrote this post on a Friday evening directly after work with a beer bottle in one hand and people playing pool or foosball around me. I wrote it up in an hour or so and didn’t think much of it, to be honest. The next day I woke up and saw, to my extreme surprise, that it had gone slightly viral! It was featured in Deep Learning Weekly (Issue #76), was being reposted by people such as Dr Adrian Rosebrock from PyImageSearch, and was getting about 1000 hits/day. Not bad, hey!?
The Top Image Datasets and Their Challenges
Finding a Good Thesis Topic in Computer Vision – I wrote this post after constantly seeing people asking this question on forums. Considering it’s consistently in my top 3 posts every week, I guess people are still searching for inspiration.
Mapping Camera Coordinates to a 2D Floor Plan – This post came about after I had to work on security footage from a bank for a project at work. The boss was very pleased with what I had done and writing about my experiences in a post was a no-brainer after that.
The Early History of Computer Vision – History is something that really interests me so it was only a matter of time before I was going to read up on the history of computer vision. Once I did and saw how fascinating it was, I just had to write a post about it.

My favourite posts thus far:

Like I said, these are not popular (some barely get a single hit in a week) but I really enjoyed researching for and writing them.

Soccer on Your Tabletop – The coolest thing going around in computer vision.
Amazon Go – Computer Vision at the Forefront of Innovation – This to me is something amazing.
The Baidu and ImageNet Controversy – Nothing like a good controversy!
Computer Vision on Mars – Computer vision in space. Imagine working on that project!
The Growth of Computer Vision in the Industry / The Reasons Behind the Recent Growth of Computer Vision – I’m proud of how far computer vision has come over the years. It’s been a pleasure to be a part of the adventure.

Enjoy looking back over my posts. Thanks once again for your support over the last 18 months.

To be informed when new content like this is posted, subscribe to the mailing list:

Smartphone Camera Technology from Google and Nokia

Posted on March 13, 2019February 1, 2024 by Zbigatron

A few days ago Nokia unveiled its new smartphone: the Nokia 9 PureView. It looks kind of weird (or maybe funky?) with its 5 cameras at its rear (see image above). But what’s interesting is how Nokia uses these 5 cameras to give you better quality photos with a technique called High Dynamic Range (HDR) imaging.

HDR has been around in smartphones for a while, though. In fact, Google has had this imaging technique available in some of its phones since at least 2014. And in my opinion it does a much better job than Nokia.

In this post I would like to discuss what HDR is and then present what Nokia and Google are doing with it to provide some truly amazing results. I will break the post up into the following sections:

High Dynamic Range Imaging (what it is)
The Nokie 9 PureView
Google’s HDR+ (some amazing results here)

High Dynamic Range Imaging

I’m sure you’ve attempted to take photos of high luminosity range scenarios such as dimly lit scenes or ones where the backdrop is brightly radiant. Frequently such photos come out either overexposed, underexposed and/or blurred. The foreground, for example, might be completely in shadow or details will be blurred out because it’s hard to keep the camera still when you have the shutter speed set to low to let in extra light.

HDR attempts to alleviate these high range scenario problems by capturing additional shots of the same scene (at different exposure levels, for instance) and then taking what’s best out of each photo and merging this into one picture.

Photo by Gustave Le Gray (image taken from Wikipedia)

Interestingly, the idea of taking multiple shots of a scene to provide a better single photo goes back to the 1850s. Gustave Le Gray, a highly noted French photographer, rendered seascapes showing both the sky and the sea by using one negative for the sky, and another one with a longer exposure for the sea. He then combined the two into one picture in the positive. Quite innovative for the period. The picture on the right was captured by him using the HDR technique.

The Nokia 9 PureView

As you’ve probably already guessed, Nokia uses the five cameras on the Nokia 9 PureView to take photos of the same scene. However, each camera is different. Two cameras are standard RGB sensors to capture colour. The remaining three are monochrome that capture nearly three times more light as the RGB cameras. These 5 cameras are each 12 megapixels in resolution. There is also an infrared sensor for depth readings.

Depending on the scene and lighting conditions each camera can be triggered up to four times in quick succession (commonly referred to as burst photography).

One colour photo is then selected to act as the primary shot and the other photos are used to improve it with details.

The final result is a photo of up to 240 megapixel in quality. Interestingly, you also have control over how much photo merging takes place and where this merging occurs. For example, you can choose to add additional detail to the foreground and ignore the background. The depth map from the depth sensor undoubtedly assists in this. And yes, you have access to all the RAW files taken by the cameras.

Not bad, but in my opinion Google does a much better job… and with only one camera. Read on!

Google’s HDR+

Google’s HDR technology is dubbed HDR+. It has been around for a while, first appearing in the Nexus 5 and 6 phones. It is now a standard on the Pixel range of phones. It is standard because HDR+ uses the regular single camera on Google’s phones.

It gets away with just using one camera by taking up to 10 photos in quick succession – much more than Nokia does. Although the megapixel quality of the resulting photos may not match Nokia’s, the results are nonetheless impressive. Just take a look at this:

google-hdr-eg — *(image taken from here)*

That is a dimly lit indoor scene. The final result is truly astonishing, isn’t it?

Here’s another example:

What makes HDR+ standout from the crowd is its academic background. This isn’t some black-box technology that we know nothing about – it’s a technology that has been peer-reviewed by world-class academics and published in a world-class conference (SIGGRAPH Asia 2016).

Moreover, only last month, Google released to the public a dataset of archives of image bursts to help improve this technology.

When Google does something, it (usually) does it with a bang. You have to love this. This is HDR imaging done right.

To be informed when new content like this is posted, subscribe to the mailing list:

Google’s Dataset Search Engine – Find The Dataset You Need Here

Posted on February 27, 2019January 27, 2024 by Zbigatron

Did you know that there is now a search engine for datasets that is powered by Google? Well, there is! And it’s something that the research community and the industry have been needing (whether they knew it or not) for years now.

This new search engine is called Dataset Search and can be found at this link.

This is a big deal. Datasets have become crucial since the prominent arrival of deep learning onto the scene a few years ago. Deep learning needs data. Lots and lots of data. This is because in deep learning, neural networks are told to (more or less) autonomously discover the underlying patterns in data. In computer vision, for example, you would want a machine to learn that bicycles are composed of two wheels, a handlebar, and a seat. But you need to provide enough examples for a machine to be able to learn these patterns.

Creating such large datasets is not an easy task. Some of the top image datasets (as I have documented here), contain millions of hand annotated images. These are famous datasets that most people in the computer vision world know about. But what about datasets that are more niche and hence less known? Some of these can be very difficult to find – and you certainly would not want to spend months or years creating them only to find that someone had already gone to all the trouble before you.

Up until now, then, there was no central location to search for these datasets. You had to manually traverse the web in the hope of finding what you were looking for. But that was until Dataset Search came along! Thank the heavens for that. Although Dataset Search is still in its beta stage, this is definitely something the research and industry communities have been needing.

For datasets to be listed in a coherent and informative manner on Dataset Search, Google has developed guidelines for dataset providers. These guidelines are based on schema.org, which is an open standard for describing such information (in metadata tags). As Google states:

We encourage dataset providers, large and small, to adopt this common standard so that all datasets are part of this robust ecosystem.

It would be a good idea to start adhering to these guidelines when creating datasets because a central place of reference for datasets is something we all need.

As a side note, Dataset Search has been in development for at least three years (interestingly, Dataset Search’s previous name was actually Goods – Google Dataset Search). Google released two academic papers on this in 2016 – see here and here. It’s nice to see that their work has finally culminated into what they have offered us now.

Dataset Search is definitely a step in in the right direction.

To be informed when new content like this is posted, subscribe to the mailing list:

Seeing Around Corners with a Laser

Posted on January 24, 2019January 27, 2024 by Zbigatron

In this post I would like to show you some results of another interesting paper I came across recently that was published last year in the prestigious Nature journal. It’s on the topic of non-line-of-sight (NLOS) imaging or, in other words, it’s about research that helps you see around corners. NLOS could be something particularly useful for use cases such as autonomous cars in the future.

I’ll break this post up into the following sections:

The LIDAR laser-mapping technology
LIDAR and NLOS
Current Research into NLOS

Let’s get cracking, then.

LIDAR

You may have heard of LIDAR (a term which combines “light” and “radar”). It is used very frequently as a tool to scan surroundings in 3D. It works similarly to radar but instead of emitting sound waves, it sends out pulses of infrared light and then calculates the time it takes for this light to return to the emitter. Closer objects will reflect this laser light quicker than distant objects. In this way, a 3D representation of the scene can be acquired, like this one which shows a home damaged by the 2011 Christchurch Earthquake:

LIDAR has been around for decades and I came across it very frequently in my past research work in computer vision, especially in the field of robotics. More recently, LIDAR has been experimented with in autonomous vehicles for obstacle detection and avoidance. It really is a great tool to acquire depth information of the scene.

NLOS Imaging

But what if where you want to see is obscured by an object? What if you want to see what’s behind a wall or what’s in front of the car in front of you? LIDAR does not, by default, allow you to do this:

lidar-eg-with-occlusion — *The rabbit object is not reachable by the LIDAR system (image adapted from this video)*

This is were the field of NLOS comes in.

The idea behind NLOS is to use sensors like LIDAR to bounce laser light off walls and then read back any reflected light.

lidar-eg-with-NLOS — *The laser is bounced off the wall to reach the object hidden behind the occluder (image adapted from this video)*

This process is repeated around a particular point (p in the image above) to obtain as much reflected light as possible. The reflected light is then analysed and any objects on the other side of the occlusion are attempted to be reconstructed.

This is still an open area of research with many assumptions (e.g. that light is not reflected multiple times by the occluded object but bounces straight back to the wall and then the sensors) but the work on this done so far is quite intriguing.

Current Research into NLOS

The paper that I came across is entitled “Confocal non-line-of-sight imaging based on the light-cone transform“. It was published in March of last year in the Nature journal (555, no. 7696, p. 338). Nature is one of the world’s top and most famous academic journals, so anything published there is more than just world-class – it’s unique and exceptional.

The experiment setup from this paper was as shown here:

nlos-experiment-setup — *The setup of the experiment for NLOS. The laser light is bounced off the white wall to hit and reflect off the hidden object (image taken from original publication)*

The idea, then, was to try and reconstruct anything placed behind the occluder by bouncing laser light off the white wall. In the paper, two objects were scrutinised: an “S” (as shown in the image above) and a road sign. With a novel method of reconstruction, the authors were able to obtain the following reconstructed 3D images of the two objects:

NLOS-results — *(image adapted from original publication)*

Remember, these results are obtained by bouncing light off a wall. Very interesting, isn’t it? What’s even more interesting is that the text on the street sign has been detected as well. Talk about precision! You can clearly see how one day, this could come in handy with autonomous cars who could use information such as this to increase safety on the roads.

A computer simulation was also created to ascertain with dexterity the error rates involved with the reconstruction process. The simulated setup was as shown in the above images with the bunny rabbit. The results of the simulation were as follows:

NLOS-simulation-results — *(image adapted from original publication)*

The green in the image is the reconstructed parts of the bunny superimposed on the original object. You can clearly see how the 3D shape and structure of the object is extremely well-preserved. Obviously, the parts of the bunny not visible to the laser could not be reconstructed.

Summary

This post introduced the field of non-line-of-sight imaging, which is, in a nutshell, research that helps you see around corners. The idea behind NLOS is to use sensors like LIDAR to bounce laser light off walls and then read back any reflected light. The scene behind an occlusion is then attempted to be reconstructed.

Recent results from state-of-the-art research in NLOS published in the Nature journal were also presented in this post. Although much more work is needed in this field, the results are quite impressive and show that NLOS could one day be very useful with, for example, autonomous cars who could use information such as this to increase safety on the roads.

To be informed when new content like this is posted, subscribe to the mailing list:

Be an Optimist Prime in the world of Computer Vision and AI

Author: Zbigatron

How Facial Recognition Works – Part 1

1. Face Detection and Aligning

2. Face Representation

3. Model Training

4. Recognition

Conclusion

De-Identification of Faces in Live Videos – ICCV 2019

Summary

The Largest Cat Video Dataset in the World

Capturing the Moment in Photography – SIGGRAPH 2019 Award

Smart Glasses for the Blind – Why has it Taken This Long?

Delivery Drones and the Google Wing Project

The Benefits of Delivery Drones

Benefits for local businesses:

Benefits for consumers:

Benefits for society:

The Potential Drawbacks of Delivery Drones

Google Wing Drones

Summary

My Top 5 Posts So Far

My top 5 posts thus far:

My favourite posts thus far:

Smartphone Camera Technology from Google and Nokia

High Dynamic Range Imaging

The Nokia 9 PureView

Google’s HDR+

Google’s Dataset Search Engine – Find The Dataset You Need Here

Seeing Around Corners with a Laser

LIDAR

NLOS Imaging

Current Research into NLOS

Summary