teach-image

Loss of Class Dynamics Amid Distance Learning

In this article I would like to step away a bit from computer vision and talk about education at university/college level and how I hope it won’t change too much as we (hopefully) slowly recover from the coronavirus pandemic (at least here in Australia).

I bumped into a colleague of mine from Carnegie Mellon University last week while having coffee outside the nearby cafe in town. We got talking about the usual things and then the topic ventured, of necessity, towards COVID-19 and how things will change post-pandemic. For example, we’re already seeing remote work opportunities being given permanent status in some firms because (finally) benefits for both employers and employees are being recognised. But my colleague and I then started talking about how things might change at our university. Perhaps students at CMU should be given more opportunities to attend lectures remotely even when the necessity for social distancing will be removed? This is an interesting question, centred around this whole idea of distance learning, that has been discussed for decades. My colleague and I talked about it for a little bit, too.

The benefits of such a plan (for distance learning) are easily discernible: e.g. there is no longer the necessity for students and lecturers to commute to a campus or even to reside in a given country; or there is the benefit of the university not needing to maintain as many facilities or equipment on campus, thus saving money that could be put into many other things to improve the learning experience.

The drawbacks of distance learning are also fairly well know: e.g. it is more difficult for students to get motivated and project work involving teams are much more cumbersome to complete, let alone do well.

But in this whole debate on the benefits and drawbacks of remote learning I think one thing is being significantly disregarded: class dynamics. This is what I would like to write about in this article.

Before I continue, I need to define what I mean by “class dynamics”. Class dynamics, at least the way that I will be using it here, is a certain atmosphere or ambience that can be set up in a classroom/lecture environment that can foster or impede the interaction that can take place between a pedagogue and their students. There are many factors that contribute to class dynamics. For example, the attitude and mood of the interlocutor, the attitudes and moods of the students, the topics being discussed, etc.

Class dynamics is just so important. It can significantly affect the learning outcomes of students. It can be the decisive factor between good class engagement and no class engagement. It can be the decisive factor between students coming to seek out the lecturer after a session to delve deeper into a topic or to have things explained further. All of this will have an impact on the teacher as well. He will be spurred on by a positive class engagement and find satisfaction in what he is doing. And then this contentment will flow over onto the students even more and boost their satisfaction. Class dynamics affects the students and teachers in a cyclical way. Like I said, it is just so important.

Since the beginning of the pandemic, I have delivered countless lectures via video conference. Yes, it has been convenient in many respects (e.g. I have worn comfy pyjamas and slippers on my bottom half) but I have come to truly appreciate what a physical classroom environment really gives towards the whole educational experience, predominantly in the context of class dynamics.

Indeed, physical presence just gives so much. Firstly, there is the notion of body language. We’ve all heard just how much body language can convey. It truly can communicate a lot. Little reactions to things I’m saying, people turning around to others at particular moments to seek explanation, slouching – things that a camera cannot properly capture. We read body language and consciously or subconsciously react to it. A good pedagogue will be able to react accordingly and steer discussions or lectures in the right direction to keep people’s attention at full capacity or to notice when concepts need to be reiterated perhaps in a different way. You lose all (or at least most) of this when you’re delivering lectures via video conference. I miss this aspect so much. I just can’t read my students’ “presence” in a given lecture at all. And it’s seriously draining and detrimental to all involved. Especially since concepts in computer science (and science in general) build on top of each other. So, whether a student grasps something now will have a knock on effect for any future classes he/she will attend.

Class dynamics is paramount. And it is fostered by physical presence.

Something else that contributes to class dynamics is the building up of a community in a class. When students attend a campus in person they can get to know each other so much better. They can “hang out” after class or in the evenings and friendships can be formed. Classroom interaction becomes so much better when everyone is relaxed around each other! When you teach via video conference the ability to form a community is significantly diminished. Everyone loses out.

These are really important points to consider. Because, ultimately, with learning via video conferencing the students, the class dynamic, the relationships between the pedagague and his pupils, the entire learning experience gets flattened into two-dimensions much like everyone’s face on the screen in front of you.

So much is lost.

And it’s, hence, important to think about this when weighing up the pros and cons of distance learning. We want to keep the standards at our universities/colleges high while, of course, maintaining costs at a minimum and leisure at a maximum. Class dynamics cannot be ignored even though it is difficult to measure and put into argument form when discussing these things with the people in charge. But it has to be discussed and argued for, especially when it looks like the world will slowly be returning to normality in the near future.

To be informed when new content like this is posted, subscribe to the mailing list:

facial-recognition

Apple’s and Samsung’s Face Unlocking Technologies

Have you ever wondered how the technology that unlocks your phone with your face works? This is a fascinating question and, interestingly, Samsung and Apple provide very different technologies for this feature on their devices. This post will examine the differences between the two technologies and will also show you how either of the two can be fooled to grant you access to anybody else’s phone.

(Please note: this is the third part of my series of articles on how facial recognition works. Hence, I will breeze over some more complex topics. If you wish to delve deeper into this area of computer vision, please see my first and second posts in this series.)

Samsung’s Face Recognition

Samsung’s face unlocking feature has, perhaps surprisingly, been around since 2011. Over the years, especially recently, it has undergone some improvements. Up until the Galaxy S8 model, face unlocking was done using the regular front camera of the phone to take a picture of your face. This picture was analysed for facial features such as the distance between the eyes, facial contours, iris colour, iris size, etc. The information was stored on your phone so that next time you tried to unlock it, the phone would take a picture of you, process it for the aforementioned data, and then compare it to the information it had stored on your phone. If everything matched, your phone was unlocked.

This was a cheap, fast, and easy way to implement facial recognition. Unfortunately, it was not very secure. The major problem was that all processing was done using 2D images. So, as you may have guessed, a simple printed photo of your face or even one displayed on another phone could fool the system. Need proof? Here’s a video of someone unlocking a Galaxy Note 8, which was released in April 2017, with a photo shown on another phone. It’s quite amusing.

There was a “liveness check” added to this technology with the release of Android Jelly Bean in 2012. This worked by attempting to detect blinking. I never tried this feature but from what I’ve read on forums, it wasn’t very accurate and required a longer time to process your face – hence probably why the feature wasn’t turned on by default. And yes, it could also be fooled by a close-up video of you, though this would be much harder to acquire.

With the release of the Galaxy S8, a new biometric identification technology was introduced: iris scanning. Irises, like fingerprints, are unique to each person. Iris scanning on Samsung phones works by illuminating your eye with infrared light (invisible to the naked eye). However, this technology could also be fooled with photographs and contact lenses. Here’s a video of a security researcher from Berlin doing just that. He took a photo of his friend’s eye from a few metres away (!) in infrared mode (i.e. night mode), printed it out on paper, and then stuck a contact lens on the printed eye. Clever.

Perhaps because of this flaw, Samsung’s Galaxy S9 introduced Intelligent Scan, which combined facial scanning and iris scanning. Facial scanning, however, is still only performed on 2D images (as described above) taken from the front camera of the phone. But a combination of the two technologies was seen as improving face unlocking technology in general.

Unfortunately, the Samsung Galaxy S10 (and subsequently the S20) retracted Intelligent Scan and went back to standard 2D photo face recognition. The reason for this was to make room for a larger screen because the iris scanning components were taking up a little too much room at the top of the phone for Samsung’s liking. With this move returned the possibility to unlock people’s phones with photos or images. For example, here’s a video showing a Galaxy S10 phone being unlocked with an image on another phone. According to some users, however, if you manually tweak the settings on your phone by going to Settings > Biometrics and Security > Face recognition and toggling “Faster recognition” to off, it seems that this makes it a lot harder to defeat.

(Interestingly, in this period of coronavirus pandemic, people have been crying out for the iris scanning technology to return because face recognition just does not work when you’re wearing a mask!)

Apple’s Face ID

This is where the fun begins. Apple really took face recognition seriously.

The Apple technology in question is called Face ID and it first appeared in November 2017 with the iPhone X.

In a nutshell, Face ID works by firstly illuminating your face with infrared light (like with iris scanning) and then projecting a further 30,000 (!) infrared points onto your face to build a super-detailed 3D map of your facial features. These 3D maps are then converted into mathematical representations (to understand how this is performed, see my first blog post on how facial recognition works). So, each time you try to unlock your phone, its these representations that are compared. Quite impressive.

What’s more, this technology can recognise faces with glasses, clothing, makeup, and facial hair (not face masks, though!), and adapts to changes in appearance over time. The latter works by simply monitoring how your face may be changing over time – e.g. you may be gaining or losing weight, which will of course be affecting the general structure of your face, and hence the 3D map of it.

This impressive infrared technology, however, has been in use for a very long time. If you are familiar with the Microsoft Kinect camera/sensor (initially released in 2010), it uses the same concept of infrared point projection to capture and analyse 3D motion.

So, how do you fool the ‘TrueDepth camera system’, as Apple calls it? It’s definitely not easy because this technology is quite sophisticated. But successful attempts have already been documented.

To start off with, here’s a video showing identical twins unlocking each other’s phones. Also quite amusing. How about relatives that look similar? It’s been done! Here’s a video showing a 10-year-old boy unlocking his mother’s phone. Now that’s a little more worrisome. However, it shows that iPhone Xs can be an alternative to DNA paternity/maternity tests 🙂 Finally, here’s a video posted by Vietnamese hackers documenting how their 3D-printed face mask fooled Apple’s technology. Some elements, like the eyes, on this mask were printed on a standard colour printer. The model of the face was acquired in 5 minutes using a hand-held scanner.

Conclusion

In summary, if you’re truly worried about security, face unlocking on Samsung phones is just not up to scratch. I would recommend using their new (ultrasonic) fingerprint scanning technology instead. Because Apple works with 3D images of faces, it is much more secure. In this respect, Apple wins the battle of the phones, for sure.

To be informed when new content like this is posted, subscribe to the mailing list:

The-Last-Lecture-book-cover

Review of The Last Lecture by Randy Pausch

I detest receiving books as gifts in the workplace. Such books are usually of the soulless sort, the sort that are written around some lifeless corporate motto that is supposed to inspire a new employee to work overtime when needed. So, when I walked into my first day at work at Carnegie Mellon University (Australia campus) and saw the person in charge of my induction training holding a book obviously destined to end up in my possession, the invisible eyes of my invisible soul rolled as far back as possible into my invisible soul’s head. But I felt guilty about this reaction shortly after when I was told what the book was about: a past Carnegie Mellon University professor’s last lecture shortly before dying of cancer. “This could actually be interesting”, I thought, and mentally added the book to the bottom of my “To Read” list.

That happened over a year ago. I’ve been incessantly pestered ever since by the person in charge of my induction to read this book. But what can one do when one’s “To Read” list is longer than the actual book itself? However, I finally got round to it, and I’m glad I did. Here are my thoughts on it.

The book in question is entitled “The Last Lecture” and was written by Randy Pausch. Randy was a computer scientist (like me!), a professor at CMU (like me!), with significant previous experience in the industry (also like me!) – a kindred soul it seems. In August 2007, he was told he only had 3-6 months left of his life as a result of pancreatic cancer. The following month he gave his final lecture at CMU in Pittsburgh and then wrote this book about that event.

His final lecture was entitled “Really Achieving Your Childhood Dreams”. During this talk he showed approximately 60 slides each with a single picture meant to, in one way or another, reference a childhood dream that he was able to fulfil or at least attempted to fulfil. But, as he states at the end of the book, the lecture topic was really a feint (or “head fake”, to use his NFL terminology) from his primary aim: to give a lecture for his three children aged 6, 3, and 18 months, in order for them to see who their father was and to pass down any wisdom that he had accrued in his life that he would have liked to have given them over time. It was heart-wrenching for him to think about his children not having their father present growing up and not having solid memories of him. So, he wanted to give them something concrete to look back on as the years after his death progressed.

randy-pausch-photo
A photo of Randy Pausch

Randy was very concise in squeezing a lifetime of thoughts into a 60 minute talk – but from it a few things definitely stood out for me.

Firstly, it was his career as an educator, rather than as an academic. He definitely emphasised the former over the latter. Professor Pausch had a passion for teaching. He was damn good at it, too. The stories that he had about how he inspired students throughout the years and also about his (sometimes unorthodox) teaching methods are stirring and stimulating to a fellow educator like myself. He strove to make a difference in each and every students’ life. In a way, he felt like he was an extension of their parents and it was his duty to convey to students as much as he could, which included things like life experiences. Yes, he was a true educator and he showed this well in his book. He undoubtedly wanted his children to know this part about himself. He wanted them to be proud of his passion and his great adeptness at it.

Another thing that stood out for me was the wisdom conveyed in this book. When faced with death, any honest person is going to make significant re-evaluations of their values, will inevitably see and experience things from a different perspective, and will undoubtedly view past experiences in a different light. It is always worth reading the thoughts of such a person because you know that they will be rich and profound and definitely not soulless. Randy’s short book is full of such thoughts.

The last thing I want to mention is that “The Last Lecture” is permeated with a fighting spirit that overflows into a sense of celebration of life. Despite staring death in the face Randy still managed to let an optimistic outlook govern his everyday workings:

Look, I’m not in denial about my situation. I am maintaining my clear-eyed sense of the inevitable. I’m living like I’m dying. But at the same time, I’m very much living like I’m still living.

He lived his final months in this spirit and has conveyed this also well in his book, if only for the simple fact that the book is full of humour. We can learn a lot from such an outlook on life. The man would have been a great guy to have a coffee with in the staff room, for sure.

In conclusion, Professor Pausch achieved his aim of leaving something for his children to remember him by, to be proud of, and to inspire and teach them as they themselves tread through life. Simultaneously, however, he left a lot for us, too. I can see why I was given this book on my induction day at Carnegie Mellon University. Randy’s children I’m sure are proud of him. And now I am proud myself knowing that I am teaching at the same institution as he once did.

(Employers please note: this is how you legitimately make an employee want to work overtime after an induction session)

To be informed when new content like this is posted, subscribe to the mailing list:

facial-recognition

How Facial Recognition Works – Part 2 (FaceNet)

This post is the second post in my series on “How Facial Recognition Works”. In the first post I talked about the difference between face detection and face recognition, how machines represent (i.e. see) faces, how these representations are generated, and then what happens with them later for facial recognition to work.

Here, I would like to describe a specific facial recognition algorithm – one that changed things forever in this particular domain of artificial intelligence. The algorithm is called FaceNet and it was developed by Google in 2015.

FaceNet was published in a paper entitled “FaceNet: A Unified Embedding for Face Recognition and Clustering” at CVPR 2015 (a world-class conference for computer vision). When it was released it smashed the records of two top facial recognition academic datasets (Labeled Faces in the Wild and YouTube Faces DB) by a whopping 30% (on both datasets!). This is an utterly HUGE margin by which to defeat past state-of-the-art algorithms.

FaceNet’s major innovation lies in the fact that it developed a system that:

…directly learns a mapping from face images to a compact Euclidean space where distances directly correspond to a measure of face similarity. (Quote from original publication)

What this means is that it was the first algorithm to develop a deep neural network (DNN) whose sole task was to create embeddings for any face that was fed through it. That is, any image of a face inputted into the neural network would be given a 128-bit vector representation in the Euclidean space.

What this also means is that similar-looking faces are clustered/grouped together because they receive similar vector representations. Hence, clustering algorithms such as SVM or k-means clustering, can be employed on generated embeddings to perform facial recognition directly.

(To understand how such clustering algorithms work and to understand terms and concepts such as “embeddings” and “vector representations”, please refer to my first post on facial recognition where, as I have said earlier, I explain the fundamentals of facial recognition). 

Another important contribution made in Google’s paper was its choice of method that it used to train the deep neural network to generate embeddings for faces. Usually, one would train a DNN for a fixed number of classes, e.g. 10 different types of objects to be detected in images. You would collect a large number of example images from each of these 10 different classes and tell the DNN during training which image contained which class. In tandem, you would use, for instance, a cross entropy loss function that would indicate to you the error rate of your model being trained – i.e. how far away you were from an “ideally” trained neural network. However, because the neural network is going to be used to generate embeddings (rather than, for example, to state whether there is a particular object out of 10 in an image), you don’t really know how many classes you are training your DNN for. It’s a different problem that you are trying to solve. You need a different loss function – something specific  for generating embeddings. In this respect, Google decided to opt for the triplet-based loss function.

The idea behind the triplet-based loss function is to, during the training phase, take three example images from the training data:

  • A random image of a person – we call this image the anchor image
  • A random but different image of the same person – we call this image the positive image
  • A random image of another person – we call this image the negative image.

During training, then, embeddings will be created for these three images and the triplet-based loss function’s task is to minimise the distance (in the Euclidean space) between the anchor and positive image and maximise the distance between the anchor and negative image. The following image from the original publication depicts this idea:

triplet-based-loss-function

Notice the negative image initially is closer to the anchor than the positive image. The neural network would then go and adjust itself for the positive image to be closer to the anchor rather than the negative one. And the process would be repeated for different anchor, positive, and negative images.

Employing the triplet-based loss function to guide the training of the DNN was an incredibly intelligent move by Google. Likewise was the decision to decide to use a DNN to generate embeddings outright for faces. It really is no surprise that FaceNet busted onto the scene like it did and subsequently laid a solid foundation for facial recognition. The current state of this field owes an incredible amount to this particular publication.

If you would like to play around with FaceNet, take a look at its Github Repository here.

(Update: part 1 of this post can be found here)

To be informed when new content like this is posted, subscribe to the mailing list:

facial-recognition

How Facial Recognition Works – Part 1

(Update: part 2 of this post can be found here)

Facial recognition. What a hot topic this is today! Hardly a week goes by without it hitting the news in one way or another, and usually for the wrong reasons – i.e. privacy and its growing ubiquity. But have you ever wondered how this technology works? How is it that only now it has become such a hot topic of interest, whereas 10 years ago not many people cared about it at all?

In this blog post I hope to demystify facial recognition and present its general workings in a lucid way. I especially want to explain why it is that facial recognition has seen tremendous performance improvements over the last 5 or so years.

I will break this post up according to the steps taken in most facial recognition technologies and discuss each step one by one:

  1. Face Detection and Aligning
  2. Face Representation
  3. Model Training
  4. Recognition

(Note: in my next blog post I will describe Google’s FaceNet technology: the facial recognition algorithm behind much of the hype we are witnessing today.)



1. Face Detection and Aligning

The first step in any (reputable) facial recognition technology is to detect the location of faces in an image/video. This step is called face detection and should not be confused with actual face recognition. Generally speaking, face detection is a much simpler task to facial recognition and in many ways it is considered a solved problem.

There are numerous face detection algorithms out there. A popular one, especially in the days when CPUs and memories weren’t what they are today, used to be the Viola-Jones algorithm because of its impressive speed. However, much more accurate algorithms have since been developed. This paper benchmarked a few of these under various conditions and the Tiny Faces Detector (published in the world-class Conference on Computer Vision and Pattern Recognition in 2017 – code can be downloaded here) came out on top:

teaser
Face detection being performed by Tiny Face Detector. (Image taken from paper’s website)

If you wish to implement a face detection algorithm, here is a good article that shows how you can do this using Python and OpenCV. How face detection algorithms work, however, is beyond the scope of this article. Here, I would like to focus on actual face recognition and assume that faces have already been located on a given image.

Once faces have been detected, the next (usual) step is to rotate and scale the face in order for its main features to be located in the same place, more or less, as the features of other detected faces. Ideally, you want the face to be looking directly at you with the lips and the position of the eyes parallel to the ground. The aligning of faces is an important step much akin to the cleaning of data (for those that work in data analysis). It makes further processing a lot easier to perform.

2. Face Representation

When we have a clean picture of a face, the next step is to extract a representation of it. A representation (often also called a signature) for a machine is a description/summary of a thing in a form that can be processed and analysed by it. For example, when dealing with faces, it is common to represent a face as a vector of numbers. The best way to explain this is with a simple example.

Suppose we choose to represent a face with a 2-dimensional vector: the first dimension represents the distance between the eyes, the second dimension the width of the nose. We have two people, Alice and Bob, and a photo of each of their faces. We detect and align the two faces in these photos and work out that the distance between the eyes of Alice is 12 pixels and that of Bob is 15 pixels. Similarly, the width of Alice’s nose is 4 pixels, Bob’s is 7 pixels. We therefore have the following representations of the two faces:

table-of-face-representations

So, for example, Alice’s face is represented by the vector (12, 4) – the first dimension stores the distance between the eyes, the second dimension stores the width of the nose. Bob’s face is represented by the vector (15, 7).

Of course, one photo of a person is not enough for a machine to get a robust representation/understanding of a person’s face. We need more examples. So, let’s say we grab 3 photos of each person and extract the representation from each of them. We might come up with the following list of vectors (remember this list as I’ll use these numbers later in the post to explain further concepts):

table-of-face-representations

Having numbers like these to represent faces is much easier to deal with for a machine than raw pictures – machines are unlike us!

Now, in the above example, we used two dimensions to describe a face. We can easily increase the dimensionality of our representations to include even more features. For example, the third dimension could represent the colour of the eyes, the fourth dimension the colour of the skin – and so on. The more dimensions we choose to use to describe each face, generally speaking, the more precise our descriptions will be. And the more precise our descriptions are, the easier it will be for our machines to perform facial recognition. In today’s facial recognition algorithms, it is not uncommon to see vectors with 128+ dimensions.

Let’s talk about the facial features used in representations. The above example, where I chose the distance between the eyes and the width of the nose as features, is a VERY crude example. In reality, reputable facial recognition algorithms use “lower-level” features. For instance, in the late 1990s facial recognition algorithms were being published that considered the gradient direction (using Local Binary Patterns (LBP)) around each pixel on a face. That is, each pixel was analysed to see how much brighter or darker it was compared to its neighbouring pixels. Hence, brightness/darkness of pixels were the features chosen to create representations of faces. (You can see how this would work: each pixel location would have a relative different brightness/darkness score depending on a person’s facial structure).

Other lower-level features have been proposed for facial recognition algorithms, too. Some solutions, in fact used hybrid approaches – i.e. they used more than one low-level feature. There is an interesting paper from 2018 (M. Wang and W. Deng, “Deep Face Recognition: A Survey“, arXiv preprint) that summarises the history of facial recognition algorithms. This picture from the paper shows the evolution of features chosen for facial recognition:

facial-recognition-history

Notice how the accuracy of facial recognition algorithms (benchmarked on the LFW dataset) have increased over time depending on the type of representation used? Notice also the LBP algorithm, described above, making an appearance in the late 1990s? It gets an accuracy score of around 70%.

And what do you see at the very top of the graph? Deep Learning, of course! Deep Learning changed the face of Computer Vision and AI in general (as I discussed in this earlier post of mine). But how did it revolutionise facial recognition to the point that it is now achieving human-level precision? The key difference is that instead of choosing the features manually (aka “hand-crafting features” – i.e. when you say: “let’s use pixel brightness as a feature”), you let the machine decide which features should be used. In other words, the machine generates the representation itself. You build a neural network and train it to deliver vectors for you that describe faces. You can put any face through this network and you will get a vector out at the end.

But what does each dimension in the vector represent? Well, we don’t really know! We would have to break down the neural network used to see what is going on inside of it step by step. DeepFace, the algorithm developed by Facebook (you can see it mentioned in the graph above), churns out vectors with 128 dimensions. But considering that the neural network used has 120 million parameters, it’s probably infeasible to break it apart to see what each dimension in the vector represents exactly. The important thing, however, is that this strategy works! And it works exceptionally well.

3. Model Training

It’s time to move on to the next step: model training. This is the step where we train a classifier (an algorithm to classify things) on a list of face representations of people in order to be able to recognise more examples (representations) of these people’s faces.

Let’s go back to our example with our friends Alice and Bob to explain this step. Recall the table of six vector representations of their faces we came up with earlier? Well, those vectors can be plotted on a 2-dimensional graph like so:

facial-recognition-graph
A scatter-plot of face representations (unit of measurement is pixels; plot generated here)

Notice how Alice and Bob’s representations cluster around each other (shown in red)? The bottom left data points belong to Alice, the top right belong to Bob. The job of a machine now will be to learn to differentiate between these two clusters. Typically, a well-known classification algorithm such as SVM is used for this.

If more than two clusters are present in the data (i.e. we’re dealing with more than two people in our data), the classifier will need to be able to deal with this. And then if we’re working in higher dimensions (e.g. 128+), the algorithm will need to operate in these dimensions, too.

4. Recognition

Recognition is the final step in the facial recognition process. Given a new image of a face, a representation will be generated for it, and the classification algorithm will provide a score of how close it is to its nearest cluster. If the score is high enough (according to some threshold) the face will be marked as recognised/identified.

So, in our example case, let’s say a new photo has emerged with a face on it. We would first generate a representation of it, e.g.: (13, 4). The classification algorithm will take this vector and see which cluster it lies closest to – in this case it will be Alice’s. Since this data point is very close to the cluster, a high recognition score will also be generated. The picture below illustrates this example:

face-representation-graph
A scatter-plot of face representations. The green point represents the new face that will be classified as Alice’s

This recognition step is usually extremely fast. And the accuracy of it is highly dependent on the preceding steps – the most important of which is the quality of the second step (the one that generates representations of faces).

Conclusion

In this post I described the major steps taken in a robust facial recognition algorithm. Each step was then described and an example use case was utilised to illustrate the concepts behind these steps. The major breakthrough in facial recognition came in 2014 when deep learning was used to generate representations of faces rather than the technique of hand-crafting features. As a result, facial recognition algorithms can now achieve near human-level precision.

(Update: part 2 of this post can be found here)

To be informed when new content like this is posted, subscribe to the mailing list:

face-recognition

De-Identification of Faces in Live Videos – ICCV 2019

Facial recognition. What a hot topic this is today. A week hardly goes by without it making the news in one way or another. This technology seems to be infiltrating more and more of our everyday lives: from ID verification on our phones (to unlock them) to automated border processing systems at airports. In some ways this is a good thing but in some this is not. The most controversial aspect of the growing ubiquity of facial recognition technology (FRT) is arguably the erosion of privacy. The more that FRT is used in our lives, the more it seems that we are turning into a highly monitored society.

This erosion of privacy is such a foremost issue to some that three cities in the USA have banned the use of FRT: San Francisco and Oakland in California and Somerville in Massachusetts. These bans, however, only affect city agencies such as police departments. Portland, Oregan, on the other hand, may soon be introducing a bill that could also cover private retailers and airlines. Moreover, according the Mutale Nkonde, a Harvard fellow and AI policy advisor, a federal ban could be around the corner.

FRT is undoubtedly controversial with respect to the debate on privacy.

In this post I would like to introduce to you a paper from the International Conference on Computer Vision 2019 (ICCV) that attempts to provide that little bit of additional privacy in our lives by proposing a fast and impressive method to de-identify videos in real-time. The de-identification process is purported to be effective against machines rather than humans such that we are still able to perceive the original identity of the speaker in the resulting video.

(TL;DR: jump to the end to see the results generated by the researchers. It’s quite impressive.)

The paper in question here is entitled “Live Face De-Identification in Videos” by Gafni et al. published by the Facebook Research Group (isn’t it ironic that Facebook is writing academic papers on privacy?).

The de-identification algorithm itself is a bit tricky to explain but I’ll do my best. First, an adversarial autoencoder network is paired with a trained facial classifier. An autoencoder network (which is a special case of the encoder-decoder architecture) works by imposing a bottleneck in the neural network, which forces the network to learn a compressed representation of the original input image. (Here’s a fantastic video explaining what this means exactly). So, what happens is that a compressed version of your face is generated – but the important aspects such as your pose, lip positioning, expression, illumination conditions, any occlusions, etc. are all retained. What is discarded are the identifying elements. This retaining/discarding is controlled by having a trained facial classifier nearby that the autoencoder tries to fool. During training, the autoencoder gets better and better at fooling the facial classifier by learning to more effectively discard identifying elements from faces, while retaining the important aspects of them.

What results is a face that is still easily recognisable to us, an algorithm that works in real-time (meaning that you can “turn it on” for skype sessions, for example) and one that doesn’t need to be retrained for each particular face – it just works out of the box for all faces.

Here is a video released by the authors of the paper showing some of their results:

Very impressive, if you ask me! Remember, the generated faces in the video have been de-identified, meaning that a facial recognition algorithm (FaceNet or ArcFace, for example) will find it extremely difficult to deal with. In fact, experiments were performed to see how well the researchers’ algorithm performs against popular FRTs. For one experiment, FaceNet was tested on images before and after de-identification. The true positive rate for one dataset dropped from almost 0.99, to less than 0.04. Very nice, indeed.

Moreover, the paper also goes into detail on a number of key steps of their algorithm. One of them is the de-identification distance from the original image. That is, they play around a bit with how much a person’s face is de-identified. The image below shows Nicholas Cage being gradually anonymised by increasing a variable in the algorithm. This is also something quite interesting.

cage-de-identified
Image showing Nicholas Cage being gradually de-identified (image taken from original publication)

Summary

In this post I presented a paper from the Facebook Research Group on the de-identification of faces in real-time. In the context of FRTs and the current hot debate on privacy, this is an important piece of work, especially considering the fact that this algorithm is getting impressive results and works in real-time. Whether we will see this technology in use in the near future is hard to say, but I wouldn’t be surprised if a de-identification app that works much like the face swap filter on instagram becomes available to the general public. There is certainly a demand for it.

To be informed when new content like this is posted, subscribe to the mailing list:

grumpy-cat-image

The Largest Cat Video Dataset in the World

This is a bit of a fun post about a “dataset” I stumbled upon a few days ago… a dataset of cat videos.

As I’ve mentioned numerous times in various posts of mine, the deep learning revolution that is driving the recent advancements in AI around the world needs data. Lots and lots of data. For example, the famous image classification models that are able to tell you, with better precision than humans, what objects are in an image, are trained on datasets containing sometimes millions of images (e.g. ImageNet). Large datasets have basically become essential fuel for the AI boom of recent years.

So, I had a really good laugh when a few days ago I found this innocuous and unexposed YouTube channel owned by a Japanese man who has been posting a few videos every day of him feeding stray cats. Since he has been doing this for the past 9 years, he has managed to accumulate over 19,000 cat videos on his channel. And in doing so he has most probably and inadvertently created the largest cat video dataset in the world. My goodness!

Technically speaking, unless you’re a die-hard cat lover (like me!), these videos aren’t all that interesting. They’re simply of stray cats having a decent feed or drink with their good-hearted caretaker on occasion uttering a few sentences here and there. Here’s one, for example, of two cats eating out of a bowl:

Or here’s one of two cats enjoying a good ol’ scratch behind the ears:

On average these videos are about 30-60 seconds in length. And they’re all titled by the default name given by his cameras (e.g. MVI 3985, etc.). Hence, nothing about these clips is designed in order for them be found by anybody out there.

However, despite all this mundanity, to the computer vision community (that needs datasets to survive like humans need oxygen), these videos could come in handy… one day. I’m not sure how just yet, but I’m sure somebody out there could find a use for them. I mean, there’s over 19,000 cat videos just sitting there. This is just too good to pass up.

So, if there are any academics out there: please, please, please use this “dataset” in your publishable studies. It would make my year, for sure! The cats would be proud, too.

Oh, and one more thing. I found this guy’s twitter account (@niiyan1216). And, you guessed it: it is full of pictures of cats.

To be informed when new content like this is posted, subscribe to the mailing list:

siggraph-logo

Capturing the Moment in Photography – SIGGRAPH 2019 Award

SIGGRAPH 2019 is coming to end today. SIGGRAPH, which stands for “Special Interest Group on Computer GRAPHics and Interactive Techniques”, is a world-renowned annual conference held predominantly for computer graphics researchers – but you do sometimes get papers from the world of computer vision being published there. In fact, I’ve presented a few such papers on this blog in the past (e.g. see here).

Michael F. Cohen

I’m not going to present any papers from this conference today. What I would like to do is mention a person who is being recognised at this year’s conference with a special award. Michael F. Cohen, the current Director of Facebook’s Computational Photography Research team, a few days ago received the 2019 Steven A. Coons Award for Outstanding Creative Contributions to Computer Graphics. This is an award given every two years to honour outstanding lifetime contributions to computer graphics and interactive techniques.

For the full, very impressive list of Michael’s achievements, see the SIGGRAPH award’s page. But there are a few that stand out. In particular his significant contributions to Facebook’s 3D photos feature and most interestingly (for me) his work on The Moment Camera.

You may recall that in March of this year I wrote about Smartphone Camera Technology from Google and Nokia. At the time, I didn’t realise that the foundations for the technologies I discussed there were laid down by Michael nearly 15 years ago.

In that post I talked about High Dynamic Range (HDR) Imaging, which is a technique employed by some cameras to give you better quality photos. The basic idea behind HDR is to capture additional shots of the same scene (at different exposure levels, for instance) and then take what’s best out of each photo to create a single picture. For example, the image on the right below was created by a Google phone using a single camera and HDR technology. A quick succession of 10 photos (called an image burst) was taken of a dimly lit indoor scene. The final merged picture gives a vivid representation of the scene. Quite astonishing, really.

(image taken from here

Well, Michael F. Cohen, laid out the basic ideas behind HDR for combining images/photos to create better pictures at the beginning of this century. For example, he along with Richard Szeliski published this fantastic paper in 2006. In it he talks about the idea of capturing a moment rather than an image. Capturing a moment is a much better description of what HDR is all about!

The abstract to the paper says it best:

Future cameras will let us “capture the moment,” not just the instant when the shutter opens. The moment camera will gather significantly more data than is needed for a single image. This data, coupled with automated and user-assisted algorithms, will provide powerful new paradigms for image making.

Ah, the moment camera. What a good name for HDR-capable phones!

It’s interesting to note that it has taken a long time for the moment camera to become available to the general public. I would guess that we just had to wait for faster CPUs on our phones for Michael’s work to become a reality. However, some features of the “moment camera” described in the 2006 paper are yet to be implemented in our HDR-enabled phones. For example, this idea of a group shot being improved by image segmentation:

capturing-moment-group-shot
The original caption to the image reads: “Working with stored images, the user indicates when each person photographed looks best. The system automatically finds the best regions around each selection to compose into a final group shot.” (image taken from original publication)

Anyway, a well-deserved lifetime achievement award, Michael. And thank you for the “moment camera”.

To be informed when new content like this is posted, subscribe to the mailing list:

aira-glasses

Smart Glasses for the Blind – Why has it Taken This Long?

Remember Google Glass? Those smart glasses that were released by Google to the public in May of 2014 (see image below). Less than a year later production was halted because, well, not many people wanted to walk around with a goofy looking pair of specs on their noses. They really did look wacky. I’m not surprised the gadget never caught on.

google-glass
Google Glass in action (image source)

Well, despite the (predictable?) flop of Google Glass, it turns out that there has proven to be a fantastic use case for such smart glasses: for people with visual impairments.

Aira - Homepage

There is a company out there called Aira that provides an AI-guided service used in conjunction with smart glasses and an app on a smartphone. When images are captured by the glasses’ forward-facing camera, image and text recognition are used and an AI assistant, dubbed “Chloe”, describes in speech what is present in these videos: whether it be everyday objects such as products on a shelf in your pantry, words on medication bottles or even words in a book.

Quite amazing, isn’t it? 

Simple tasks like object and text recognition are performed locally on the smartphone. However, more complex tasks can be sent to Aira’s cloud services (powered by Amazon’s AWS).

Furthermore, the user has the option to, at the tap of a button on the glasses or app, connect to a live agent who is then able to access a live video stream from the smart glasses and other data from the smartphone like GPS location. With these the live agent is able to provide real-time assistance by speaking directly to the visually impaired person. A fantastic idea.

According to NVIDIA, Aira trains its object recognition deep learning neural networks not on image datasets like ImageNet but from 3 million minutes worth of data captured by their users, which has been annotated by Aira’s agents. An interesting idea considering how time consuming such a task must have been. But this has given the service an edge as training from real-world scenarios has provided, as reported, better results.

The uses for Aira’s product and service are pretty much endless! As suggested on their site, you can use Aira for things like: reading to a child, locating a stadium seat, reading a whiteboard, navigating premises, sorting and reading mail and the paper, enjoying the park or the zoo, roaming historical sites. Gosh, the list can be endless!

And thankfully, the glasses don’t look goofy at all! That’s definitely a win right there.

aira-glasses
Aira’s nicely-designed smart glasses (image source)

Finally, I would encourage you to take a look at this official video demonstrating the uses of Aira. This is computer vision serving society in the right way.

(Unfortunately, the video that was once here has been taken down by the pubisher)


To be informed when new content like this is posted, subscribe to the mailing list:

delivery-drone-example

Delivery Drones and the Google Wing Project

I gave a guest lecture last Thursday at Carnegie Mellon University at their Adelaide campus in South Australia. (A special shout-out to the fantastic students that I met there!). The talk was on the recent growth of computer vision (CV) in the industry. At the end of the presentation I showed the students some really interesting projects that are being worked on today in the CV domain such as Amazon Go, Soccer/Football on Your Tabletop, autonomous cars (which I am yet to write about), CV in the fashion industry, and the like.

I missed one project, however, that has been making news in the past few days in Australia: delivery drones. Three days ago, Google announced that it is officially launching the first home delivery drone service in Australia in our capital city, Canberra, to deliver takeaway food, coffee, and medicines. Google Wing is the name of the project behind all this.

Big, big news, especially for computer vision.

In this post I am going to look at the story behind this. I will present:

  • the benefits of delivery drones,
  • the potential drawbacks of them,
  • and then I’ll take a look at (as much as is possible) the technology behind Google’s drones.

The Benefits of Delivery Drones

There was an official report prepared two months ago by AlphaBeta Advisors on behalf of Google Wing for the Standing Committee on Economic Development at the Parliament of the Australian Capital Territory (Canberra). The report, entitled “Inquiry into drone delivery systems in the ACT“, analysed the benefits of delivery drones in order to sway the government to give permission for drones to be utilised in this city for the purposes described above. The report was successful since, as I’ve mentioned,  the requested permission was granted a few days ago.

Let’s take a look (in summary) at the benefits discussed by the article. Note that numbers presented here are specific to Canberra.

Benefits for local businesses:

  1. More households can be brought into range by delivery drones. More households means more consumers.
  2. Reduction of delivery costs. It is estimated that delivery costs could fall by up to 80-90% in the long term.
  3. Lower costs will generate more sales.
  4. More businesses delivering means a more competitive market.

Benefits for consumers:

  1. Drones will be able to reach the more underserved members of the public such as the elderly, disabled, and homebound.
  2. Since delivery times are faster by 60-70%, it is estimated that 3 million hours will be saved per year. This includes scenarios where customer pick-up journeys are replaced by drones.
  3. As a result of lower delivery costs, drones could save households $5 million in fees per year.
  4. Product variety will be expanded for the consumer as up to 4 times more merchants could be brought into range for them.

Benefits for society:

  1. 35 million km per year will be removed as a result of more delivery vehicles being taken off the road. This will reduce traffic congestion.
  2. The above benefit will also result in a reduction of emissions by 8,000 tonnes, which is equivalent to the carbon storage of 250,000 trees (huge!).
  3. Fewer cars on the road means fewer road accidents.

Some convincing arguments here. The benefits to society are my personal favourites. I hate traffic congestion!

The Potential Drawbacks of Delivery Drones

Drawbacks are not discussed in the aforementioned report. But some have been raised by the public living in Canberra. These are definitely worth mulling over:

  1. Noise pollution. Ever since 2014 when Google started testing these delivery drones people have complained about how noisy they are. Some have even mentioned that wildlife seems to have disappeared from delivery areas as a result of this noise pollution. In fact, residents from this area have created an action group, called Bonython Against Drones, “to raise awareness of the negative impact of the drone delivery trial on people, pets and wildlife in Bonython [a suburb in Canberra] and to ensure governance and appropriate legislative orders are in place to protect the community“. Below is a video of a delivery in progress. Bonython Against Drones appears to have a strong case. This noise really is irritating.
  2. Invasion of privacy. Could flying low over people’s properties be deemed as an invasion of privacy? A fair question to ask. Also, could Google use these drones to collect private information from the households they fly over? Of course, the company says that they comply with privacy laws and regulations but, well, their track record on privacy isn’t stellar. Heck, there’s even an entire Wikipedia article on the Privacy Concerns Regarding Google.
  3. Bad weather conditions such as strong winds would render drones unusable. Can we rely on weather conditions so heavily?

The first point is definitely a drawback worth considering.

Google Wing Drones

Let’s take a look at the drones in operation in Canberra.

google-wing-drone
The Google Wing drone currently in operation (image taken from here)

It seems as if this drone is a hybrid between a plane and a helicopter. The drone has wings with 2 large propellers but also 9 smaller hover propellers. Google says that the hover propellers are designed specifically to reduce noise. From the video above, though, a little bit more is probably needed to curtail that obnoxious buzzing sound.

There’s not much information out there on the technical side of things. For example, no white papers have been released by Google as of yet. But I dug around a bit and managed to come up with some interesting things. I stumbled upon this job description for the position of Perception Software Engineer at Google Wing HQ in California. What a find 🙂

(If you’re reading this post some time after April 2019, chances are the job description has been taken down… sorry about that)

The job description gives us hints as to what is going on in the background of this project. For example, we know that Google has developed “an unmanned traffic management platform–a kind of air traffic control for unmanned aircraft–to safely route drones through the sky”. Very cool.

More importantly for us, we also know that computer vision plays a prominent role in the guidance of these drones:

“Our perception solutions run on real-time embedded systems, and familiarity with computer vision, optical sensors, flight control, and simulation is a plus.” 

And the job requirements specifically request 2 years of experience working with camera sensors for computer vision applications.

One interesting task that these drones perform is visual odometry, which is the process of determining the position and orientation of a device/vehicle by analysing camera images. As I’ve documented earlier, visual odometry was a CV technique used on Mars by the MER rovers from way back in the early 2000s.

It’s interesting to note that the CV techniques listed by the job description are performed on embedded systems and are coded in C++. A lot of people (including me) are predicting that embedded systems (e.g. IoTsedge computing) are the next big thing for CV, so it’s worth taking note of this. Oh, and notice also that C++ is being used here. This language is not dead yet, despite it not being taught at universities any more. C++ is just damn fast – something that is a must in embedded CV solutions.

Summary

This post looked at some background information pertaining to the Google Wing project that, as of a few days ago, officially launched the first home delivery drone service in Australia’s capital city, Canberra. The first section of the post discussed the benefits and drawbacks of delivery drones. The last part of the post presented the Google Wing project from the technical side. Not much technical information is available on this project but a job description for the position of Perception Software Engineer gives us a sneak peek at the inner workings of Google Wing, especially from the perspective of computer vision.

It will be interesting to see whether delivery drones will be deemed a success by Google and also, most importantly, by the public of Canberra.

To be informed when new content like this is posted, subscribe to the mailing list: