Zbigatron | Zbigatron

AI is Still Fundamentally Unintelligent

Posted on September 17, 2021January 16, 2025 by Zbigatron

My last post talked about how AI’s hitherto growth is unsustainable and will hence see a slowing down unless a new paradigm is discovered. A discovery like this may or may not happen, as discoveries go. In this post I wish to follow along a similar vein. Today, I wish to discuss how AI, despite all its achievements, is still fundamentally unintelligent and, hence, the hype behind it is to some significant degree unjustified. I will use the GPT-3 language model as representative of AI to argue my case. Conclusions from my analyses can be generalised to AI as a whole, as I will talk about below.

I’ve praised the GPT-3 language model in the past (see this previous post). In summary, GPT-3 can create code based on text descriptions for you, answer in-depth medical questions correctly, compose guitar tabs, can fill in missing parts of an image, write creative fiction, etc. Truly incredible things. In the field of language processing and understanding, one can safely say that GPT-3 is state-of-the-art, even though larger language models have since been released (e.g. Google’s Switch Transformer or Wu Dao 2.0).

But when one looks at GPT-3 closer, one starts to question, at least a little, the hype behind it and behind AI in general. Could AI still be fundamentally unintelligent?

There is no public access to GPT-3, so we can’t test things out for ourselves. GPT-J has been created by other researchers to alleviate this (and to disrupt OpenAI’s monopoly) but it still isn’t the real-deal, so to speak. Hence, we have to, even after a year of the initial release, lean on other people’s experiences of the product. But there’s thankfully plenty to work with and analyse.

Once again, what GPT-3 can do is astounding. Take a look at some of these responses (taken from here) that GPT-3 gave to quiz questions:

Q: What is your favorite animal?
A: My favorite animal is a dog.

Q: Why?
A: Because dogs are loyal and friendly.

Q: What are two reasons that a dog might be in a bad mood?
A: Two reasons that a dog might be in a bad mood are if it is hungry or if it is hot.

Q: How many eyes does a giraffe have?
A: A giraffe has two eyes.

Q: How many legs does a frog have?
A: A frog has four legs.

Q: Are there any animals with three legs?
A: No, there are no animals with three legs.

Q: Why don’t animals have three legs?
A: Animals don’t have three legs because they would fall over.

These are amazing responses (especially the last one) that could fool anybody into thinking that they were conversing with a human being. GPT-3 is unreal in this respect. You really can hold a “reasonable” conversation with it for a long time without stumbling upon inconsistencies in its responses.

But let’s keep digging and start to ask different questions, ones not really based on knowledge but more on understanding.

Q: How many eyes does a giraffe have?
A: A giraffe has two eyes.

Q: How many eyes does my foot have?
A: Your foot has two eyes.

Q: How many eyes does a spider have?
A: A spider has eight eyes.

Q: How many eyes does the sun have?
A: The sun has one eye.

Q: How many eyes does a blade of grass have?
A: A blade of grass has one eye.

Suddenly, you start to question the intelligence of this thing. Does it have understanding? If you can’t answer the question that a foot has no eyes correctly despite having been trained on 43 terabytes of data from the internet and books, then perhaps there is something fundamentally missing in your entity.

Let’s take a look at another example taken from this academic (pre-print) paper. In these experiments, GPT-3 was asked to analyse sentences and state the relationships between them. (Disclaimer: the paper actually analysed the underpinning technology that drives state-of-the-art language models like GPT-3. It did not explicitly examine GPT-3 itself, but for simplicity’s sake, I’m going to generalise here).

The above two sentences were correctly analysed as being paraphrases of each other. Nothing wrong here. But let’s jumble up some of the words a bit and create two different “sentences”:

These two sentences are completely nonsensical, yet they are still classified as paraphrases. How can two “sentences” like this be classified as not only having meaning but having similar meaning, too? Nonsense cannot be a paraphrase of nonsense. There is no understanding being exhibited here. None at all.

Another example:

These two sentences were classified as having the exact same meaning this time. They are definitely not the same. If the machine “understood” what marijuana was and what cancer was, it would know that these are not identical phrases. It should know these things, however, considering the data that it was trained on. But the machine is operating on a much lower level of “comprehension”. It is operating on the level of patterns in languages, on pure and simple language statistics rather than understanding.

I can give you plenty more examples to show that what I’m presenting here is a universal dilemma in AI (this lack of “understanding”) but I’ll refrain from doing so as the article is already starting to get a little too verbose. To see more, though, see this, this and this link.

The problem with AI today and the way that it is being marketed is that all examples, all presentations of AI are cherry picked. AI is a product that needs to be sold. Whether it be in academic circles for publications, or in the industry for investment money, or in the media for a sensationalistic spin to a story: AI is being predominantly shown from only one angle. And of course, therefore, people are going to think that it is intelligent and that we are ever so close to AGI (Artificial General Intelligence).

But when you work with AI, when you see what is happening under the hood, you cannot but question some of the hype behind it (unless you’re a crafty and devious person – I’m looking at you Elon Musk). Even one of the founders of OpenAI downplays the “intelligence” behind GPT-3:

You can argue, as some do, that AI just needs more data, it just needs to be tweaked a bit more. But, like I said earlier, GPT-3 was trained on 43 terabytes of text. That is an insane amount. Would it not be fair to say that any living person, having access to this amount of information, would not make nonsensical remarks like GPT-3 does? Even if such a living person were to make mistakes, there is a difference between a mistake and nonsense of the type above. There is still an underlying element of intelligence behind a mistake. Nonsense is nothingness. Machine nonsense is empty, hollow, barren – machine-like, if you will.

Give me any AI entity and with enough time, I could get it to converge to something nonsensical, whether in speech, action, etc. No honest scientist alive would dispute this claim of mine. I could not do this with a human being, however. They would always be able to get out of a “tight situation”.

“How many eyes does my foot have?”

Response from a human: “Are you on crack, my good man?”, and not: “Your foot has two eyes”.

Any similar situation, a human being would escape from intelligently.

Fundamentally, I think the problem is the way that us scientists understand intelligence. Hence, we confound visible, perceived intelligence with inherent intelligence. But this is a discussion for another time. The purpose of my post is to show that AI, even with its recent breathtaking leaps, is still fundamentally unintelligent. All state-of-the-art models/machines/robots/programs can be pushed to nonsensical results or actions. And nonsensical means unintelligent.

When I give lectures at my university here I always present this little adage of mine (that I particularly like, I’ll admit):

Machines operate on the level of knowledge. We operatre on the level of knowledge and understanding. #artificialintelligence #AI

— Zbigniew Zdziarski 🇦🇺🇵🇱 (@zbigatron) April 9, 2021

It is important to discuss this distinction in operation because otherwise AI will remain over-hyped. And an over-hyped product is not a good thing, especially a product that is as powerful as AI. Artificial Intelligence operates in mission critical fields. Further, big decisions are being made with AI in mind by the governments of countries around the world (in healthcare, for instance). If we don’t truly grasp the limitations of AI, if we make decisions based on a false image, particularly one founded on hype, then there will be serious consequences to this. And there have been. People have suffered and died as a result. I plan to write on this topic, however, in a future post.

For now, I would like to stress once more: current AI is fundamentally unintelligent and there is unjustified hype to some significant degree surrounding it. It is important that we become aware of this, if only for the sake of truth. But then again, truth for itself is important because if one operates in truth, one operates in a real world, rather than a fictitious one.

To be informed when new content like this is posted, subscribe to the mailing list (or subscribe to my YouTube channel!):

Artificial Intelligence is Slowing Down

Posted on July 20, 2021January 16, 2025 by Zbigatron

(Update: part 2 of this post was posted recently here and part 3 has been posted here.)

Over the last few months here at Carnegie Mellon University (Australia campus) I’ve been giving a set of talks on AI and the great leaps it has made in the last 5 or so years. I focus on disruptive technologies and give examples ranging from smart fridges and jackets to autonomous cars, robots, and drones. The title of one of my talks is “AI and the 4th Industrial Revolution”.

Indeed, we are living in the 4th industrial revolution – a significant time in the history of mankind. The first revolution occurred in the 18th century with the advent of mechanisation and steam power; the second came about 100 years later with the discovery of electrical energy (among other things); and the big one, the 3rd industrial revolution, occurred another 100 years after that (roughly around the 1970s) with things like nuclear energy, space expeditions, electronics, telecommunications, etc. coming to the fore.

So, yes, we are living in a significant time. The internet, IoT devices, robotics, 3D printing, virtual reality: these technologies are drastically “revolutionising” our way of life. And behind the aforementioned technologies of the 4th industrial revolution sits artificial intelligence. AI is the engine that is pushing more boundaries than we could have possibly imagined 10 years ago. Machines are doing more work “intelligently” for us at an unprecedented level. Science fiction writers of days gone by would be proud of what we have achieved (although, of course, predictions of where we should be now in terms of technological advances have fallen way short according to those made at the advent of AI in the middle of the 20th century).

The current push in AI is being driven by data. “Data is the new oil” is a phrase I keep repeating in my conference talks. Why? Because if you have (clean) data, you have facts, and with facts you can make insightful decisions or judgments. The more data you have, the more facts you have, and therefore the more insightful your decisions can potentially be. And with insightful decisions comes the possibility to make more money. If you want to see how powerful data can be, watch the film “The Social Dilemma” that shows how every little thing we do on social media (e.g. where we click, what we hover our mouse over) is being harvested and converted into facts about us that drive algorithms to keep us addicted to these platforms or to form our opinions on important matters. It truly is scary. But we’re talking here about loads and loads and loads of data – or “big data” as it is now being referred to.

Once again: the more data you have, the more facts you have, and therefore the more insightful your decisions can be. The logic is simple. But why haven’t we put this logic into practice earlier? Why only now are we able to unleash the power of data? The answer is two-fold: firstly, we only now have the means to be thrifty in the way we store big data. Today storing big data is cheap: hard drive storage sizes have sky-rocketed while their costs have remained stable – and then let’s not forget about cloud storage.

The bottom line is that endless storage capabilities are accessible to everybody.

The second answer to why the power of big data is now being harnessed is that we finally have the means to process it to get those precious facts/insights out of them. A decade ago, machine learning could not handle big data. Algorithms like SVM just couldn’t deal with data that had too many parameters (i.e. was too complex). It could only deal with simple data – and not a lot of it for that matter. It couldn’t find the patterns in big data that now, for example, drive the social media algorithms mentioned above, nor could it deal with things like language, image or video processing.

But then there came a breakthrough in 2012: deep learning (DL). I won’t describe here how deep learning works or why it has been so revolutionary (I have already done so in this post) but the important thing is that DL has allowed us to process extremely complex data, data that can have millions or even billions of parameters rather than just hundreds or thousands.

It’s fair to say that all the artificial intelligence you see today has a deep learning engine behind it. Whether it be autonomous cars, drones, business intelligence, chatbots, fraud detection, visual recognition, recommendation engines – chances are that DL is powering all of these. It truly was a breakthrough. An amazing one at that.

Moreover, the fantastic thing about DL models is that they are scalable meaning that if you have too much data for your current model to handle, you can, theoretically, just increase its size (that is, you increase its number of parameters). This is where the old adage: the more data you have, the more facts you have, and therefore the more insightful your decisions can be comes to the fore. Thus, if you have more data, you just grow your model size.

Deep learning truly was a huge breakthrough.

There is a slight problem, however, in all of this. DL has an achiles heal – or a major weakness, let’s say. This weakness is it’s training time. To process big data, that is, to train these DL models is a laborious task that can take days, weeks or even months! The larger and more complex the model, the more training time is required.

Let’s discuss, for example, the GPT-3 language model that I talked about in my last blog post. At its release last year, GPT-3 was the largest and most powerful natural language processing model. If you were to train GPT-3 yourself, it would take you 355 years to do so on a decent, home machine. Astonishing, isn’t it? Of course, GPT-3 was trained on state-of-the-art clusters of GPUs but undoubtedly it still would have taken a significant amount of time to do.

But what about the cost of these training tasks? It is estimated that OpenAI spent US$4.6 million to train the GPT-3 model. And that’s only counting the one iteration of this process. What about all the failed attempts? What about all the fine-tunings of the model that had to have taken place? Goodness knows how many iterations the GPT-3 model went through before OpenAI reached their final (brilliant) product.

We’re talking about a lot of money here. And who has this amount of money? Not many people.

Hence, can we keep growing our deep learning models to accommodate for more and more complex tasks? Can we keep increasing the number of parameters in these things to allow current AI to get better and better at what it does. Surely, we are going to hit a wall soon with our current technology? Surely, the current growth of AI is unsustainable. We’re spending months now training some state-of-the-art products and millions and millions of dollars on top of that.

Don’t believe me that AI is slowing down and reaching a plateau? How about a higher authority on this topic? Let’s listen to what Jerome Pesenti, the current head of AI at Facebook, has to say on this (original article here):

When you scale deep learning, it tends to behave better and to be able to solve a broader task in a better way… But clearly the rate of progress is not sustainable… Right now, an experiment might [cost] seven figures, but it’s not going to go to nine or ten figures, it’s not possible, nobody can afford that…

In many ways we already have [hit a wall]. Not every area has reached the limit of scaling, but in most places, we’re getting to a point where we really need to think in terms of optimization, in terms of cost benefit

This is all true, folks. The current growth of AI is unsustainable. Sure, there is research in progress to optimise the training processes, to improve the hardware being utilised, to devise more efficient ways that already trained models can be reused in other contexts, etc. But at the end of the day, the current engine that powers today’s AI is reaching its max speed. Unless that engine is replaced with something bigger and better, i.e. another astonishing breakthrough, we’re going to be stuck with what we have.

Will another breakthrough happen? It’s possible. Highly likely, in fact. But when that will be is anybody’s guess. It could be next year, it could be at the end of the decade, or it could be at the end of the century. Nobody knows when such breakthroughs come along. It requires an inspiration, a moment of brilliance, usually coupled with luck. And inspirations and luck together don’t come willy-nilly. These things just happen. History attests to this.

So, to conclude, AI is slowing down. There is ample evidence to back my claim. We’ve achieved a lot with what we’ve had – truly amazing things. And new uses of DL will undoubtedly appear. But DL itself is slowing reaching its top speed.

It’s hard to break this kind of news to people who think that AI will just continue growing exponentially until the end of time. It’s just not going to happen. And besides, that’s never been the case in the history of AI anyway. There have always been AI winters followed by hype cycles. ALWAYS. Perhaps we’re heading for an AI winter now? It’s definitely possible.

Update: part 2 of this post was posted recently here.

To be informed when new content like this is posted, subscribe to the mailing list (or subscribe to my YouTube channel!):

The Incredible Power of GPT-3 by OpenAI

Posted on March 13, 2021January 16, 2025 by Zbigatron

Article content:

GPT-1
GPT-2
GPT-3
DALL·E (text to image translation)

Last week I gave a few conference talks in which I talked about the exponential growth of AI over the past few years. One of the topics I covered was Natural Language Processing/Understanding (NLP/NLU) – the branch of AI that helps machines understand, manipulate, and use human language. We have come a seriously long way since Google Translate made its debut in 2006 (yes, it’s been that long!) or since chatbots first came to the fore. Machines can do incredible things now. And a lot of the progress can be attributed to the research done by OpenAI.

OpenAI was founded in 2015 by none other than Elon Musk and friends (e.g. Sam Ultman). It was initially a not-for-profit organisation whose mission was to safely improve AI for the betterment of human society. Since then a lot has changed: Elon Musk left the board in February 2018, the organisation changed its official status to for profit in 2019, and has since attracted a lot of attention from the corporate world, especially from that of Microsoft.

Over the years OpenAI has truly delivered incredible advancements in AI. Some of their products can only be labelled as truly exceptional: OpenAI Gym (its platform for reinforcement learning) and OpenAI Five (AI for the video game Dota 2) deserve honourable mentions here. But the real headlines have been made by their Generative Pre-Trained Transformer (GPT) language models.

GPT-1

GPT-1 was released by OpenAI in 2018. It was trained on the BooksCorpus dataset (7,000 unpublished books) and what made this model stand out from others was that it was trained as a “task-agnostic” model meaning that it was designed for, let’s say, “general purpose” use (rather than just sentiment analysis, for example). Moreover, it had a significant unsupervised pre-training phase, which allowed the model to learn from unannotated (raw) data. As the published academic paper states, this was a significant achievement because the abundant unlabeled text corpora available on the internet could now potentially be tapped:

By pre-training on a diverse corpus with long stretches of contiguous text our model acquires significant world knowledge and ability to process long-range dependencies which are then successfully transferred to solving discriminative tasks such as question answering, semantic similarity assessment, entailment determination, and text classification. [emphasis mine]

GPT-1 made headlines. It was something new and powerful and showed great potential because it demonstrated that language models could be generalised with relatively minimal human interference.

GPT-2

The following year in 2019 came GPT-2. And this bad boy made headlines worldwide (e.g. this BBC article). It was huge. And scary. Scary because of how good it was. So good that OpenAI decided to release the model in stages to allow people to get used to its capabilities. They said that GPT-2 could easily be used for nefarious purposes (e.g. to write fake news at large scale) and it was important that people and researchers were given a head-start to investigate properly this new “weapon” in NLP/NLU.

The power of GPT-2 came not necessarily from any new algorithm/process, per se, but from simply using a larger dataset and more parameters in the model. Basically, GPT-1 was improved by throwing more data at it: whereas GPT-1 was trained on 7,000 books, GPT-2 was trained on 8,000,000 websites (scraped from the Reddit website and outbound links – the dataset can be found here). To allow for the discovery of patterns in the Eglish language in such a massive dataset, the number of parameters used had to be increased: from 117 million (GPT-1) to 1.5 billion (GPT-2). A staggering increase. But like I said, this is where the power of GPT-2 came from – and it just shows that data, data, and once again, data is the key to AI at this present moment. If you can process this data, why not use it?

Let’s have a look at how good GPT-2 proved to be. This website allows you to get the model to generate text for you. You can type anything you want as a prompt and GPT-2 will continue the story, poem or article for you. I just typed the following into the prompt:

I am writing a blog post on computer vision.

GPT-2 came up with the following:

I am writing a blog post on computer vision. So please, if you are on facebook, do not leave comments there. I want to read your comments on my blog post. Thank you.

And then, as if magically, it went ahead and began to write my blog post:

So since I’m so busy with my part – time job, I have been making fun things for my students. Last week we played a game and the object was to cut more than everyone else. The most frugal way I had was to cut with a paper cutter. I was going to use scissors, but that would have been very expensive.

That is absolutely phenomenal. No wonder OpenAI got scared of GPT-2’s possibilities. How about writing books? Priya Dwivedi here famously trained GPT-2 on all the Harry Potter books just to see if it could get the model to write something in the Hogwart’s universe. It didn’t dissapoint:

That is pristine English. And also shows an amazing ability to learn and memorise contexts. E.g. that Hagrid is a big person, that Lockhart writes books, and that new books can appear in the library. Relations like this are being found in the original books and then used when generating text. Truly fantastic.

GPT-3

But then came GPT-3 in mid-2020. It caught a lot of people by surprise because we were still only just getting used to playing around with the power of GPT-2. And if GPT-2 was large, GPT-3 was monstrous in size. The largest thing the AI world had seen to date. It had 100 times more parameters (175 billion) than GPT-2 meaning that, crucially, even more data could be thrown at it in order for the model to learn significantly more language patterns and relations that we have in our beautiful language.

Let’s have a look at some of the things GPT-3 is capable of. Here is Qasim Munye asking the model a very detailed question in medicine:

GPT-3’s response was this:

As Qasim explains, this was not an easy task at all. One that even human doctors would have trouble with:

That’s pretty good. But it gets better. Here is Francis Jervis asking GPT-3 to translate “normal” English to legal English. This is incredible. Just look at how precise the result is:

Lastly, what I want to show you is that GPT-3 is so good that it can even be used to program in computer languages. In fact, GPT-3 can code in CSS, JSX, and Python among others.

Since it’s release, numerous start-ups have popped up, as well as countless projects, to try to tap into GPT-3’s power. But to me, what came at the beginning of this year, blew my mind completely. And this is more in-line with computer vision (and the general scope of this blog).

DALL·E

DALL·E, released by OpenAI in January 5, 2021, is a 12-billion parameter version of GPT-3 trained to generate images from text descriptions. That is, OpenAI modified GPT-3 to generate images rather than text! You provide GPT-3 with a text caption and it will produce for you a set of images to match the text – without any human interaction. A completely astounding concept. But it works!

Let’s take a look at how well it does this:

What you see is the text input at the top given to DALL·E and its generated output.

Let me rehash again: you provide a machine text and it generates images for you that match the text. When I first saw this I got up, walked out of my office, and paced around the campus for a bit because I was in awe at what I had just seen.

Some more examples:

Incredible isn’t it? Especially the last example where two types of images are generated simultaneously, as requested by the user.

The sky’s the limit with technology such as this. It truly is. I can’t wait to see what the future holds in NLP/NLU and computer vision.

In my next post I will look at why I think, however, that AI is beginning to slow down. The exponential growth in innovation that I mentioned at the beginning of this post, I suspect, is coming to an end. But more on this next time. For now, enjoy the awe that I’m sure I’ve awoken in you.

To be informed when new content like this is posted, subscribe to the mailing list (or subscribe to my YouTube channel!):

Loss of Class Dynamics Amid Distance Learning

Posted on January 17, 2021January 16, 2025 by Zbigatron

In this article I would like to step away a bit from computer vision and talk about education at university/college level and how I hope it won’t change too much as we (hopefully) slowly recover from the coronavirus pandemic (at least here in Australia).

I bumped into a colleague of mine from Carnegie Mellon University last week while having coffee outside the nearby cafe in town. We got talking about the usual things and then the topic ventured, of necessity, towards COVID-19 and how things will change post-pandemic. For example, we’re already seeing remote work opportunities being given permanent status in some firms because (finally) benefits for both employers and employees are being recognised. But my colleague and I then started talking about how things might change at our university. Perhaps students at CMU should be given more opportunities to attend lectures remotely even when the necessity for social distancing will be removed? This is an interesting question, centred around this whole idea of distance learning, that has been discussed for decades. My colleague and I talked about it for a little bit, too.

The benefits of such a plan (for distance learning) are easily discernible: e.g. there is no longer the necessity for students and lecturers to commute to a campus or even to reside in a given country; or there is the benefit of the university not needing to maintain as many facilities or equipment on campus, thus saving money that could be put into many other things to improve the learning experience.

The drawbacks of distance learning are also fairly well know: e.g. it is more difficult for students to get motivated and project work involving teams are much more cumbersome to complete, let alone do well.

But in this whole debate on the benefits and drawbacks of remote learning I think one thing is being significantly disregarded: class dynamics. This is what I would like to write about in this article.

Before I continue, I need to define what I mean by “class dynamics”. Class dynamics, at least the way that I will be using it here, is a certain atmosphere or ambience that can be set up in a classroom/lecture environment that can foster or impede the interaction that can take place between a pedagogue and their students. There are many factors that contribute to class dynamics. For example, the attitude and mood of the interlocutor, the attitudes and moods of the students, the topics being discussed, etc.

Class dynamics is just so important. It can significantly affect the learning outcomes of students. It can be the decisive factor between good class engagement and no class engagement. It can be the decisive factor between students coming to seek out the lecturer after a session to delve deeper into a topic or to have things explained further. All of this will have an impact on the teacher as well. He will be spurred on by a positive class engagement and find satisfaction in what he is doing. And then this contentment will flow over onto the students even more and boost their satisfaction. Class dynamics affects the students and teachers in a cyclical way. Like I said, it is just so important.

Since the beginning of the pandemic, I have delivered countless lectures via video conference. Yes, it has been convenient in many respects (e.g. I have worn comfy pyjamas and slippers on my bottom half) but I have come to truly appreciate what a physical classroom environment really gives towards the whole educational experience, predominantly in the context of class dynamics.

Indeed, physical presence just gives so much. Firstly, there is the notion of body language. We’ve all heard just how much body language can convey. It truly can communicate a lot. Little reactions to things I’m saying, people turning around to others at particular moments to seek explanation, slouching – things that a camera cannot properly capture. We read body language and consciously or subconsciously react to it. A good pedagogue will be able to react accordingly and steer discussions or lectures in the right direction to keep people’s attention at full capacity or to notice when concepts need to be reiterated perhaps in a different way. You lose all (or at least most) of this when you’re delivering lectures via video conference. I miss this aspect so much. I just can’t read my students’ “presence” in a given lecture at all. And it’s seriously draining and detrimental to all involved. Especially since concepts in computer science (and science in general) build on top of each other. So, whether a student grasps something now will have a knock on effect for any future classes he/she will attend.

Class dynamics is paramount. And it is fostered by physical presence.

Something else that contributes to class dynamics is the building up of a community in a class. When students attend a campus in person they can get to know each other so much better. They can “hang out” after class or in the evenings and friendships can be formed. Classroom interaction becomes so much better when everyone is relaxed around each other! When you teach via video conference the ability to form a community is significantly diminished. Everyone loses out.

These are really important points to consider. Because, ultimately, with learning via video conferencing the students, the class dynamic, the relationships between the pedagague and his pupils, the entire learning experience gets flattened into two-dimensions much like everyone’s face on the screen in front of you.

So much is lost.

And it’s, hence, important to think about this when weighing up the pros and cons of distance learning. We want to keep the standards at our universities/colleges high while, of course, maintaining costs at a minimum and leisure at a maximum. Class dynamics cannot be ignored even though it is difficult to measure and put into argument form when discussing these things with the people in charge. But it has to be discussed and argued for, especially when it looks like the world will slowly be returning to normality in the near future.

To be informed when new content like this is posted, subscribe to the mailing list (or subscribe to my YouTube channel!):

Apple’s and Samsung’s Face Unlocking Technologies

Posted on December 20, 2020January 16, 2025 by Zbigatron

Have you ever wondered how the technology that unlocks your phone with your face works? This is a fascinating question and, interestingly, Samsung and Apple provide very different technologies for this feature on their devices. This post will examine the differences between the two technologies and will also show you how either of the two can be fooled to grant you access to anybody else’s phone.

(Please note: this is the third part of my series of articles on how facial recognition works. Hence, I will breeze over some more complex topics. If you wish to delve deeper into this area of computer vision, please see my first and second posts in this series.)

Samsung’s Face Recognition

Samsung’s face unlocking feature has, perhaps surprisingly, been around since 2011. Over the years, especially recently, it has undergone some improvements. Up until the Galaxy S8 model, face unlocking was done using the regular front camera of the phone to take a picture of your face. This picture was analysed for facial features such as the distance between the eyes, facial contours, iris colour, iris size, etc. The information was stored on your phone so that next time you tried to unlock it, the phone would take a picture of you, process it for the aforementioned data, and then compare it to the information it had stored on your phone. If everything matched, your phone was unlocked.

This was a cheap, fast, and easy way to implement facial recognition. Unfortunately, it was not very secure. The major problem was that all processing was done using 2D images. So, as you may have guessed, a simple printed photo of your face or even one displayed on another phone could fool the system. Need proof? Here’s a video of someone unlocking a Galaxy Note 8, which was released in April 2017, with a photo shown on another phone. It’s quite amusing.

There was a “liveness check” added to this technology with the release of Android Jelly Bean in 2012. This worked by attempting to detect blinking. I never tried this feature but from what I’ve read on forums, it wasn’t very accurate and required a longer time to process your face – hence probably why the feature wasn’t turned on by default. And yes, it could also be fooled by a close-up video of you, though this would be much harder to acquire.

With the release of the Galaxy S8, a new biometric identification technology was introduced: iris scanning. Irises, like fingerprints, are unique to each person. Iris scanning on Samsung phones works by illuminating your eye with infrared light (invisible to the naked eye). However, this technology could also be fooled with photographs and contact lenses. Here’s a video of a security researcher from Berlin doing just that. He took a photo of his friend’s eye from a few metres away (!) in infrared mode (i.e. night mode), printed it out on paper, and then stuck a contact lens on the printed eye. Clever.

Perhaps because of this flaw, Samsung’s Galaxy S9 introduced Intelligent Scan, which combined facial scanning and iris scanning. Facial scanning, however, is still only performed on 2D images (as described above) taken from the front camera of the phone. But a combination of the two technologies was seen as improving face unlocking technology in general.

Unfortunately, the Samsung Galaxy S10 (and subsequently the S20) retracted Intelligent Scan and went back to standard 2D photo face recognition. The reason for this was to make room for a larger screen because the iris scanning components were taking up a little too much room at the top of the phone for Samsung’s liking. With this move returned the possibility to unlock people’s phones with photos or images. For example, here’s a video showing a Galaxy S10 phone being unlocked with an image on another phone. According to some users, however, if you manually tweak the settings on your phone by going to Settings > Biometrics and Security > Face recognition and toggling “Faster recognition” to off, it seems that this makes it a lot harder to defeat.

(Interestingly, in this period of coronavirus pandemic, people have been crying out for the iris scanning technology to return because face recognition just does not work when you’re wearing a mask!)

Apple’s Face ID

This is where the fun begins. Apple really took face recognition seriously.

The Apple technology in question is called Face ID and it first appeared in November 2017 with the iPhone X.

In a nutshell, Face ID works by firstly illuminating your face with infrared light (like with iris scanning) and then projecting a further 30,000 (!) infrared points onto your face to build a super-detailed 3D map of your facial features. These 3D maps are then converted into mathematical representations (to understand how this is performed, see my first blog post on how facial recognition works). So, each time you try to unlock your phone, its these representations that are compared. Quite impressive.

What’s more, this technology can recognise faces with glasses, clothing, makeup, and facial hair (not face masks, though!), and adapts to changes in appearance over time. The latter works by simply monitoring how your face may be changing over time – e.g. you may be gaining or losing weight, which will of course be affecting the general structure of your face, and hence the 3D map of it.

This impressive infrared technology, however, has been in use for a very long time. If you are familiar with the Microsoft Kinect camera/sensor (initially released in 2010), it uses the same concept of infrared point projection to capture and analyse 3D motion.

So, how do you fool the ‘TrueDepth camera system’, as Apple calls it? It’s definitely not easy because this technology is quite sophisticated. But successful attempts have already been documented.

To start off with, here’s a video showing identical twins unlocking each other’s phones. Also quite amusing. How about relatives that look similar? It’s been done! Here’s a video showing a 10-year-old boy unlocking his mother’s phone. Now that’s a little more worrisome. However, it shows that iPhone Xs can be an alternative to DNA paternity/maternity tests 🙂 Finally, here’s a video posted by Vietnamese hackers documenting how their 3D-printed face mask fooled Apple’s technology. Some elements, like the eyes, on this mask were printed on a standard colour printer. The model of the face was acquired in 5 minutes using a hand-held scanner.

Conclusion

In summary, if you’re truly worried about security, face unlocking on Samsung phones is just not up to scratch. I would recommend using their new (ultrasonic) fingerprint scanning technology instead. Because Apple works with 3D images of faces, it is much more secure. In this respect, Apple wins the battle of the phones, for sure.

To be informed when new content like this is posted, subscribe to the mailing list (or subscribe to my YouTube channel!):

Review of The Last Lecture by Randy Pausch

Posted on July 13, 2020January 16, 2025 by Zbigatron

I detest receiving books as gifts in the workplace. Such books are usually of the soulless sort, the sort that are written around some lifeless corporate motto that is supposed to inspire a new employee to work overtime when needed. So, when I walked into my first day at work at Carnegie Mellon University (Australia campus) and saw the person in charge of my induction training holding a book obviously destined to end up in my possession, the invisible eyes of my invisible soul rolled as far back as possible into my invisible soul’s head. But I felt guilty about this reaction shortly after when I was told what the book was about: a past Carnegie Mellon University professor’s last lecture shortly before dying of cancer. “This could actually be interesting”, I thought, and mentally added the book to the bottom of my “To Read” list.

That happened over a year ago. I’ve been incessantly pestered ever since by the person in charge of my induction to read this book. But what can one do when one’s “To Read” list is longer than the actual book itself? However, I finally got round to it, and I’m glad I did. Here are my thoughts on it.

The book in question is entitled “The Last Lecture” and was written by Randy Pausch. Randy was a computer scientist (like me!), a professor at CMU (like me!), with significant previous experience in the industry (also like me!) – a kindred soul it seems. In August 2007, he was told he only had 3-6 months left of his life as a result of pancreatic cancer. The following month he gave his final lecture at CMU in Pittsburgh and then wrote this book about that event.

His final lecture was entitled “Really Achieving Your Childhood Dreams”. During this talk he showed approximately 60 slides each with a single picture meant to, in one way or another, reference a childhood dream that he was able to fulfil or at least attempted to fulfil. But, as he states at the end of the book, the lecture topic was really a feint (or “head fake”, to use his NFL terminology) from his primary aim: to give a lecture for his three children aged 6, 3, and 18 months, in order for them to see who their father was and to pass down any wisdom that he had accrued in his life that he would have liked to have given them over time. It was heart-wrenching for him to think about his children not having their father present growing up and not having solid memories of him. So, he wanted to give them something concrete to look back on as the years after his death progressed.

randy-pausch-photo — *A photo of Randy Pausch*

Randy was very concise in squeezing a lifetime of thoughts into a 60 minute talk – but from it a few things definitely stood out for me.

Firstly, it was his career as an educator, rather than as an academic. He definitely emphasised the former over the latter. Professor Pausch had a passion for teaching. He was damn good at it, too. The stories that he had about how he inspired students throughout the years and also about his (sometimes unorthodox) teaching methods are stirring and stimulating to a fellow educator like myself. He strove to make a difference in each and every students’ life. In a way, he felt like he was an extension of their parents and it was his duty to convey to students as much as he could, which included things like life experiences. Yes, he was a true educator and he showed this well in his book. He undoubtedly wanted his children to know this part about himself. He wanted them to be proud of his passion and his great adeptness at it.

Another thing that stood out for me was the wisdom conveyed in this book. When faced with death, any honest person is going to make significant re-evaluations of their values, will inevitably see and experience things from a different perspective, and will undoubtedly view past experiences in a different light. It is always worth reading the thoughts of such a person because you know that they will be rich and profound and definitely not soulless. Randy’s short book is full of such thoughts.

The last thing I want to mention is that “The Last Lecture” is permeated with a fighting spirit that overflows into a sense of celebration of life. Despite staring death in the face Randy still managed to let an optimistic outlook govern his everyday workings:

Look, I’m not in denial about my situation. I am maintaining my clear-eyed sense of the inevitable. I’m living like I’m dying. But at the same time, I’m very much living like I’m still living.

He lived his final months in this spirit and has conveyed this also well in his book, if only for the simple fact that the book is full of humour. We can learn a lot from such an outlook on life. The man would have been a great guy to have a coffee with in the staff room, for sure.

In conclusion, Professor Pausch achieved his aim of leaving something for his children to remember him by, to be proud of, and to inspire and teach them as they themselves tread through life. Simultaneously, however, he left a lot for us, too. I can see why I was given this book on my induction day at Carnegie Mellon University. Randy’s children I’m sure are proud of him. And now I am proud myself knowing that I am teaching at the same institution as he once did.

(Employers please note: this is how you legitimately make an employee want to work overtime after an induction session)

To be informed when new content like this is posted, subscribe to the mailing list (or subscribe to my YouTube channel!):

How Facial Recognition Works – Part 2 (FaceNet)

Posted on May 27, 2020January 16, 2025 by Zbigatron

This post is the second post in my series on “How Facial Recognition Works”. In the first post I talked about the difference between face detection and face recognition, how machines represent (i.e. see) faces, how these representations are generated, and then what happens with them later for facial recognition to work.

Here, I would like to describe a specific facial recognition algorithm – one that changed things forever in this particular domain of artificial intelligence. The algorithm is called FaceNet and it was developed by Google in 2015.

FaceNet was published in a paper entitled “FaceNet: A Unified Embedding for Face Recognition and Clustering” at CVPR 2015 (a world-class conference for computer vision). When it was released it smashed the records of two top facial recognition academic datasets (Labeled Faces in the Wild and YouTube Faces DB) by a whopping 30% (on both datasets!). This is an utterly HUGE margin by which to defeat past state-of-the-art algorithms.

FaceNet’s major innovation lies in the fact that it developed a system that:

…directly learns a mapping from face images to a compact Euclidean space where distances directly correspond to a measure of face similarity. (Quote from original publication)

What this means is that it was the first algorithm to develop a deep neural network (DNN) whose sole task was to create embeddings for any face that was fed through it. That is, any image of a face inputted into the neural network would be given a 128-bit vector representation in the Euclidean space.

What this also means is that similar-looking faces are clustered/grouped together because they receive similar vector representations. Hence, clustering algorithms such as SVM or k-means clustering, can be employed on generated embeddings to perform facial recognition directly.

(To understand how such clustering algorithms work and to understand terms and concepts such as “embeddings” and “vector representations”, please refer to my first post on facial recognition where, as I have said earlier, I explain the fundamentals of facial recognition).

Another important contribution made in Google’s paper was its choice of method that it used to train the deep neural network to generate embeddings for faces. Usually, one would train a DNN for a fixed number of classes, e.g. 10 different types of objects to be detected in images. You would collect a large number of example images from each of these 10 different classes and tell the DNN during training which image contained which class. In tandem, you would use, for instance, a cross entropy loss function that would indicate to you the error rate of your model being trained – i.e. how far away you were from an “ideally” trained neural network. However, because the neural network is going to be used to generate embeddings (rather than, for example, to state whether there is a particular object out of 10 in an image), you don’t really know how many classes you are training your DNN for. It’s a different problem that you are trying to solve. You need a different loss function – something specific for generating embeddings. In this respect, Google decided to opt for the triplet-based loss function.

The idea behind the triplet-based loss function is to, during the training phase, take three example images from the training data:

A random image of a person – we call this image the anchor image
A random but different image of the same person – we call this image the positive image
A random image of another person – we call this image the negative image.

During training, then, embeddings will be created for these three images and the triplet-based loss function’s task is to minimise the distance (in the Euclidean space) between the anchor and positive image and maximise the distance between the anchor and negative image. The following image from the original publication depicts this idea:

Notice the negative image initially is closer to the anchor than the positive image. The neural network would then go and adjust itself for the positive image to be closer to the anchor rather than the negative one. And the process would be repeated for different anchor, positive, and negative images.

Employing the triplet-based loss function to guide the training of the DNN was an incredibly intelligent move by Google. Likewise was the decision to decide to use a DNN to generate embeddings outright for faces. It really is no surprise that FaceNet busted onto the scene like it did and subsequently laid a solid foundation for facial recognition. The current state of this field owes an incredible amount to this particular publication.

If you would like to play around with FaceNet, take a look at its Github Repository here.

(Update: part 1 of this post can be found here)

—

To be informed when new content like this is posted, subscribe to the mailing list (or subscribe to my YouTube channel!):

How Facial Recognition Works – Part 1

Posted on February 11, 2020January 16, 2025 by Zbigatron

(Update: part 2 of this post can be found here)

Facial recognition. What a hot topic this is today! Hardly a week goes by without it hitting the news in one way or another, and usually for the wrong reasons – i.e. privacy and its growing ubiquity. But have you ever wondered how this technology works? How is it that only now it has become such a hot topic of interest, whereas 10 years ago not many people cared about it at all?

In this blog post I hope to demystify facial recognition and present its general workings in a lucid way. I especially want to explain why it is that facial recognition has seen tremendous performance improvements over the last 5 or so years.

I will break this post up according to the steps taken in most facial recognition technologies and discuss each step one by one:

Face Detection and Aligning
Face Representation
Model Training
Recognition

(Note: in my next blog post I will describe Google’s FaceNet technology: the facial recognition algorithm behind much of the hype we are witnessing today.)

1. Face Detection and Aligning

The first step in any (reputable) facial recognition technology is to detect the location of faces in an image/video. This step is called face detection and should not be confused with actual face recognition. Generally speaking, face detection is a much simpler task to facial recognition and in many ways it is considered a solved problem.

There are numerous face detection algorithms out there. A popular one, especially in the days when CPUs and memories weren’t what they are today, used to be the Viola-Jones algorithm because of its impressive speed. However, much more accurate algorithms have since been developed. This paper benchmarked a few of these under various conditions and the Tiny Faces Detector (published in the world-class Conference on Computer Vision and Pattern Recognition in 2017 – code can be downloaded here) came out on top:

teaser — Face detection being performed by Tiny Face Detector. *(Image taken from paper’s website)*

If you wish to implement a face detection algorithm, here is a good article that shows how you can do this using Python and OpenCV. How face detection algorithms work, however, is beyond the scope of this article. Here, I would like to focus on actual face recognition and assume that faces have already been located on a given image.

Once faces have been detected, the next (usual) step is to rotate and scale the face in order for its main features to be located in the same place, more or less, as the features of other detected faces. Ideally, you want the face to be looking directly at you with the lips and the position of the eyes parallel to the ground. The aligning of faces is an important step much akin to the cleaning of data (for those that work in data analysis). It makes further processing a lot easier to perform.

2. Face Representation

When we have a clean picture of a face, the next step is to extract a representation of it. A representation (often also called a signature) for a machine is a description/summary of a thing in a form that can be processed and analysed by it. For example, when dealing with faces, it is common to represent a face as a vector of numbers. The best way to explain this is with a simple example.

Suppose we choose to represent a face with a 2-dimensional vector: the first dimension represents the distance between the eyes, the second dimension the width of the nose. We have two people, Alice and Bob, and a photo of each of their faces. We detect and align the two faces in these photos and work out that the distance between the eyes of Alice is 12 pixels and that of Bob is 15 pixels. Similarly, the width of Alice’s nose is 4 pixels, Bob’s is 7 pixels. We therefore have the following representations of the two faces:

So, for example, Alice’s face is represented by the vector (12, 4) – the first dimension stores the distance between the eyes, the second dimension stores the width of the nose. Bob’s face is represented by the vector (15, 7).

Of course, one photo of a person is not enough for a machine to get a robust representation/understanding of a person’s face. We need more examples. So, let’s say we grab 3 photos of each person and extract the representation from each of them. We might come up with the following list of vectors (remember this list as I’ll use these numbers later in the post to explain further concepts):

Having numbers like these to represent faces is much easier to deal with for a machine than raw pictures – machines are unlike us!

Now, in the above example, we used two dimensions to describe a face. We can easily increase the dimensionality of our representations to include even more features. For example, the third dimension could represent the colour of the eyes, the fourth dimension the colour of the skin – and so on. The more dimensions we choose to use to describe each face, generally speaking, the more precise our descriptions will be. And the more precise our descriptions are, the easier it will be for our machines to perform facial recognition. In today’s facial recognition algorithms, it is not uncommon to see vectors with 128+ dimensions.

Let’s talk about the facial features used in representations. The above example, where I chose the distance between the eyes and the width of the nose as features, is a VERY crude example. In reality, reputable facial recognition algorithms use “lower-level” features. For instance, in the late 1990s facial recognition algorithms were being published that considered the gradient direction (using Local Binary Patterns (LBP)) around each pixel on a face. That is, each pixel was analysed to see how much brighter or darker it was compared to its neighbouring pixels. Hence, brightness/darkness of pixels were the features chosen to create representations of faces. (You can see how this would work: each pixel location would have a relative different brightness/darkness score depending on a person’s facial structure).

Other lower-level features have been proposed for facial recognition algorithms, too. Some solutions, in fact used hybrid approaches – i.e. they used more than one low-level feature. There is an interesting paper from 2018 (M. Wang and W. Deng, “Deep Face Recognition: A Survey“, arXiv preprint) that summarises the history of facial recognition algorithms. This picture from the paper shows the evolution of features chosen for facial recognition:

Notice how the accuracy of facial recognition algorithms (benchmarked on the LFW dataset) have increased over time depending on the type of representation used? Notice also the LBP algorithm, described above, making an appearance in the late 1990s? It gets an accuracy score of around 70%.

And what do you see at the very top of the graph? Deep Learning, of course! Deep Learning changed the face of Computer Vision and AI in general (as I discussed in this earlier post of mine). But how did it revolutionise facial recognition to the point that it is now achieving human-level precision? The key difference is that instead of choosing the features manually (aka “hand-crafting features” – i.e. when you say: “let’s use pixel brightness as a feature”), you let the machine decide which features should be used. In other words, the machine generates the representation itself. You build a neural network and train it to deliver vectors for you that describe faces. You can put any face through this network and you will get a vector out at the end.

But what does each dimension in the vector represent? Well, we don’t really know! We would have to break down the neural network used to see what is going on inside of it step by step. DeepFace, the algorithm developed by Facebook (you can see it mentioned in the graph above), churns out vectors with 128 dimensions. But considering that the neural network used has 120 million parameters, it’s probably infeasible to break it apart to see what each dimension in the vector represents exactly. The important thing, however, is that this strategy works! And it works exceptionally well.

3. Model Training

It’s time to move on to the next step: model training. This is the step where we train a classifier (an algorithm to classify things) on a list of face representations of people in order to be able to recognise more examples (representations) of these people’s faces.

Let’s go back to our example with our friends Alice and Bob to explain this step. Recall the table of six vector representations of their faces we came up with earlier? Well, those vectors can be plotted on a 2-dimensional graph like so:

facial-recognition-graph — *A scatter-plot of face representations (unit of measurement is pixels; plot generated here)*

Notice how Alice and Bob’s representations cluster around each other (shown in red)? The bottom left data points belong to Alice, the top right belong to Bob. The job of a machine now will be to learn to differentiate between these two clusters. Typically, a well-known classification algorithm such as SVM is used for this.

If more than two clusters are present in the data (i.e. we’re dealing with more than two people in our data), the classifier will need to be able to deal with this. And then if we’re working in higher dimensions (e.g. 128+), the algorithm will need to operate in these dimensions, too.

4. Recognition

Recognition is the final step in the facial recognition process. Given a new image of a face, a representation will be generated for it, and the classification algorithm will provide a score of how close it is to its nearest cluster. If the score is high enough (according to some threshold) the face will be marked as recognised/identified.

So, in our example case, let’s say a new photo has emerged with a face on it. We would first generate a representation of it, e.g.: (13, 4). The classification algorithm will take this vector and see which cluster it lies closest to – in this case it will be Alice’s. Since this data point is very close to the cluster, a high recognition score will also be generated. The picture below illustrates this example:

face-representation-graph — *A scatter-plot of face representations. The green point represents the new face that will be classified as Alice’s*

This recognition step is usually extremely fast. And the accuracy of it is highly dependent on the preceding steps – the most important of which is the quality of the second step (the one that generates representations of faces).

Conclusion

In this post I described the major steps taken in a robust facial recognition algorithm. Each step was then described and an example use case was utilised to illustrate the concepts behind these steps. The major breakthrough in facial recognition came in 2014 when deep learning was used to generate representations of faces rather than the technique of hand-crafting features. As a result, facial recognition algorithms can now achieve near human-level precision.

(Update: part 2 of this post can be found here)

—

To be informed when new content like this is posted, subscribe to the mailing list (or subscribe to my YouTube channel!):

De-Identification of Faces in Live Videos – ICCV 2019

Posted on December 13, 2019January 16, 2025 by Zbigatron

Facial recognition. What a hot topic this is today. A week hardly goes by without it making the news in one way or another. This technology seems to be infiltrating more and more of our everyday lives: from ID verification on our phones (to unlock them) to automated border processing systems at airports. In some ways this is a good thing but in some this is not. The most controversial aspect of the growing ubiquity of facial recognition technology (FRT) is arguably the erosion of privacy. The more that FRT is used in our lives, the more it seems that we are turning into a highly monitored society.

This erosion of privacy is such a foremost issue to some that three cities in the USA have banned the use of FRT: San Francisco and Oakland in California and Somerville in Massachusetts. These bans, however, only affect city agencies such as police departments. Portland, Oregan, on the other hand, may soon be introducing a bill that could also cover private retailers and airlines. Moreover, according the Mutale Nkonde, a Harvard fellow and AI policy advisor, a federal ban could be around the corner.

FRT is undoubtedly controversial with respect to the debate on privacy.

In this post I would like to introduce to you a paper from the International Conference on Computer Vision 2019 (ICCV) that attempts to provide that little bit of additional privacy in our lives by proposing a fast and impressive method to de-identify videos in real-time. The de-identification process is purported to be effective against machines rather than humans such that we are still able to perceive the original identity of the speaker in the resulting video.

(TL;DR: jump to the end to see the results generated by the researchers. It’s quite impressive.)

The paper in question here is entitled “Live Face De-Identification in Videos” by Gafni et al. published by the Facebook Research Group (isn’t it ironic that Facebook is writing academic papers on privacy?).

The de-identification algorithm itself is a bit tricky to explain but I’ll do my best. First, an adversarial autoencoder network is paired with a trained facial classifier. An autoencoder network (which is a special case of the encoder-decoder architecture) works by imposing a bottleneck in the neural network, which forces the network to learn a compressed representation of the original input image. (Here’s a fantastic video explaining what this means exactly). So, what happens is that a compressed version of your face is generated – but the important aspects such as your pose, lip positioning, expression, illumination conditions, any occlusions, etc. are all retained. What is discarded are the identifying elements. This retaining/discarding is controlled by having a trained facial classifier nearby that the autoencoder tries to fool. During training, the autoencoder gets better and better at fooling the facial classifier by learning to more effectively discard identifying elements from faces, while retaining the important aspects of them.

What results is a face that is still easily recognisable to us, an algorithm that works in real-time (meaning that you can “turn it on” for skype sessions, for example) and one that doesn’t need to be retrained for each particular face – it just works out of the box for all faces.

Here is a video released by the authors of the paper showing some of their results:

Very impressive, if you ask me! Remember, the generated faces in the video have been de-identified, meaning that a facial recognition algorithm (FaceNet or ArcFace, for example) will find it extremely difficult to deal with. In fact, experiments were performed to see how well the researchers’ algorithm performs against popular FRTs. For one experiment, FaceNet was tested on images before and after de-identification. The true positive rate for one dataset dropped from almost 0.99, to less than 0.04. Very nice, indeed.

Moreover, the paper also goes into detail on a number of key steps of their algorithm. One of them is the de-identification distance from the original image. That is, they play around a bit with how much a person’s face is de-identified. The image below shows Nicholas Cage being gradually anonymised by increasing a variable in the algorithm. This is also something quite interesting.

cage-de-identified — *Image showing Nicholas Cage being gradually de-identified (image taken from original publication)*

Summary

In this post I presented a paper from the Facebook Research Group on the de-identification of faces in real-time. In the context of FRTs and the current hot debate on privacy, this is an important piece of work, especially considering the fact that this algorithm is getting impressive results and works in real-time. Whether we will see this technology in use in the near future is hard to say, but I wouldn’t be surprised if a de-identification app that works much like the face swap filter on instagram becomes available to the general public. There is certainly a demand for it.

To be informed when new content like this is posted, subscribe to the mailing list (or subscribe to my YouTube channel!):

The Largest Cat Video Dataset in the World

Posted on September 9, 2019January 16, 2025 by Zbigatron

This is a bit of a fun post about a “dataset” I stumbled upon a few days ago… a dataset of cat videos.

As I’ve mentioned numerous times in various posts of mine, the deep learning revolution that is driving the recent advancements in AI around the world needs data. Lots and lots of data. For example, the famous image classification models that are able to tell you, with better precision than humans, what objects are in an image, are trained on datasets containing sometimes millions of images (e.g. ImageNet). Large datasets have basically become essential fuel for the AI boom of recent years.

So, I had a really good laugh when a few days ago I found this innocuous and unexposed YouTube channel owned by a Japanese man who has been posting a few videos every day of him feeding stray cats. Since he has been doing this for the past 9 years, he has managed to accumulate over 19,000 cat videos on his channel. And in doing so he has most probably and inadvertently created the largest cat video dataset in the world. My goodness!

Technically speaking, unless you’re a die-hard cat lover (like me!), these videos aren’t all that interesting. They’re simply of stray cats having a decent feed or drink with their good-hearted caretaker on occasion uttering a few sentences here and there. Here’s one, for example, of two cats eating out of a bowl:

Or here’s one of two cats enjoying a good ol’ scratch behind the ears:

On average these videos are about 30-60 seconds in length. And they’re all titled by the default name given by his cameras (e.g. MVI 3985, etc.). Hence, nothing about these clips is designed in order for them be found by anybody out there.

However, despite all this mundanity, to the computer vision community (that needs datasets to survive like humans need oxygen), these videos could come in handy… one day. I’m not sure how just yet, but I’m sure somebody out there could find a use for them. I mean, there’s over 19,000 cat videos just sitting there. This is just too good to pass up.

So, if there are any academics out there: please, please, please use this “dataset” in your publishable studies. It would make my year, for sure! The cats would be proud, too.

Oh, and one more thing. I found this guy’s twitter account (@niiyan1216). And, you guessed it: it is full of pictures of cats.

To be informed when new content like this is posted, subscribe to the mailing list (or subscribe to my YouTube channel!):

Be an Optimist Prime in the world of Computer Vision and AI

Author: Zbigatron

AI is Still Fundamentally Unintelligent

Artificial Intelligence is Slowing Down

The Incredible Power of GPT-3 by OpenAI

GPT-1

GPT-2

GPT-3

DALL·E

Loss of Class Dynamics Amid Distance Learning

Apple’s and Samsung’s Face Unlocking Technologies

Samsung’s Face Recognition

Apple’s Face ID

Conclusion

Review of The Last Lecture by Randy Pausch

How Facial Recognition Works – Part 2 (FaceNet)

How Facial Recognition Works – Part 1

1. Face Detection and Aligning

2. Face Representation

3. Model Training

4. Recognition

Conclusion

De-Identification of Faces in Live Videos – ICCV 2019

Summary

The Largest Cat Video Dataset in the World