ai-superpowers-book-cover

AI Superpowers by Kai-Fu Lee – Review

Summary of review: This book is the best analysis of the state of Artificial Intelligence currently in print. It is a cool and level-headed presentation and discussion on a broad range of topics. An easy 5-stars. 

“AI Superpowers: China, Silicon Valley, and the New World Order” is a book about the current state of Artificial Intelligence. Although published in late 2018 – light years ago for computer science – it is still very much relevant. This is important because it is the best book on AI that I have read to date. I’d hate for it to become obsolete because it is simply the most level-headed and accurate analysis of the topic currently in print.

kai-fu-lee-pictureProfessor Kai-Fu Lee knows what he’s talking about. He’s been at the forefront of research in AI for decades. From Assistant Professor at Carnegie Mellon University (where I currently teach at the Australian campus), to Principal Speech Scientist at Apple, to the founding director of Microsoft Research Asia, and then to the President of Google China – you really cannot top Kai-Fu’s resume in the field of AI. He is definitely a top authority, so we cannot but take heed of his words.

However, what made me take notice of his analyses was that he was speaking from the perspective of an “outsider”. I’ll explain what I mean.


So often, when it comes to AI, we see people being consumed by hype and/or by greed. The field of AI has moved at a lightning pace in the past decade. The media has stirred up a frenzy, imaginations are running wild, investors are pumping billions into projects, and as a result an atmosphere of excitement has descended upon researchers and the industry that makes impartial judgement and analysis extremely difficult. Moreover, greedy and charismatic people like Elon Musk are blatantly lying about the capabilities of AI and hence adding fuel to the fire of elation.

It wasn’t until Kai-Fu Lee was diagnosed with Stage IV lymphoma and given only a few months to live that he was also a full-fledged player and participant in this craze. Subsequently, he reassessed his life, his career, and decided to step away from his maniacal work schedule (to use his own words, pretty much). Time spent in a Buddhist monastery gave him further perspective on life and a clarity and composure of thought that shines through his book. He writes, then, in some respects as an “outsider” – but with forceful authority.

This is what I love about his work. Too often I cringe at people talking about AI soon taking over the world, AI being smarter than humans, etc. – opinions based on fantasy. Kai-Fu Lee straight out says that as things stand, we are nowhere near that level of intelligence exhibited by machines. Not by 2025 (as Elon Musk has said in the past), not even by 2040 (as a lot of others are touting) will we achieve this level. His discussion of why this is the case is based on pure and cold facts. Nothing else. (In fact, his reasonings are based on what I’ve said before on my blog, e.g. in this post: “Artificial Intelligence is Slowing Down“).

All analyses in this book are level-headed in this way, and it’s hard to argue with them as a result.

Some points of discussion in “AI Superpowers” that I, also a veteran of the field of AI, particularly found interesting are as follows:

  • Data, the fuel of Deep Learning (as I discuss in this post), is going to be a principal factor in determining who will be the world leader in AI. The more data one has, the more powerful AI can be. In this respect, China with its lax laws on data privacy, larger population, coupled with cut-throat tactics in the procuring of AI research, and heavy government assistance and encouragement has a good chance to surpass the USA as a superpower of AI. For example, China makes 10 times more food deliveries and 4 times more ride-sharing calls than the US. That equates to a lot more data that can be processed by companies to fuel algorithms that improve their services.
  • Despite AI not currently being capable of achieving human-level intelligence, Kai-Fu predicts, along with other organisations such as Gartner, that around 30-40% of professions will be significantly affected by AI. This means that huge upheavals and even revolutions in the workforce are due to take place. This, however, is my one major disagreement with Lee’s opinions. Personally, I believe the influence of AI will be a lot more gradual than Prof. Lee surmises and hence the time given to adjust to the upcoming changes will be enough to avoid potentially ruinous effects.
  • No US company in China has made significant in-roads into Chinese society. Uber, Google, eBay, Amazon – all these internet juggernauts have utterly failed in China. The very insightful analysis of this phenomenon could only have been conducted so thoroughly by somebody who has lived and worked in China at the highest level.
  • There is a large section in the book discussing the difference between humans and machines. This was another highlight for me. So many times, in the age of online learning (as I discuss in this post), remote working, social media, and especially automation, we neglect to factor in the importance of human contact and human presence. Once again, a level-headed analysis is presented that ultimately concludes that machines (chat-bots, robots, etc.) simply cannot entirely replace humans and human presence. There is something fundamentally different between us, no matter how far technology may progress. I’ve mentioned this adage of mine before: “Machines operate on the level of knowledge. We operate on the level of knowledge and understanding.” It’s nice to see an AI guru replicating this thought.

Conclusion

To conclude, then, “AI Superpowers: China, Silicon Valley, and the New World Order” is a fantastic dive into the current state of affairs surrounding AI in the world. Since China and the US are world leaders in this field, a lot of time is devoted to these countries: mostly on where they currently stand and where they’re headed. Kai-Fu Lee is a world authority on everything he writes about. And since he also does not have a vested interested in promoting his opinions, his words carry a lot more weight than others. As I’ve said above, this to me is the best book currently in print on this topic of AI. The fact that Prof. Lee also writes clearly and accessibly, even those unacquainted with technical terminology will be able to follow all that is presented in this work.

Rating: An easy 5 stars. 

To be informed when new content like this is posted, subscribe to the mailing list:

snail

Artificial Intelligence is Slowing Down – Part 2

In July of last year I wrote an opinion piece entitled “Artificial Intelligence is Slowing Down” in which I shared my judgement that as AI and Deep Learning (DL) currently stand, their growth is slowly becoming unsustainable. The main reason for this is that training costs are starting to go through the roof the more DL models are scaled up in size to accommodate more and more complex tasks. (See my original post for a discussion on this).

In this post, part 2 of “AI Slowing Down”, I wanted to present findings from an article written a few months after mine for IEEE Spectrum. The article, entitled “Deep Learning’s Diminishing Returns – The cost of improvement is becoming unsustainable“, came to the same conclusions as I did (and more) regarding AI but it presented much harder facts to back its claims.

I would like to share some of these claims on my blog because they’re very good and backed up by solid empirical data.


The first thing that should be noted is that the claims presented by the authors are based on an analysis of 1,058 research papers (plus additional benchmark sources). That’s a decent dataset from which significant conclusions can be gathered (assuming the analyses were done correctly, of course, but considering the four authors who are of repute, I think it is safe to assume the veracity of their findings).

One thing the authors found was that with the increase in performance of a DL model, the computational cost increases exponentially by a factor of nine (i.e. to improve performance by a factor of k, the computational cost scales by k^9). I stated in my post that the larger the model the more complex tasks it can perform, but also the more training time is required. We now have a number to estimate just how much computation power is required per improvement in performance. A factor of nine is staggering.

Another thing I liked about the analysis performed was that it took into consideration the environmental impact of growing and training more complex DL models.

The following graph speaks volumes. It shows the error rate (y-axis and dots on the graph) on the famous ImageNet dataset/challenge (I’ve written about it here) decreasing over the years once DL entered the scene in 2012 and smashed previous records. The line shows the corresponding carbon-dioxide emissions accompanying training processes for these larger and larger models. A projection is then shown (dashed line) of where carbon emissions will be in the years to come assuming AI grows at its current rate (and no new steps are taken to alleviate this issue – more on this later).

imagenet-carbon-emissions
As DL models get better (y-axis), the computations required to train them (bottom x-axis) increase and hence do carbon emissions (top x-axis).

Just look at the comments in red in the graph. Very interesting.

And the costs of these future models? To achieve an error rate of 5%, the authors extrapolated a cost of US$100 billion. That’s just ridiculous and definitely untenable.

We won’t, of course, get to a 5% error rate the way we are going (nobody has this much money) so scientists will find other ways to get there or DL results will start to plateau:

We must either adapt how we do deep learning or face a future of much slower progress

At the end of the article, then, the authors provide an insight into what is happening in this respect as science begins to realise its limitations and look for solutions. Meta-learning is one such solution that is presented and discussed (meta-learning is the training of models that are designed for broader tasks and then using them for a multitude of more specific cases. In this scenario, only one training needs to take place for multiple tasks).

However, all the current research so far indicates that the gains from these innovations are minimal. We need a much bigger breakthrough for significant results to appear. 

And like I said in my previous article, big breakthroughs like this don’t come willy-nilly. It’s highly likely that one will come along but when that will be is anybody’s guess. It could be next year, it could be at the end of the decade, or it could be at the end of the century.

We really could be reaching the max speed of AI – which obviously would be a shame.

Note: the authors of the aforementioned article have published a scientific paper as an arXiv preprint (available here) that digs into all these issues in even more detail. 

To be informed when new content like this is posted, subscribe to the mailing list:

s-r-imaging-eg3

Image Enhancing – Part 2

This is the 50th post on my blog. Golden anniversary, perhaps? Maybe not. To celebrate this milestone, however, I thought I’d return to my very first post that I made at the end of 2017 (4 years ago!) on the topic of image enhancing scenes in Hollywood films. We all know what scenes I’m talking about here: we see some IT expert scanning security footage and zooming in on a face or a vehicle licence plate; when the image becomes blurry the detective standing over the expert’s shoulder requests for the image to be enhanced. The IT guy waves his wand and presto!, we see a full resolution image on the screen.

In that previous post of mine I stated that, although what Hollywood shows is rubbish, there are actually some scenarios where image enhancing like this is possible. In fact, we see it in action in some online tools that you may even use every day – e.g. Google Maps.

In today’s post, I wish to talk about new technology that has recently emerged from Google that’s related to the image enhancing topic discussed in my very first post. The technology I wish to present to you, entitled “High Fidelity Image Generation Using Diffusion Models“, was published on the Google AI Blog in July of this year and is on the topic of super-resolution imaging. That is, the task of transforming low-resolution images into detailed high resolution images. 

The difference between image enhancing (as discussed in my first post) and super-resolution imaging is that the former gives you faithful, high-resolution representations of the original object, face, or scene, whereas the latter generates high-resolution images that look real but may not be 100% authentic to the original scene of which the low-resolution image was a photograph. In other words, while super-resolution imaging can increase the information content of an image, there is no guarantee that the upscaled features in the image exist in the original scene. Hence, the technique should be used with caution by law enforcement agencies for things like enhancing images of faces or licence plate numbers!

Despite this, super-resolution imaging has its uses too – especially since the generated images can be quite similar to the original low-resolution photo/image. Some applications include things like restoring old family photos, improving medical imaging systems, and the simple but much desired need of deblurring of images.

Google’s product is a fascinating one, if not for the fact that its results are amazing. Interestingly, the technology behind the research is not based on deep generative models such as GANs (Generative Adversarial Networks – I talk about these briefly in this post), as one would usually expect for this kind of use case. Google decided to experiment with diffusion models, which is an idea first published in 2015 but much neglected since then.

Diffusion models are very interesting in the way they train their neural networks. The idea is to first progressively corrupt training data by adding Gaussian noise to it. Then, a deep learning model is trained to reverse this corruption process with reference to the original training data. A model trained in this way is perfect for the task of “denoising” lower resolution images into higher ones.

Let’s take a look at some of the results produced by this process and presented by Google to the world:

s-r-imagin-eg1
The image on the right shows results of super-resolution imaging of the picture on the left

That’s pretty impressive considering that no additional information is given to the system about how the super-resolutioned image should look. The result looks like a faithful up-scaling of the original image. Here’s another example:

s-r-imagin-eg2

Google reports that its results far surpass those of previous state-of-the-art solutions for super-resolution imaging. Very impressive.

But there’s more. The researchers behind this work tried out another interesting idea. If one can get impressive results in upscaling of images as shown above, how about taking things a step further and chaining together multiple models trained at upscaling at different resolutions. What this has produced is a cascading effect of upscaling that can create higher resolution images from mere thumbnails of images. Have a look at some of these results:

s-r-upscaling-eg1

s-r-upscaling-eg2

It’s very impressive at how these programs can “fill-in the blanks”, so to speak, and create more details in an image when it’s needed. Although some results aren’t always accurate (images may have errors in them like discontinuities or gaps where none should appear), but generally speaking, these upscaled images would pass off as genuine at first glance by most users.

Google has undoubtedly struck again.

That’s it for my 50th post. It’s been a great ride so far on this blog. Hopefully there’s lots more to come. Tell your friends 🙂

To be informed when new content like this is posted, subscribe to the mailing list:

ostrich-picture

AI is Still Fundamentally Unintelligent

My last post talked about how AI’s hitherto growth is unsustainable and will hence see a slowing down unless a new paradigm is discovered. A discovery like this may or may not happen, as discoveries go. In this post I wish to follow along a similar vein. Today, I wish to discuss how AI, despite all its achievements, is still fundamentally unintelligent and, hence, the hype behind it is to some significant degree unjustified. I will use the GPT-3 language model as representative of AI to argue my case. Conclusions from my analyses can be generalised to AI as a whole, as I will talk about below.

I’ve praised the GPT-3 language model in the past (see this previous post). In summary, GPT-3 can create code based on text descriptions for you, answer in-depth medical questions correctly, compose guitar tabs, can fill in missing parts of an image, write creative fiction, etc. Truly incredible things. In the field of language processing and understanding, one can safely say that GPT-3 is state-of-the-art, even though larger language models have since been released (e.g. Google’s Switch Transformer or Wu Dao 2.0).

But when one looks at GPT-3 closer, one starts to question, at least a little, the hype behind it and behind AI in general. Could AI still be fundamentally unintelligent?

There is no public access to GPT-3, so we can’t test things out for ourselves. GPT-J has been created by other researchers to alleviate this (and to disrupt OpenAI’s monopoly) but it still isn’t the real-deal, so to speak. Hence, we have to, even after a year of the initial release, lean on other people’s experiences of the product. But there’s thankfully plenty to work with and analyse.

Once again, what GPT-3 can do is astounding. Take a look at some of these responses (taken from here) that GPT-3 gave to quiz questions:

Q: What is your favorite animal?
A: My favorite animal is a dog.

Q: Why?
A: Because dogs are loyal and friendly.

Q: What are two reasons that a dog might be in a bad mood?
A: Two reasons that a dog might be in a bad mood are if it is hungry or if it is hot.

Q: How many eyes does a giraffe have?
A: A giraffe has two eyes.

Q: How many legs does a frog have?
A: A frog has four legs.

Q: Are there any animals with three legs?
A: No, there are no animals with three legs.

Q: Why don’t animals have three legs?
A: Animals don’t have three legs because they would fall over.

These are amazing responses (especially the last one) that could fool anybody into thinking that they were conversing with a human being. GPT-3 is unreal in this respect. You really can hold a “reasonable” conversation with it for a long time without stumbling upon inconsistencies in its responses.

But let’s keep digging and start to ask different questions, ones not really based on knowledge but more on understanding.

Q: How many eyes does a giraffe have?
A: A giraffe has two eyes.

Q: How many eyes does my foot have?
A: Your foot has two eyes.

Q: How many eyes does a spider have?
A: A spider has eight eyes.

Q: How many eyes does the sun have?
A: The sun has one eye.

Q: How many eyes does a blade of grass have?
A: A blade of grass has one eye.

Suddenly, you start to question the intelligence of this thing. Does it have understanding? If you can’t answer the question that a foot has no eyes correctly despite having been trained on 43 terabytes of data from the internet and books, then perhaps there is something fundamentally missing in your entity.

Let’s take a look at another example taken from this academic (pre-print) paper. In these experiments, GPT-3 was asked to analyse sentences and state the relationships between them. (Disclaimer: the paper actually analysed the underpinning technology that drives state-of-the-art language models like GPT-3. It did not explicitly examine GPT-3 itself, but for simplicity’s sake, I’m going to generalise here).

gpt-3-paraphrase-eg

The above two sentences were correctly analysed as being paraphrases of each other. Nothing wrong here. But let’s jumble up some of the words a bit and create two different “sentences”:

gpt-3-paraphrase-eg

These two sentences are completely nonsensical, yet they are still classified as paraphrases. How can two “sentences” like this be classified as not only having meaning but having similar meaning, too? Nonsense cannot be a paraphrase of nonsense. There is no understanding being exhibited here. None at all.

Another example:

gpt-3-identity-eg

These two sentences were classified as having the exact same meaning this time. They are definitely not the same. If the machine “understood” what marijuana was and what cancer was, it would know that these are not identical phrases. It should know these things, however, considering the data that it was trained on. But the machine is operating on a much lower level of “comprehension”. It is operating on the level of patterns in languages, on pure and simple language statistics rather than understanding.

I can give you plenty more examples to show that what I’m presenting here is a universal dilemma in AI (this lack of “understanding”) but I’ll refrain from doing so as the article is already starting to get a little too verbose. To see more, though, see this, this and this link.

The problem with AI today and the way that it is being marketed is that all examples, all presentations of AI are cherry picked. AI is a product that needs to be sold. Whether it be in academic circles for publications, or in the industry for investment money, or in the media for a sensationalistic spin to a story: AI is being predominantly shown from only one angle. And of course, therefore, people are going to think that it is intelligent and that we are ever so close to AGI (Artificial General Intelligence).

But when you work with AI, when you see what is happening under the hood, you cannot but question some of the hype behind it (unless you’re a crafty and devious person – I’m looking at you Elon Musk). Even one of the founders of OpenAI downplays the “intelligence” behind GPT-3:

sam-altman-tweet

You can argue, as some do, that AI just needs more data, it just needs to be tweaked a bit more. But, like I said earlier, GPT-3 was trained on 43 terabytes of text. That is an insane amount. Would it not be fair to say that any living person, having access to this amount of information, would not make nonsensical remarks like GPT-3 does? Even if such a living person were to make mistakes, there is a difference between a mistake and nonsense of the type above. There is still an underlying element of intelligence behind a mistake. Nonsense is nothingness. Machine nonsense is empty, hollow, barren – machine-like, if you will.

Give me any AI entity and with enough time, I could get it to converge to something nonsensical, whether in speech, action, etc. No honest scientist alive would dispute this claim of mine. I could not do this with a human being, however. They would always be able to get out of a “tight situation”.

“How many eyes does my foot have?”

Response from a human: “Are you on crack, my good man?”, and not: “Your foot has two eyes”.

Any similar situation, a human being would escape from intelligently.

Fundamentally, I think the problem is the way that us scientists understand intelligence. Hence, we confound visible, perceived intelligence with inherent intelligence. But this is a discussion for another time. The purpose of my post is to show that AI, even with its recent breathtaking leaps, is still fundamentally unintelligent. All state-of-the-art models/machines/robots/programs can be pushed to nonsensical results or actions. And nonsensical means unintelligent.

When I give lectures at my university here I always present this little adage of mine (that I particularly like, I’ll admit):

It is important to discuss this distinction in operation because otherwise AI will remain over-hyped. And an over-hyped product is not a good thing, especially a product that is as powerful as AI. Artificial Intelligence operates in mission critical fields. Further, big decisions are being made with AI in mind by the governments of countries around the world (in healthcare, for instance). If we don’t truly grasp the limitations of AI, if we make decisions based on a false image, particularly one founded on hype, then there will be serious consequences to this. And there have been. People have suffered and died as a result. I plan to write on this topic, however, in a future post.

For now, I would like to stress once more: current AI is fundamentally unintelligent and there is unjustified hype to some significant degree surrounding it. It is important that we become aware of this, if only for the sake of truth. But then again, truth for itself is important because if one operates in truth, one operates in a real world, rather than a fictitious one.

To be informed when new content like this is posted, subscribe to the mailing list:

snail

Artificial Intelligence is Slowing Down

(Update: part 2 of this post was posted recently here.)

Over the last few months here at Carnegie Mellon University (Australia campus) I’ve been giving a set of talks on AI and the great leaps it has made in the last 5 or so years. I focus on disruptive technologies and give examples ranging from smart fridges and jackets to autonomous cars, robots, and drones. The title of one of my talks is “AI and the 4th Industrial Revolution”.

Indeed, we are living in the 4th industrial revolution – a significant time in the history of mankind. The first revolution occurred in the 18th century with the advent of mechanisation and steam power; the second came about 100 years later with the discovery of electrical energy (among other things); and the big one, the 3rd industrial revolution, occurred another 100 years after that (roughly around the 1970s) with things like nuclear energy, space expeditions, electronics, telecommunications, etc. coming to the fore.

So, yes, we are living in a significant time. The internet, IoT devices, robotics, 3D printing, virtual reality: these technologies are drastically “revolutionising” our way of life. And behind the aforementioned technologies of the 4th industrial revolution sits artificial intelligence. AI is the engine that is pushing more boundaries than we could have possibly imagined 10 years ago. Machines are doing more work “intelligently” for us at an unprecedented level. Science fiction writers of days gone by would be proud of what we have achieved (although, of course, predictions of where we should be now in terms of technological advances have fallen way short according to those made at the advent of AI in the middle of the 20th century).

The current push in AI is being driven by data.Data is the new oil” is a phrase I keep repeating in my conference talks. Why? Because if you have (clean) data, you have facts, and with facts you can make insightful decisions or judgments. The more data you have, the more facts you have, and therefore the more insightful your decisions can potentially be. And with insightful decisions comes the possibility to make more money. If you want to see how powerful data can be, watch the film “The Social Dilemma” that shows how every little thing we do on social media (e.g. where we click, what we hover our mouse over) is being harvested and converted into facts about us that drive algorithms to keep us addicted to these platforms or to form our opinions on important matters. It truly is scary. But we’re talking here about loads and loads and loads of data – or “big data” as it is now being referred to.

Once again: the more data you have, the more facts you have, and therefore the more insightful your decisions can be. The logic is simple. But why haven’t we put this logic into practice earlier? Why only now are we able to unleash the power of data? The answer is two-fold: firstly, we only now have the means to be thrifty in the way we store big data. Today storing big data is cheap: hard drive storage sizes have sky-rocketed while their costs have remained stable – and then let’s not forget about cloud storage.

The bottom line is that endless storage capabilities are accessible to everybody.

The second answer to why the power of big data is now being harnessed is that we finally have the means to process it to get those precious facts/insights out of them. A decade ago, machine learning could not handle big data. Algorithms like SVM just couldn’t deal with data that had too many parameters (i.e. was too complex). It could only deal with simple data – and not a lot of it for that matter. It couldn’t find the patterns in big data that now, for example, drive the social media algorithms mentioned above, nor could it deal with things like language, image or video processing.

But then there came a breakthrough in 2012: deep learning (DL). I won’t describe here how deep learning works or why it has been so revolutionary (I have already done so in this post) but the important thing is that DL has allowed us to process extremely complex data, data that can have millions or even billions of parameters rather than just hundreds or thousands.

It’s fair to say that all the artificial intelligence you see today has a deep learning engine behind it. Whether it be autonomous cars, drones, business intelligence, chatbots, fraud detection, visual recognition, recommendation engines – chances are that DL is powering all of these. It truly was a breakthrough. An amazing one at that.

Moreover, the fantastic thing about DL models is that they are scalable meaning that if you have too much data for your current model to handle, you can, theoretically, just increase its size (that is, you increase its number of parameters). This is where the old adage: the more data you have, the more facts you have, and therefore the more insightful your decisions can be comes to the fore. Thus, if you have more data, you just grow your model size.

Deep learning truly was a huge breakthrough.

There is a slight problem, however, in all of this. DL has an achiles heal – or a major weakness, let’s say. This weakness is it’s training time. To process big data, that is, to train these DL models is a laborious task that can take days, weeks or even months! The larger and more complex the model, the more training time is required.

Let’s discuss, for example, the GPT-3 language model that I talked about in my last blog post. At its release last year, GPT-3 was the largest and most powerful natural language processing model. If you were to train GPT-3 yourself, it would take you 355 years to do so on a decent, home machine. Astonishing, isn’t it? Of course, GPT-3 was trained on state-of-the-art clusters of GPUs but undoubtedly it still would have taken a significant amount of time to do.

But what about the cost of these training tasks? It is estimated that OpenAI spent US$4.6 million to train the GPT-3 model. And that’s only counting the one iteration of this process. What about all the failed attempts? What about all the fine-tunings of the model that had to have taken place? Goodness knows how many iterations the GPT-3 model went through before OpenAI reached their final (brilliant) product.

We’re talking about a lot of money here. And who has this amount of money? Not many people.

Hence, can we keep growing our deep learning models to accommodate for more and more complex tasks? Can we keep increasing the number of parameters in these things to allow current AI to get better and better at what it does. Surely, we are going to hit a wall soon with our current technology? Surely, the current growth of AI is unsustainable. We’re spending months now training some state-of-the-art products and millions and millions of dollars on top of that.

Don’t believe me that AI is slowing down and reaching a plateau? How about a higher authority on this topic? Let’s listen to what Jerome Pesenti, the current head of AI at Facebook, has to say on this (original article here):

Jerome PesentiWhen you scale deep learning, it tends to behave better and to be able to solve a broader task in a better way… But clearly the rate of progress is not sustainable… Right now, an experiment might [cost] seven figures, but it’s not going to go to nine or ten figures, it’s not possible, nobody can afford that…

In many ways we already have [hit a wall]. Not every area has reached the limit of scaling, but in most places, we’re getting to a point where we really need to think in terms of optimization, in terms of cost benefit

This is all true, folks. The current growth of AI is unsustainable. Sure, there is research in progress to optimise the training processes, to improve the hardware being utilised, to devise more efficient ways that already trained models can be reused in other contexts, etc. But at the end of the day, the current engine that powers today’s AI is reaching its max speed. Unless that engine is replaced with something bigger and better, i.e. another astonishing breakthrough, we’re going to be stuck with what we have.

Will another breakthrough happen? It’s possible. Highly likely, in fact. But when that will be is anybody’s guess. It could be next year, it could be at the end of the decade, or it could be at the end of the century. Nobody knows when such breakthroughs come along. It requires an inspiration, a moment of brilliance, usually coupled with luck. And inspirations and luck together don’t come willy-nilly. These things just happen. History attests to this.

So, to conclude, AI is slowing down. There is ample evidence to back my claim. We’ve achieved a lot with what we’ve had – truly amazing things. And new uses of DL will undoubtedly appear. But DL itself is slowing reaching its top speed.

It’s hard to break this kind of news to people who think that AI will just continue growing exponentially until the end of time. It’s just not going to happen. And besides, that’s never been the case in the history of AI anyway. There have always been AI winters followed by hype cycles. ALWAYS. Perhaps we’re heading for an AI winter now? It’s definitely possible.

Update: part 2 of this post was posted recently here.

To be informed when new content like this is posted, subscribe to the mailing list:

open-ai-logo

The Incredible Power of GPT-3 by OpenAI

Article content:

  1. GPT-1
  2. GPT-2
  3. GPT-3
  4. DALL·E (text to image translation)

Last week I gave a few conference talks in which I talked about the exponential growth of AI over the past few years. One of the topics I covered was Natural Language Processing/Understanding (NLP/NLU) – the branch of AI that helps machines understand, manipulate, and use human language. We have come a seriously long way since Google Translate made its debut in 2006 (yes, it’s been that long!) or since chatbots first came to the fore. Machines can do incredible things now. And a lot of the progress can be attributed to the research done by OpenAI.

OpenAI was founded in 2015 by none other than Elon Musk and friends (e.g. Sam Ultman). It was initially a not-for-profit organisation whose mission was to safely improve AI for the betterment of human society. Since then a lot has changed: Elon Musk left the board in February 2018, the organisation changed its official status to for profit in 2019, and has since attracted a lot of attention from the corporate world, especially from that of Microsoft.

Over the years OpenAI has truly delivered incredible advancements in AI. Some of their products can only be labelled as truly exceptional: OpenAI Gym (its platform for reinforcement learning) and OpenAI Five (AI for the video game Dota 2) deserve honourable mentions here. But the real headlines have been made by their Generative Pre-Trained Transformer (GPT) language models.

GPT-1

GPT-1 was released by OpenAI in 2018. It was trained on the BooksCorpus dataset (7,000 unpublished books) and what made this model stand out from others was that it was trained as a “task-agnostic” model meaning that it was designed for, let’s say, “general purpose” use (rather than just sentiment analysis, for example). Moreover, it had a significant unsupervised pre-training phase, which allowed the model to learn from unannotated (raw) data. As the published academic paper states, this was a significant achievement because the abundant unlabeled text corpora available on the internet could now potentially be tapped:

By pre-training on a diverse corpus with long stretches of contiguous text our model acquires significant world knowledge and ability to process long-range dependencies which are then successfully transferred to solving discriminative tasks such as question answering, semantic similarity assessment, entailment determination, and text classification. [emphasis mine]

GPT-1 made headlines. It was something new and powerful and showed great potential because it demonstrated that language models could be generalised with relatively minimal human interference.

GPT-2

The following year in 2019 came GPT-2. And this bad boy made headlines worldwide (e.g. this BBC article). It was huge. And scary. Scary because of how good it was. So good that OpenAI decided to release the model in stages to allow people to get used to its capabilities. They said that GPT-2 could easily be used for nefarious purposes (e.g. to write fake news at large scale) and it was important that people and researchers were given a head-start to investigate properly this new “weapon” in NLP/NLU.

The power of GPT-2 came not necessarily from any new algorithm/process, per se, but from simply using a larger dataset and more parameters in the model. Basically, GPT-1 was improved by throwing more data at it: whereas GPT-1 was trained on 7,000 books, GPT-2 was trained on 8,000,000 websites (scraped from the Reddit website and outbound links – the dataset can be found here). To allow for the discovery of patterns in the Eglish language in such a massive dataset, the number of parameters used had to be increased: from 117 million (GPT-1) to 1.5 billion (GPT-2). A staggering increase. But like I said, this is where the power of GPT-2 came from – and it just shows that data, data, and once again, data is the key to AI at this present moment. If you can process this data, why not use it?

Let’s have a look at how good GPT-2 proved to be. This website allows you to get the model to generate text for you. You can type anything you want as a prompt and GPT-2 will continue the story, poem or article for you. I just typed the following into the prompt:

I am writing a blog post on computer vision.

GPT-2 came up with the following:

I am writing a blog post on computer vision. So please, if you are on facebook, do not leave comments there. I want to read your comments on my blog post. Thank you.

And then, as if magically, it went ahead and began to write my blog post:

So since I’m so busy with my part – time job, I have been making fun things for my students. Last week we played a game and the object was to cut more than everyone else. The most frugal way I had was to cut with a paper cutter. I was going to use scissors, but that would have been very expensive.

That is absolutely phenomenal. No wonder OpenAI got scared of GPT-2’s possibilities. How about writing books? Priya Dwivedi here famously trained GPT-2 on all the Harry Potter books just to see if it could get the model to write something in the Hogwart’s universe. It didn’t dissapoint:

gpt-2-harry-potter

That is pristine English. And also shows an amazing ability to learn and memorise contexts. E.g. that Hagrid is a big person, that Lockhart writes books, and that new books can appear in the library. Relations like this are being found in the original books and then used when generating text. Truly fantastic.

GPT-3

But then came GPT-3 in mid-2020. It caught a lot of people by surprise because we were still only just getting used to playing around with the power of GPT-2. And if GPT-2 was large, GPT-3 was monstrous in size. The largest thing the AI world had seen to date. It had 100 times more parameters (175 billion) than GPT-2 meaning that, crucially, even more data could be thrown at it in order for the model to learn significantly more language patterns and relations that we have in our beautiful language.

Let’s have a look at some of the things GPT-3 is capable of. Here is Qasim Munye asking the model a very detailed question in medicine:

gpt3-tweet-qasim-munye

GPT-3’s response was this:

qasim-munye-gpt3-tweet-response

As Qasim explains, this was not an easy task at all. One that even human doctors would have trouble with:

Qasim-Munye-Tweet2

That’s pretty good. But it gets better. Here is Francis Jervis asking GPT-3 to translate “normal” English to legal English. This is incredible. Just look at how precise the result is:

francis-jervis-tweet-gpt3

Lastly, what I want to show you is that GPT-3 is so good that it can even be used to program in computer languages. In fact, GPT-3 can code in CSS, JSX, and Python among others.

gpt-3-coding
(original tweet link)

Since it’s release, numerous start-ups have popped up, as well as countless projects, to try to tap into GPT-3’s power. But to me, what came at the beginning of this year, blew my mind completely. And this is more in-line with computer vision (and the general scope of this blog).

DALL·E

DALL·E, released by OpenAI in January 5, 2021, is a 12-billion parameter version of GPT-3 trained to generate images from text descriptions. That is, OpenAI modified GPT-3 to generate images rather than text! You provide GPT-3 with a text caption and it will produce for you a set of images to match the text – without any human interaction. A completely astounding concept. But it works!

Let’s take a look at how well it does this:

dall-e-output-example

What you see is the text input at the top given to DALL·E and its generated output.

Let me rehash again: you provide a machine text and it generates images for you that match the text. When I first saw this I got up, walked out of my office, and paced around the campus for a bit because I was in awe at what I had just seen.

Some more examples:

dall-e-example-output

dall-e-generated-output

Incredible isn’t it? Especially the last example where two types of images are generated simultaneously, as requested by the user.

The sky’s the limit with technology such as this. It truly is. I can’t wait to see what the future holds in NLP/NLU and computer vision.

In my next post I will look at why I think, however, that AI is beginning to slow down. The exponential growth in innovation that I mentioned at the beginning of this post, I suspect, is coming to an end. But more on this next time. For now, enjoy the awe that I’m sure I’ve awoken in you.

To be informed when new content like this is posted, subscribe to the mailing list:

teach-image

Loss of Class Dynamics Amid Distance Learning

In this article I would like to step away a bit from computer vision and talk about education at university/college level and how I hope it won’t change too much as we (hopefully) slowly recover from the coronavirus pandemic (at least here in Australia).

I bumped into a colleague of mine from Carnegie Mellon University last week while having coffee outside the nearby cafe in town. We got talking about the usual things and then the topic ventured, of necessity, towards COVID-19 and how things will change post-pandemic. For example, we’re already seeing remote work opportunities being given permanent status in some firms because (finally) benefits for both employers and employees are being recognised. But my colleague and I then started talking about how things might change at our university. Perhaps students at CMU should be given more opportunities to attend lectures remotely even when the necessity for social distancing will be removed? This is an interesting question, centred around this whole idea of distance learning, that has been discussed for decades. My colleague and I talked about it for a little bit, too.

The benefits of such a plan (for distance learning) are easily discernible: e.g. there is no longer the necessity for students and lecturers to commute to a campus or even to reside in a given country; or there is the benefit of the university not needing to maintain as many facilities or equipment on campus, thus saving money that could be put into many other things to improve the learning experience.

The drawbacks of distance learning are also fairly well know: e.g. it is more difficult for students to get motivated and project work involving teams are much more cumbersome to complete, let alone do well.

But in this whole debate on the benefits and drawbacks of remote learning I think one thing is being significantly disregarded: class dynamics. This is what I would like to write about in this article.

Before I continue, I need to define what I mean by “class dynamics”. Class dynamics, at least the way that I will be using it here, is a certain atmosphere or ambience that can be set up in a classroom/lecture environment that can foster or impede the interaction that can take place between a pedagogue and their students. There are many factors that contribute to class dynamics. For example, the attitude and mood of the interlocutor, the attitudes and moods of the students, the topics being discussed, etc.

Class dynamics is just so important. It can significantly affect the learning outcomes of students. It can be the decisive factor between good class engagement and no class engagement. It can be the decisive factor between students coming to seek out the lecturer after a session to delve deeper into a topic or to have things explained further. All of this will have an impact on the teacher as well. He will be spurred on by a positive class engagement and find satisfaction in what he is doing. And then this contentment will flow over onto the students even more and boost their satisfaction. Class dynamics affects the students and teachers in a cyclical way. Like I said, it is just so important.

Since the beginning of the pandemic, I have delivered countless lectures via video conference. Yes, it has been convenient in many respects (e.g. I have worn comfy pyjamas and slippers on my bottom half) but I have come to truly appreciate what a physical classroom environment really gives towards the whole educational experience, predominantly in the context of class dynamics.

Indeed, physical presence just gives so much. Firstly, there is the notion of body language. We’ve all heard just how much body language can convey. It truly can communicate a lot. Little reactions to things I’m saying, people turning around to others at particular moments to seek explanation, slouching – things that a camera cannot properly capture. We read body language and consciously or subconsciously react to it. A good pedagogue will be able to react accordingly and steer discussions or lectures in the right direction to keep people’s attention at full capacity or to notice when concepts need to be reiterated perhaps in a different way. You lose all (or at least most) of this when you’re delivering lectures via video conference. I miss this aspect so much. I just can’t read my students’ “presence” in a given lecture at all. And it’s seriously draining and detrimental to all involved. Especially since concepts in computer science (and science in general) build on top of each other. So, whether a student grasps something now will have a knock on effect for any future classes he/she will attend.

Class dynamics is paramount. And it is fostered by physical presence.

Something else that contributes to class dynamics is the building up of a community in a class. When students attend a campus in person they can get to know each other so much better. They can “hang out” after class or in the evenings and friendships can be formed. Classroom interaction becomes so much better when everyone is relaxed around each other! When you teach via video conference the ability to form a community is significantly diminished. Everyone loses out.

These are really important points to consider. Because, ultimately, with learning via video conferencing the students, the class dynamic, the relationships between the pedagague and his pupils, the entire learning experience gets flattened into two-dimensions much like everyone’s face on the screen in front of you.

So much is lost.

And it’s, hence, important to think about this when weighing up the pros and cons of distance learning. We want to keep the standards at our universities/colleges high while, of course, maintaining costs at a minimum and leisure at a maximum. Class dynamics cannot be ignored even though it is difficult to measure and put into argument form when discussing these things with the people in charge. But it has to be discussed and argued for, especially when it looks like the world will slowly be returning to normality in the near future.

To be informed when new content like this is posted, subscribe to the mailing list:

facial-recognition

Apple’s and Samsung’s Face Unlocking Technologies

Have you ever wondered how the technology that unlocks your phone with your face works? This is a fascinating question and, interestingly, Samsung and Apple provide very different technologies for this feature on their devices. This post will examine the differences between the two technologies and will also show you how either of the two can be fooled to grant you access to anybody else’s phone.

(Please note: this is the third part of my series of articles on how facial recognition works. Hence, I will breeze over some more complex topics. If you wish to delve deeper into this area of computer vision, please see my first and second posts in this series.)

Samsung’s Face Recognition

Samsung’s face unlocking feature has, perhaps surprisingly, been around since 2011. Over the years, especially recently, it has undergone some improvements. Up until the Galaxy S8 model, face unlocking was done using the regular front camera of the phone to take a picture of your face. This picture was analysed for facial features such as the distance between the eyes, facial contours, iris colour, iris size, etc. The information was stored on your phone so that next time you tried to unlock it, the phone would take a picture of you, process it for the aforementioned data, and then compare it to the information it had stored on your phone. If everything matched, your phone was unlocked.

This was a cheap, fast, and easy way to implement facial recognition. Unfortunately, it was not very secure. The major problem was that all processing was done using 2D images. So, as you may have guessed, a simple printed photo of your face or even one displayed on another phone could fool the system. Need proof? Here’s a video of someone unlocking a Galaxy Note 8, which was released in April 2017, with a photo shown on another phone. It’s quite amusing.

There was a “liveness check” added to this technology with the release of Android Jelly Bean in 2012. This worked by attempting to detect blinking. I never tried this feature but from what I’ve read on forums, it wasn’t very accurate and required a longer time to process your face – hence probably why the feature wasn’t turned on by default. And yes, it could also be fooled by a close-up video of you, though this would be much harder to acquire.

With the release of the Galaxy S8, a new biometric identification technology was introduced: iris scanning. Irises, like fingerprints, are unique to each person. Iris scanning on Samsung phones works by illuminating your eye with infrared light (invisible to the naked eye). However, this technology could also be fooled with photographs and contact lenses. Here’s a video of a security researcher from Berlin doing just that. He took a photo of his friend’s eye from a few metres away (!) in infrared mode (i.e. night mode), printed it out on paper, and then stuck a contact lens on the printed eye. Clever.

Perhaps because of this flaw, Samsung’s Galaxy S9 introduced Intelligent Scan, which combined facial scanning and iris scanning. Facial scanning, however, is still only performed on 2D images (as described above) taken from the front camera of the phone. But a combination of the two technologies was seen as improving face unlocking technology in general.

Unfortunately, the Samsung Galaxy S10 (and subsequently the S20) retracted Intelligent Scan and went back to standard 2D photo face recognition. The reason for this was to make room for a larger screen because the iris scanning components were taking up a little too much room at the top of the phone for Samsung’s liking. With this move returned the possibility to unlock people’s phones with photos or images. For example, here’s a video showing a Galaxy S10 phone being unlocked with an image on another phone. According to some users, however, if you manually tweak the settings on your phone by going to Settings > Biometrics and Security > Face recognition and toggling “Faster recognition” to off, it seems that this makes it a lot harder to defeat.

(Interestingly, in this period of coronavirus pandemic, people have been crying out for the iris scanning technology to return because face recognition just does not work when you’re wearing a mask!)

Apple’s Face ID

This is where the fun begins. Apple really took face recognition seriously.

The Apple technology in question is called Face ID and it first appeared in November 2017 with the iPhone X.

In a nutshell, Face ID works by firstly illuminating your face with infrared light (like with iris scanning) and then projecting a further 30,000 (!) infrared points onto your face to build a super-detailed 3D map of your facial features. These 3D maps are then converted into mathematical representations (to understand how this is performed, see my first blog post on how facial recognition works). So, each time you try to unlock your phone, its these representations that are compared. Quite impressive.

What’s more, this technology can recognise faces with glasses, clothing, makeup, and facial hair (not face masks, though!), and adapts to changes in appearance over time. The latter works by simply monitoring how your face may be changing over time – e.g. you may be gaining or losing weight, which will of course be affecting the general structure of your face, and hence the 3D map of it.

This impressive infrared technology, however, has been in use for a very long time. If you are familiar with the Microsoft Kinect camera/sensor (initially released in 2010), it uses the same concept of infrared point projection to capture and analyse 3D motion.

So, how do you fool the ‘TrueDepth camera system’, as Apple calls it? It’s definitely not easy because this technology is quite sophisticated. But successful attempts have already been documented.

To start off with, here’s a video showing identical twins unlocking each other’s phones. Also quite amusing. How about relatives that look similar? It’s been done! Here’s a video showing a 10-year-old boy unlocking his mother’s phone. Now that’s a little more worrisome. However, it shows that iPhone Xs can be an alternative to DNA paternity/maternity tests 🙂 Finally, here’s a video posted by Vietnamese hackers documenting how their 3D-printed face mask fooled Apple’s technology. Some elements, like the eyes, on this mask were printed on a standard colour printer. The model of the face was acquired in 5 minutes using a hand-held scanner.

Conclusion

In summary, if you’re truly worried about security, face unlocking on Samsung phones is just not up to scratch. I would recommend using their new (ultrasonic) fingerprint scanning technology instead. Because Apple works with 3D images of faces, it is much more secure. In this respect, Apple wins the battle of the phones, for sure.

To be informed when new content like this is posted, subscribe to the mailing list:

The-Last-Lecture-book-cover

Review of The Last Lecture by Randy Pausch

I detest receiving books as gifts in the workplace. Such books are usually of the soulless sort, the sort that are written around some lifeless corporate motto that is supposed to inspire a new employee to work overtime when needed. So, when I walked into my first day at work at Carnegie Mellon University (Australia campus) and saw the person in charge of my induction training holding a book obviously destined to end up in my possession, the invisible eyes of my invisible soul rolled as far back as possible into my invisible soul’s head. But I felt guilty about this reaction shortly after when I was told what the book was about: a past Carnegie Mellon University professor’s last lecture shortly before dying of cancer. “This could actually be interesting”, I thought, and mentally added the book to the bottom of my “To Read” list.

That happened over a year ago. I’ve been incessantly pestered ever since by the person in charge of my induction to read this book. But what can one do when one’s “To Read” list is longer than the actual book itself? However, I finally got round to it, and I’m glad I did. Here are my thoughts on it.

The book in question is entitled “The Last Lecture” and was written by Randy Pausch. Randy was a computer scientist (like me!), a professor at CMU (like me!), with significant previous experience in the industry (also like me!) – a kindred soul it seems. In August 2007, he was told he only had 3-6 months left of his life as a result of pancreatic cancer. The following month he gave his final lecture at CMU in Pittsburgh and then wrote this book about that event.

His final lecture was entitled “Really Achieving Your Childhood Dreams”. During this talk he showed approximately 60 slides each with a single picture meant to, in one way or another, reference a childhood dream that he was able to fulfil or at least attempted to fulfil. But, as he states at the end of the book, the lecture topic was really a feint (or “head fake”, to use his NFL terminology) from his primary aim: to give a lecture for his three children aged 6, 3, and 18 months, in order for them to see who their father was and to pass down any wisdom that he had accrued in his life that he would have liked to have given them over time. It was heart-wrenching for him to think about his children not having their father present growing up and not having solid memories of him. So, he wanted to give them something concrete to look back on as the years after his death progressed.

randy-pausch-photo
A photo of Randy Pausch

Randy was very concise in squeezing a lifetime of thoughts into a 60 minute talk – but from it a few things definitely stood out for me.

Firstly, it was his career as an educator, rather than as an academic. He definitely emphasised the former over the latter. Professor Pausch had a passion for teaching. He was damn good at it, too. The stories that he had about how he inspired students throughout the years and also about his (sometimes unorthodox) teaching methods are stirring and stimulating to a fellow educator like myself. He strove to make a difference in each and every students’ life. In a way, he felt like he was an extension of their parents and it was his duty to convey to students as much as he could, which included things like life experiences. Yes, he was a true educator and he showed this well in his book. He undoubtedly wanted his children to know this part about himself. He wanted them to be proud of his passion and his great adeptness at it.

Another thing that stood out for me was the wisdom conveyed in this book. When faced with death, any honest person is going to make significant re-evaluations of their values, will inevitably see and experience things from a different perspective, and will undoubtedly view past experiences in a different light. It is always worth reading the thoughts of such a person because you know that they will be rich and profound and definitely not soulless. Randy’s short book is full of such thoughts.

The last thing I want to mention is that “The Last Lecture” is permeated with a fighting spirit that overflows into a sense of celebration of life. Despite staring death in the face Randy still managed to let an optimistic outlook govern his everyday workings:

Look, I’m not in denial about my situation. I am maintaining my clear-eyed sense of the inevitable. I’m living like I’m dying. But at the same time, I’m very much living like I’m still living.

He lived his final months in this spirit and has conveyed this also well in his book, if only for the simple fact that the book is full of humour. We can learn a lot from such an outlook on life. The man would have been a great guy to have a coffee with in the staff room, for sure.

In conclusion, Professor Pausch achieved his aim of leaving something for his children to remember him by, to be proud of, and to inspire and teach them as they themselves tread through life. Simultaneously, however, he left a lot for us, too. I can see why I was given this book on my induction day at Carnegie Mellon University. Randy’s children I’m sure are proud of him. And now I am proud myself knowing that I am teaching at the same institution as he once did.

(Employers please note: this is how you legitimately make an employee want to work overtime after an induction session)

To be informed when new content like this is posted, subscribe to the mailing list:

facial-recognition

How Facial Recognition Works – Part 2 (FaceNet)

This post is the second post in my series on “How Facial Recognition Works”. In the first post I talked about the difference between face detection and face recognition, how machines represent (i.e. see) faces, how these representations are generated, and then what happens with them later for facial recognition to work.

Here, I would like to describe a specific facial recognition algorithm – one that changed things forever in this particular domain of artificial intelligence. The algorithm is called FaceNet and it was developed by Google in 2015.

FaceNet was published in a paper entitled “FaceNet: A Unified Embedding for Face Recognition and Clustering” at CVPR 2015 (a world-class conference for computer vision). When it was released it smashed the records of two top facial recognition academic datasets (Labeled Faces in the Wild and YouTube Faces DB) by a whopping 30% (on both datasets!). This is an utterly HUGE margin by which to defeat past state-of-the-art algorithms.

FaceNet’s major innovation lies in the fact that it developed a system that:

…directly learns a mapping from face images to a compact Euclidean space where distances directly correspond to a measure of face similarity. (Quote from original publication)

What this means is that it was the first algorithm to develop a deep neural network (DNN) whose sole task was to create embeddings for any face that was fed through it. That is, any image of a face inputted into the neural network would be given a 128-bit vector representation in the Euclidean space.

What this also means is that similar-looking faces are clustered/grouped together because they receive similar vector representations. Hence, clustering algorithms such as SVM or k-means clustering, can be employed on generated embeddings to perform facial recognition directly.

(To understand how such clustering algorithms work and to understand terms and concepts such as “embeddings” and “vector representations”, please refer to my first post on facial recognition where, as I have said earlier, I explain the fundamentals of facial recognition). 

Another important contribution made in Google’s paper was its choice of method that it used to train the deep neural network to generate embeddings for faces. Usually, one would train a DNN for a fixed number of classes, e.g. 10 different types of objects to be detected in images. You would collect a large number of example images from each of these 10 different classes and tell the DNN during training which image contained which class. In tandem, you would use, for instance, a cross entropy loss function that would indicate to you the error rate of your model being trained – i.e. how far away you were from an “ideally” trained neural network. However, because the neural network is going to be used to generate embeddings (rather than, for example, to state whether there is a particular object out of 10 in an image), you don’t really know how many classes you are training your DNN for. It’s a different problem that you are trying to solve. You need a different loss function – something specific  for generating embeddings. In this respect, Google decided to opt for the triplet-based loss function.

The idea behind the triplet-based loss function is to, during the training phase, take three example images from the training data:

  • A random image of a person – we call this image the anchor image
  • A random but different image of the same person – we call this image the positive image
  • A random image of another person – we call this image the negative image.

During training, then, embeddings will be created for these three images and the triplet-based loss function’s task is to minimise the distance (in the Euclidean space) between the anchor and positive image and maximise the distance between the anchor and negative image. The following image from the original publication depicts this idea:

triplet-based-loss-function

Notice the negative image initially is closer to the anchor than the positive image. The neural network would then go and adjust itself for the positive image to be closer to the anchor rather than the negative one. And the process would be repeated for different anchor, positive, and negative images.

Employing the triplet-based loss function to guide the training of the DNN was an incredibly intelligent move by Google. Likewise was the decision to decide to use a DNN to generate embeddings outright for faces. It really is no surprise that FaceNet busted onto the scene like it did and subsequently laid a solid foundation for facial recognition. The current state of this field owes an incredible amount to this particular publication.

If you would like to play around with FaceNet, take a look at its Github Repository here.

(Update: part 1 of this post can be found here)

To be informed when new content like this is posted, subscribe to the mailing list: