Did you know that there is now a search engine for datasets that is powered by Google? Well, there is! And it’s something that the research community and the industry have been needing (whether they knew it or not) for years now.
This new search engine is called Dataset Search and can be found at this link.
This is a big deal. Datasets have become crucial since the prominent arrival of deep learning onto the scene a few years ago. Deep learning needs data. Lots and lots of data. This is because in deep learning, neural networks are told to (more or less) autonomously discover the underlying patterns in data. In computer vision, for example, you would want a machine to learn that bicycles are composed of two wheels, a handlebar, and a seat. But you need to provide enough examples for a machine to be able to learn these patterns.
Creating such large datasets is not an easy task. Some of the top image datasets (as I have documented here), contain millions of hand annotated images. These are famous datasets that most people in the computer vision world know about. But what about datasets that are more niche and hence less known? Some of these can be very difficult to find – and you certainly would not want to spend months or years creating them only to find that someone had already gone to all the trouble before you.
Up until now, then, there was no central location to search for these datasets. You had to manually traverse the web in the hope of finding what you were looking for. But that was until Dataset Search came along! Thank the heavens for that. Although Dataset Search is still in its beta stage, this is definitely something the research and industry communities have been needing.
For datasets to be listed in a coherent and informative manner on Dataset Search, Google has developed guidelines for dataset providers. These guidelines are based on schema.org, which is an open standard for describing such information (in metadata tags). As Google states:
We encourage dataset providers, large and small, to adopt this common standard so that all datasets are part of this robust ecosystem.
It would be a good idea to start adhering to these guidelines when creating datasets because a central place of reference for datasets is something we all need.
As a side note, Dataset Search has been in development for at least three years (interestingly, Dataset Search’s previous name was actually Goods – Google Dataset Search). Google released two academic papers on this in 2016 – see here and here. It’s nice to see that their work has finally culminated into what they have offered us now.
Dataset Search is definitely a step in in the right direction.
To be informed when new content like this is posted, subscribe to the mailing list (or subscribe to my YouTube channel!):