Remember “big data”? Yeah. Ugh. As if having too much data was the problem.
“Now he’s really gone off the deep end!”, I hear you say. “We are drowning in data!”
Well, yes, but consider this: Back in 2015 already, Peter Sweeney wrote an excellent article, titled “Where Big Data Fails.. and Why”. He points out that most current self-learning technologies require large amounts of data in order to “learn” something (i.e. to estimate their parameters). So why is this a problem?
Here’s the rub: while we do have tons of data, these data are usually far from homogeneous. It is actually lots and lots of small data sets in most real-world scenarios. And learning from lots of small data sets is something quite different from learning from one big data set. For example, consider personalization: while we do have lots of people who generate lots of data (e.g. what do they click on?), we typically only have very little individual-level data. Just think about the number of news articles you read per week. It may be a lot, but it is probably not 10,000 each week, right? Most likely not even one hundred. So how is one of these parameter-rich algorithms supposed to learn how your interests may or may not shift from week to week, if it only has such a small data set to learn from?
Isn’t there something that can learn from 10 or 20 examples instead of 100,000? Or from just one example? Not “big data” but “small data”?
I used Mergeflow’s tech discovery software to look into this. Here are some of my findings:
There are quite a number of companies who aim to “un-silo” (i.e. connect) small datasets. The idea there is that it is easier to see patterns once you can analyze and search across previously disparate datasets. Of course, this only works if the resultant one big dataset is somewhat homogeneous. But since I was interested in scenarios where this is not the case, I had to keep looking.
I did find one analytics company; Primal, based in Kitchener (Ontario). They specifically point out that their algorithms can learn from small amounts of data. How small? They do not specify a number (at least I could not find one), but it sounds like they really mean 5, 10, or 20, rather than 10,000. From what I could learn, it seems like they combine rule-based approaches to constrain what their models learn. And such constraints mean you need less data.
R&D on small data
After looking for companies, I turned to R&D (i.e. scientific publications, research papers). There, I found a concept called “one-shot learning”. The idea behind one-shot learning is to build algorithms that can learn from just one, or at least just a few, examples.
In order to get an overview of one-shot learning and related research questions, I’d suggest a relatively recent article by Ben Lorica and Mike Loukides, “Building tools for the AI applications of tomorrow”. If you want to do a deep-dive, I’d recommend “Building machines that learn and think like people” by Josh Tenenbaum and colleagues (or watch a video lecture to see Josh in action).
So, what do people do with one-shot learning?
In order to find out, I used Mergeflow’s patent class and technology detection algorithm. Results included image processing and robotics, drug discovery, and machine translation. There were others too, but let’s look at some examples for these applications for now.
Image processing and robotics
Image processing in the context of robotics, for example…
Other applications of one-shot learning in image processing include semantic segmentation of images. Semantic segmentation of an image refers to detecting e.g. people, things, or landmarks in an image.
Most “AI disrupting drug discovery” stories tend to be about how very sophisticated algorithms help analyze very large datasets. So I found it even more interesting to find a “small data” approach there too. For example, Low Data Drug Discovery with One-Shot Learning. Or Molecular Structure-Based Large-Scale Prediction of Chemical-Induced Gene Expression Changes. Just think about what this might mean for personalized medicine, for example.
I think that machine translation is a (or the) prototypical example of how having lots and lots of data helps drastically improve algorithm performance. But even there, I found approaches like towards one-shot learning for rare-word translation with external experts.
What have I learned?
One of my lessons-learned is that I am certainly not the only one who finds “small data” interesting. There does seem to be a vibrant community of people working on “small data”, or one-shot learning. Even if this community is, so far, relatively small (no pun intended here).
But I also have a question: if we use rule- or heuristics-based models to constrain what a model can learn, how much knowledge have we then put into a system? Is this not some form of cheating? In other words, how much “learning capability” of an algorithm comes from our model design, and how much of it is truly learnt? This, to me, is one of the many elephants in the room. What do you think? Do you agree?