Data Curation and My Tiny Unicorn
Nineteen minutes of meanness is not redeemed by one minute of resolution and expressions of eternal friendship.
Per Ashley Merryman and Po Bronson's excellent book, 'NutureShock' studies have shown that when children watch TV shows like MTU Friendship is Fabulous (MTU FiF - name changed to protect the innocent) they learn meanness. They learn how to hurt, cut with their words. It is, of course, not the only place they learn this, but it is an important one. Why does this happen? Well for the first nineteen minutes the story is about the characters being mean about each other, excluding each other and generally being unpleasant. Then at the end it's all fine and everyone is great friends again. So isn't the lesson in the moral at the end of the story? No. It's in the volume. The 19:1 ratio of mean to nice.
The same problem exists in the data universe we swim in, and therefore the data we use to train our LLMs.
Any data which can be acquired is thrown into the melting pot, without regard to quality. Let's be honest: the vast majority of data is dirty, and I don't just mean unpleasant or the other meaning. Sure, there's some great golden(ish) sources, Wikipedia, some of the better news outlets and the Star Wars fan wiki. But the rest is opinion, rant and ignorance. Often packaged as truth.
In a way it's treated by the LLMs similarly to the way the children treat MTU FiF. They watch it sequentially, and they are impacted by the frequency of the interactions and not the whole story. They don't yet process things from end to end. "Brain Child" on Netflix had an episode where kids and parents were tested on various things, one of which was following a list of instructions. When they started the list the kids read and started the first step, and did all of them, including the unpleasant ones, the parents read the whole thing then sat down in their chairs and drank some juice and relaxed. It was because the last instruction said 'ignore the previous steps on this list, sit down and relax'. LLMs are sort of the same, they don't really validate the whole structure, they validate the volume not the holistic message.
It would help enormously if we were to curate this, reduce the noise, eliminate the most egregious of the dirty data. It would probably help our society too, but no matter what we do, we'll never make all the data clean. Which is fine. What we need to be able to do is differentiate between good and bad, truth and lie, poor truth and great fiction.
Am I just repeating my point about volume and quality. Yes, I am, but there's also a more fundamental point. Education.
Our children start with hugely complex brains, empty of data, which they then fill, vast volumes consumed over years until they become adults. In the early years they don't understand the world and we need to protect them, as they get older they become more independent, and one day they fly from the nest. Their ability to do this is partly due to the data they gather, but it's also the connections they make and the validation of all that data. Hopefully they've been raised and educated so they make positive societal contributions, so they can look after themselves and their families, and so they can help bring up the next generation. Our jobs as parents is to help guide that process, to give them feedback such that they make the right connections, and so they are capable of making the right connections in the future.
AGIs, the real thing, will need the same help. They probably won't need a couple of decades, but they will need help and feedback, and training wheels over a course of maturation. They need to build connections, and understanding. To know that actions have consequences.
Switching them on and unleashing them into the world without this is as cruel as sending out toddlers in the same state. Well, almost, except AGIs may well have access to dangerous data and processes and be a genuine threat to us, without really understanding what they're doing. Like arming those toddlers before we send them out. It's not right, and it's not sensible. We will need to treat AGIs with the same care and attention as we treat our kids, and hope they thank us for it when they're grown up.
###





