During my thesis research in the ’80s, I started thinking about rational decision-making and the problem that it’s actually impossible. If you were rational you would think: Here’s my current state, here are the actions I could do right now, and after that I can do those actions and then those actions and then those actions; which path is guaranteed to lead to my goal? The definition of rational behavior requires you to optimize over the entire future of the universe. It’s just completely infeasible computationally.
It didn’t make much sense that we should define what we’re trying to do in AI as something that’s impossible, so I tried to figure out: How do we really make decisions?
So, how do we do it?
One trick is to think about a short horizon and then guess what the rest of the future is going to look like. So chess programs, for example—if they were rational they would only play moves that guarantee checkmate, but they don’t do that. Instead they look ahead a dozen moves into the future and make a guess about how useful those states are, and then they choose a move that they hope leads to one of the good states.
“Could you prove that your systems can’t ever, no matter how smart they are, overwrite their original goals as set by the humans?”
Another thing that’s really essential is to think about the decision problem at multiple levels of abstraction, so “hierarchical decision making.” A person does roughly 20 trillion physical actions in their lifetime. Coming to this conference to give a talk works out to 1.3 billion or something. If you were rational you’d be trying to look ahead 1.3 billion steps—completely, absurdly impossible. So the way humans manage this is by having this very rich store of abstract, high-level actions. You don’t think, “First I can either move my left foot or my right foot, and then after that I can either…” You think, “I’ll go on Expedia and book a flight. When I land, I’ll take a taxi.” And that’s it. I don’t think about it anymore until I actually get off the plane at the airport and look for the sign that says “taxi”—then I get down into more detail. This is how we live our lives, basically. The future is spread out, with a lot of detail very close to us in time, but these big chunks where we’ve made commitments to very abstract actions, like, “get a Ph.D.,” “have children.”
What about differences in human values?
That’s an intrinsic problem. You could say machines should err on the side of doing nothing in areas where there’s a conflict of values. That might be difficult. I think we will have to build in these value functions. If you want to have a domestic robot in your house, it has to share a pretty good cross-section of human values; otherwise it’s going to do pretty stupid things, like put the cat in the oven for dinner because there’s no food in the fridge and the kids are hungry. Real life is full of these tradeoffs. If the machine makes these tradeoffs in ways that reveal that it just doesn’t get it—that it’s just missing some chunk of what’s obvious to humans—then you’re not going to want that thing in your house.
I don’t see any real way around the fact that there’s going to be, in some sense, a values industry. And I also think there’s a huge economic incentive to get it right. It only takes one or two things like a domestic robot putting the cat in the oven for dinner for people to lose confidence and not buy them.
You’ve argued that we need to be able to mathematically verify the behavior of AI under all possible circumstances. How would that work?
One of the difficulties people point to is that a system can arbitrarily produce a new version of itself that has different goals. That’s one of the scenarios that science fiction writers always talk about; somehow, the machine spontaneously gets this goal of defeating the human race. So the question is: Could you prove that your systems can’t ever, no matter how smart they are, overwrite their original goals as set by the humans?
It would be relatively easy to prove that the DQN system, as it’s written, could never change its goal of optimizing that score. Now, there is a hack that people talk about called “wire-heading” where you could actually go into the console of the Atari game and physically change the thing that produces the score on the screen. At the moment that’s not feasible for DQN, because its scope of action is entirely within the game itself; it doesn’t have a robot arm. But that’s a serious problem if the machine has a scope of action in the real world. So, could you prove that your system is designed in such a way that it could never change the mechanism by which the score is presented to it, even though it’s within its scope of action? That’s a more difficult proof.
Are there any advances in this direction that you think hold promise?
There’s an area emerging called “cyber-physical systems” about systems that couple computers to the real world. With a cyber-physical system, you’ve got a bunch of bits representing an air traffic control program, and then you’ve got some real airplanes, and what you care about is that no airplanes collide. You’re trying to prove a theorem about the combination of the bits and the physical world. What you would do is write a very conservative mathematical description of the physical world—airplanes can accelerate within such-and-such envelope—and your theorems would still be true in the real world as long as the real world is somewhere inside the envelope of behaviors.