惨痛的教训

原文链接:http://www.incompleteideas.net/IncIdeas/BitterLesson.html

The Bitter Lesson
Rich Sutton
March 13, 2019
The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin. The ultimate reason for this is Moore’s law, or rather its generalization of continued exponentially falling cost per unit of computation. Most AI research has been conducted as if the computation available to the agent were constant (in which case leveraging human knowledge would be one of the only ways to improve performance) but, over a slightly longer time than a typical research project, massively more computation inevitably becomes available. Seeking an improvement that makes a difference in the shorter term, researchers seek to leverage their human knowledge of the domain, but the only thing that matters in the long run is the leveraging of computation. These two need not run counter to each other, but in practice they tend to. Time spent on one is time not spent on the other. There are psychological commitments to investment in one approach or the other. And the human-knowledge approach tends to complicate methods in ways that make them less suited to taking advantage of general methods leveraging computation. There were many examples of AI researchers’ belated learning of this bitter lesson, and it is instructive to review some of the most prominent.

In computer chess, the methods that defeated the world champion, Kasparov, in 1997, were based on massive, deep search. At the time, this was looked upon with dismay by the majority of computer-chess researchers who had pursued methods that leveraged human understanding of the special structure of chess. When a simpler, search-based approach with special hardware and software proved vastly more effective, these human-knowledge-based chess researchers were not good losers. They said that ``brute force” search may have won this time, but it was not a general strategy, and anyway it was not how people played chess. These researchers wanted methods based on human input to win and were disappointed when they did not.

A similar pattern of research progress was seen in computer Go, only delayed by a further 20 years. Enormous initial efforts went into avoiding search by taking advantage of human knowledge, or of the special features of the game, but all those efforts proved irrelevant, or worse, once search was applied effectively at scale. Also important was the use of learning by self play to learn a value function (as it was in many other games and even in chess, although learning did not play a big role in the 1997 program that first beat a world champion). Learning by self play, and learning in general, is like search in that it enables massive computation to be brought to bear. Search and learning are the two most important classes of techniques for utilizing massive amounts of computation in AI research. In computer Go, as in computer chess, researchers’ initial effort was directed towards utilizing human understanding (so that less search was needed) and only much later was much greater success had by embracing search and learning.

In speech recognition, there was an early competition, sponsored by DARPA, in the 1970s. Entrants included a host of special methods that took advantage of human knowledge—knowledge of words, of phonemes, of the human vocal tract, etc. On the other side were newer methods that were more statistical in nature and did much more computation, based on hidden Markov models (HMMs). Again, the statistical methods won out over the human-knowledge-based methods. This led to a major change in all of natural language processing, gradually over decades, where statistics and computation came to dominate the field. The recent rise of deep learning in speech recognition is the most recent step in this consistent direction. Deep learning methods rely even less on human knowledge, and use even more computation, together with learning on huge training sets, to produce dramatically better speech recognition systems. As in the games, researchers always tried to make systems that worked the way the researchers thought their own minds worked—they tried to put that knowledge in their systems—but it proved ultimately counterproductive, and a colossal waste of researcher’s time, when, through Moore’s law, massive computation became available and a means was found to put it to good use.

In computer vision, there has been a similar pattern. Early methods conceived of vision as searching for edges, or generalized cylinders, or in terms of SIFT features. But today all this is discarded. Modern deep-learning neural networks use only the notions of convolution and certain kinds of invariances, and perform much better.

This is a big lesson. As a field, we still have not thoroughly learned it, as we are continuing to make the same kind of mistakes. To see this, and to effectively resist it, we have to understand the appeal of these mistakes. We have to learn the bitter lesson that building in how we think we think does not work in the long run. The bitter lesson is based on the historical observations that 1) AI researchers have often tried to build knowledge into their agents, 2) this always helps in the short term, and is personally satisfying to the researcher, but 3) in the long run it plateaus and even inhibits further progress, and 4) breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning. The eventual success is tinged with bitterness, and often incompletely digested, because it is success over a favored, human-centric approach.

One thing that should be learned from the bitter lesson is the great power of general purpose methods, of methods that continue to scale with increased computation even as the available computation becomes very great. The two methods that seem to scale arbitrarily in this way are search and learning.

The second general point to be learned from the bitter lesson is that the actual contents of minds are tremendously, irredeemably complex; we should stop trying to find simple ways to think about the contents of minds, such as simple ways to think about space, objects, multiple agents, or symmetries. All these are part of the arbitrary, intrinsically-complex, outside world. They are not what should be built in, as their complexity is endless; instead we should build in only the meta-methods that can find and capture this arbitrary complexity. Essential to these methods is that they can find good approximations, but the search for them should be by our methods, not by us. We want AI agents that can discover like we can, not which contain what we have discovered. Building in our discoveries only makes it harder to see how the discovering process can be done.

惨痛的教训
里奇·萨顿
2019年3月13日
70年的AI研究中最关键的教训在于,利用计算力的通用方法最终是最有效的,而且优势巨大。最终的原因在于摩尔定律,或者更确切地说,是计算单位成本持续呈指数级下降这一概括。大多数AI研究以代理可用的计算力恒定为前提开展(在这种情况下,利用人类知识将成为提高性能的唯一途径之一),但随着时间的推移,略长于一个典型的研究项目,计算力将不可避免地大幅提升。为了寻求能在短期内产生差异的改进,研究人员会寻求利用其对领域的专业知识,但从长远来看,唯一重要的事情便是利用计算力。这两者无需互相抵触,但在实践中,它们往往会如此。花在一个方面的时间就不是花在另一个方面的时间。对一种方法或另一种方法的投资都有心理上的承诺。而基于人类知识的方法往往会以一种让它们不太适合利用通用计算方法的方式使方法复杂化。AI研究人员迟迟才吸取这一惨痛教训的例子很多,回顾其中一些最突出的例子很有教育意义。

在电脑象棋中,1997年击败世界冠军卡斯帕罗夫的方法基于大规模、深入的搜索。当时,大多数寻求利用人类对象棋特殊结构的理解的方法的电脑象棋研究人员对此感到沮丧。当一种基于搜索的更简单方法与特殊硬件和软件被证明更有效时,这些基于人类知识的象棋研究人员并不是好的失败者。他们说,“蛮力”搜索这一次可能赢了,但这不是一种通用策略,无论如何,这不是人类下象棋的方式。这些研究人员希望赢得胜利的方法是基于人类输入的,当他们没有做到时,他们感到失望。

在电脑围棋中可以看到类似的研究进展模式,只不过推迟了20年。最初的巨大努力都花在利用人类知识,或围棋的特殊功能来避免搜索上面,但是一旦能够大规模有效地应用搜索,所有这些努力都证明与问题无关,甚至更糟。同样重要的是利用自我博弈学习来获取值函数(这在许多其他游戏中甚至在象棋中都是如此,即使学习并未在1997年首次击败世界冠军的程序中发挥重要作用)。通过自我博弈学习和学习通常就像搜索一样,它使得大量的计算得以应用。搜索和学习是AI研究中利用大量计算最重要的两类技术。在电脑围棋中,与电脑象棋一样,研究人员最初的努力是利用人类理解(这样所需的搜索就更少),而只有在很久以后,通过采用搜索和学习才取得了更大的成功。

在语音识别方面,DARPA在20世纪70年代发起了早期竞赛。参赛者包括大量利用人类知识的特殊方法—关于单词、音素、人类声道等的知识。另一方面是本质上更具统计学性质且执行更多计算的新方法,这些方法基于隐马尔可夫模型(HMM)。同样,统计学方法胜过了基于人类知识的方法。这导致所有自然语言处理发生了重大变化,在几十年间,统计学和计算逐渐主导了这一领域。深度学习在语音识别中的兴起是这一持续方向上最近的一步。深度学习方法对人类知识的依赖更少,并结合在大量训练集上进行学习,使用更多计算,以生成大大改善的语音识别系统。与游戏中一样,研究人员总是试图创建按照研究人员认为自己大脑工作的方式工作的系统—他们试图将这种知识纳入自己的系统—但最终证明适得其反,并且浪费了研究员大量时间,而通过摩尔定律,大量的计算已经可用,并且找到了将它们用作有用的方法。

在计算机视觉中,出现了类似的模式。早期的计算机视觉方法将其视为搜索边缘,或广义圆柱体,或根据SIFT特征。但如今,所有这些都被抛弃了。现代深度学习神经网络仅使用卷积和某些种类的不变性的概念来执行更好的性能。

这是一个重要的教训。作为一个领域,我们仍然没有透彻地学习到这一点,因为我们正在继续犯同样的错误。为了看到这一点并有效抵制它,我们必须理解这些错误的吸引力。我们必须吸取这样的惨痛教训:从长远来看,建立在我们认为自己思考的方式上是行不通的。这一惨痛教训基于以下历史观察:1) AI研究人员经常试图将知识构建到其代理中,2) 从短期来看,这总是会有所帮助,并且会让研究人员在个人上得到满足,但3) 从长远来看,它会达到平台期,甚至会阻碍进一步的进展,4) 最终的突破进展最终来自通过搜索和学习来扩展计算的基础方法。最终的成功带着些许苦涩,并且常常无法完全消化,因为它是对一种偏好的以人为中心的方法的成功。

从这一惨痛教训中应该吸取的一件事是通用方法的巨大力量,即使在可用计算量变得非常大的情况下,通用方法仍能继续随着计算的增加而扩展。似乎以这种方式任意扩展的两种方法是搜索和学习。

从这一惨痛教训中学到的第二个一般性观点是,心灵的实际内容极其、无可救药地复杂;我们应该停止试图找到思考大脑内容的简单方法,比如思考空间、物体、多重代理或对称性的简单方法。所有这些都是任意、内在复杂的外部世界的一部分。它们不应该是内置的,因为它们的复杂性是无穷的;相反,我们应该只内置能够找到并捕捉这种任意复杂性的元方法。对于这些方法至关重要的是,它们能找到很好的近似值,但对它们的搜索应该通过我们的方法,而不是通过我们。我们想要的是像我们一样可以发现的AI代理,而不是包含我们发现内容的AI代理。构建我们的发现只会让我们更难看到发现过程是如何完成的。

完~