近些年,人工智能领域发生了飞跃性的突破,更使得许多科技领域的学生或工作者对这一领域产生了浓厚的兴趣。在入门人工智能的道路上,The Master Algorithm 可以说是必读书目之一,其重要性不需多言。作者 Pedro Domingos 看似只是粗略地介绍了机器学习领域的主流思想,然而几乎所有当今已出现的、或未出现的重要应用均有所提及。本书既适合初学者从更宏观的角度概览机器学习这一领域,又埋下无数伏笔,让有心人能够对特定技术问题进行深入学习,是一本不可多得的指导性入门书籍。诙谐幽默的行文风格也让阅读的过程充满趣味。
以这本书为载体,机器之心「人工智能研学社 · 入门组」近期将正式开班(加入方式)!我们邀请所有对人工智能、机器学习感兴趣的初学者加入我们,通过对 The Master Algorithm 的阅读与讨论,宏观、全面地了解人工智能的发展历史与技术原理。
第 4 章总结
加拿大心理学家 Donald Hebb 陈述的 Hebb's Rule 是联结主义(connectionism)的基础,其在本书中被解释为「将放电的神经元连接在一起」。连接主义者认为「学习是大脑所做的工作,所以我们需要做的就是对大脑进行反向工程」,这导致了一种想法,即知识存储在神经元的连接之中。对于大脑构建机制的理解启发了 Frank Rosenblatt,使其于 20 世纪 50 年代发明了感知器模型。
通过对一些抽象概念的简化,感知器具有对应于神经元的部分,同时还带有权重的额外含义。祖母细胞实验(Grandmother cell experiment)是一个很好的例子,该实验阐明了感知器算法学习权重的方式。最初,感知器被认为是一种强大的通用学习算法,但后来人们发现它不能学习异或(XOR)函数,且其不能学习相互关联的神经元的层的概念或解决信用分配问题。本书认为感知器在数学上是不可忽视的,十分清晰且有巨大影响」。
在 1982 年,加州理工学院的物理学家 John Hopfield 注意到了大脑与旋转玻璃之间相似之处,这使其开发了解决信用分配问题的第一个算法。此后,机器学习取代了知识工程。其思想是定义一种随时间进化的神经网络,其存储器是最小的能量状态。最重要的是其在模式识别(Pattern Recognition)中的应用:它可以将通过将一些失真的图像融合为理想的图像从而达到识别的目的。
然而,神经连接与选装相互作用并不完全相同:旋转相互作用具有对称性以及确定性,但是神经连接却并非如此。Boltzmann 机很好地解决了通过在中间网络所有状态上分配概率分布从而对神经元的统计特性进行建模的问题。Boltzmann 机在理论上很好地解决了信用分配问题,但是由于其缓慢以及痛苦的学习过程,大多数应用并不能实现。
关于信用分配问题一个新的解决方案是用一个 S 曲线代替感知器的阶梯函数,S 曲线是介于线性函数的愚笨和阶梯函数的困难之间的折中办法。本书有一整节阐述了 S 曲线在数学和各种相关领域的重要性。 S 曲线揭示了隐含神经元中误差传播的机制。每个神经元决定激发量的多少,并根据其选择,它可以加强或削弱自己的连接。
误差传播被证明是在自然界和技术领域中重要且普遍采取的策略。对于多层感知器,由误差传播找出层与层之间的误差是非常有效的一种方法。然而反向传播不能解决机器学习问题,因为它无法找到全局误差最小值,而局部的误差可能是任意的。话虽如此,这个问题并不大,因为局部最低限度的工作大部分时间是正常的。
由于超空间的属性,比起全局最小值,局部最小值不太可能过拟合数据,并且也许有很多非常不同的感知器学习同样出色。现在,具有 S 曲线的感知器模型解决了 XOR 学习问题,而反向传播的成功则导致了其在 NETTalk ,股票市场的预测,CMU 在较早时期的无人驾驶汽车等领域的强大应用。
现在让我们回到生物学中。最初被认为在相关量和构建细胞完整模型之间的很难找到的简单线性关系,随着反向传播的发展变得有希望了,如果我们有完整的细胞代谢途径图以及对于所有相关变量足够的观察,理论上可以有效地学习非线性函数。这个问题就转变成了如何用所需信息的部分知识来完成工作。一个直接的观察是基于以前的知识使用归纳法来推断细胞网络的结构。在构建需要大量层的人工大脑方面,由于误差信号的增多,使得反向传播分解。
作者引入了自编码器,以声明在连接机制中的进展不仅仅是计算速度更快,数据量更大的直接结果。自编码器是一种多层感知器,其输出与其输入相同。 其关键在于使隐含层比输入和输出层小得多。 尽管会造成一个问题——即使只有一个隐藏层也很难去学习。 稀疏自编码器通过使隐含层大于输入和输出的方式进行工作,并强制所有隐含单元在任意给定时间关闭。 另一个聪明的想法是使稀疏自编码器进行堆栈,其中第一个自编码器的隐含层成为输入/输出层。
没有上述所有的重大进展,将如何提升我们对大脑的理解? 作者指出,我们的模型与人类大脑仍然相差甚远:我们需要一个完整的脑图,但是我们现在没有,我们还需要明白脑图工作内容的细节,我们还远没有达到这个目标。 该问题不能仅仅通过扩展反向传播来解决,解决方案应该更深入地了解人类的学习机制。
第四周 问答合集
原始感知器模型如何对应于神经元?
a.「通过一些简化的抽象,感知器具有对神经元的相应部分,具有额外含义的权重。」
Boltzmann 机的核心是什么?
a.「Boltzmann 机器很好地解决了通过在中间网络所有状态上分配概率分布从而对神经元的统计特性进行建模的问题」
为什么作者声称在大部分时间内局部最低限度是足够的?
a.“因为超空间的特性,比起全局最小值,局部最小值不太可能过拟合数据,或许有许多非常不同的感知学习同样出色”
稀疏自编码器的关键是什么?
a.「稀疏自编码器通过使隐含层大于输入和输出的方式进行工作,并强制所有隐含单元在任何给定时间关闭。」
b.「另一个聪明的想法是使稀疏自编码器进行堆栈,其中第一个自动编码器的隐含层成为输入/输出层。」
为什么我们仍旧与构建人工大脑相距甚远?
a.回顾的最后一段
演绎和逆向演绎(归纳)的区别
a.演绎就是从一般推理到特殊,逆向演绎就是从特殊到一般情况。
b.演绎是基于现有知识做实例判断,归纳是基于实例总结出新知识。举个例子: 知道人要吃饭,马云是人,所以马云要吃饭,这是演绎。知道马云要吃饭,王健林要吃饭,我们自己也要吃饭,我们跟马云王健林都是人,由此得出人要吃饭,这是归纳。
第五章预习
在本章中,我们将按照进化的核心思想来构建机器学习算法。大自然的代码会让我们上升到新的层次吗?从自然进化中借用的思想是否会遵循具体的数学模式?这些千奇百怪的东西将如何引领我们走向 Master Algorithm?即使我们最初是从达尔文的论点受到了启发,但是仍然存在各种各样的问题:对于性别、死亡、适者生存个体的定义、生物之间的相互作用、我们算法的性质等。似乎这些因素可能会以某种方式助力于衍生算法的成功,甚至为其性质提供更深刻的见解。
达尔文算法:20 世纪 60 年代,Holland 阅读了 Ronald Fisher 的经典论文「自然选择的遗传理论」,然后他阐述了其核心思想并设计了遗传算法。本节介绍了与自然和计算机算法定义相关的基础理论。提出具体的实例,垃圾邮件过滤,来说明这些概念。
探索-开发困境:本部分将探索与开发进行比较。每个行为都有自己的优势和缺点。遵循探索模式的遗传算法就像一堆寻宝者一样尽可能多地去寻找,尽管其可能在某些地方死亡。 最后,拥有最珍贵宝藏的猎人赢了。 但是,探索是否比开发更好(例如反向传播算法)?
适者生存的程序:如何将演化定义为算法的核心部分。我们应该设计一种算法,该算法可以在每次迭代中学习和切换为复杂的表示,而不是操纵简单的二进制字符串。然而,面临的挑战在于使用正确的参数组成正确的行为结构。 只要我们做到这一点,我们相信这个算法可以从任何程序中学习,这也导致了 Master Algorithm 的诞生。
什么是性别:性别在进化中的作用仍然是一个重要的谜题。有人相信性别维持和增加人口的多样性,但这一论点并不确定。 有人认为,性别优化了混合性,使整个群体在进化中更为强大。 然而,这种方法被证明在机器学习中的梯度优化方面效果较差。 因此,性别可能不会在机器学习中取得成功。
培育自然:我们的算法,模型和结构正是计算中的。 要充分提高这些模型,毫无疑问数据是不可或缺的。 自然因为其所得到的东西而进化。 这就是数据培育「自然」的方式。 但在现实世界中,往往是大自然先行,培育在后。 关键在于如何结合这两者。
学习最快的获得胜利:在现实世界中,需要花费数十年,才能将含有少量变异的基因从一代传递到下一代。 对于计算机来说,这样学习太慢了。 即使有快速计算的力量,如何加快这一时间过程仍然是一个长期存在的问题。
遗传编程
进化计算
Baldwin 效应
说出并描述遗传算法的主要特征?
邮件过滤是如何实现的?
遗传算法的优缺点是什么?
什么的 Baldwin 效应?它是如何与机器学习相关联的?
Chapter #4 Review
【Chapter Summary】
Hebb's Rule, stated by a Canadian psychologist Donald Hebb, is the cornerstone of connectionism, which is paraphrased in the book as “Neurons that fire together wire together”. Connectionists think that “learning is what the brain does, so what we need to do is reverse engineering it”, so it leads to a belief that knowledge is stored in the connections between neurons. The understanding of how the brain is built inspires the invention of Perceptron model in 1950s by Frank Rosenblatt.
With some simplified abstractions, Perceptron has corresponding parts to neuron with additional meanings for the weight. Grandmother cell experiment is a good example that illustrates how perceptron algorithm learns weight. Perceptron was first thought as a powerful general-purpose learning algorithm, but later people found that it cannot learn XOR function, and there was no way to learn layers of interconnected neurons or to solve the credit-assignment problem. The book summarizes perceptron as “mathematically unimpeachable, searing in clarity, and disastrous in effects”.
In 1982, an inspiring analogy between the brain and spin glass was noticed by Caltech physicist John Hopfield, leading to the first algorithm to solve credit-assignment problems. Since then, machine learning replaces knowledge engineering. The idea is to define a type of neural network that evolves over time, whose memories are the minimum energy state. One significant application was in pattern recognition: it could recognize some distorted image by converging it to the ideal one.
However, neural connections are not exactly the same as spin interactions: spin interactions are symmetric and deterministic, but neural connections are not. Boltzmann machine came as one good solution to modeling the statistical property of neurons by assigning a probability distribution over all states of the neutral network. Boltzmann machine effectively solved the credit-assignment problems in principle, though it was impractical for most applications because of slow and painful learning.
A new solution to the credit-assignment problem is to replace the perceptron's step function with an S curve, which is “a nice halfway house between the dumbness of linear function and the hardness of the step function”. In the book, there is a whole section that addresses the importance of S-curve in mathematics and a variety of related fields. S-curve unlocks an important interpretation of error propagation to the hidden neuron. Each neuron decides how much more or less to fire, and based on the decision, it can strengthen or weaken its own connections.
Error propagation turns out to be an important and common strategy in both nature and technology. For a multilayer perceptron, it is efficient to just figure out error between layers by error propagation. Whereas backprop cannot solve the machine-learning problem for the reason that it doesn't know how to find the global error minimum, and local ones can be arbitrarily bad. That being said, the issue is not a big deal since a local minimum works fine for most of the time.
Because of the property of hyperspace, the local minimum is less likely to have overfitted the data rather than the global one, and there are probably many very different perceptron learnings that are just as good. Now Perceptron model with S-curve solves the XOR learning problems, and the success of backprop leads to the powerful application in NETTalk, predictions in the stock market, driverless car by CMU in earlier times, etc..
Now take one step back to biology. Originally known as difficult to find simple linear relationships between related quantities, building a complete model of a cell becomes promising with backprop, which can learn nonlinear function efficiently in principle if we have a complete map of the cell's metabolic pathways and enough observations of all the relevant variables. The problem shifts to how to complete the work with only partial knowledge of the required information. One immediate observation is to use induction to infer the structure of the cell's network based on previous knowledge. In terms of building an artificial brain, which needs a large amount of layers, backprop breaks down because the error signal becomes more and more diffuse.
The author introduces autoencoder to support the claim that the progress made in connectionism is not just a direct result of faster computation and bigger data. Autoencoder is a multilayer perceptron whose output is the same as its input. The key is to make the hidden layer much smaller than the input and output layer. Whereas one issue is that it is hard to learn even though there is only a single hidden layer. Sparse auto-encoder plays a trick by making the hidden layer larger than the input and output ones, and forcing all but a few of the hidden units to be off at any given time. Another clever idea is sparse autoencoder stack, where the hidden layer of the first autoencoder becomes the input/output layer.
Without all significant progress mentioned above, how would the understanding of the brain be improved? The author states that our models are still far from the human brain: we need a complete map of the brain, which we don't have right now, and also need to figure out what the map does precisely, which we still haven't reached this far. The issue cannot be solved by just scaling up the backprop, and the solutions are buried in a deeper understanding of how human actually learn.
Week 4 Q & A Collection
How does the original perceptron model correspond to the neuron?
“With some simplified abstractions, Perceptron has corresponding parts to neuron with additional meanings for the weight.”
What is the core of the Boltzmann machine?
“Boltzmann machine came as one good solution to modeling the statistical property of neurons by assigning a probability distribution over all states of the neutral network”
Why does the author state that a local minimum is sufficient for most of the times in backprop?
“Because of the property of hyperspace, the local minimum is less likely to have overfitted the data rather than the global one, and there are probably many very different perceptron learnings that are just as good.”
What is the trick of sparse autoencoder?
“Sparse auto-encoder plays a trick by making the hidden layer larger than the input and output ones, and forcing all but a few of the hidden units to be off at any given time.”
“Another clever idea is sparse autoencoder stack, where the hidden layer of the first autoencoder becomes the input/output layer.”
Why are we still far away from building an artificial brain?
Last paragraph of the review
Chapter #5 Preview
【Chapter Summary】
In this chapter, we will follow the core idea of evolution to build machine learning algorithms. Will the code of nature escalate us to the new heaven filling with intricate designs? Will the idea referenced from natural evolution follow a specific mathematical pattern? How will those kinds of whims lead us to the Master Algorithm? Though we are initially inspired from Darwin's arguments, a variety of questions still remain open: the definitions of sex, the mortality, the fittest individual, the interaction between creatures, the nature of our algorithm, etc.. It seems that these factors may somehow contribute to the success of the derived algorithms, or even supply deeper insights to their nature.
【Important Sections】
Darwin's algorithm:
In the 1960s, Holland read Ronald Fisher's classic treatise The Genetical Theory of Natural Selection, then he elaborated the core idea and devised the Genetic Algorithm. This section introduces basic concepts related to the definitions of nature and computer algorithms. A concrete example, spam e-mail filter, is presented to illustrate those concepts.
The exploration-exploitation dilemma:
This section compares exploration with exploitation. Each behavior has its own advantage and drawback over the other. A genetic algorithm following the exploration pattern is like a bunch of treasure hunters who try to find as much as they can get, even though they might be dead at some points. At last, the hunter who owns the most valuable treasures wins. But is exploration better than exploitation(e.g. the backprop algorithm)?
Survival of the fittest programs
How to define the evolution becomes the core part of an algorithm. We should design an algorithm that can learn and switch complex representations in each iteration instead of manipulating simple binary strings. However, the challenge is to assemble the right structure of behaviors with proper parameters. As long as we make it, we shall believe that this algorithm can learn from any program, which leads to the Master Algorithm.
What is sex for
The role of sex in evolution is still a notable mystery. It is believed that sex is to maintain and increase the diversity in the population, but this argument is inconclusive. It is also believed that sex optimizes the mixability, rendering the whole population more robust in evolution. However, this approach turns out to be less effective than gradient optimization in machine learning. Thus, sex may not come to success in machine learning.
Nurturing nature
Our algorithms, models, and structures are exactly the 'nature' in the computation. To fully enhance those models, data undoubtedly plays an indispensable role. Nature evolves for the nurture it gets. That is how data nourishes the 'nature'. But in the real world, it is often that nature sets first, then nurture follows. The key is figuring out how to combine these two.
He who learns fastest wins
In the real world, it takes decades for the genes, which only contain little variation, to be passed from one generation to the next. It is too slow to learn for a computer. Even with the power of fast computation, how to expedite this time-taking process would still remain as a long-lasting problem.
【Key Concepts】
Genetic programming
Evolutionary computation
The Baldwin effect
【Quiz】
Name and describe some main features of genetic algorithms.
How does the spam filter work?
What are the advantages and disadvantages of genetic algorithm?
What is the Baldwin effect? How does it relate to machine learning?