近些年,人工智能领域发生了飞跃性的突破,更使得许多科技领域的学生或工作者对这一领域产生了浓厚的兴趣。在入门人工智能的道路上,The Master Algorithm 可以说是必读书目之一,其重要性不需多言。作者 Pedro Domingos 看似只是粗略地介绍了机器学习领域的主流思想,然而几乎所有当今已出现的、或未出现的重要应用均有所提及。本书既适合初学者从更宏观的角度概览机器学习这一领域,又埋下无数伏笔,让有心人能够对特定技术问题进行深入学习,是一本不可多得的指导性入门书籍。诙谐幽默的行文风格也让阅读的过程充满趣味。
以这本书为载体,机器之心「人工智能研学社 · 入门组」近期将正式开班!
加入方式我们邀请所有对人工智能、机器学习感兴趣的初学者加入我们,通过对 The Master Algorithm 的阅读与讨论,宏观、全面地了解人工智能的发展历史与技术原理。
- 人工智能研学社 · 入门组 | 一起研习Pedro Domingos的《终极算法》
- 研学社 · 入门组 | 《终极算法》前两章总结及第三章学习
- 研学社 · 入门组 | 《终极算法》第三章总结及第四章学习
- 研学社 · 入门组 | 第四期:进化是大自然的学习算法
- 研学社 · 入门组 | 第五期:进入贝叶斯的殿堂
- 研学社 · 入门组 | 第六期:初入贝叶斯网络
- 研学社 · 入门组 | 第七期:不需要老师的无监督学习
第8章回顾
【本章概括】
与前几章关注标签数据不同,第8章引入没有标签的无监督学习,即「没有教师的学习」。认知科学家将儿童学习的理论用算法的形式描述出来。而本文的作者也将通过算法的形式为聚类、降维、强化学习、分块(chunking)以及关系学习寻找解决方案。本章是一些概念和算法的集合,这些算法可以用于重构新生机器人的大脑学习过程。
首先是聚类,聚类能够自发地将相似对象聚合在一起。本文的作者引入了最受欢迎的算法之一——最大期望(EM)算法,以及2个 EM 在特殊情况下的著名算法:马尔可夫模型和 k-means。为了学习隐马尔可夫模型(HMM),我们依次在推断隐状态与基于推断估计转移(transition)概率和观测(observation)概率之间进行循环计算。当我们想学习的统计模型缺失某重要信息(如样本的分类标签)时,我们可以使用 EM 算法。k-means 适用于所有属性服从正态分布并具有小方差的情况。我们从模糊聚类(coarse clustering)开始,并进一步将大类聚类为更小的子类。
随后,本文的作者介绍了一些流行的无监督学习技术,PCA(主成分分析)和 Isomap,这两个算法主要用于降维。降维是将大量观测到的特征维度(如像素)降维变换成一些隐式的特征表示(如表情、面部特征),这对于大数据是非常重要的处理技术。PCA 地图在超空间中找到多个特征维度的线性组合,使得数据的所有特征维度的总方差和最大化。Isomap 是一种非线性降维技术,这种技术在高维度空间(如一张面部图像数据)中将每个数据点与所有附近的数据点(与之非常相似的面部图像)连接起来,在网络中计算所有数据对(pair of point)之间的最短距离,并找到最近似这些距离的降维坐标。
在介绍聚类和降维之后,本文的作者讲到了「强化学习」,这是一种通过依赖环境对学习者各种动作行为的反馈而学习的技术。作者详细阐述了强化学习的历史和发展历程。在20世纪80年代初,马萨诸塞大学的 Rich Sutton 和 Andy Barto 观察到,学习在很大程度上取决于与环境的相互作用,这是监督学习无法获取到的,因此他们从动物学习的心理学机制寻找灵感。Sutton 是强化学习的主要倡议者。另一个重大进展在1989年,剑桥大学的 Chris Watkins 由对儿童学习的实验观测得到启发,发展了在未知环境中的最优控制(optimal control)理论,这标志着现代强化学习的形成。最近的一个将神经网络与强化学习相结合的成功案例是一家叫「Deep Mind」的初创公司,这家公司被谷歌以5亿美元收购。
由心理学启发,分块(chunking)是一种具有成为「终极算法」潜力的技术。本文的作者给出了该算法核心思想的基本概要。在商业领域,分块和强化学习并没有像监督学习、聚类或降维那样被大规模使用。一种更简单的与环境互动的学习类型是 A/B 测试。
本章最后介绍了另一个「杀手级算法」——关系学习(relation learning),基于世界是相互联系的网络,每个我们创造的特征模板都与它们的实例参数相连接。本文的作者建议将第8章提到的所有元素、解决方案和算法转化成为最终的「大师级算法」最为结论。
第8周 Q&A
1. 分别给出马尔可夫模型和 k-means 算法的两个应用。
- a. 马可夫模型:语音识别;手写识别;生物序列分析等。
- b. K-means:商品推荐(亚马逊,Netflix等);商店位置选择;文章自动摘要等。
2. 描述 Isomap 算法最适用于降维时的情形。
- a. “Isomap 是用于非线性降维的最流行的算法之一。”“从理解视频中的运动到识别语音中的情绪,Isomap 在聚焦复杂数据的最重要的特征维度上有惊人的能力。”[摘自书中]
3. 带有泛化的强化学习往往不能给出稳定的解决方案,为什么?
- a. “在监督学习中,一个状态(state)的目标价值总是一致的,但在强化学习中,随着周围状态的更新,一个状态的值是不断变化的。”[摘自书中]
4. 聚类和分块有什么区别?
- a. 它们在根本上不同。聚类是通过无监督学习的方式将相似的数据分类在一起,而分块是有意地将数据分割以更容易地解决问题。
第9章预览
【一章概括】
欢迎来到第9章。在这一章中,作者介绍了结合一些之前提到的算法来通向终极算法的途径,并且对它们进行了明确地对比。当然,作者在进行这样的结合的过程中遇到了很多挑战。他解释了如何克服这些困难,如何生一个机器学习方法的统一者。最后,他从不同的角度分享了自己对这些方法的看法,例如,优点、缺点以及应用。
【重要部分】
- 元学习方法
- 作者展示了很多元学习方法--堆叠(stacking)、袋装(bagging)、提升(boosting),并且总结了每种方法的特点和不足。
- 终极算法
- 作者指引着我们进行了一场奇幻的旅行。我们在途中学习到了重要算法的功能,各种算法之间的关系,并学会了如何将它们结合起来以实现终极算法。
- 马尔可夫逻辑网络
- 作者介绍了马尔可夫逻辑网络。 他成功地填补了结合各种算法时遇到的最后的鸿沟。并且他还定义了终极算法中每一个算法的职能。
- 从科幻到落地
- 作者将初始的终极算法命名为Alchemy。他得出了一些Alchemy的结论,并且展示了如何为不同的情景定制Alchemy,这部分也展示了一些实际应用。
- 大规模应用的机器学习
- 由于用到的计算特别昂贵,Alchemy并不能大规模应用。那么,我们应该怎么做呢?为什么它在现实世界中可以如此有效?下一步应该如何做呢?这部分主要解答了这些问题。
- 实际应用
- CanceRx 是Alchemy的一个典型成功应用。它有着巨大的潜力,并且持续被改进。
【关键概念】
- Unifier(统一者):统一者就是将别的人或者事物聚集在一起的人或者事物。
- Metalearning(元学习):元学习是学会学习的过程
- Stacking(堆叠):算法堆叠指的是训练一个算法去结合好几个其他算法的预测结果。
- Bagging(袋装):打包就是给集成投票中的每一个模型都赋予相等的权重。
- Boosting(提升):提升(Boosting)就是通过训练一个新模型实例来强调被之前的模型误分类样本的集成方法。
- Representation(表征):学习器用来表达模型的正式语言。
- Evaluation(评估):展示模型好坏程度的评价函数。
- Optimization(优化):搜索最好的模型并将其返回的算法。
- Markov network(马尔可夫网络):用特征和的加权定义,和感知机很类似。
- Markov Logic Network (MLN)(马尔可夫逻辑网络):一系列逻辑公式以及它们的权重。
- Alchemy:它简单地定义了我们的通用学习机。这里它主要指的是由作者的团队开发的马尔可夫逻辑网络算法/模型。
【小测试】
- 堆叠(stacking)、Bagging和提升(boosting)的区别是什么?
- 作者是如何结合逻辑和概率的?
- Alchemy面临的挑战是什么?
- 分享一下你对这部分内容的看法。.
Chapter #8 Review
【Chapter Summary】
Unlike the previous chapters which focus on labeled data, Chapter 8 “Learning without a teacher” introduces unsupervised learning. Cognitive scientists describe their theories of children's learning in the form of algorithms. So does the author who seeks solutions from clustering, dimensionality reduction, reinforcement learning, chunking, and relational learning. This chapter is a collection of concepts and algorithms for recreating the brain's learning process in a newborn robot.
Clustering is the first try to spontaneously group similar objects together. The author explains Expectation Maximization(EM) algorithm, one of the most popular algorithms, along with its two special cases: Markov models and k-means. To learn hidden Markov models, we alternate between inferring the hidden states and estimating the transition and observation probabilities based on the inference. Whenever we want to learn a statistical model but are missing some crucial information (e.g., the classes of the examples), we can use EM algorithm.K-means is for the situation where all the attributes have normal distributions with very small variance. We can start with a coarse clustering and then further divide each primary cluster into smaller subclusters.
Subsequently, the author introduces some popular techniques for unsupervised learning, PCA(principal-component analysis) and Isomap, which are mainly used for dimensional reduction. Dimensionality reduction is the process to reduce a large number of visible dimensions (the pixels) to a few implicit ones (expression, facial features), which are very essential to cope with big data. PCA tries to come up a linear combination of various dimensions in the hyperspace so that the total variance of the data across all dimensions is maximized. Isomap, a nonlinear dimensionality reduction technique, connects each data point in a high-dimensional space (a face, say) to all nearby points (very similar faces), computes the shortest distances between all pairs of points along the resulting network, and finds the reduced coordinates that best approximate these distances.
After introducing clustering and dimensionality reduction, the author talks about "reinforcement learning", a technique that relies on immediate response of the environment to various actions of the learner. The author describes the history and development of reinforcement learning in detail. In the early 1980s, Rich Sutton and Andy Barto at the University of Massachusetts observed that learning depends crucially on the interaction with the environment, which supervised algorithms did not capture, and therefore they found inspiration instead in the psychology of animal learning. Sutton became the leading proponent of reinforcement learning. Then another key step happened in 1989, when Chris Watkins at Cambridge, initially motivated by his experimental observations of children’s learning, arrived at the modern formulation of reinforcement learning as optimal control in an unknown environment. A recent example of a successful startup that combines neural networks and reinforcement learning is "DeepMind", a company acquired by Google for half a billion dollars.
Inspired by psychology, chunking is an algorithm which is potential to be a part of "Master Algorithm". The author gives a basic outline of the core idea. Chunking and reinforcement learning are not as widely used in business as supervised learning, clustering, or dimensionality reduction. A simpler type of learning by interacting with the environment is A/B testing.
The chapter ends with the explanation to another potential killer algorithm: relational learning, as the world is a web of interconnections and every feature template we create ties the parameters of all its instances. The author recommends to transmute all the elements, the solutions and algorithms mentioned in this chapter 8 for an ultimate “Master Algorithm” as the conclusion.
Week 8 Q & A Collection
- Give two applications of Markov models and k-means algorithm respectively.
- Markov models: Speech Recognition; Handwriting Recognition; Biological Sequence Analysis etc.K-means: Making recommendations (Amazon, Netflix etc.); Store location choice; Summarize articles etc.
- Describe the situation when Isomap algorithm performs the best for dimensionality reduction.
- “One of the most popular algorithms for nonlinear dimensionality reduction, called Isomap, does just this.” “From understanding motion in video to detecting emotion in speech, Isomap has a surprising ability to zero in on the most important dimensions of complex data.” [from the book]
- Reinforcement learning with generalization often fails to settle on a stable solution, why?
- “In supervised learning the target value for a state is always the same, but in reinforcement learning, it keeps changing as a consequence of updates to nearby states”[from the book]
- What is the difference between clustering and chunking?
- They are fundamentally different. Clustering is to do unsupervised learning to group similar data together while chunk is to divide data intentionally to solve problems more easily.
Chapter #9 Preview
【Chapter Summary】
Welcome to Chapter 9. In this Chapter, the author introduces a path to Master Algorithm by combining some of the algorithms mentioned before and comparing them explicitly. Of course, the author encounters many challenges during this combination process. He explains how to overcome those problems and to generate the unifier of machine learning - Alchemy. Finally, he shares his insights about the alchemy from different perspectives, such as advantages, disadvantages, and application.
【Important Sections】
- Out of many models, one
- The author shows the exploration of the metalearner - stacking, bagging and boosting and the summary of shortages and characteristics for each.
- The Master Algorithm
- The author guides us to a fantastic tour in an imaginary city. In this journey, we learn the features of important algorithms, the relationship among them, and how to combine them all together to reach the Master Algorithm.
- Markov logic networks
- The author introduces the Markov network. He successfully merges the final gap in the combination and derives the Markov logic network. Also, he defines the duty of each algorithm in the Master Algorithm.
- From Hume to your housebot
- The author names the initial status of the master algorithms as Alchemy. He derives some conclusion of the Alchemy and shows how to customize it under different situations. Some practical applications are also shown in this section.
- Planetary-scale machine learning
- Alchemy is not scaled yet because the computation is very expensive. What should we do? Why could it work well in the real world? What is the next step? This section answers the above questions.
- The doctor will see you now
- CanceRx is one of the successful applications of the Alchemy. It has tremendous potential and is constantly improved.
【Key Concepts】
- Unifier: A unifier is someone or something that brings others together.
- Metalearning: Metalearning is the process of learning to learn.
- Stacking: Stacking involves training a learning algorithm to combine the predictions of several other learning algorithms.
- Bagging: Bagging is to make each model in the ensemble vote with equal weight.
- Boosting: Boosting is incrementally building an ensemble by training each new model instance to emphasize the training instances which previous models misclassified.
- Representation: It is the formal language in which the learner expresses the model.
- Evaluation: It is a scoring function that shows how good a model is.
- Optimization: It is an algorithm that searches for the highest-scoring model and returns it.
- Markov network: It is defined by a weighted sum of features, much similar to a perceptron.
- Markov Logic Network (MLN): An MLN is a set of logical formulas and their weights.
- Alchemy: It refines our universal learner candidate for simplicity. Here it mainly referes to the Markov Logic Network algorithms/models developed by the author's team
【Quiz】
- What is the difference among stacking, bagging and boosting?
- How to combine logic and probability according to the author?
- What are the challenges for the Alchemy?
- Share your opinion on this section.