近些年,人工智能领域发生了飞跃性的突破,更使得许多科技领域的学生或工作者对这一领域产生了浓厚的兴趣。在入门人工智能的道路上,The Master Algorithm 可以说是必读书目之一,其重要性不需多言。作者 Pedro Domingos 看似只是粗略地介绍了机器学习领域的主流思想,然而几乎所有当今已出现的、或未出现的重要应用均有所提及。本书既适合初学者从更宏观的角度概览机器学习这一领域,又埋下无数伏笔,让有心人能够对特定技术问题进行深入学习,是一本不可多得的指导性入门书籍。诙谐幽默的行文风格也让阅读的过程充满趣味。
以这本书为载体,机器之心「人工智能研学社 · 入门组」近期将正式开班(加入方式)!我们邀请所有对人工智能、机器学习感兴趣的初学者加入我们,通过对 The Master Algorithm 的阅读与讨论,宏观、全面地了解人工智能的发展历史与技术原理。
第五章总结
第五章通过达尔文自然选择的遗传理论,向我们展示了进化的力量。虽然有数以千计的媒介同时并且迭代性地与环境进行相互作用,但只有最优秀的一类能够将基因传递至下一代。进化与多层感知器可不同,它使用数字达到其最终目标。
遗传算法:
初始化:随机假设p个个体合成群体P
个体评价:计算群体P中每一个体的适应度h
当[适应度h的最大值] < Fitness_threshold
选择运算:在群体P中随机选择(1-r)*p个个体,并添加到Ps中
交叉运算:在群体P中随机选择(r-p)/ 2对假设的个体,每一对
都要通过交叉配对产生两个后代。而后将所有后代添加到Ps中
变异运算:对mp个假设的个体随机反转位
更新:将Ps的内容赋值于P
个体评价:计算群体P中每一个体的适应度h
将具有最高适应度的群体P作为返回值
遗传算法的关键输入是适应度函数(fitness function),该函数为程序分配了一个能够反映其符合目的之程度的数值分数。从并不完全符合的个体群体(很可能是随机选出的)开始,遗传算法需要提出由适应度所选出的变异体。为了使下一代表现更优,遗传算法就像选择育种:程序选择得分最高的个体作为亲代,而后随机反转其子代的位来刺激突变。
训练数据:垃圾邮件中经常会出现的一万个词
对于训练数据中的每一个词,我们用两位来表示规则:第一位标记存在(existence),第二位标记不存在(absence)。例如,若电子邮件包含目标词,那么这个词所对应的第一位便显示1;若不包含,那么它对应的第二位便显示1。而如今的问题是,为何要使用两位来表示排除规则?看起来仅使用一位便可达到相同的效果。请注意,我们正在创建一个过滤器,并且希望每一代都能有所改良,所以我们需要冗余通过交叉来进化。倘若一个词的两位都是1,则意味着不论这个词是否出现,这封邮件都将被确定为垃圾邮件。相反,倘若两位都是0,那么这封邮件永远不会被确定为垃圾邮件。可以发现,若两位都为1或0,这个规则便不再健全。在几代之后,不够健全的规则将通过适应度的测量进行消除,适应度是电子邮件正确分类的百分比。与现实世界的任何生物不同,我们可以让程序「永生」来随意地「作弊」,适应度得分最高的程序将会与其后代相比。若它的分数仍然高于后代,那么它的寿命便可能比其后代更长。「永生」通过避免代际间的退步,来帮助算法更快达到所需的适应度。
为了评估遗传算法,我们可以将其与反向传播进行比较。在任一给定时间,反向传播都会有一个单独假设,并且该假设会逐渐变化,直到它落到局部最优值为止。遗传算法在每一步都会考虑到整个假设群体,它们能在亲代和子代间产生显著进步。遗传算法的问题便是维度的诅咒(curse of dimensionality),该算法不存在对结构的先验假设,继而使状态空间变得更大,因此搜索问题变得异常复杂且难以分析。以垃圾邮件过滤器为例,假设共有2^2000个状态(这个规则集相对较小),在机器学习中,像计算机科学的其他方面一样,没有什么会比得到合适的组合爆炸(combinatorial explosion)而非带来麻烦更棒的事了。所幸更好的规则拥有更高的生存概率,因此我们应当期待在每一代都能出现更多优秀的个体。
直到如今,我们仍在操纵位串(bit strings)。我们可能会想,除了位串,是否存在值得学习的有效结构或表征?此处很自然地应用了计算机的一种编程概念——树结构。为何不去进化简单事物,而是进化成熟的计算机程序呢?我们可以将亲代的子树进行交叉来进化出新的树,继而评估其适应度。一般而言,程序树能包括编程结构的全部范围,如 If . . . else. . .语句、循环以及递归。这个想法为我们提供了无限多种选择来实现特定目标:结果程序可以有任意大小,从而使学习更加灵活。
生物学中,鲍德温效应(Baldwin effect)是通过动物个体的一生所学得的进化相互作用的结果。事实证明,个体学习可以增强物种层面的进化学习。在机器学习中也很相似,鲍德温效应表明你可以通过将其应用于能够自主学习的系统,来改良进化算法的结果。从平均分配开始,程序通过自主学习聚到一起,最终将聚集于几个点上。在密集点附近的单一程序更容易移至点上。
进化需要经过数十亿年方能完成,大脑则需要一生的时间,但学习算法却应能只花几分钟或几秒钟便可学完。机器学习的目的是不惜一切代价找到最优的学习算法,这种算法进化和大脑可不太可能提供得出来。现在成功的遗传算法应用屈指可数,但遗传算法和搜索的概念的确激励了我们通过优化一些具体结构,而非固守一般学习的希望,来找到更有效的学习方法。
第五周 问答合集
a.遗传算法与选择育种相似,其关键是fitness函数
b.遗传算法由三个主要运算符:选择、交叉和变异
c.遗传算法「并未对结构作出先验假设,继而使状态空间变得更大,因此搜索问题变得异常复杂并且难以分析」。[出自回顾]
a.参见以「例如,假设我们要为过滤垃圾邮件制定规则」为开头的段落。
a.优点:
i. 遗传算法易与其他算法集成,并能不局限于领域地应用到多种问题中
ii. 遗传算法「相比反向传播而言,陷入局部最佳状态的可能性会低很多,原则上推陈出新的能力更强 [来自书本]
b.缺点:
i. 若考虑到复杂性,遗传算法在分析时要比反向传播困难得多
ii. 遗传算法的参数极具影响力,但它们主要取决于经验
a.参见以「在所有可能的基因组中,很少能与活生物体相对应」为开头的段落。[来自书本]
b.在机器学习中,鲍德温效应表明:通过将其应用于能够自我学习的系统,你可以将进化算法得出的结果加以改良。[来自总结]
第六章预习
第六章讨论了贝叶斯的统计数据,这是一套使用新的数据来迭代性更新信任度或现有知识的数学规则。贝叶斯方法在18世纪由Tomas Bayes所发明,它的出现使频率学派(frequentist)大为震惊,因为频率学派使用频率来衡量事件的概率。本章中,作者谈到了与频率学派相比,我们能如何运用贝叶斯定理,以不同的角度观察世界。此外,作者还介绍了贝叶斯推理(贝叶斯定理的重要应用之一),以及一些运用贝叶斯推理方法的模型,其中包括朴素贝叶斯分类器、马尔可夫链、隐马尔可夫模型以及马尔可夫随机场。
所有的模型都是错的,但有些很有用:本节介绍了朴素贝叶斯分类器。朴素贝叶斯分类器对独立变量和因变量(特征和标签)做出了简单假设。在实际问题中,简单假设往往是错的。但朴素贝叶斯分类器在经验上是一个具有O(nc)维时间复杂度的简单模型(n是样本大小,c是特征数),因此它通常表现良好,且没有过拟合的倾向。所以朴素贝叶斯非常适合于初期的机器学习模型,因为它提供了这些模型在给定数据时如何执行的一般概念。
从Eugene Onegin到Siri:本节介绍了马尔科夫链以及隐马尔可夫模型。这两种模型都符合马尔科夫性质,并用于对顺序数据进行建模。
一切都存在联系,但联系并不直接:马尔可夫性质意味着事件的未来状态仅取决于当前状态。但若未来状态还取决于当前状态之外的其他事件呢?隐马尔可夫模型或类似模型无法捕获这种情况,但我们可以贝叶斯网络,它是一种概率模型,能让我们明确事件之间的任一方向性概率关系。作者在本节中介绍了贝叶斯网络及其应用。
马尔可夫使证据更加明确:虽然贝叶斯网络让我们得以明确事件之间的任一方向性概率关系,但在某些情况下,我们知道事件A和B相关,但其确切的因果关系未知(即究竟是事件A导致B还是事件B导致A)。此时马尔可夫网络和马尔可夫随机场便是所应寻找的模型类型。
频率学派vs贝叶斯学派:
频率学派主张概率是频率,并且我们应当通过计算相应事件发生的频率来估计概率。
贝叶斯学派则认为概率是一种主观的信念程度,你应当用新的证据来更新之前的观点,以获得后验观点。
贝叶斯定理:数学上的贝叶斯定理是
P(A)和P(B)是事件A和B的概率。P(A)称为先验,表示事件A的初始信任度。
P(B | A)被称为似然,即A发生的条件下B发生的概率。
P(A | B)称为后验,表示将证据B考虑在内之后我们的信任度。
我们通常会忽略常数P(B),故而可以将上述函数重写为P(A | B)〜P(B | A)P(A)。
朴素贝叶斯分类器的简单假设:所有特征在标签给定时条件独立。
马尔可夫性质:事件的未来状态仅取决于当前状态,而非先前事件状态的序列。
最大似然估计和最大后验估计:这两种估计方法并不相同,取决于最大化的目标是似然P(B|A)还是后验P(A|B)。
何为贝叶斯定理?
简单假设是什么?它在朴素贝叶斯分类器中起着什么作用?
隐马尔可夫模型和马尔可夫模型有哪些区别?
为什么领域知识在图模型的构建与推理方面举足轻重?
Chapter #5 Review
【Chapter Summary】
Chapter 5 presents the power of evolutionary along with the Darwin's genetic theory of nature selection. While thousands of agents interact with the environment simultaneously and iteratively, the best class of agents is selected to pass their genes to the next generation. Unlike multilayer perceptrons, evolution reaches its ultimate goal with the numbers.
The Genetic Algorithm:
Initialize: P ← p random hypotheses
Evaluate: for each h in P, compute Fitness(h)
While [pick h max for Fitness(h)] < Fitness_threshold
Select: Randomly select (1-r)*p members of P to add to Ps.
Crossover: Randomly select (r-p)/2 pairs of hypotheses from P. For each pair
, produce two offsprings by crossover. Add all offsprings to Ps.
Mutate: Invert random bit in mp random hypotheses.
Update: P ← Ps
Evaluate: for each h in P, compute Fitness(h)
Return hypotheses from P with the highest fitness
The key input to a genetic algorithm is a fitness function. This function assigns the program a numeric score reflecting how well it fits the purpose. Starting with a population of not-very-fit individuals, possibly completely random ones, the genetic algorithm has to come up with variations that can be selected according to the fitness. To obtain better performance in the following generation, genetic algorithm is like selective breeding. Programs pick the ones who have the highest score to be parents of the next generation. Then programs randomly flip bits of the offsprings to stimulate mutation.
Training data: ten thousand words that frequently shows in spams
For each word in training data, we use two bits to represent the rule: the first bit is used to mark the existence; and the second bit is used to mark the absence. For example, if the email contains the target word, the first bit corresponding to this word will be one; if the target word is not in the email, the second bit will be one. Now the question is why we would like to use two bits to represent the exclusive rule? It seems that we could only use one bit to achieve the same result. Please note that we are building a filter and we hope that every generation would get better, so we need redundancies to evolve by crossovers. If both bits are one for a single word, it means this email will be determined as a spam regardless of whether it contains the word or not. In contrast, if both bits are zero for the word, this email will never be determined as spam. We can see that if both bits are one or zero, the rule is no longer robust. After several generations, those who are less robust will be wiped out by the measurement of fitness, which is the percentage of emails classified correctly. Not like any creature in the real world, we can cheat liberally by allowing our program to be immortal. The one who has the highest fitness score will be compared with its offsprings. Once its score is still higher than the next generation, it will live even longer than its offsprings. Immortality helps algorithm to reach the desired fitness sooner by avoiding backsliding between generations.
To evaluate the genetic algorithm, we can compare it with back propagation. Backprop entertains a single hypothesis at any given time, and that hypothesis changes gradually until it settles down to a local optimum. Genetic algorithms consider an entire population of hypotheses at each step, and these can make big jumps from one generation to the next. The problem of the genetic algorithm is the curse of dimensionality — the algorithm makes no priori assumptions about the structures, thus resulting in a larger state space. Thus search problem becomes very complicated and hard to analyze. Take a spam filter as an example, we assume that we have 2^2000 states in total, which a relatively small rule set. In machine learning, as elsewhere in computer science, there is nothing better than getting such a combinatorial explosion to work for you instead of against you. Luckily, better rules have higher probabilities to survive, so we should expect to see more of good ones in every generation.
Until now, we have been manipulating the bit strings. We may ask if there are any effective structures or representations to learn from other than bit strings. One of the computer programming concept, the tree structure, is applied here naturally. Instead of evolving simple things, why not to evolve full-blown computer programs? We could crossover parents’ subtrees to evolve new trees, and evaluate their fitnesses. In general, program trees can include the full range of programming constructs, such as If . . . else. . . statements, loops, and recursion. This idea supplies us with infinite choices to achieve a specific goal: the resulting programs can have any size, making the learning more flexible.
In biology, the Baldwin effect is the result of the interaction of evolution with learning by individual animals over their lifetime. It turns out that individual learning can enhance evolutionary learning at the species level. Similarly, in machine learning, the Baldwin effect tells that you may improve the results of an evolutionary algorithm by applying it to systems that can learn by themselves. Starting from uniform distribution, programs are gathered together by self-learning. At last, programs are dense on several points. A single program that near the dense point is more prone to move to that point.
Evolution takes billions of years to complete, and the brain takes a lifetime. However, learning algorithms should be able to learn in minutes or seconds. The goal of machine learning is to find the best possible learning algorithm, by any means available, which evolution and the brain are unlikely to provide. Today, there are few successful applications of genetic algorithms. Nevertheless, genetic algorithm and the concept of searching do inspire us to find more effective learning methods by optimizing some specific structures rather than to stick with the hope for general learning.
Week 5 Q & A Collection
1.Name and describe some main features of genetic algorithms.
GA is like selective breeding, and the key is a fitness function
There are three main operators used in GA: selection, crossover, and mutation
GA “makes no priori assumptions about the structures, thus resulting in a larger state space; thus search problem becomes very complicated and hard to analyze.” [from the review]
2.How does the spam filter work?
See the paragraph begins with “For example, suppose we want to evolve a rule for filtering spam.” [from the book]
3.What are the advantages and disadvantages of genetic algorithm?
advantages:
i. GAs can be easily integrated with other algorithms and applied to various problems without considering the domain of applications
ii. GAs “are much less likely than backprop to get stuck in a local optimum and in principle better able to come up with something truly new”[from the book]disadvantages:
i. GAs are much more difficult to analyze than backprop considering the complexity
ii. GAs' parameters which affect results a lot are mostly decided based on experience
4.What is the Baldwin effect? How does it relate to machine learning?
See the paragraph begins with “Of all the possible genomes, very few correspond to viable organisms.”[from the book]
“In machine learning, the Baldwin effect tells that you may improve the results of an evolutionary algorithm by applying it to systems that can learn by themselves.” [from the review]
Chapter #6 Preview
【Chapter Summary】
Chapter 6 talks about Bayes' statistics, a set of mathematical rules for using new data to iteratively update beliefs or existing knowledge. The Bayes' method was invented in 18th century by Tomas Bayes. Its appearance shocked frequentist, who use frequency to measure an event's probability. In this chapter, the author talks about how we can use Bayes' theorem to see the world with a different angle compared with frequentists. In addition, the author also presents Bayes inference, an important application of Bayes' theorem, and some models that exploit Bayes' inference methods, including Naive Bayes Classifier, Markov Chain, Hidden Markov Model, and Markov Random Field.
【Important Sections】
All models are wrong, but some are useful
This section introduces the Naive Bayes classifier. Naive Bayes classifier makes naive assumptions to independent and dependent variables (features and labels). In practical problems, naive assumptions are often false. However, empirically, the Naive Bayes classifier is a simple model with O(nc) time complexity (n is sample size, and c is features number), so it often performs pretty well and does not tend to overfit. It is always a good idea to fit Naive Bayes for your starter machine learning model, because it provides a general idea of how machine learning models would perform given the data.
This section talks about Markov Chain and Hidden Markov Model. Both models adapt Markov property, and are used to model sequential data.
Markov property implies that the future state of an event only depends on the current state. What if the future state also depends on other event besides the current one? This situation cannot be captured by HMM or similar model, but we can use Bayesian Network, a probabilistic model which enable us to specify any directional probabilistic relationships between events. In this section, the author introduces Bayesian Network and its applications.
Though Bayesian network allows us to specify any directional probabilistic relationships between events, in some cases, we know that event A and B are related but the exact causal relationship is unknown, i.e., whether event A causes B or B causes A. In such kind of situation, Markov network and Markov random field are the types of models that you should look for.
【Key Concepts】
Frequentist vs. Bayesian:
A frequentist claims that the probability is a frequency, and we should estimate the probability by counting how often the corresponding events have occurred.
A Bayesian states that the probability is a subjective degree of belief. You should update your prior beliefs based on new evidence to obtain your posterior beliefs.
Bayes' theorem: mathematically, Bayes' theorem is
P(A) and P(B) are the probabilities of event A and B. P(A) is called prior, which represents our initial belief of an event A.
P(B|A) is called likelihood, which describes the probability of B given A.
P(A|B) called posterior, which represents our belief after evidence B is taken into account.
We often ignore constant P(B), thus rewrite the above function to P(A|B) ~ P(B|A) P(A).
Naive assumption in Naive Bayes classifier:
All features are conditionally independent given the label.
Markov property:
the future state of an event only depends on the current state, not a sequence of previous event states.
Maximum likelihood estimation and max a posteriori estimation
They are two different estimation methods, depending on what you want to maximize: likelihood P(B|A) or posterior P(A|B).
【Quiz】
What is Bayes' theorem?
What is Naive assumption, and what rule does it play in Naive Bayes classifier?
What is the difference between HMM and Markov model?
Why is domain knowledge important in building and inferencing graphical models?