本文研究目标为:
使用NLTK预处理/清理文本数据
使用word2vec创建词汇和标题嵌入,并通过t-SNE使其显示为数据集
将标题情绪与文章转发量之间的关系可视化
尝试通过嵌入及其他相关方法预测文章转发量
使用模型融合来提升文章转发量预测的准确性(此步最终未实现)
操作的全过程可以参见:
https://nbviewer.jupyter.org/github/chambliss/Notebooks/blob/master/Word2Vec_News_Analysis.ipynb
数据输入和预处理
首先,数据输入如下:
import pandas as pd
import gensim
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import xgboost as xgb
然后读取数据:
main_data = pd.read_csv('News_Final.csv')
main_data.head()
# Grab all the titles
article_titles = main_data['Title']
# Create a list of strings, one for each title
titles_list = [title for title in article_titles]
# Collapse the list of strings into a single long string for processing
big_title_string = ' '.join(titles_list)
from nltk.tokenize import word_tokenize
# Tokenize the string into words
tokens = word_tokenize(big_title_string)
# Remove non-alphabetic tokens, such as punctuation
words = [word.lower() for word in tokens if word.isalpha()]
# Filter out stopwords
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
words = [word for word in words if not word in stop_words]
# Print first 10 words
words[:10]
接下来,需要加载预先设定好的word2vec模型(可参照https://github.com/RaRe-Technologies/gensim-data)。由于这是一个新闻数据集,笔者使用的是谷歌新闻模型,该模型已经通过大约1000亿词汇量的测试。
# Load word2vec model (trained on an enormous Google corpus)
model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary = True)
# Check dimension of word vectors
model.vector_size
300
该模型将生成300维的词汇向量,生成的向量必须通过模型传递下去。向量列举如下:
economy_vec = model['economy']
economy_vec[:20] # First 20 components
word2vec只能依据自己的词汇库来创造向量。因此,在创建词汇向量的完整列表时,要指定“if model in model.vocab”
# Filter the list of vectors to include only those that Word2Vec has a vector for
vector_list = [model[word] for word in words if word in model.vocab]
# Create a list of the words corresponding to these vectors
words_filtered = [word for word in words if word in model.vocab]
# Zip the words together with their vector representations
word_vec_zip = zip(words_filtered, vector_list)
# Cast to a dict so we can turn it into a DataFrame
word_vec_dict = dict(word_vec_zip)
df = pd.DataFrame.from_dict(word_vec_dict, orient='index')
df.head(3)
使用t-SNE降低维数
接下来,使用t-SNE压缩这些词汇向量(即降维)。如果想了解更多关于t-SNE的相关知识,请查看:https://distill.pub/2016/misread-tsne/。
选择t-SNE的参数非常重要,因为不同的参数会产生截然不同的结果。笔者测试了0到100之间的几个值的准确性,每次产生的结果大致相同。笔者又测试了20到400之间几个学习率的数值,最终决定将学习率设为在默认值(200)。
为了节省处理时间,笔者并没有使用全部的20000个单词向量,而是只选用了其中的400个。
from sklearn.manifold import TSNE
# Initialize t-SNE
tsne = TSNE(n_components = 2, init = 'random', random_state = 10, perplexity = 100)
# Use only 400 rows to shorten processing time
tsne_df = tsne.fit_transform(df[:400])
现在,可以准备测绘减少的词汇向量数组的二维图。在这里,笔者使用adjust_text将单词自动分开。
sns.set()
# Initialize figure
fig, ax = plt.subplots(figsize = (11.7, 8.27))
sns.scatterplot(tsne_df[:, 0], tsne_df[:, 1], alpha = 0.5)
# Import adjustText, initialize list of texts
from adjustText import adjust_text
texts = []
words_to_plot = list(np.arange(0, 400, 10))
# Append words to list
for word in words_to_plot:
texts.append(plt.text(tsne_df[word, 0], tsne_df[word, 1], df.index[word], fontsize = 14))
# Plot text using adjust_text (because overlapping text is hard to read)
adjust_text(texts, force_points = 0.4, force_text = 0.4,
expand_points = (2,1), expand_text = (1,2),
arrowprops = dict(arrowstyle = "-", color = 'black', lw = 0.5))
plt.show()
如果你有兴趣使用adjust_text来进行测绘,可以参照:https://github.com/Phlya/adjustText。注意要使用驼峰格式adjustText进行导入,且adjustText目前与matplotlib3.0或其更高版本不兼容。
即使将矢量嵌入减少到2维,还是可以看到某些项聚集在一起。例如,我们在左/左上角导入“month”,在底部导入“corporatefinance terms”,在中间导入较多的通用非主题词(如“full”, “really”, “slew”)。
需要注意的是,如果使用不同的参数再次运行t-SNE,可能会得到相似的结果,但不一定完全相同。t-SNE不是确定性的,数据集的紧密度和它们之间的分散距离并不能说明什么。t-SNE主要是一种探索性工具,而不是相似性的决定性指标。
平均词嵌入
上文已经介绍了如何将词嵌入的方法应用于此数据集中。接下来,本文将探讨ML的应用:找到聚集的标题集,并研究它们的关系。
Doc2Vec没有预先设定好的模型,因此需要更长时间的测试过程。笔者在这里使用更简洁更有效率的方法:即平均每个文档中单词向量的嵌入。在这个例子中,文档指的就是标题。
此处需要对标题再次进行预处理步骤,这要比分隔词汇的处理更复杂。Dimitris Spathis开发了一系列功能,在本例中可以得到很好的应用。
def document_vector(word2vec_model, doc):
# remove out-of-vocabulary words
doc = [word for word in doc if word in model.vocab]
return np.mean(model[doc], axis=0)
# Our earlier preprocessing was done when we were dealing only with word vectors
# Here, we need each document to remain a document
def preprocess(text):
text = text.lower()
doc = word_tokenize(text)
doc = [word for word in doc if word not in stop_words]
doc = [word for word in doc if word.isalpha()]
return doc
# Function that will help us drop documents that have no word vectors in word2vec
def has_vector_representation(word2vec_model, doc):
"""check if at least one word of the document is in the
word2vec dictionary"""
return not all(word not in word2vec_model.vocab for word in doc)
# Filter out documents
def filter_docs(corpus, texts, condition_on_doc):
"""
Filter corpus and texts given the function condition_on_doc which takes a doc. The document doc is kept if condition_on_doc(doc) is true.
"""
number_of_docs = len(corpus)
if texts is not None:
texts = [text for (text, doc) in zip(texts, corpus)
if condition_on_doc(doc)]
corpus = [doc for doc in corpus if condition_on_doc(doc)]
print("{} docs removed".format(number_of_docs - len(corpus)))
return (corpus, texts)
根据上述结果,进行下一步处理:
# Preprocess the corpus
corpus = [preprocess(title) for title in titles_list]
# Remove docs that don't include any words in W2V's vocab
corpus, titles_list = filter_docs(corpus, titles_list, lambda doc: has_vector_representation(model, doc))
# Filter out any empty docs
corpus, titles_list = filter_docs(corpus, titles_list, lambda doc: (len(doc) != 0))
x = []
for doc in corpus: # append the vector for each document
x.append(document_vector(model, doc))
X = np.array(x) # list to array
t-SNE操作:文件向量
现在,我们已成功创建了文档向量数组,对它们进行t-SNE测算可能会得到与上文类似的结果。
# Initialize t-SNE
tsne = TSNE(n_components = 2, init = 'random', random_state = 10, perplexity = 100)
# Again use only 400 rows to shorten processing time
tsne_df = tsne.fit_transform(X[:400])
fig, ax = plt.subplots(figsize = (14, 10))
sns.scatterplot(tsne_df[:, 0], tsne_df[:, 1], alpha = 0.5)
from adjustText import adjust_text
texts = []
titles_to_plot = list(np.arange(0, 400, 40)) # plots every 40th title in first 400 titles
# Append words to list
for title in titles_to_plot:
texts.append(plt.text(tsne_df[title, 0], tsne_df[title, 1], titles_list[title], fontsize = 14))
# Plot text using adjust_text
adjust_text(texts, force_points = 0.4, force_text = 0.4,
expand_points = (2,1), expand_text = (1,2),
arrowprops = dict(arrowstyle = "-", color = 'black', lw = 0.5))
plt.show()
根据上述操作,可以发现t-SNE将文档向量折叠成一个维度空间,文档根据其内容与某些领域的相关性而展开,如与国家、世界领导者、外交事务及技术公司等方面的相关性。
现在可以对文章的转发量进行研究。通常,人们认为标题要耸人听闻才会点击率越高,文章被转发分享的可能性也就越高。那么这种共识在数据集中是否能够得到验证呢?下一部分将进行具体介绍。
文章转发量与标题情绪分析
首先,我们需要删除所有没有来源或者无法判定转发量的文章。转发量的零位测量在数据中表示为-1.
# Drop all the rows where the article popularities are unknown (this is only about 11% of the data)
main_data = main_data.drop(main_data[(main_data.Facebook == -1) |
(main_data.GooglePlus == -1) |
(main_data.LinkedIn == -1)].index)
# Also drop all rows where we don't know the source
main_data = main_data.drop(main_data[main_data['Source'].isna()].index)
main_data.shape
删除一部分文章后,仍有81000篇文章可供使用。接下来,可以研究标题情绪与文章的转发量之间是否存在某种相关性。
fig, ax = plt.subplots(1, 3, figsize=(15, 10))
subplots = [a for a in ax]
platforms = ['Facebook', 'GooglePlus', 'LinkedIn']
colors = list(sns.husl_palette(10, h=.5)[1:4])
for platform, subplot, color in zip(platforms, subplots, colors):
sns.scatterplot(x = main_data[platform], y = main_data['SentimentTitle'], ax=subplot, color=color)
subplot.set_title(platform, fontsize=18)
subplot.set_xlabel('')
fig.suptitle('Plot of Popularity (Shares) by Title Sentiment', fontsize=24)
plt.show()
经过上述操作,依然无法明确两者之间的相关性,因为某些文章的转发量显著高于其他文章。可以尝试对x轴进行对数转换,并使用regplot,这样seaborn可以覆盖所有的线性回归。
# Our data has over 80,000 rows, so let's also subsample it to make the log-transformed scatterplot easier to read
subsample = main_data.sample(5000)
fig, ax = plt.subplots(1, 3, figsize=(15, 10))
subplots = [a for a in ax]
for platform, subplot, color in zip(platforms, subplots, colors):
# Regression plot, so we can gauge the linear relationship
sns.regplot(x = np.log(subsample[platform] + 1), y = subsample['SentimentTitle'],
ax=subplot,
color=color,
# Pass an alpha value to regplot's scatterplot call
scatter_kws={'alpha':0.5})
# Set a nice title, get rid of x labels
subplot.set_title(platform, fontsize=18)
subplot.set_xlabel('')
fig.suptitle('Plot of log(Popularity) by Title Sentiment', fontsize=24)
plt.show()
与我们的预期相反,在这个数据集中,吸睛的标题与文章转发量并没有直接的联系。为了更清楚文章转发量的影响因素,可以再做一次关于文章转发量的二维图测绘。
fig, ax = plt.subplots(3, 1, figsize=(15, 10))
subplots = [a for a in ax]
for platform, subplot, color in zip(platforms, subplots, colors):
sns.distplot(np.log(main_data[platform] + 1), ax=subplot, color=color, kde_kws={'shade':True})
# Set a nice title, get rid of x labels
subplot.set_title(platform, fontsize=18)
subplot.set_xlabel('')
fig.suptitle('Plot of Popularity by Platform', fontsize=24)
plt.show()
接下来,当发布者不同时,吸睛的标题所产生的影响是否与上述结论一致呢?
# Get the list of top 12 sources by number of articles
source_names = list(main_data['Source'].value_counts()[:12].index)
source_colors = list(sns.husl_palette(12, h=.5))
fig, ax = plt.subplots(4, 3, figsize=(20, 15), sharex=True, sharey=True)
ax = ax.flatten()
for ax, source, color in zip(ax, source_names, source_colors):
sns.distplot(main_data.loc[main_data['Source'] == source]['SentimentTitle'],
ax=ax, color=color, kde_kws={'shade':True})
ax.set_title(source, fontsize=14)
ax.set_xlabel('')
plt.xlim(-0.75, 0.75)
plt.show()
得出的图像大体一致,但难以分辨细节的区别。因此,尝试将这些图像全部叠加在一起。
# Overlay each density curve on the same plot for closer comparison
fig, ax = plt.subplots(figsize=(12, 8))
for source, color in zip(source_names, source_colors):
sns.distplot(main_data.loc[main_data['Source'] == source]['SentimentTitle'],
ax=ax, hist=False, label=source, color=color)
ax.set_xlabel('')
plt.xlim(-0.75, 0.75)
plt.show()
可以发现,消息来源与标题情绪的分布图非常相似,没有任何一个来源的值是过高或过低的,12个最常见的源都以0为中心,分布图的尾部值也都很接近。但是这能说明真正的关系吗?请看下面这组数据:
# Group by Source, then get descriptive statistics for title sentiment
source_info = main_data.groupby('Source')['SentimentTitle'].describe()
# Recall that `source_names` contains the top 12 sources
# We'll also sort by highest standard deviation
source_info.loc[source_names].sort_values('std', ascending=False)[['std', 'min', 'max']]
非常明显的是,与其他新闻发布者相比,华尔街日报分布图的标准差最大,范围最大。这表明,华尔街日报经常使用带有负面情绪的标题。这是一个很有趣的潜在发现。但要严格地对这个结论进行验证,需要进行假设检验。这超出了本文的研究范围。
文章转发量预测
第一个任务是重新加入带有各自标题的文档向量。在进行预处理时,笔者同时处理了语料库和titles_list,因此向量及其所代表的标题仍将匹配。同时,在main_df中,笔者删除了所有人气值为-1的文章,因此也需要删除代表这些文章标题的向量。
在如此巨大的数据集上构建模型绝非易事。因此,可以减少一部分向量的维度。基于Unix时间,笔者根据文章的发布日期设计了一个新的变量。
具体可以参见:
https://en.wikipedia.org/wiki/Unix_time
import datetime
# Convert publish date column to make it compatible with other datetime objects
main_data['PublishDate'] = pd.to_datetime(main_data['PublishDate'])
# Time since Linux Epoch
t = datetime.datetime(1970, 1, 1)
# Subtract this time from each article's publish date
main_data['TimeSinceEpoch'] = main_data['PublishDate'] - t
# Create another column for just the days from the timedelta objects
main_data['DaysSinceEpoch'] = main_data['TimeSinceEpoch'].astype('timedelta64[D]')
main_data['TimeSinceEpoch'].describe()
根据上述操作可见,这些文章发布的时间形成了一个250天的数据集合。
from sklearn.decomposition import PCA
pca = PCA(n_components=15, random_state=10)
# as a reminder, x is the array with our 300-dimensional vectors
reduced_vecs = pca.fit_transform(x)
df_w_vectors = pd.DataFrame(reduced_vecs)
df_w_vectors['Title'] = titles_list
# Use pd.concat to match original titles with their vectors
main_w_vectors = pd.concat((df_w_vectors, main_data), axis=1)
# Get rid of vectors that couldn't be matched with the main_df
main_w_vectors.dropna(axis=0, inplace=True)
现在,需要删除非数字和非虚拟列,以便更好地建模。笔者还将比例缩放功能应用于DaysSinceEpoch,因为与其他变量相比,发布时间的数据范围更广。
# Drop all non-numeric, non-dummy columns, for feeding into the models
cols_to_drop = ['IDLink', 'Title', 'TimeSinceEpoch', 'Headline', 'PublishDate', 'Source']
data_only_df = pd.get_dummies(main_w_vectors, columns = ['Topic']).drop(columns=cols_to_drop)
# Standardize DaysSinceEpoch since the raw numbers are larger in magnitude
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
# Reshape so we can feed the column to the scaler
standardized_days = np.array(data_only_df['DaysSinceEpoch']).reshape(-1, 1)
data_only_df['StandardizedDays'] = scaler.fit_transform(standardized_days)
# Drop the raw column; we don't need it anymore
data_only_df.drop(columns=['DaysSinceEpoch'], inplace=True)
# Look at the new range
data_only_df['StandardizedDays'].describe()
# Get Facebook data only
fb_data_only_df = data_only_df.drop(columns=['GooglePlus', 'LinkedIn'])
# Separate the features and the response
X = fb_data_only_df.drop('Facebook', axis=1)
y = fb_data_only_df['Facebook']
# 80% of data goes to training
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 10)
现在对数据进行非优化的XGBoost处理:
from sklearn.metrics import mean_squared_error
# Instantiate an XGBRegressor
xgr = xgb.XGBRegressor(random_state=2)
# Fit the classifier to the training set
xgr.fit(X_train, y_train)
y_pred = xgr.predict(X_test)
mean_squared_error(y_test, y_pred)
353495.42
可以说,得出的结果并不令人满意。那么,是否可以通过超参数调整来改善结果呢?笔者根据以下文章重新设计了一个超参数调整布局,具体参见:
https://www.kaggle.com/jayatou/xgbregressor-with-gridsearchcv
from sklearn.model_selection import GridSearchCV
# Various hyper-parameters to tune
xgb1 = xgb.XGBRegressor()
parameters = {'nthread':[4],
'objective':['reg:linear'],
'learning_rate': [.03, 0.05, .07],
'max_depth': [5, 6, 7],
'min_child_weight': [4],
'silent': [1],
'subsample': [0.7],
'colsample_bytree': [0.7],
'n_estimators': [250]}
xgb_grid = GridSearchCV(xgb1,
parameters,
cv = 2,
n_jobs = 5,
verbose=True)
xgb_grid.fit(X_train, y_train)
根据xgb_grid,最佳参数如下:
{'colsample_bytree': 0.7, 'learning_rate': 0.03,'max_depth': 5, 'min_child_weight': 4, 'n_estimators': 250, 'nthread': 4,'objective': 'reg:linear', 'silent': 1, 'subsample': 0.7}
使用新参数再试一次:
params = {'colsample_bytree': 0.7, 'learning_rate':0.03, 'max_depth': 5, 'min_child_weight': 4,
'n_estimators': 250, 'nthread':4, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.7}
# Try again with new params
xgr = xgb.XGBRegressor(random_state=2, **params)
# Fit the classifier to the training set
xgr.fit(X_train, y_train)
y_pred = xgr.predict(X_test)
mean_squared_error(y_test, y_pred)
351220.13
可以看出,数据量在35000左右的时候,结果更好。因此,可以推断,当前状态下的数据量不足导致模型产生的结果不好。那么,是否可以对数据特征进行改善呢?笔者利用分类方法将文章分为两组:Duds(0或1次转载)与Not Duds。
如果能够赋予回归量一个新的特征,将更有利于预测转发量更高的文章,从而使残值降低,并减少均方误差。
进行Detour操作: 检测Dud文章
从之前制作的对数转换图中可以看出,总共有两个文章集合,一个集合在0的位置,而另一个从1开始。笔者用一些分类方法来识别文章是否属于“Duds”(即转载次数只有0或1次),并将这种分类作为最终回归量的特征,以此来预测文章的转发量的可能性。这种方法被称为模型融合。
# Define a quick function that will return 1 (true) if the article has 0-1 share(s)
def dud_finder(popularity):
if popularity <= 1:
return 1
else:
return 0
# Create target column using the function
fb_data_only_df['is_dud'] = fb_data_only_df['Facebook'].apply(dud_finder)
fb_data_only_df[['Facebook', 'is_dud']].head()
# 28% of articles can be classified as "duds"
fb_data_only_df['is_dud'].sum() / len(fb_data_only_df)
0.28
现在,dud分类已经完成。接下来,要使用随机森林算法(RandomForest)、优化的XGBC分类法和K-Nearest Neighbors分类法。这里对于优化XGB的过程不做赘述。
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
X = fb_data_only_df.drop(['is_dud', 'Facebook'],axis=1)
y = fb_data_only_df['is_dud']
# 80% of data goes to training
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2,random_state = 10)
# Best params, produced by HP tuning
params = {'colsample_bytree': 0.7, 'learning_rate': 0.03, 'max_depth': 5,'min_child_weight': 4,
'n_estimators': 200, 'nthread':4, 'silent': 1, 'subsample': 0.7}
# Try xgc again with new params
xgc = xgb.XGBClassifier(random_state=10, **params)
rfc = RandomForestClassifier(n_estimators=100, random_state=10)
knn = KNeighborsClassifier()
preds = {}
for model_name, model in zip(['XGClassifier', 'RandomForestClassifier','KNearestNeighbors'], [xgc, rfc, knn]):
model.fit(X_train, y_train)
preds[model_name] =model.predict(X_test)
测试模型,获取分类报告:
from sklearn.metrics import classification_report,roc_curve, roc_auc_score
for k in preds:
print("{}performance:".format(k))
print()
print(classification_report(y_test,preds[k]), sep='\n')
可以看出,f1分数最高的是XGC分类法,其次是RF,最后是KNN,但KNN在数据撤回方面做得最好(能够成功识别duds)。这足以说明模型融合的意义,因为像XGBoost这样的优秀模型也会在识别任务上表现不佳,而KNN的预测缺乏一定的多样性。
# Plot ROC curves
for model in [xgc, rfc, knn]:
fpr, tpr, thresholds =roc_curve(y_test, model.predict_proba(X_test)[:,1])
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curves')
plt.show()
文章人气预测:第2轮
现在我们可以从三种分类方法中得出预测概率的平均值,并将其作为回归量的特征。
averaged_probs = (xgc.predict_proba(X)[:, 1] +
knn.predict_proba(X)[:,1] +
rfc.predict_proba(X)[:,1]) / 3
X['prob_dud'] = averaged_probs
y = fb_data_only_df['Facebook']
接下来将进行再一轮对向量特征的调整,此处不作赘述。结果如下所示:
xgb.XGBRegressor(random_state=2,**params)
# Fit the classifier to the training set
xgr.fit(X_train, y_train)
y_pred = xgr.predict(X_test)
mean_squared_error(y_test, y_pred)
314551.59
所得结果与模型融合之前所得的结果基本相同。这就表明,MSE在测量误差时,异常值所占的比重较大。其实,我们还可以计算平均绝对误差(MAE),以此评估显著异常值产生的影响。在数学术语中,MAE可以用来计算残差绝对值的L1范数,而不是MSE应用的L2范数。我们可以将MAE与MSE的平方根进行比较,也称为均方根误差(RMSE)。
mean_absolute_error(y_test,y_pred), np.sqrt(mean_squared_error(y_test, y_pred))
(180.528167661507,560.8489939081992)
平均绝对误差仅为RMSE的1/3左右。模型的准确性比当初设想得要好很多。最后一步,根据XGRegressor了解每个特征的重要性:
zip(list(X.columns),xgr.feature_importances_):
print('Model weight for feature {}:{}'.format(feature, importance))
结果发现,prob_dud是最重要的特征,自定义的StandardsDays功能是第二重要的特征。(特征值0到14对应于缩小的标题嵌入向量。)
尽管通过这一轮模型融合没有改善整体的结果,但我们成功发现了数据变化的重要来源。
如果继续研究下去,笔者会考虑使用外部数据来增加数据量,包括通过分箱或散列将Source作为变量,在原始的300维向量上运行模型,使用每篇文章在不同时间的人气的“分片”数据来进行预测。
本文所使用的原始数据可以参见:
https://archive.ics.uci.edu/ml/datasets/News+Popularity+in+Multiple+Social+Media+Platforms