【独家】周志华教授gcForest(多粒度级联森林)算法预测股指期货涨跌(4)
2017-04-27 编辑:
Slicing Sequence...Training MGS Random Forests...Slicing Sequence...Training MGS Random Forests...Adding/Training Layer, n_layer=1Layer validation accuracy = 0.5964125560538116Adding/Training Layer, n_layer=2Layer validation accuracy = 0.5695067264573991
参数改为shape_1X=[1,13], window=[1,6]后训练集达到0.59,不理想,这里只是抛砖引玉,调参需要大神指导。
Now checking the prediction for the test set:
现在看看测试集的预测值:
pred_X = gcf.predict(X_te)print(len(pred_X))print(len(y_te))print(pred_X) Slicing Sequence...Slicing Sequence...549549[1 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 1 0 1 0 1 0 1 0 0 0 0 0 0 1 1 1 0 0 1 0等#最近预测
fori inrange( 1,len(pred_X)): print(y_te[-i],pred_X[-i],-i)
0 1 -10 0 -21 0 -31 0 -40 1 -5
等 # 保存每一天预测的结果,如果某天预测对了,保存1,如果某天预测错了,保存-1
result_list = []
# 检查预测是否成功
defcheckPredict(i):ifpred_X[i] == y_te[i]: result_list.append( 1)
else: result_list.append( 0)
#画出最近第k+1个长度为j的时间段准确率
k= 0j
=len(y_te)
#j=100
fori inrange(len(y_te)-j*(k+ 1), len(y_te)-j*k): checkPredict(i)
#print(y_pred[i])#return result_list
print(len(y_te) ) print(len(result_list) )
importmatplotlib.pyplot asplt
#将准确率曲线画出来
x = range( 0, len(result_list))y = []
#z=[]
fori inrange( 0, len(result_list)):
#y.append((1 + float(sum(result_list[:i])) / (i+1)) / 2)y.append( float(sum(result_list[:i])) / (i+ 1))print( '最近',j, '次准确率',y[- 1])print(x, y)line, = plt.plot(x, y)plt.show 549549最近 549 次准确率 0.5300546448087432range(0, 549) [0.0, 0.0, 0.3333333333333333, 0.25等
#评估准确率
# evaluating accuracy
accuracy = accuracy_score(y_true=y_te, y_pred=pred_X)print( 'gcForest accuracy : {}'.format(accuracy)) gcForest accuracy : 0.5300546448087432
预测结果很一般,不过还是有效的。
预测涨跌看起不是那么靠谱,但识别手写数字还是相当牛逼的。
下面只贴出结果:
# loading the data
digits = load_digits()X = digits.datay = digits.targetX_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size= 0.4)gcf = gcForest(shape_1X=[ 7, 8], window=[ 4, 6], tolerance= 0.0, min_samples_mgs= 10, min_samples_cascade= 7)
#gcf = gcForest(shape_1X=13, window=13, tolerance=0.0, min_samples_mgs=10, min_samples_cascade=7)
gcf.fit(X_tr, y_tr)
Slicing Images...Training MGS Random Forests...Slicing Images...Training MGS Random Forests...Adding/Training Layer, n_layer=1Layer validation accuracy = 0.9814814814814815Adding/Training Layer, n_layer=2Layer validation accuracy = 0.9814814814814815# evaluating accuracy
accuracy = accuracy_score(y_true=y_te, y_pred=pred_X)print( 'gcForest accuracy : {}'.format(accuracy))
gcForest accuracy : 0.980528511821975
厉害了,简单的参数都能使手写数字识别的准确率高达98%
单独利用多粒度扫描和级联森林
由于多粒度扫描和级联森林模块是相当独立的,因此可以单独使用它们。
如果给定目标“y”,代码将自动使用它进行训练,否则它会调用最后训练的随机森林来分割数据。
gcf = gcForest(shape_1X=[ 8, 8], window= 5, min_samples_mgs= 10, min_samples_cascade= 7)X_tr_mgs = gcf.mg_scanning(X_tr, y_tr)
Slicing Images...Training MGS Random Forests...
It is now possible to use the mg_scanning output as input for cascade forests using different parameters. Note that the cascade forest module does not directly return predictions but probability predictions from each Random Forest in the last layer of the cascade. Hence the need to first take the mean of the output and then find the max.
gcf = gcForest(tolerance= 0.0, min_samples_mgs= 10, min_samples_cascade= 7)_ = gcf.cascade_forest(X_tr_mgs, y_tr)
Adding/Training Layer, n_layer=1Layer validation accuracy = 0.9722222222222222Adding/Training Layer, n_layer=2Layer validation accuracy = 0.9907407407407407Adding/Training Layer, n_layer=3Layer validation accuracy = 0.9814814814814815
importnumpy asnppred_proba = gcf.cascade_forest(X_te_mgs)tmp = np.mean(pred_proba, axis= 0)preds = np.argmax(tmp, axis= 1)accuracy_score(y_true=y_te, y_pred=preds)gcf = gcForest(tolerance= 0.0, min_samples_mgs= 20, min_samples_cascade= 10)_ = gcf.cascade_forest(X_tr_mgs, y_tr)pred_proba = gcf.cascade_forest(X_te_mgs)tmp = np.mean(pred_proba, axis= 0)preds = np.argmax(tmp, axis= 1)accuracy_score(y_true=y_te, y_pred=preds) 0.97774687065368571Adding/Training Layer, n_layer=1Layer validation accuracy = 0.9629629629629629Adding/Training Layer, n_layer=2Layer validation accuracy = 0.9675925925925926Adding/Training Layer, n_layer=3Layer validation accuracy = 0.9722222222222222Adding/Training Layer, n_layer=4Layer validation accuracy = 0.97222222222222220.97218358831710705
Skipping mg_scanning
It is also possible to directly use the cascade forest and skip the multi grain scanning step.
gcf = gcForest(tolerance= 0.0, min_samples_cascade= 20)_ = gcf.cascade_forest(X_tr, y_tr)