【独家】周志华教授gcForest(多粒度级联森林)算法预测股指期货涨跌(2)
2017-04-27 编辑:
I have recently improved the memory usage (from version 0.1.4) when slicing the data but will keep looking at ways to optimize the code.
OOB score error During the Random Forests training the Out-Of-Bag (OOB) technique is used for the prediction probabilities. It was found that this technique can sometimes raises an error when one or several samples is/are used for all trees training.
A potential solution consists in using cross validation instead of OOB score although it slows down the training. Anyway, simply increasing the number of trees and re-running the training (and crossing fingers) is often enough.
Built With
PyCharmcommunity edition
memory_profilerlibra
License
This project is licensed under the MIT License (see LICENSE for details)
Early Results
(will be updated as new results come out)
Scikit-learn handwritten digits classification :
training time ~ 5min
accuracy ~ 98%
部分代码:
importitertools
importnumpy asnp
fromsklearn.ensemble importRandomForestClassifier
fromsklearn.model_selection importtrain_test_split
fromsklearn.metrics importaccuracy_score__author__ = "Pierre-Yves Lablanche"
__email__ = "plablanche@aims.ac.za"
__license__ = "MIT"
__version__ = "0.1.3"
__status__ = "Development"
# noinspection PyUnboundLocalVariable
classgcForest(object):def__init__(self, shape_1X=None, n_mgsRFtree=30, window=None, stride=1, cascade_test_size=0.2, n_cascadeRF=2, n_cascadeRFtree=101, cascade_layer=np.inf, min_samples_mgs=0.1, min_samples_cascade=0.05, tolerance=0.0, n_jobs=1):""" gcForest Classifier.
关于规模
目前gcForest实现中的主要技术问题是在输入数据时的内存使用情况。真实的计算实际上可以让您了解算法将处理的对象的数量和规模。
计算C类[l,L]大小N维的问题,初始规模为:
Slicing Step
If my window is of size [wl,wL] and the chosen stride are [sl,sL] then the number of slices per sample is :
Obviously the size of slice is [wl,wL]hence the total size of the sliced data set is :
This is when the memory consumption is its peak maximum.
Class Vector after Multi-Grain Scanning
Now all slices are fed to the random forest to generate class vectors. The number of class vector per random forest per window per sample is simply equal to the number of slices given to the random forest
Hence, if we have Nrfrandom forest per window the size of a class vector is (recall we have N samples and C classes):
And finally the total size of the Multi-Grain Scanning output will be:
This short calculation is just meant to give you an idea of the data processing during the Multi-Grain Scanning phase. The actual memory consumption depends on the format given (aka float, int, double, etc.) and it might be worth looking at it carefully when dealing with large datasets.
预测每根K线涨跌
获取每根k线的交易数据后,把open,close,high,low,volume,ema, macd, linreg, momentum, rsi, var, cycle, atr作为特征指标,下根K线涨跌作为预测指标
#获取当前时间
fromdatetime importdatetimenow = datetime.now()
startDate = '2010-4-16'
endDate = now
#获取沪深300股指期货数据,频率为1分钟
df=get_price( 'IF88', start_date=startDate, end_date=endDate, frequency= '1d', fields= None, country= 'cn')open = df[ 'open'].valuesclose = df[ 'close'].valuesvolume = df[ 'volume'].valueshigh = df[ 'high'].valueslow = df[ 'low'].values
importtalib asta
importpandas aspd
importnumpy asnp