社会焦点

【独家】周志华教授gcForest(多粒度级联森林)算法预测股指期货涨跌(2)

字号+ 作者: 来源: 2017-04-27

I have recently improved the memory usage (from version 0.1.4) when slicing the data but will keep looking at ways to optimize the code. OOB score error During the Random Forests training the Out-Of-

  I have recently improved the memory usage (from version 0.1.4) when slicing the data but will keep looking at ways to optimize the code.

OOB score error During the Random Forests training the Out-Of-Bag (OOB) technique is used for the prediction probabilities. It was found that this technique can sometimes raises an error when one or several samples is/are used for all trees training.

A potential solution consists in using cross validation instead of OOB score although it slows down the training. Anyway, simply increasing the number of trees and re-running the training (and crossing fingers) is often enough.

Built With

  • PyCharmcommunity edition

  • memory_profilerlibra

  • License

    This project is licensed under the MIT License (see LICENSE for details)

    Early Results

    (will be updated as new results come out)

  • Scikit-learn handwritten digits classification :

      training time ~ 5min

      accuracy ~ 98%

  • 部分代码:

      importitertools

      importnumpy asnp

      fromsklearn.ensemble importRandomForestClassifier

      fromsklearn.model_selection importtrain_test_split

      fromsklearn.metrics importaccuracy_score__author__ = "Pierre-Yves Lablanche"

      __email__ = "plablanche@aims.ac.za"

      __license__ = "MIT"

      __version__ = "0.1.3"

      __status__ = "Development"

      # noinspection PyUnboundLocalVariable

      classgcForest(object):def__init__(self, shape_1X=None, n_mgsRFtree=30, window=None, stride=1, cascade_test_size=0.2, n_cascadeRF=2, n_cascadeRFtree=101, cascade_layer=np.inf, min_samples_mgs=0.1, min_samples_cascade=0.05, tolerance=0.0, n_jobs=1):""" gcForest Classifier.

      关于规模

    目前gcForest实现中的主要技术问题是在输入数据时的内存使用情况。真实的计算实际上可以让您了解算法将处理的对象的数量和规模。

    计算C类[l,L]大小N维的问题,初始规模为:

    Slicing Step

      If my window is of size [wl,wL] and the chosen stride are [sl,sL] then the number of slices per sample is :

    【独家】周志华教授gcForest(多粒度级联森林)算法预测股指期货涨跌

    Obviously the size of slice is [wl,wL]hence the total size of the sliced data set is :

    【独家】周志华教授gcForest(多粒度级联森林)算法预测股指期货涨跌

    This is when the memory consumption is its peak maximum.

    Class Vector after Multi-Grain Scanning

      Now all slices are fed to the random forest to generate class vectors. The number of class vector per random forest per window per sample is simply equal to the number of slices given to the random forest

    Hence, if we have Nrfrandom forest per window the size of a class vector is (recall we have N samples and C classes):

      And finally the total size of the Multi-Grain Scanning output will be:

    This short calculation is just meant to give you an idea of the data processing during the Multi-Grain Scanning phase. The actual memory consumption depends on the format given (aka float, int, double, etc.) and it might be worth looking at it carefully when dealing with large datasets.

      预测每根K线涨跌

    获取每根k线的交易数据后,把open,close,high,low,volume,ema, macd, linreg, momentum, rsi, var, cycle, atr作为特征指标,下根K线涨跌作为预测指标

      #获取当前时间

      fromdatetime importdatetimenow = datetime.now()

      startDate = '2010-4-16'

      endDate = now

      #获取沪深300股指期货数据,频率为1分钟

      df=get_price( 'IF88', start_date=startDate, end_date=endDate, frequency= '1d', fields= None, country= 'cn')open = df[ 'open'].valuesclose = df[ 'close'].valuesvolume = df[ 'volume'].valueshigh = df[ 'high'].valueslow = df[ 'low'].values

    【独家】周志华教授gcForest(多粒度级联森林)算法预测股指期货涨跌

      importtalib asta

      importpandas aspd

      importnumpy asnp

    转载请注明出处。


    1.本站遵循行业规范,任何转载的稿件都会明确标注作者和来源;2.本站的原创文章,请转载时务必注明文章作者和来源,不尊重原创的行为我们将追究责任;3.作者投稿可能会经我们编辑修改或补充。

    相关文章