邊玩邊學，輕鬆自學 칸: [機器學習練習] [Machine Learning Practice] 用 Scikit 預測河川水位資料（三）

沒有意外的話，這是這個教學的最後一篇文章惹。

目前還剩下三個步驟... ...

1. 決定 features，而 label 理所當然的就是水位高度

2. 決定最佳化方法並 training

3. 用 classifier 預測資料

正文開始！

------------------------------------------------------------------------------------------------------------

Step 5 : 決定 features

上次決定 features 中，提到水位反應時間這個關鍵，

但我們卻不知道典寶溪流域反應時間到底是多少，

也沒有學過水利相關知識可以用 model 計算。

但沒關係，有機器學習搭配些許統計知識或許可以解決這個問題。

我們首先取前一小時到前十小時的各雨量站資料，程式碼如下：

for i in range(1,11):
df['ctrf'+str(i)] = df['ctrf'].shift(i) # CT rainfall
df['dasrf'+str(i)] = df['dasrf'].shift(i) # DAS rainfall

df.fillna(0, inplace=True)
df_corr = df.corr() # this will list all the correlation between each of df columns
print (df_corr['wll']) # what we care is the correlation with W

在終端機目標資料夾中輸入

python DBriver.py

我們會獲得以下資訊（這邊可以先暫時把 plt.show() comment out）

我們可以發現，現在的水位跟四小時前的雨量大小相關係數最高。

而且 CT 雨量站的水位又比 DS 雨量站的相關係數高，

我這邊為了偷懶，所以只選了 CT 當 features，有興趣的話可以選 DS 看看，

或許效果會更好？

好，除了 ctrf4 以外，我還選了 ctrf2, ctrf3, ctrf5, ctrf6, ctrf7 當作 feature。

為什麼？因為相關係數大於 0.5。

為什麼？我也不知道，一個感覺。

反正機器學習的好處就是，你給他多少 features 沒關係，

只要你知道其中有關係，在最佳化過程中，它會自動幫你weighting :)

有了雨量資料，總是還要水位資料，我這次是設定預測兩小時後的水位資料，

因此我又在 df.fillna(0, inplace=True) 前面加上了這一句程式碼，

df['wll_feature2'] = df['wll'].shift(2)

好！處理完要選取的 features 後，我們把 corr() 那邊 comment out (# 的意思)。

把資料視覺化那邊也改一下，因為我們不需要 DS rainfall 的圖表了！

ax1 = plt.subplot2grid((2,2),(0,0))
ax2 = plt.subplot2grid((2,2),(1,0), sharex=ax1)
ax3 = plt.subplot2grid((2,2),(0,1)) # we are going to put forecasted data here
ax4 = plt.subplot2grid((2,2),(1,1))

df[['crc','wll']].plot(ax= ax1, linewidth = 1, color=['r','b'])
df['ctrf'].plot(ax= ax2, label="rf1", linewidth = 1, color='g')

ax1.set_ylabel('Waterlevel')
ax2.set_ylabel('CT rainfall')

大功告成！可以練習自己 print (df.head()) 或 plt.show() 看一下現在資料的長相！

------------------------------------------------------------------------------------------------------------

Step 6 : 決定最佳化方法並 training

這一步照理說，是很難的！

但是感謝 Scikit 哥，讓我們只需要把他叫出來用一用就可以拿到資料惹～

（之後如果想到，再分享一下 SVM、KMeans、LinearRegression 的概念和寫法）

好，總之，感謝 Scikit 哥。

像我們這種趨勢型二話不說就是使用 LinearRegression，

當然也可以試試看其他演算法或換 kernel，但這邊為求簡單，

直接用 LinearRegression，效果也是最好。( 決定演算法的 cheat sheet 放在下面 )

在程式中，加上以下程式碼：

df_feature = df[['wll_feature2','ctrf2','ctrf3','ctrf4','ctrf5','ctrf6','ctrf7']] # extract feature set
X = np.array(df_feature) # change to array form
#X = preprocessing.scale(X) # I end up decide not to use this, reason is straight forward...
y = np.array(df['wll']) # water level as lable

# This step is very important, it's validation!! you can change test_size parameter to
# determine the size of your test set and the size your training set
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.1)

clf = LinearRegression(n_jobs=10) # as I said, LinearRegression
# I put this here, by changing kernel parameter you can use different support vector regression
#clf = svm.SVR(kernel='poly')

clf.fit(X_train, y_train)

accuracy = clf.score(X_test, y_test) # it shows the validation error.

print ('accuracy', accuracy)

cross_validation.train_test_split(X, y, test_size=0.1)

這個我特別拿出來講一下，我覺得滿重要的，不知道有沒有誤會。

這一部跟傳統統計方法不同於，假設我有一萬組 data ，

我每次取隨便九千組來 train 我的 classifier，然後用剩下一千組給我 classifier 判斷，

再去算正確率。這代表什麼意思呢？

代表每一次 accuracy 的值，都會不一樣！但卻很客觀！

到這邊算完成了整個 training step。

在終端機目標資料夾中輸入

python DBriver.py

我的 accuracy 是 0.92519986705211898。

------------------------------------------------------------------------------------------------------------

Step 7 : FORECASTING!

終於走到這一步，即將要看看我們訓練的小機器，有沒有好聰明可以預測水位。

大概查看了一下今年的大雨，

主要出現在 6 / 10 號和 7 / 6 號尼伯特颱風附近。

預測完之後，我們還要想一個計算 error 的方法，所以還是加開一篇好惹～

------------------------------------------------------------------------------------------------------------

Reference

[1] " Forecasting Time Series Water Levels on Mekong River Using Machine Learning Models ", 10.1109/KSE.2015.53

[2] '' Application of Support Vector Machine in Lake Water Level Prediction " , http://ascelibrary.org/doi/abs/10.1061/(ASCE)1084-0699(2006)11%3A3(199)

[3] " Integrating Support Vector Regression and a geomorphologic Artificial Neural Network for daily rainfall-runoff modelling", http://www.sciencedirect.com/science/article/pii/S1568494615006304

[4] Scikit-Learn, http://machine-learning-python.kspax.io

[5] Scikit-Learn, http://scikit-learn.org/stable/

[6] 水文資訊網, http://gweb.wra.gov.tw/hyis/index.aspx

[7] 水利署防災資訊服務網, http://fhy.wra.gov.tw/fhy/

[8] 典寶溪排水治理計畫 - 經濟部水利署

[9] Scikit cheat sheet, http://scikit-learn.org/stable/tutorial/machine_learning_map/

最後一集：

[機器學習練習] [Machine Learning Practice] 用 Scikit 預測河川水位資料（四）

邊玩邊學，輕鬆自學 칸

2016年8月8日星期一

[機器學習練習] [Machine Learning Practice] 用 Scikit 預測河川水位資料（三）

沒有留言:

張貼留言

2016年8月8日 星期一

[機器學習練習] [Machine Learning Practice] 用 Scikit 預測河川水位資料（三）

沒有留言:

張貼留言

2016年8月8日星期一