Fill Missing Precipitation Data with Machine Learning in Python and Scikit-Learn - Tutorial
/Evaluation of hydrological processes as evapotranspiration, runoff, routing and infiltration require data as precipitation, flow, temperature and radiation on a daily basis. Required data for hydrological modeling need to be accurate and must be completed over the study period. It is common that historical data from hydrological stations are incomplete and has many gaps that can be filled by the use of machine learning algorithms like Scikit-Learn in Python3. This tutorial shows a applied procedure to run a complete script for the filling of missing precipitation in one station by the use of data from 2 nearby stations. The script will be run on Python 3.9 on a Anaconda Prompt environment.
Tutorial
Code
#!pip install sklearn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPRegressor
precipStations = pd.read_csv('../Data/Est1_Est2_Est3.csv',index_col=0,parse_dates=True)
precipStations.head()
Est1 | Est2 | Est3 | |
---|---|---|---|
Fecha | |||
2014-07-25 | 0.0 | NaN | 0.0 |
2014-07-26 | 0.6 | NaN | 0.0 |
2014-07-27 | 0.0 | NaN | 0.0 |
2014-07-28 | 0.0 | NaN | 0.0 |
2014-07-29 | 0.2 | NaN | 0.2 |
fig, axs=plt.subplots(3,1,figsize=(12,8),sharex=True,sharey=True)
axs[0].plot(precipStations.index,precipStations['Est1'],label='Est1')
axs[0].legend()
axs[1].plot(precipStations.index,precipStations['Est2'],label='Est2')
axs[1].legend()
axs[2].plot(precipStations.index,precipStations['Est3'],label='Est3')
axs[2].legend()
plt.xticks(rotation='vertical')
plt.show()
precipNotNan = precipStations.dropna()
precipNotNan
Est1 | Est2 | Est3 | |
---|---|---|---|
Fecha | |||
2014-08-23 | 0.0 | 0.0 | 0.0 |
2014-08-24 | 0.0 | 0.0 | 0.0 |
2014-08-25 | 0.0 | 0.0 | 0.0 |
2014-08-26 | 0.0 | 0.0 | 0.0 |
2014-08-27 | 0.0 | 0.0 | 0.0 |
... | ... | ... | ... |
2015-06-21 | 0.0 | 0.0 | 0.0 |
2015-06-22 | 0.0 | 0.0 | 0.0 |
2015-06-23 | 0.0 | 0.0 | 0.0 |
2015-06-24 | 0.0 | 0.0 | 0.0 |
2015-06-25 | 3.8 | 2.2 | 0.0 |
229 rows × 3 columns
xTrain = precipNotNan[['Est1','Est3']]
yTrain = precipNotNan[['Est2']].values.flatten()
print(xTrain[:10])
print(yTrain[:10])
Est1 Est3
Fecha
2014-08-23 0.0 0.0
2014-08-24 0.0 0.0
2014-08-25 0.0 0.0
2014-08-26 0.0 0.0
2014-08-27 0.0 0.0
2014-08-28 0.0 0.0
2014-08-29 0.0 0.0
2014-08-30 0.0 0.0
2014-08-31 0.0 0.0
2014-09-01 2.8 3.4
[0. 0. 0. 0. 0. 0. 0. 0. 0. 3.4]
scaler = StandardScaler().fit(xTrain.values)
xTrainScaled = scaler.transform(xTrain.values)
print(xTrainScaled[:10])
[[-0.53894584 -0.50809201]
[-0.53894584 -0.50809201]
[-0.53894584 -0.50809201]
[-0.53894584 -0.50809201]
[-0.53894584 -0.50809201]
[-0.53894584 -0.50809201]
[-0.53894584 -0.50809201]
[-0.53894584 -0.50809201]
[-0.53894584 -0.50809201]
[-0.04157671 0.123051 ]]
#check scaler
print(xTrainScaled.mean(axis=0))
print(xTrainScaled.std(axis=0))
[ 1.00841218e-16 -4.65421006e-17]
[1. 1.]
#regressor
regr = MLPRegressor(random_state=1, max_iter=5000).fit(xTrainScaled, yTrain)
#test
xTest = precipStations[['Est1','Est3']].dropna()
xTestScaled = scaler.transform(xTest.values)
print(xTest.describe())
print(xTestScaled[:10])
Est1 Est3
count 431.000000 431.000000
mean 2.376798 2.364269
std 4.948763 4.888289
min 0.000000 0.000000
25% 0.000000 0.000000
50% 0.000000 0.000000
75% 2.200000 2.400000
max 36.000000 43.800000
[[-0.53894584 -0.50809201]
[-0.43236674 -0.50809201]
[-0.53894584 -0.50809201]
[-0.53894584 -0.50809201]
[-0.50341948 -0.47096595]
[-0.50341948 -0.43383989]
[ 0.98868793 0.97695037]
[-0.53894584 -0.50809201]
[-0.53894584 -0.50809201]
[-0.53894584 -0.50809201]]
#regression
yPredict = regr.predict(xTestScaled)
print(yPredict[:10])
[0.05938569 0.15278002 0.05938569 0.05938569 0.17877184 0.26853265
8.31785189 0.05938569 0.05938569 0.05938569]
#comparison of station 2
fig, ax=plt.subplots(figsize=(12,8),sharex=True,sharey=True)
ax.plot(precipStations.index,precipStations['Est2'],label='Est2')
ax.plot(xTest.index,yPredict,label='Est2Predict')
plt.legend()
plt.xticks(rotation='vertical')
plt.show()
#insert predicted values
precipStations.head()
Est1 | Est2 | Est3 | |
---|---|---|---|
Fecha | |||
2014-07-25 | 0.0 | NaN | 0.0 |
2014-07-26 | 0.6 | NaN | 0.0 |
2014-07-27 | 0.0 | NaN | 0.0 |
2014-07-28 | 0.0 | NaN | 0.0 |
2014-07-29 | 0.2 | NaN | 0.2 |
#create new column
precipStations['Est2Completed'] = 0
#fill the new column with original and predicted values for Est2
for index, row in precipStations.iterrows():
if np.isnan(row['Est2']) and ~np.isnan(row['Est1']) and ~np.isnan(row['Est3']):
rowScaled = scaler.transform([[row['Est1'],row['Est3']]])
precipStations.loc[index,['Est2Completed']] = regr.predict(rowScaled)
elif ~np.isnan(row['Est2']):
precipStations.loc[index,['Est2Completed']] = row['Est2']
else:
row['Est2Completed'] = np.nan
#show original and filled values
fig, axs=plt.subplots(4,1,figsize=(12,8),sharex=True,sharey=True)
axs[0].plot(precipStations.index,precipStations['Est1'],label='Est1')
axs[0].legend()
axs[1].plot(precipStations.index,precipStations['Est2'],label='Est2',color='orangered')
axs[1].legend()
axs[2].plot(precipStations.index,precipStations['Est2Completed'],label='Est2Completed',color='darkorange')
axs[2].legend()
axs[3].plot(precipStations.index,precipStations['Est3'],label='Est3')
axs[3].legend()
plt.xticks(rotation='vertical')
plt.show()
Input data
You can download the input data from this link.