Statistical analysis of precipitation data with Python 3 - Tutorial
/Usually we use probabilistic approaches when dealing with extreme events since the size of available data is scarce to address the maximum for a determined return period. Precipitation data present challenges when we try to fit to a statistical distribution. This is an exercise done with 10 years record of precipitation for the Morropon station (Piura-Peru) affected by El Niño and La Niña events in years 98 and 99 respectively.
This tutorial is intended to be a basic reference in Python programming for water resources professionals. The script covers many steps on the data management, representation and analysis with the most common Python commands and libraries. It is encouraged to review first all the described steps with the provided input data and then apply to the user own hydrological data. The same analysis will be ported to Julia on future tutorials to have a overview of the similarities / differences between programming in Python or Julia for hydrological data.
Tutorial
Code
This is the python code used for the tutorial:
Install the required libraries
For this tutorial we use numpy, scipy and pandas plus a interactive mode on matplotlib.
%matplotlib inline
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
Read and plot precipitation data
Pandas makes easy to read comma separated values, excel files or text files with precipitation data, the library has also options for value plot.
datos = pd.read_csv('../Csv/PptMorropon.csv', index_col=0, na_values=[-9999], parse_dates=True)
datos.plot(figsize=(12,8))
<matplotlib.axes._subplots.AxesSubplot at 0x98839b0>
Create a histogram of precipitation
A histogram of the precipitation values is represented , then values higher than 0 are filtered.
datos.hist(figsize=(12,8))
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000000000A981358>]],
dtype=object)
datos_ppt = datos[datos>0]
datos_ppt.hist(figsize=(12,8))
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000000000AA23A58>]],
dtype=object)
As we can see, precipitation data are quite skewed. We will work on the distribution to fit it with “acceptably uncertainty” to a statistical distribution.
Normal and lognormal distribution plot
Mean and the standard deviation calculation, stadistical distribution array and distribution plot with the histogram.
promedio = datos_ppt.mean()
desviacion = datos_ppt.std()
print(promedio)
print(desviacion)
Ppt_mm 8.495465
dtype: float64
Ppt_mm 16.213448
dtype: float64
tabulaciones=np.arange(-40,51,0.1)
distnormal = stats.norm.pdf(tabulaciones, loc=promedio, scale=desviacion)
distlognormal = stats.pearson3.pdf(tabulaciones,skew=1,loc=promedio, scale=desviacion)
datos_ppt.hist(bins=100,density=True,figsize=(12,8))
plt.plot(tabulaciones,distnormal, color='red', label='distnormal')
plt.plot(tabulaciones,distlognormal, color='grey', label='distlognormal')
plt.xlim(0,30)
plt.legend(loc='upper right')
<matplotlib.legend.Legend at 0xab28ef0>
Since the adjustment of the normal and lognormal distribution depend on the arithmetic mean and the standard deviation, those are measure locations not conservative and are highly affected by the extreme events. We have to “shape” the data in order to have a precipitation that fits reasonably a statistical distribution.
Trimming data
We decided to cut the upper 20 percent of the values
datos_ppt['Ppt_mm'].quantile(0.8)
8.439999771400004
datos_cut=datos_ppt[(datos_ppt['Ppt_mm']<8.4)]
datos_cut.describe()
Ppt_mm | |
---|---|
count | 335.000000 |
mean | 2.122687 |
std | 2.031877 |
min | 0.100000 |
25% | 0.600000 |
50% | 1.200000 |
75% | 3.000000 |
max | 8.400000 |
Plotting trimmed data
We define new statistics for the trimmed data and plot statistical distributions.
promedio_cut=datos_cut.mean()
desviacion_cut=datos_cut.std()
tabulaciones_cut=np.arange(-10,20,0.1)
distnormal_cut = stats.norm.pdf(tabulaciones_cut, loc=promedio_cut, scale=desviacion_cut)
distlognormal_cut = stats.pearson3.pdf(tabulaciones_cut,skew=1,loc=promedio_cut, scale=desviacion_cut)
datos_cut.hist(density=True,bins=15,figsize=(24,16))
plt.plot(tabulaciones_cut,distnormal_cut, color='red', label='distnormal')
plt.plot(tabulaciones_cut,distlognormal_cut, color='grey', label='distlognormal')
plt.xlim(0,15)
plt.legend(loc='upper right')
<matplotlib.legend.Legend at 0xaeecac8>
The two statistical distributions can be compared to analize the stadistical distribuition that fits better to the precipitation.