Time-Series Forecasting using Autoregressive Moving Average (ARMA)

Friday. April 20, 2018 - 8 mins

Time series :

data points that are collected sequentially at a regular interval with association over a time period is termed as time-series data.Time-series tend to have a linear relationship between lagged variables and this is called as autocorrelation. Hence a time-series historic data can be modelled to forecast the future data points without involvement of any other independent variables, these types of models are generally known as time-series forecasting.

Application of time series analysis

Some key areas of applications of time-series are sales forecasting, economic forecasting, stock market forecasting etc.

Autoregressive–moving-average model (ARMA)

Autoregressive–moving-average (ARMA) models provide a parsimonious description of a stationary stochastic process in terms of two polynomials, one for the autoregression and the second for the moving average.These models are fitted to time series data either to better understand the data or to predict future points in the series (forecasting). ARMA models are applied in some cases where data show evidence of non-stationarity. Source

Stationarity

A stationary process is a stochastic process whose unconditional joint probability distribution does not change when shifted in time. Consequently, parameters such as mean and variance, if they are present, also do not change over time. A time-series data having the mean and variance as constant is called stationary time-series.In order to apply ARMA for time series that is staionary we have to first check for stationarity and remove it.Source

Objective

We will analyse Hawaii Carbondioxide emission from from 1959 to 1990 and then forcast the future trend of the emission

Data

the dataset can be downloaded from Source

The Notebook

The notebook for this work can be found at [Autoregressive Moving Average (ARMA)]

Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pylab as plt
%matplotlib inline

from scipy import stats
import statsmodels.api as sm
from statsmodels.graphics.api import qqplot

from statsmodels.tsa.stattools import adfuller

# function to calculate MAE, RMSE
from sklearn.metrics import mean_absolute_error, mean_squared_error
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.graphics.api import qqplot

data = pd.read_csv('./time_series/Carbon_dioxide_emissions_in_Hawaii.csv')

Display first five row of the dataset

print data.head()

     Month  Carbondioxide
1959M01         315.42
1959M02         316.32
1959M03         316.49
1959M04         317.56
1959M05         318.13

check the data types

data.dtypes

Month             object
Carbondioxide    float64
dtype: object

We will replace M with - for easier conversion to datetime

data.Month = data.Month.str.replace('M','-')

print data.head()

     Month  Carbondioxide
1959-01         315.42
1959-02         316.32
1959-03         316.49
1959-04         317.56
1959-05         318.13

Convert Month to datetime and make it an index of the series

time_series = pd.Series(list(data['Carbondioxide']),
                        index=pd.to_datetime(data['Month'],format='%Y-%m'))

print time_series.head()

Month
1959-01-01    315.42
1959-02-01    316.32
1959-03-01    316.49
1959-04-01    317.56
1959-05-01    318.13
dtype: float64

Visualize the trend and the seasonality of the data set

import matplotlib
font = {'family' : 'normal',
        'weight' : 'normal',
        'size'   : 18}

matplotlib.rc('font', **font)
decomposition = seasonal_decompose(time_series)

trend = decomposition.trend
seasonal = decomposition.seasonal
residual = decomposition.resid

plt.figure(figsize=(17,9))
plt.subplot(411)
plt.title('Time Series - Decomposed')
plt.plot(time_series, label='Original')
plt.legend(loc='best')
plt.subplot(412)
plt.plot(trend, label='Trend')
plt.legend(loc='best')
plt.subplot(413)
plt.plot(seasonal,label='Seasonality')
plt.legend(loc='best')
plt.tight_layout()

png

checking for staionarity

s_test = adfuller(time_series, autolag='AIC')
# extract p value from test results
print "p value > 0.05 means data is non-stationary: ", s_test[1]

p value > 0.05 means data is non-stationary:  1.0

plt.figure(figsize=(17,8))
plt.plot(time_series)

[<matplotlib.lines.Line2D at 0xa60b448c>]

png

Lets visulize Auto correlation test

fig = plt.figure(figsize=(12,8))
ax1 = fig.add_subplot(211)
fig = sm.graphics.tsa.plot_acf(time_series.values.squeeze(), lags=40, ax=ax1)
ax2 = fig.add_subplot(212)
fig = sm.graphics.tsa.plot_pacf(time_series, lags=40, ax=ax2)

png

Build the Model

arma_mod20 = sm.tsa.ARMA(time_series, (2,0)).fit(disp=False)

Summary of the model

print(arma_mod20.summary())

                              ARMA Model Results                              
==============================================================================
Dep. Variable:                      y   No. Observations:                  384
Model:                     ARMA(2, 0)   Log Likelihood                -477.909
Method:                       css-mle   S.D. of innovations              0.833
Date:                Sun, 22 Apr 2018   AIC                            963.818
Time:                        10:34:34   BIC                            979.620
Sample:                    01-01-1959   HQIC                           970.086
                         - 12-01-1990                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const        332.6967      5.522     60.246      0.000     321.873     343.520
ar.L1.y        1.7039      0.036     47.523      0.000       1.634       1.774
ar.L2.y       -0.7109      0.036    -19.718      0.000      -0.782      -0.640
                                    Roots                                    
=============================================================================
                 Real           Imaginary           Modulus         Frequency
-----------------------------------------------------------------------------
AR.1            1.0266           +0.0000j            1.0266            0.0000
AR.2            1.3702           +0.0000j            1.3702            0.0000
-----------------------------------------------------------------------------

Predict and plot the predcition with confidence interval

import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(18,9))
fig = arma_mod20.plot_predict(start='1959-01-01', end='1995-01-01', ax=ax)
legend = ax.legend(loc='upper left')

plt.figure(figsize=(18,8))
ts_predict = arma_mod20.predict('1959-01-01','1995-01-01')
plt.plot(time_series, label='Actual')
plt.plot(ts_predict, 'r--', label='Predicted')
plt.xlabel('Year-Month',)
plt.ylabel('Carbondioxide')
plt.legend(loc='best')

<matplotlib.legend.Legend at 0xa69381ec>

png

check the first ten and last ten prediction and compare with the actual emission

predcit = arma_mod20.predict(start='1959-01-01', end='1995-01-01')

print "\tACTUAL (1-5)"
print time_series.head()
print "\tPREDICTED (1-5)"
print predcit.head()
#predcit.tail(30)

	ACTUAL (1-5)
Month
1959-01-01    315.42
1959-02-01    316.32
1959-03-01    316.49
1959-04-01    317.56
1959-05-01    318.13
dtype: float64
	PREDICTED (1-5)
1959-01-01    332.696720
1959-02-01    315.490752
1959-03-01    317.074564
1959-04-01    316.724408
1959-05-01    318.426730
Freq: MS, dtype: float64

print "\tPREDICTED (last 30)"
print predcit.tail(30)

	PREDICTED (last 30)
1992-08-01    348.541119
1992-09-01    348.132717
1992-10-01    347.734226
1992-11-01    347.345574
1992-12-01    346.966640
1993-01-01    346.597268
1993-02-01    346.237284
1993-03-01    345.886495
1993-04-01    345.544700
1993-05-01    345.211695
1993-06-01    344.887272
1993-07-01    344.571222
1993-08-01    344.263340
1993-09-01    343.963421
1993-10-01    343.671265
1993-11-01    343.386675
1993-12-01    343.109457
1994-01-01    342.839423
1994-02-01    342.576388
1994-03-01    342.320171
1994-04-01    342.070597
1994-05-01    341.827493
1994-06-01    341.590693
1994-07-01    341.360034
1994-08-01    341.135356
1994-09-01    340.916504
1994-10-01    340.703328
1994-11-01    340.495680
1994-12-01    340.293418
1995-01-01    340.096400
Freq: MS, dtype: float64

Mustapha Omotosho

constant learner,machine learning enthusiast,huge Barcelona fan