Your Web News in One Place

Help Webnuz

Referal links:

Sign up for GreenGeeks web hosting
July 7, 2021 05:55 pm GMT

Predicting the price of Heating Oil using PyCaret

Predicting the price of Heating Oil using PyCaret

In this notebook, we'll go over how to perform a time series forecasting on the price of heating oil.

Background

In update New York, we are unable to get natural gas to service our home for heating. This is because the rock is mostly made of shale, which makes it tough to get pull natural gas. So we have to rely on oil for heat.

As the price of gas goes up and down, the price of heating oil is the same. There are many companies that distribute oil and all of them have different prices. Also, these prices change between seasons. So, I wanted to build an application that predicts the price of oil, so that I know when is the best time to buy.

Data

Before building the models, i needed to get data. I have been unable to find a dataset with the current price of oil, therefore I had to build my own. The website cheapestoil.com shows the price of oil for many companies in the northeast United States. The site shows the latest prices for these companies, but they do not show the previous prices.

So in order to get the prices, I build a AWS lambda function that scrapes the price of oil daily. I used AWS CloudWatch events to run a lambda function every 12 hours, in order to fetch the prices for that time. This lambda extracted the last updated date, price and supplier, and save these results as JSON and save to an S3 bucket. After the json data is saved, I have another lambda function, attached as a trigger, to read each json file, and save into DynamoDB. I'll be writing another post on how I gathered the data soon.

I started this project back in Dec 2020 in order to build up my dataset. The lambda function has been running for about 6 months, and I have a decent amount of data to work with. In order to expand my dataset, I was able to pull more data using The Wayback Machine on web.archive.org. The Wayback Machine stores a snapshot of many pages on the internet. It doesn't have every site, but it did have some snapshots from cheaptestoil.com. To get that data, I used https://github.com/hartator/wayback-machine-downloader to download the archive data. The archive only had 7 snapshots, between the dates of Aug 2020 and Oct 2021.

In all, I have about 5k records of all the oil prices from the website

Fetching Saved Data

I used DynamoDB to store the oil price data, and used Boto3 to fetch the data, which I then save to a CSV.

import boto3from boto3.dynamodb.conditions import Keyimport pandas as pddynamodb = boto3.resource('dynamodb')table = dynamodb.Table('heating_oil_prices')response = table.scan()df = pd.DataFrame(response["Items"])df.to_csv("data.csv")
def get_data():    df = pd.read_csv("data.csv", usecols=["last_updated", "price150","price500", "price300", "supplier", "state"])    df["last_updated"] = pd.to_datetime(df["last_updated"])    df = df.set_index("last_updated")    df["state"] = df["state"].apply(lambda x: "NewYork" if str(x) == "nan" else x)    df = df.sort_index()    return dfdf = get_data()

On Cheapestoil.com, the have the price of oil in gallons, but the price is slightly different for how many gallons you buy. If we get suppliers in the state of New York, we'll see the following

    state = "NewYork"    suppliers_by_state = df[ (df["state"] == state)].dropna()    suppliers_by_state.iloc[1]
price500          1.449price300          1.469price150          1.549supplier    Suffolk Oilstate           NewYorkName: 2020-08-03 15:11:16, dtype: object

The row above shows that the price for 500 gallons(price500) is $1.449 per gallon, 300 gallons(price300) is $1.69 and 150 gallons(price150) is $1.549

Lets see how many suppliers we have for New York

suppliers_by_state["supplier"].value_counts().sum()
2708

Since we have so many suppliers, a forecasr for the average price of oil for all the suppliers might be a better way to go, since the prices are simlar between every company

df = df.reset_index()data = df[ df["state"] == state].resample('d', on='last_updated').mean().dropna()data = data.reset_index()

Here is the mean price of oil over all the suppliers in New York.

data.head()
.dataframe tbody tr th:only-of-type {    vertical-align: middle;}.dataframe tbody tr th {    vertical-align: top;}.dataframe thead th {    text-align: right;}
last_updated price500 price300 price150
0 2020-08-03 1.672000 1.719400 1.76140
1 2020-08-04 1.632429 1.625750 1.68075
2 2020-10-29 1.852500 1.819455 1.85300
3 2020-11-11 1.786250 1.759000 1.76900
4 2020-11-12 1.420000 1.460000 1.54000

For a quick check on our data, let's plot the price for 500 gallons

data["price500"].plot()
<AxesSubplot:>

svg

Model

Now, we can start building our model. We'll be using PyCaret to build our time series forecast.
Before modeling, we need to update the dataset to remove the date and replace with numeric values. To do this, I've included fastai's add_datepart function to convert the data is series of features, split by year, month, day, day of week, and much more

from fastai.tabular.all import *add_datepart(data, field_name="last_updated")
.dataframe tbody tr th:only-of-type {    vertical-align: middle;}.dataframe tbody tr th {    vertical-align: top;}.dataframe thead th {    text-align: right;}
price500 price300 price150 last_updatedYear last_updatedMonth last_updatedWeek last_updatedDay last_updatedDayofweek last_updatedDayofyear last_updatedIs_month_end last_updatedIs_month_start last_updatedIs_quarter_end last_updatedIs_quarter_start last_updatedIs_year_end last_updatedIs_year_start last_updatedElapsed
0 1.672000 1.719400 1.761400 2020 8 32 3 0 216 False False False False False False 1.596413e+09
1 1.632429 1.625750 1.680750 2020 8 32 4 1 217 False False False False False False 1.596499e+09
2 1.852500 1.819455 1.853000 2020 10 44 29 3 303 False False False False False False 1.603930e+09
3 1.786250 1.759000 1.769000 2020 11 46 11 2 316 False False False False False False 1.605053e+09
4 1.420000 1.460000 1.540000 2020 11 46 12 3 317 False False False False False False 1.605139e+09
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
56 2.220000 2.260000 2.340000 2021 6 26 28 0 179 False False False False False False 1.624838e+09
57 2.632621 2.626935 2.646290 2021 7 26 2 4 183 False False False False False False 1.625184e+09
58 2.606429 2.582633 2.611437 2021 7 27 5 0 186 False False False False False False 1.625443e+09
59 2.619422 2.609042 2.632478 2021 7 27 6 1 187 False False False False False False 1.625530e+09
60 2.571294 2.545455 2.575238 2021 7 27 7 2 188 False False False False False False 1.625616e+09

61 rows 16 columns

The add_datepart function generates lots of feature for the date, but we don't need to use all of them. For this model, we'll use last_updatedYear, last_updatedMonth and last_updatedDay. In future models, we can try to use other features.

data = data[['price500', 'last_updatedYear', 'last_updatedMonth', 'last_updatedDay']]
mask = round(len(data) * 0.7)train = data[:mask]test = data[mask:]train.shape, test.shape
((43, 4), (18, 4))

After we split the data into a train, test set, we then must initalize the regression setup in PyCaret. This includes suppling the dataset, the feature we are predicting on (price500) and the other features to use(last_updatedYear,last_updatedMonth and last_updatedDay)

from pycaret.regression import *s = setup(data = train, test_data = test, target = 'price500', fold_strategy = 'timeseries', numeric_features = ['last_updatedYear','last_updatedMonth', 'last_updatedDay'], fold = 3, transform_target = True, session_id = 42)

Next we call the compare_models function to find the best model using the Mean Absolute Error(MAE), which is the mean of the absolute difference between the models prediction and expected values.

best = compare_models(sort = 'MAE')

From the results above, it looks like the Passive Aggressive Regressor has the lowest MAE error(0.0605), so we'll use that model on the test set

preds = predict_model(best)

Next, we'll use this model and generate a forecast for the next 7 days worth of prices

forecast_df = pd.DataFrame()forecast_df["last_updated"] = pd.date_range(start='2021-07-08', periods=8)add_datepart(forecast_df, 'last_updated')forecast_df = forecast_df[['last_updatedYear', 'last_updatedMonth', 'last_updatedDay']]forecast_df
predictions = predict_model(best, data=forecast_df)predictions

Once we make our predictions, we'll then merge the predictions to the orginal data and plot the last 15 days in the dataframe

def pad_value(day):    value = str(int(day))    print()    if len(value) ==1:        return f'0{value}'    return valueresults_df = pd.concat([data,predictions], axis=0)dates = []for idx, x in results_df.iterrows():    date_str =  f'{pad_value(x["last_updatedYear"])}-{pad_value(x["last_updatedMonth"])}-{pad_value(x["last_updatedDay"])}'    dates.append(pd.to_datetime(date_str, format='%Y-%m-%d'))results_df["date"] = datesresults_df.drop(["last_updatedYear", "last_updatedMonth", "last_updatedDay"], axis=1,inplace=True)results_df = results_df.set_index('date')
results_df[-15:].plot()

The blue line above shows the actual prices and the orange are the predicitons, which do not look the best. If we zoom in to the predictions. we'll see a increasing in price per day

results_df[-8:].plot()

Saving The Model

After we trained our model, we can now save and use for forecasting on other data

save_model(best, "model")
Transformation Pipeline and Model Succesfully Saved(Pipeline(memory=None,          steps=[('dtypes',                  DataTypes_Auto_infer(categorical_features=[],                                       display_types=True, features_todrop=[],                                       id_columns=[], ml_usecase='regression',                                       numerical_features=['last_updatedYear',                                                           'last_updatedMonth',                                                           'last_updatedDay'],                                       target='price500', time_features=[])),                 ('imputer',                  Simple_Imputer(categorical_strategy='not_available',                                 fill_value_ca...                                                  regressor=PassiveAggressiveRegressor(C=1.0,                                                                                       average=False,                                                                                       early_stopping=False,                                                                                       epsilon=0.1,                                                                                       fit_intercept=True,                                                                                       loss='epsilon_insensitive',                                                                                       max_iter=1000,                                                                                       n_iter_no_change=5,                                                                                       random_state=42,                                                                                       shuffle=True,                                                                                       tol=0.001,                                                                                       validation_fraction=0.1,                                                                                       verbose=0,                                                                                       warm_start=False),                                                  shuffle=True, tol=0.001,                                                  validation_fraction=0.1,                                                  verbose=0,                                                  warm_start=False)]],          verbose=False), 'model.pkl')

Training Model for all Prices

After we have a intial model, we can now train 3 seperate models for each price(price150, price300 and price500)
We'll refactor the model training code into a function that trains each price, gets the model with the best score, and saves the model to a seperate file

def train(data, price, state="NewYork"):    data = data.reset_index()    data = data[ data["state"] == state].resample('d', on='last_updated').mean().dropna()    data = data.reset_index()    add_datepart(data, field_name="last_updated")    data = data[[price, 'last_updatedYear', 'last_updatedMonth', 'last_updatedDay']]       mask = round(len(data) * 0.7)    train = data[:mask]    test = data[mask:]    print("train",train.shape)    print("test",test.shape)    s = setup(data = train, test_data = test, target = price, fold_strategy = 'timeseries', numeric_features = ['last_updatedYear','last_updatedMonth', 'last_updatedDay'], fold=2, transform_target = True, session_id = 42)    best = compare_models(sort = 'MAE')    save_model(best, f'{price}_model')
for feature in ["price150", "price300", "price500"]:    df = get_data()    train(df, feature)
Transformation Pipeline and Model Succesfully Saved




Conclusion and What's Next

In this post, we've seen how to build a time series model for forecasting the price of heating oil. In the next post, we'll go over how to deploy these models into a StreamLit application.
We'll also go over how the process on how the data was collected.


Original Link: https://dev.to/tbass134/predicting-the-price-of-heating-oil-using-pycaret-prices-5cjc

Share this article:    Share on Facebook
View Full Article

Dev To

An online community for sharing and discovering great ideas, having debates, and making friends

More About this Source Visit Dev To