An Interest In:
Web News this Week
- April 27, 2024
- April 26, 2024
- April 25, 2024
- April 24, 2024
- April 23, 2024
- April 22, 2024
- April 21, 2024
Predicting the price of Heating Oil using PyCaret
Predicting the price of Heating Oil using PyCaret
In this notebook, we'll go over how to perform a time series forecasting on the price of heating oil.
Background
In update New York, we are unable to get natural gas to service our home for heating. This is because the rock is mostly made of shale, which makes it tough to get pull natural gas. So we have to rely on oil for heat.
As the price of gas goes up and down, the price of heating oil is the same. There are many companies that distribute oil and all of them have different prices. Also, these prices change between seasons. So, I wanted to build an application that predicts the price of oil, so that I know when is the best time to buy.
Data
Before building the models, i needed to get data. I have been unable to find a dataset with the current price of oil, therefore I had to build my own. The website cheapestoil.com shows the price of oil for many companies in the northeast United States. The site shows the latest prices for these companies, but they do not show the previous prices.
So in order to get the prices, I build a AWS lambda function that scrapes the price of oil daily. I used AWS CloudWatch events to run a lambda function every 12 hours, in order to fetch the prices for that time. This lambda extracted the last updated date, price and supplier, and save these results as JSON and save to an S3 bucket. After the json data is saved, I have another lambda function, attached as a trigger, to read each json file, and save into DynamoDB. I'll be writing another post on how I gathered the data soon.
I started this project back in Dec 2020 in order to build up my dataset. The lambda function has been running for about 6 months, and I have a decent amount of data to work with. In order to expand my dataset, I was able to pull more data using The Wayback Machine on web.archive.org. The Wayback Machine stores a snapshot of many pages on the internet. It doesn't have every site, but it did have some snapshots from cheaptestoil.com. To get that data, I used https://github.com/hartator/wayback-machine-downloader to download the archive data. The archive only had 7 snapshots, between the dates of Aug 2020 and Oct 2021.
In all, I have about 5k records of all the oil prices from the website
Fetching Saved Data
I used DynamoDB to store the oil price data, and used Boto3 to fetch the data, which I then save to a CSV.
import boto3from boto3.dynamodb.conditions import Keyimport pandas as pddynamodb = boto3.resource('dynamodb')table = dynamodb.Table('heating_oil_prices')response = table.scan()df = pd.DataFrame(response["Items"])df.to_csv("data.csv")
def get_data(): df = pd.read_csv("data.csv", usecols=["last_updated", "price150","price500", "price300", "supplier", "state"]) df["last_updated"] = pd.to_datetime(df["last_updated"]) df = df.set_index("last_updated") df["state"] = df["state"].apply(lambda x: "NewYork" if str(x) == "nan" else x) df = df.sort_index() return dfdf = get_data()
On Cheapestoil.com, the have the price of oil in gallons, but the price is slightly different for how many gallons you buy. If we get suppliers in the state of New York, we'll see the following
state = "NewYork" suppliers_by_state = df[ (df["state"] == state)].dropna() suppliers_by_state.iloc[1]
price500 1.449price300 1.469price150 1.549supplier Suffolk Oilstate NewYorkName: 2020-08-03 15:11:16, dtype: object
The row above shows that the price for 500 gallons(price500
) is $1.449 per gallon, 300 gallons(price300
) is $1.69 and 150 gallons(price150
) is $1.549
Lets see how many suppliers we have for New York
suppliers_by_state["supplier"].value_counts().sum()
2708
Since we have so many suppliers, a forecasr for the average price of oil for all the suppliers might be a better way to go, since the prices are simlar between every company
df = df.reset_index()data = df[ df["state"] == state].resample('d', on='last_updated').mean().dropna()data = data.reset_index()
Here is the mean price of oil over all the suppliers in New York.
data.head()
.dataframe tbody tr th:only-of-type { vertical-align: middle;}.dataframe tbody tr th { vertical-align: top;}.dataframe thead th { text-align: right;}
last_updated | price500 | price300 | price150 | |
---|---|---|---|---|
0 | 2020-08-03 | 1.672000 | 1.719400 | 1.76140 |
1 | 2020-08-04 | 1.632429 | 1.625750 | 1.68075 |
2 | 2020-10-29 | 1.852500 | 1.819455 | 1.85300 |
3 | 2020-11-11 | 1.786250 | 1.759000 | 1.76900 |
4 | 2020-11-12 | 1.420000 | 1.460000 | 1.54000 |
For a quick check on our data, let's plot the price for 500 gallons
data["price500"].plot()
<AxesSubplot:>
Model
Now, we can start building our model. We'll be using PyCaret to build our time series forecast.
Before modeling, we need to update the dataset to remove the date and replace with numeric values. To do this, I've included fastai's add_datepart
function to convert the data is series of features, split by year, month, day, day of week, and much more
from fastai.tabular.all import *add_datepart(data, field_name="last_updated")
.dataframe tbody tr th:only-of-type { vertical-align: middle;}.dataframe tbody tr th { vertical-align: top;}.dataframe thead th { text-align: right;}
price500 | price300 | price150 | last_updatedYear | last_updatedMonth | last_updatedWeek | last_updatedDay | last_updatedDayofweek | last_updatedDayofyear | last_updatedIs_month_end | last_updatedIs_month_start | last_updatedIs_quarter_end | last_updatedIs_quarter_start | last_updatedIs_year_end | last_updatedIs_year_start | last_updatedElapsed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.672000 | 1.719400 | 1.761400 | 2020 | 8 | 32 | 3 | 0 | 216 | False | False | False | False | False | False | 1.596413e+09 |
1 | 1.632429 | 1.625750 | 1.680750 | 2020 | 8 | 32 | 4 | 1 | 217 | False | False | False | False | False | False | 1.596499e+09 |
2 | 1.852500 | 1.819455 | 1.853000 | 2020 | 10 | 44 | 29 | 3 | 303 | False | False | False | False | False | False | 1.603930e+09 |
3 | 1.786250 | 1.759000 | 1.769000 | 2020 | 11 | 46 | 11 | 2 | 316 | False | False | False | False | False | False | 1.605053e+09 |
4 | 1.420000 | 1.460000 | 1.540000 | 2020 | 11 | 46 | 12 | 3 | 317 | False | False | False | False | False | False | 1.605139e+09 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
56 | 2.220000 | 2.260000 | 2.340000 | 2021 | 6 | 26 | 28 | 0 | 179 | False | False | False | False | False | False | 1.624838e+09 |
57 | 2.632621 | 2.626935 | 2.646290 | 2021 | 7 | 26 | 2 | 4 | 183 | False | False | False | False | False | False | 1.625184e+09 |
58 | 2.606429 | 2.582633 | 2.611437 | 2021 | 7 | 27 | 5 | 0 | 186 | False | False | False | False | False | False | 1.625443e+09 |
59 | 2.619422 | 2.609042 | 2.632478 | 2021 | 7 | 27 | 6 | 1 | 187 | False | False | False | False | False | False | 1.625530e+09 |
60 | 2.571294 | 2.545455 | 2.575238 | 2021 | 7 | 27 | 7 | 2 | 188 | False | False | False | False | False | False | 1.625616e+09 |
61 rows 16 columns
The add_datepart
function generates lots of feature for the date, but we don't need to use all of them. For this model, we'll use last_updatedYear
, last_updatedMonth
and last_updatedDay
. In future models, we can try to use other features.
data = data[['price500', 'last_updatedYear', 'last_updatedMonth', 'last_updatedDay']]
mask = round(len(data) * 0.7)train = data[:mask]test = data[mask:]train.shape, test.shape
((43, 4), (18, 4))
After we split the data into a train, test set, we then must initalize the regression setup in PyCaret. This includes suppling the dataset, the feature we are predicting on (price500
) and the other features to use(last_updatedYear
,last_updatedMonth
and last_updatedDay
)
from pycaret.regression import *s = setup(data = train, test_data = test, target = 'price500', fold_strategy = 'timeseries', numeric_features = ['last_updatedYear','last_updatedMonth', 'last_updatedDay'], fold = 3, transform_target = True, session_id = 42)
Next we call the compare_models
function to find the best model using the Mean Absolute Error(MAE), which is the mean of the absolute difference between the models prediction and expected values.
best = compare_models(sort = 'MAE')
From the results above, it looks like the Passive Aggressive Regressor has the lowest MAE error(0.0605), so we'll use that model on the test set
preds = predict_model(best)
Next, we'll use this model and generate a forecast for the next 7 days worth of prices
forecast_df = pd.DataFrame()forecast_df["last_updated"] = pd.date_range(start='2021-07-08', periods=8)add_datepart(forecast_df, 'last_updated')forecast_df = forecast_df[['last_updatedYear', 'last_updatedMonth', 'last_updatedDay']]forecast_df
predictions = predict_model(best, data=forecast_df)predictions
Once we make our predictions, we'll then merge the predictions to the orginal data and plot the last 15 days in the dataframe
def pad_value(day): value = str(int(day)) print() if len(value) ==1: return f'0{value}' return valueresults_df = pd.concat([data,predictions], axis=0)dates = []for idx, x in results_df.iterrows(): date_str = f'{pad_value(x["last_updatedYear"])}-{pad_value(x["last_updatedMonth"])}-{pad_value(x["last_updatedDay"])}' dates.append(pd.to_datetime(date_str, format='%Y-%m-%d'))results_df["date"] = datesresults_df.drop(["last_updatedYear", "last_updatedMonth", "last_updatedDay"], axis=1,inplace=True)results_df = results_df.set_index('date')
results_df[-15:].plot()
The blue line above shows the actual prices and the orange are the predicitons, which do not look the best. If we zoom in to the predictions. we'll see a increasing in price per day
results_df[-8:].plot()
Saving The Model
After we trained our model, we can now save and use for forecasting on other data
save_model(best, "model")
Transformation Pipeline and Model Succesfully Saved(Pipeline(memory=None, steps=[('dtypes', DataTypes_Auto_infer(categorical_features=[], display_types=True, features_todrop=[], id_columns=[], ml_usecase='regression', numerical_features=['last_updatedYear', 'last_updatedMonth', 'last_updatedDay'], target='price500', time_features=[])), ('imputer', Simple_Imputer(categorical_strategy='not_available', fill_value_ca... regressor=PassiveAggressiveRegressor(C=1.0, average=False, early_stopping=False, epsilon=0.1, fit_intercept=True, loss='epsilon_insensitive', max_iter=1000, n_iter_no_change=5, random_state=42, shuffle=True, tol=0.001, validation_fraction=0.1, verbose=0, warm_start=False), shuffle=True, tol=0.001, validation_fraction=0.1, verbose=0, warm_start=False)]], verbose=False), 'model.pkl')
Training Model for all Prices
After we have a intial model, we can now train 3 seperate models for each price(price150
, price300
and price500
)
We'll refactor the model training code into a function that trains each price, gets the model with the best score, and saves the model to a seperate file
def train(data, price, state="NewYork"): data = data.reset_index() data = data[ data["state"] == state].resample('d', on='last_updated').mean().dropna() data = data.reset_index() add_datepart(data, field_name="last_updated") data = data[[price, 'last_updatedYear', 'last_updatedMonth', 'last_updatedDay']] mask = round(len(data) * 0.7) train = data[:mask] test = data[mask:] print("train",train.shape) print("test",test.shape) s = setup(data = train, test_data = test, target = price, fold_strategy = 'timeseries', numeric_features = ['last_updatedYear','last_updatedMonth', 'last_updatedDay'], fold=2, transform_target = True, session_id = 42) best = compare_models(sort = 'MAE') save_model(best, f'{price}_model')
for feature in ["price150", "price300", "price500"]: df = get_data() train(df, feature)
Transformation Pipeline and Model Succesfully Saved
Conclusion and What's Next
In this post, we've seen how to build a time series model for forecasting the price of heating oil. In the next post, we'll go over how to deploy these models into a StreamLit application.
We'll also go over how the process on how the data was collected.
Original Link: https://dev.to/tbass134/predicting-the-price-of-heating-oil-using-pycaret-prices-5cjc
Dev To
An online community for sharing and discovering great ideas, having debates, and making friendsMore About this Source Visit Dev To