This project focuses on predicting the balance of financial accounts based on historical transaction data using Time Series Forecasting and machine learning models. The dataset contains transaction details for multiple accounts, and the model aims to forecast future account balances using past data and time-based features.
The project is part of the Pesta Data Nasional (PEDAS) 2024 competition, specifically in the Data Scientist category. It provides insights into time series forecasting by applying machine learning algorithms on transaction data.
The dataset used for this project contains the following attributes:
- trx_code: A unique transaction code that identifies each transaction.
- trx_id: The unique transaction number.
- rek_code: A unique code representing the account.
- rek: The account number (unique identifier for each financial account).
- creationdate: The timestamp when the transaction was created.
- type: The type of transaction (e.g., Deposit, Withdrawal).
- amount: The transaction amount (deposit or withdrawal).
- balance: The account balance after the transaction.
The project uses the Train Data for model training, Test Data for evaluating model performance, and Inference Data for making predictions on missing balance values.
The data is preprocessed by:
- Feature Extraction: Extracting time-based features such as hour, day of the week, month, and year from the transaction dates.
- Lag Features: Creating lag features like
lag1
,lag2
, andlag3
, which represent the account balance from the previous time steps (1, 2, and 3). - Handling Missing Values: Interpolating missing balance data using time-based interpolation, which estimates missing values based on existing time-stamped data. This interpolation method ensures that the balance data remains continuous and smooth over time without introducing abrupt changes.
- Resampling: The data is resampled to an hourly frequency to ensure consistent time intervals, with duplicate timestamps removed.
To handle missing values in the balance data, time-based interpolation is used. This method estimates missing values based on the time index. The values are interpolated linearly between two existing data points, ensuring the balance follows a consistent pattern without sudden jumps. This method is suitable for time series data as it preserves the temporal relationship between observations.
The model is trained using time series data, where features like hour, day of week, month, and lag features are used to predict the account balance. Various machine learning models like Random Forest Regressor and Linear Regression are used to train the data.
- SMAPE (Symmetric Mean Absolute Percentage Error): Measures prediction accuracy.
- MAE (Mean Absolute Error): Evaluates the average prediction error.
- RMSE (Root Mean Squared Error): Measures the square root of the average squared differences between actual and predicted values.
Once the model is trained, it can be used to predict the missing balances in the inference data. The predicted balances are inserted into the dataset, replacing the missing (NaN
) values.
- Time Series Forecasting: Predicting future balances based on historical transaction data.
- Lag Features: Using previous balance data to improve predictions.
- Model Evaluation: Using SMAPE, MAE, and RMSE to evaluate the accuracy and performance of the model.
The project successfully predicts account balances using the historical transaction data. The accuracy of the predictions is evaluated using various metrics (SMAPE, MAE, RMSE), and the results are used to assess the model’s performance.
- Model Improvement: Further fine-tuning and exploring advanced models like LSTM or ARIMA for better time series predictions.
- Real-Time Prediction: Implementing real-time forecasting for ongoing transaction data.
- Handling Outliers: Improving the handling of outliers in the transaction data to improve model robustness.
- Steve Marcello Liem
- Matthew Lefrandt
- Marvel Martawidjaja
This project is licensed under the MIT License - see the LICENSE file for details.