Predicting the Air Pressure System failure in Scania trucks
- Introduction
- Business Problem
- Mapping to ML/DL problem
- Understanding the Data
- Existing approaches
- Your first cut solution
- EDA and Feature selection
- Feature Engineering
- Modeling
- Results
- Conclusions and Future Work
1. Introduction
As Industry 4.0 continues to generate media attention, many companies are struggling with the realities of AI implementation. The benefits of predictive maintenance such as helping determine the condition of equipment and predicting when maintenance should be performed, are extremely strategic. Thus the implementation of ML-based solutions can lead to major cost savings, higher predictability, and the increased availability of the systems. In this Case Study we will build a classifier to detect if Scania truck is needed to be serviced or not from APS Failure at Scania Trucks Data Set.
2. Business Problem
The Air Pressure System (APS) is a system which generates pressurised air that are utilized in various functions in a truck, such as braking and gear changes.The dataset consists of data collected from the sensors of the failed Scania trucks.The dataset is crucial to the manufacturer as it allows to isolate components which caused the failure. This may help in avoiding failure during truck operation and thereby reducing maintenance cost. Our task is for the given data of failed scania trucks systems we have to predict whether the failure is caused due to the APS system or not.
3. Mapping to ML Problem
We can make use of Machine Learning to build a classification model on top of this dataset to meet our objective of predicting if the truck needs to be serviced or not.
- It is a binary classification problem. For a given data point we need to predict if the failure occurred due to APS or not.
- The cost function for this problem is Total_cost = (Cost_1 x No_Instances) + (Cost_2 x No_Instances). Where Cost_1 refers to the cost that an unnecessary check needs to be done by an mechanic at a workshop. While Cost_2 refer to the cost of missing a faulty truck which may cause a breakdown. Here Cost_1 = 10 and Cost_2 = 500.
- Here we need to minimize the cost of misclassification i.e., our model should have a very low number of false-negatives and also a low number of false-positives. Cost_1 can be interpreted as FP (False Positives) and Cost_2 as a FN(False Negatives). To minimize both FP and FN, we use a F1 score.
4. Understanding the Data
The dataset’s positive class corresponds to component failures for a specific component of the APS system. The negative class corresponds to trucks with failures for components not related to the APS system. The dataset contains 60,000 samples in which 59,000 examples belong to negative class and the remaining 1000 belongs to positive class.
Source: https://archive.ics.uci.edu/ml/datasets/APS+Failure+at+Scania+Trucks
5. Existing Approaches For The Problem
IDA 2016 Industrial Challenge: Using Machine Learning for Predicting Failures by Costa CF et al:The paper discusses various techniques to deal with the missing values. They suggest mean imputation as one of the most common imputation techniques.They also discuss various state of art techniques like Expectation Maximization (EM) and Multiple Imputation.
After Data Preprocessing they apply various ML models on the training data using scikit learn library such as SVM, Logistic Regression, Random Forrest and K-NN. They compute a baseline model by randomly classifying into positive and negative class. It is used as the comparison for the models.
Random Forrest presented the best results. The result was obtained using 50 estimators (trees) and a cutoff of 95 %, both chosen empirically. It gave a misclassification cost equal to 40570 with an accuracy of 92.6%.
6. My First Cut Approach
First task will be to perform a detailed EDA on the dataset and identify the missing values. I will then impute the missing values using some imputaton techniques. After imputation, I will be selecting 20 important features based on Recursive feature elimination. I will then balance the dataset using techniques like SMOTE. Then some features like the PCA components is added to the dataset .After preprocessing the data, I will try out various machine learning models and some custom classifiers and observe the results. Here’s a detailed explanation to my approach.
7. EDA and Feature Selection
- After performing Exploratory Data analysis, here are some observations.
- The training set consisted of 60,000 observations in 171 variables .
- The missing values were coded as “na”
- We observed that some features have more than 60% of their values missing. We decided to remove those features from our dataset.
- There is no fixed pattern in missingness of data.Its completely random.
- The dataset is highly imbalanced.
7. 1 Missing Values and Imputation
Removed features having missing values> 60%
I then impute the missing values in the remaining features.The technique used for imputation is SoftImputation. It is a package for matrix completion using nuclear norm regularization. It iteratively computes the soft-thresholded SVD of a filled in matrix.
7. 2 Feature Selection
There are around 163 features left in the dataset. All the features are not important for modelling, hence we try and compute the top 20 features based on feature importance using Recursive Feature elimination.
The set of new features are: [“aa_000”,”ag_001",”ag_002",”ag_003",”al_000",”am_0",”aq_000",”ay_005",”ay_006",”al_000",”bj_000",”bt_000",”bu_000",”ci_000",”ck_000",”cn_000",”cn_001",”dn_000",”ee_005",”ee_007"]
7. 3 Correlation matrix
- We now plot a correlation matrix and find out highly correlated features with respect to class (target variable)
- Upon observation there are no highly correlated feature with respect to class variable
7. 4 Univariate Analysis On The Top 20 Feature Set
We plot PDF,CDF,Violinplot and Boxplots for all the features.
Here are few examples of the features with interesting conclusions.
- [ay_005]: Most of the values indicating no failure are close to zero. In this feature, higher values consists of both the classes
- [bj_000]: Higher value of the feature indicates failure. Most of the feature values indicate no failure.
- [cn_000] :Most of the feature values indicate APS failure.
- [ay_006]: In this feature, higher values consists of both the classes
7. 5 Bivariate Analysis on Features
Cn_000 is the feature with consists of majority values having Aps failure. So we plot scatterplot with other features.
7. 6 Observations by EDA
- Most of the features have large number of values corresponding to no power failure except for feature cn_000
- Most of the features have values with no failure are very small even close to 0.
- Scatter plot does not reveal much information due to the fact that points with no APS failure is close to zero as the dataset is highly imbalanced.
8. Feature Engineering
1.) PCA:We transform the dataset into 4 pca components and use it as new features.
2.) SMOTE: We now split the dataset with the pca features into training and test data. Then we oversample it using SMOTE
3.) We then standardize the dataset using StandardScalar
9. Machine Learning Modelling
After competing data-preprocessing and feature engineering, let’s move on to modelling. We will pass the training data through various models like Logistic regression,Decision Trees, Random Forest, XGBOOST , Adaboost Classifier and a Custom Classifier, perform hyperparameter tuning, and calculate their F1 score and plot its confusion Matrix.
The custom classifier gives us the best results. Below are the steps in which the custom classifier is constructed.
- Split the train set into D1 and D2 (50–50).
- From D1, perform sampling with replacement to create d1 , d2 , d3 …. dk (k samples).
- Now, create ‘k’ models and train each of these models with each of these k samples.
- Pass the D2 through each of these ‘k’ models, which gives us ‘k’ predictions for D2, from each of these models.
- Using these ‘k’ predictions create a new dataset, and for D2, since we already know it’s corresponding target values, we can now train a meta model with these ‘k’ predictions as features.
- For model evaluation, we will pass our test set through each of the base models and get ‘k’ predictions. Then we can create a new dataset with these ‘k’ predictions and pass it to the previously trained metamodel to get our final prediction.
- Now using this final prediction as well as the targets for the test set, we can calculate the models performance score.
10. Results for the custom classifier
11. Conclusion
In this dataset, there were several problems like that of class imbalance which was followed by missing values. The choice of an appropriate performance metric was a very important factor especially for a highly imbalanced dataset. At the end,we observed that the tree based models gave a decent f1 score with our custom classifier outforming all of them with the lowest misclassification cost of $ 10270.
The case study is deployed on the local system using Flask.
Reference
- https://www.researchgate.net/figure/Overall-workflow_fig1_330854331
- https://www.kaggle.com/c/malware-classification/overview
- https://www.appliedaicourse.com
You can find the full code on : https://github.com/niccosnayak/Scania-APS-Failure/blob/main/ScaniaAPSFailure.ipynb