Looking into the literature, it can be seen that several machine-learning techniques have been used for predicting accident severity, but the "no free lunch" theorem states that there is no single model that proved to be best for all situations and, therefore, exploring different comparative models could help get the best outcome. While machine-learning models have been applied to accident severity prediction globally, their application in Saudi Arabia particularly with robust feature selection methods to handle high-dimensional traffic accident data is limited. The study aims to develop several machine-learning models for predicting the severity of road traffic accidents in the Kingdom of Saudi Arabia to find an optimum model for the case study area. The study's primary motive is to employ three different dominant input selection algorithms for selecting the input parameters responsible for accident severity in the study area. Dominant input selection controls the effect of these machine-learning models in increasing the accuracy of accident severity prediction models. This will assist in making decisions on the effectiveness of the single models in improving the accident severity prediction. To the authors' knowledge, this study is the first to utilize the three feature selection algorithms (Kruskal Wallis, Chi-square and maximum relevance minimum redundancy algorithms) for selecting relevant input parameters prior to machine learning technique models for predicting accident severity.
The study's methodology involves two significant steps and is presented in Fig. 1. The first step is selecting dominant parameters and ranking their relevance in the model. The second stage is developing five machine learning-based (BRT, ANN, SVM, NVB, and logistic regression (LGR)) models for predicting traffic accident severity. All the steps are explained in the following sub-sections.
Traffic accident data from the years 2018 to 2022, covering 14 cities in the Eastern Province of Saudi Arabia, were used for this study. A total of 9,548 accident cases involving 17,100 vehicles resulted in 2,527 fatalities and 8,020 injuries during this period. These data were obtained from the General Directorate of Traffic (Ministry of Interior, Saudi Arabia), based on official accident records from police departments in 14 cities in the Eastern Province. The accidents were categorized into 19 distinct types, and the corresponding figures for cases, injuries, vehicles involved, and fatalities are presented in Table 1.
Grouping the accident types into so many groups aids in a comprehensive and clear understanding of the problem such that prevention measures can be applied easily. The moving vehicle accident was the most recorded accident type (3,665 cases), accounting for 3,272 injuries and 871 fatalities. The second and third most common accident types in the study area are vehicle overturns and run-over accidents, which have killed 640 and 503 people, respectively. The number of people injured from vehicle overturns and runover accidents was 1,809 and 1,794 people, respectively. Waste container accidents, bridge falls, and hit traffic light accidents were the least recorded accident types in the study area. The waste container and hit-traffic light accident recorded zero fatalities, while bridge fall accidents killed two people.
Table 2 provides the summary of 39 causative factors in the study area. Swerve, distracted driving, pedestrians crossing outside crossing areas, and spacing violations were the four major causes of accidents on the selected roads between 2018-2022. Swerve driving accounts for 28.1% of the accidents, while distracted driving, crossing outside specified areas, and spacing violation accounts for 18.8%, 10.2%, and 9.5% of the accidents, respectively. Swerve driving killed more people (765 people) than any other reason, followed by distracted driving, which claimed 453 lives, and the failure of pedestrians to cross at specified locations, which killed 255 people. The only causative factors with zero fatalities are related to pedestrian behaviors, which include "violation of traffic light by pedestrians and playing in the street. Downhill accidents and short-headway also resulted in zero fatal accidents. Steering failure, overload, and Electrical faults resulted in 1 fatality each. It is worth noting that all the causative factors with the least fatality cases have a lower frequency.
As shown in Table 3, 14 cities in the Kingdom of Saudia Arabia were selected for conducting the study. The results show that 28% (2,664) of the recorded accident cases during these five years occurred in Al-Ahsa. 18% (1,758) in Dammam, 13% (1,196) in Hafar Al-Batin, and 10% (917) in Qatif, and the remaining 32% took place in the remaining ten cities. The fatalities in these cities are also very high, recording 718, 339, 302, 238, and 163 fatalities in Al Ahsa, Hafar Al-Batin, Dammam, AL Jubail, and Dhahran, respectively. Looking at the ratio of accidents to fatalities. It can be seen that the city of Al Jubail has the highest fatality/accident ratio of 0.56, which translates to at least 1 person killed for every 2 accident cases in the recorded data. The Kafji and Buqayp cities also have a high ratio that translates to almost a fatality for every 2 accident cases. The cities of Aladid, Ras Tanura, and Dhahran recorded a fatality in every 3 accident cases. 5. In the cities of Hafar, Al Batin, King Fahad Causeway, Al Ahsa, Qaryat al-Ulya, and Al Nairyah, a fatality is recorded in every four accident cases, while fatality is recorded in every five accident cases reported in the cities of Al Khobar and Dammam. The lowest fatality/ accident ratio (0.16) was reported in Qatif.
The data summary based on the accident year is given in Table 4. There were more accidents, fatalities, and injuries in 2019 than in the remaining years. There was a sharp decline in the number of accidents, injuries, and fatalities in 2020, which may be attributed to restricted movement during the coronavirus (COVID-19) pandemic. As some restrictions were relaxed in 2022, the number of accidents significantly increased compared to 2020 (Full restrictions) and 2021 (partial restrictions). Even though the number of accidents has increased from 2022 up to 2022, the severity ratio (number of fatality/number of cases) continues to decrease slightly from 2018 (0.27) to 2022 (0.24).
The selection of dominant input parameters is crucial in any machine learning modeling. This helps develop a model capable of reaching global minima with the optimum resources of time and money. Including non-relevant parameters in machine learning models increases the complexity of the model, thus decreasing its efficiency. It is important to apply at least two different selection techniques to understand the relevant parameters better. This is because no single model gives the best selection for all scenarios. In this study, three different algorithms were used for ranking the relevance and importance of all the potential input parameters (No. of vehicle involved in an accident, accident type, cause of the accident, No. of injured people, No. of people involved in the accident, accident city, location of the accident). The algorithms used are maximum relevance, minimum redundancy (MRMR), Chi-square, and the Kruskall Wallis. The MRMR is one of the most powerful filter algorithms developed by (Peng et al. 2005). Its major advantage over other techniques lies in its ability to select only one relevant feature when there are two or more relevant features with the same information, leading to fast computational speed and accurate prediction (Ibrahim Bibi Farouk et al. 2022). Kruskal Wallis on other hand is a non-parametric test that removes some parameters with p-value above the set p-value threshold effectively removing non-relevant parameters hence improved performance of the models. The ranking of the parameters using MRMR, Chi-square, and the Kruskal Wallis is given in Figs. 2, 3, and 4, respectively. No. Of vehicle involved, city, and No. of injured people are the most critical three parameters for predicting accident severity, and accident cause was the least essential parameter using the MRMR algorithm. Ranking with the Chi-square algorithms ranked No. of injured people, No. of vehicle involved and city as the most critical parameters in the following order of importance. Chi-square ranked No. of vehicles involved as the last parameter. Likewise, the number of people involved in an accident, No. of injured people and whether the accidents occurred in or out of the city are the most critical parameters using the Kruskal Wallis algorithm, as shown in Fig. 4. Since the three different algorithms did not rank the seven parameters in the same order, all seven parameters were included to avoid ignoring any critical parameter. The correlation matrix between the parameters is presented in Table 5. The correlation matrix gives the correlation between the parameters. Knowing the correlation between the parameters will help eliminate collinearity issues in the model by removing one of two parameters with a very high correlation in a single model. The correlation matrix shows that no two input parameters have a strong relationship. Therefore, including all seven parameters will not cause any collinearity issues in the model. No. of people involved in accidents is the most critical parameter considering the three ranking algorithms and correlation matrix. The results also show that the relationship between the input parameters and the accident's severity is not linear, as all parameters have a slight correlation with accident severity except No of people involved in the accident (CC = 0.8782).
Support Vector Machine (SVM) is widely recognized in the field of machine learning for its effectiveness in managing uncertain and complex data structures. It is primarily used to construct an optimal separating boundary, known as a hyperplane, that distinguishes between different classes within a dataset. The technique aims to maximize the margin -- the distance between the hyperplane and the closest data points from each class, called support vectors -- which enhances the model's generalization ability and reduces classification errors. In this study, the SVM model is applied using Eq. 2, as illustrated in Fig. 5.
where w is the weight vector of the orthogonal hyperplane, x represents the input in the dataset, and b is the bisector, which ∅ denotes the null set in the dataset. The study's proposed SVM algorithm is shown in Fig. 5.
Artificial neural networks (ANN) are one of the most widely employed machine learning techniques, drawing inspiration from the intricate neural networks in the human brain. Among the various types of ANNs, feedforward neural networks are ubiquitous. They transmit the processed weight values of each artificial neuron as output to the subsequent layer, relying on inputs from neurons in the preceding layer. Within the category of feedforward neural networks, the Multilayer Perceptron (MLP) holds a significant position. For training MLP, the backpropagation algorithm stands out as the most frequently employed technique. It operates by adjusting the weights between neurons to minimize errors. This model excels in learning patterns and demonstrates adaptability to new data values. However, it is worth noting that this system may exhibit slow convergence and the potential for reaching local optima compared to EANN. This research proposed a "Feed-forward neural networks" (FFNN) algorithm consisting of 7 input parameters, 10 neurons in the hidden layer, and one output layer, as shown in Fig. 6.
The problem is categorized using Boosted Regression Trees (BRT). The Classification and Regression Trees (CART) have dual decision grading aggregates. Each point (node) shows a kind of dual (binary) on a single framework (parameter); while a branch shows the result, the last points (leaf nodes) represent the data category. CART functions by dividing the data into two categories so that the categories on each point (node) are similar and choosing the most suitable framework (parameter). Then each area is applied successively, as shown in Fig. 7, and so on. If there are examples from n classes in dataset A, index, is defined as;
where the relative frequency of class in A is . Datasets A is separated into two smaller subsets, A1 and A2, with s sizes N1 and N2; the split data index contains examples of n groups, then the index, is defined as:
Boosted Regression Trees (BRT) systematically evaluate single-variable splits, selecting the attribute that minimizes the Gini index to determine the optimal node division. The model grows the tree recursively from the root, and once fully expanded, it undergoes a pruning process to remove unnecessary branches and reduce overfitting. One of the key advantages of decision trees lies in their interpretability -- each path from the root to a leaf node can be translated into a clear "if-then" rule, making the decision-making process transparent and easy to follow.
Naive Bayes is a probabilistic classification technique commonly applied to tasks like text categorization. It operates under the simplifying assumption that the features used for prediction are statistically independent, an assumption that often does not hold in practical scenarios. The method is based on Bayes' theorem, which calculates the likelihood of an outcome based on prior knowledge of related conditions or events.
Binary logistic regression is employed to assess the influence of various factors on road accidents when the outcome variable is dichotomous. In logistic regression, the explanatory variables may be either categorical or continuous. For this study, all independent variables are categorical, while the response variable Y, indicating whether an accident is fatal or not, is binary. This justifies the use of binary logistic regression as an appropriate modeling technique. The formulation of the logistic regression model is presented in Eq. 4.
p is the probability of a fatal accident, b is the intercept, b is the model coefficients of the independent variables, and x is the independent variable.
Data preparation is one of the most important activities for any black box modeling. Normalization is one of the important activities before the model formulation. The normalization brings all the different data parameters into a similar range. Normalizing the data means no parameter will overshadow another due to higher or lower range variation, in addition to eliminating the overshadowing problem. Normalization reduces the model's complexity and helps achieve the global minima with fewer resources, time, and cost. The data was normalized between 0 and 1 using Eq. 5.
In this context, x refers to the normalized value, calculated using the observed minimum and maximum values x, x, and x, respectively. To evaluate the performance of the developed models, five key metrics were applied: accuracy, sensitivity, specificity, precision, and geometric mean (G-mean), as outlined in Equations 6-10. Collectively, these measures provide a comprehensive understanding of each model's effectiveness.
Accuracy reflects the overall correctness of the model's predictions; however, it may present a misleading picture, particularly when the dataset is imbalanced. In such cases, the majority class may dominate the accuracy score, masking poor performance on the minority class. Sensitivity, or recall, quantifies the model's ability to correctly identify true positives -- typically the minority class -- while specificity gauges its accuracy in recognizing true negatives, which often represent the majority class. These two metrics are often in tension, requiring careful balancing in evaluation.
Precision indicates the proportion of correct positive predictions among all predicted positives and is particularly important when false positives carry significant cost. A model with high precision is more reliable in its positive classifications.
The G-mean serves as a balanced indicator by taking the square root of the product of sensitivity and specificity. A high G-mean value suggests that the model performs well across both classes, whereas a low G-mean indicates disproportionate classification performance. This metric is especially valuable in imbalanced datasets, helping to detect and avoid tendencies toward overfitting to the dominant class or underfitting to the minority. The G-mean parameter is the best metric for comparing the performance of different models. The evaluation was achieved by using additional statistical measures, namely, True positive (TP), true negative (TN), false positive (FP), and false negative (FN). A true positive (TP) occurs when the model accurately predicts a fatal accident as fatal. A false positive (FP) arises when the model mistakenly labels a non-fatal accident as fatal. Conversely, a false negative (FN) happens when a fatal accident is wrongly classified as non-fatal. Lastly, a true negative (TN) is recorded when the model correctly identifies a non-fatal accident as non-fatal.