60
R. Rocha, L. Santos, R. Soares, F. Barbosa and M. D'Angelo,
“Classification of Failure Using Decision Trees Induced by Genetic Programming”,
Latin-American Journal of Computing (LAJC), vol. 11, no. 2, 2024.
Classification of Failure
Using Decision Trees
Induced by Genetic
Programming
ARTICLE HISTORY
Received 2 January 2024
Accepted 19 April 2024
Rogério Costa Negro Rocha
Graduate Program in Computational Modeling and Systems
State University of Montes Claros
Montes Claros, Brazil
rogeriocostanegro@hotmail.com
ORCID: 0009-0002-4667-9656
Laércio Ives Santos
Campus Montes Claros Federal Institute of Norte de Minas
Gerais
Montes Claros, Brazil
laercio.ives@gmail.com
ORCID: 0000-0001-6504-7692
Rafael Almeida Soares
Graduate Program in Computational Modeling and Systems
State University of Montes Claros
Montes Claros, Brazil
rafael.almeida.soares2012@gmail.com
ORCID: 0009-0006-5544-9798
Franciele Alves Barbosa
Graduate Program in Computational Modeling and Systems
State University of Montes Claros
Montes Claros, Brazil
francielealvesb10@gmail.com
ORCID: 0009-0005-3964-7391
Marcos Flávio Silveira Vasconcelos D'Angelo
Departament of Computer Science State University of Montes
Claros
Montes Claros, Brazil
marcos.dangelo@unimontes.br
ORCID: 0000-0001-5754-3397
ISSN:1390-9266 e-ISSN:1390-9134 LAJC 2024
ISSN:1390-9266 e-ISSN:1390-9134 LAJC 2024
61
DOI:
LATIN-AMERICAN JOURNAL OF COMPUTING (LAJC), Vol XI, Issue 2, July 2024
10.5281/zenodo.12192085
LATIN-AMERICAN JOURNAL OF COMPUTING (LAJC), Vol XI, Issue 2, July 2024
Classification of Failure Using Decision Trees
Induced by Genetic Programming
1
st
Rogério Costa Negro Rocha
Graduate Program in Computational
Modeling and Systems
State University of Montes Claros
Montes Claros, Brazil
rogeriocostanegro@hotmail.com
ORCID: 0009-0002-4667-9656
4
th
Franciele Alves Barbosa
Graduate Program in Computational
Modeling and Systems
State University of Montes Claros
Montes Claros, Brazil
francielealvesb10@gmail.com
ORCID: 0009-0005-3964-7391
2
nd
Laércio Ives Santos
Campus Montes Claros
Federal Institute of Norte de Minas
Gerais
Montes Claros, Brazil
laercio.ives@gmail.com
ORCID: 0000-0001-6504-7692
5
th
Marcos Flávio Silveira Vasconcelos
D'Angelo
Departament of Computer Science
State University of Montes Claros
Montes Claros, Brazil
marcos.dangelo@unimontes.br
ORCID: 0000-0001-5754-3397
3
rd
Rafael Almeida Soares
Graduate Program in Computational
Modeling and Systems
State University of Montes Claros
Montes Claros, Brazil
rafael.almeida.soares2012@gmail.com
ORCID: 0009-0006-5544-9798
AbstractFault classification in industrial processes is of
paramount importance, as it allows the implementation of
preventive and corrective measures before catastrophic failures
occur, which can result in significant repair costs and production
loss, for example. Therefore, the purpose of this study was to
develop a classification model by merging the concepts of Decision
Trees with Genetic Programming. To accomplish this, the proposed
model randomly generates a set of decision trees using the adapted
Tennessee Eastman dataset. The generation of these trees does not
rely on classical construction logic; instead, they employ an
approach where the structure and characteristics of the trees are
randomly determined and adjusted throughout the evolutionary
process. This approach enables a broader exploration of the search
space and may lead to diverse solutions. The results obtained were
moderate, largely due to the high number of target classes for
classification (21 classes), resulting in the creation of complex trees.
The average accuracy on the test data was 0.75, indicating the need
to implement new alternatives and enhancements in the algorithm to
improve the results.
Keywordsdecision trees, multiclass classification, fault
detection, genetic programming
I. INTRODUCTION
Fault Classification Problems have been widely explored
in various fields of engineering and science, being of utmost
relevance for the detection and prevention of anomalies in
systems and processes. Early fault detection plays a
fundamental role in proactive maintenance and ensuring
efficient and safe operations in complex and interconnected
environments, such as industrial production systems,
telecommunications networks, and transportation systems.
In this context, fault classification involves the
development of models and algorithms capable of identifying
subtle patterns in data, distinguishing between normal
situations and anomalous behaviors. Advanced machine
learning and data mining techniques have been applied to
address these challenges.
Precise fault detection and classification not only reduce
costs associated with unplanned downtime, but also contribute
to process optimization and worker safety.
To achieve this, models based on Machine Learning (ML)
techniques can be used for fault classification. ML techniques
are interesting because they emulate the human way of
thinking and decision-making, analyze large datasets
containing many features in a reasonable time, and can handle
complex relationships between data, making them more
accurate than human experts in some specific situations [1].
Given the above, decision trees are considered, a ML
technique used for classification problems. The use of
decision trees has proven to be a promising approach in the
context of fault classification problems. Decision trees offer
an intuitive and interpretable way to model complex patterns
in data, allowing the identification of relevant features for the
classification of different types of faults in industrial systems.
Additionally, as highlighted by [2], the hierarchical nature of
decision trees mirrors human decision-making processes,
making them suitable for analyzing anomalous behaviors in
interconnected systems.
The application of decision trees for fault classification
can also be enhanced with data preprocessing techniques, such
as relevant feature selection. Thus, the use of decision trees
offers a versatile and effective approach to address the
challenges of fault detection and classification in complex
systems. However, such algorithms often use a greedy
strategy and tend to fall into local optima. Moreover, the
recursive partitioning policy in the construction phase can
result in datasets with low cardinality for the attribute
selection process in deeper tree nodes, causing data
overfitting.
Furthermore, researchers have considered the application
of evolutionary algorithms to induce decision trees,
specifically through Genetic Programming (GP). GP is an
evolutionary algorithm that evolves a set of individuals
represented in the form of trees [3]. When GPs are applied to
the induction of decision trees, it is possible to deal with
ISSN:1390-9266 e-ISSN:1390-9134 LAJC 2024
62
DOI:
LATIN-AMERICAN JOURNAL OF COMPUTING (LAJC), Vol XI, Issue 2, July 2024
10.5281/zenodo.12192085
R. Rocha, L. Santos, R. Soares, F. Barbosa and M. D’Angelo,
Classification of Failure Using Decision Trees Induced by Genetic Programming”,
Latin-American Journal of Computing (LAJC), vol. 11, no. 2, 2024.
multiple attributes simultaneously, reducing the dependence
on feature selection methods in preprocessing and still
providing a global search strategy [4]. This is an interesting
approach to be tested, given that in recent years, it is not
common to find works that use evolutionary computation
techniques to induce decision trees.
Therefore, this work aims to build a multiclass
classification algorithm based on decision trees induced by
genetic programming, with the purpose of classifying faults in
the adapted database of the Tennessee Eastman Process
Simulation and analyzing its accuracy results. Experiments
were conducted to assess the quality and complexity of the
solutions found. The results obtained indicated that the model
presents moderate results for fault classification in the chosen
database and results in complex trees; therefore, new
strategies need to be applied to the algorithm to achieve better
results and performance.
II. L
ITERATURE REVIEW
A. Decision Trees
Decision Trees are widely used algorithms in machine
learning to solve classification and regression problems. Data
is organized in a tree-like structure, wherein each inner node
signifies a decision derived from a particular attribute, and
each terminal node, or leaf, corresponds to either a
classification label or a regression value [5].
One of the advantages of decision trees is their
interpretability. Their representation, especially when viewed
graphically, is easily understandable. One can follow the logic
of each node and interpret it until reaching a leaf node, which
indicates the class of the instance, for example. Additionally,
decision trees have the ability to handle both numerical and
categorical data. They can represent complex relationships
between attributes and classes, making them suitable for
modeling nonlinear data [6].
To evaluate a decision tree, the Misclassification Error
criterion can be used [6]. In this criterion, the number of
correct predictions is measured by comparing predicted
outputs with true outputs, resulting in accuracy. Accuracy
assesses the ratio of correctly classified examples to the total
number of evaluated examples. Higher accuracy indicates a
greater number of accurate predictions.
B. Genetic Programming
Genetic Programming (GP) is an artificial intelligence
technique that uses principles inspired by biological evolution
to evolve solutions for complex problems. In this approach, a
set of random solutions is represented as genetic structures
that can be combined and mutated over several generations,
generating new individuals representing new solutions, with
the aim of finding optimal or approximate solutions to a
problem [3].
Genetic programming starts with an initial population of
potential solutions (good or bad), known as individuals. In
each generation, these individuals are evaluated based on a
fitness function that quantifies how well they solve the given
problem. Individuals with higher fitness are more likely to be
selected for reproduction, where crossover (recombination)
and mutation operations occur, similar to the processes of
genetic evolution [3].
The genetic programming approach allows the exploration
of a broad solution space in search of effective solutions for
complex and multidimensional problems. It is applied in
various fields, including optimization, machine learning, and
modeling.
C. Genetic Programming Applied to Decision Trees
Genetic programming (GP) applied to decision trees
represents an innovative approach in the field of artificial
intelligence. In this paradigm, decision trees are portrayed as
chromosomes, enabling the evolution of effective solutions
for multiclass classification problems. [7] emphasizes that this
genetic representation facilitates the application of
evolutionary operators, such as crossover and mutation, to
generate new generations of decision trees, allowing the
discovery of novel and improved solutions to the addressed
problem.
The evolutionary process unfolds over iterations, where
trees are selected for a reproduction pool, forming pairs that
crossbreed to produce new individuals. Trees that are more
adapted, as per a fitness function, have higher chances of
being chosen for reproduction. This evolutionary approach
aims to find decision trees that optimally fit the data patterns.
Nguyen et al. [8] underscore the importance of a well-defined
fitness function to efficiently guide the evolutionary process.
The advantages of this approach include the ability to
handle complex problems and the flexibility to evolve
decision tree structures without the need for manual definition.
However, challenges such as the potential uncontrolled
growth of the tree (overfitting) need to be addressed. [9]
discuss strategies, such as penalties in the fitness function, to
mitigate these challenges and ensure more generalized
solutions.
In summary, genetic programming applied to decision
trees offers a promising approach to solve multiclass
classification problems, combining the flexibility of genetic
evolution with the structured representation of decision trees.
However, the careful selection of parameters and strategies to
prevent overfitting is crucial in the development and
implementation of this technique.
III. M
ETHODOLOGY
A. Used Database
The Tennessee Eastman Process Simulation database is
widely recognized as a benchmark in the field of process
engineering and fault detection. Developed by the Oak Ridge
National Laboratory in the United States, this database was
designed to allow the evaluation and comparison of fault
detection, diagnosis, and prediction algorithms and methods
in a simulated environment of a complex chemical process
[10]. Researchers employ this dataset to test and compare
anomaly detection algorithms, pattern identification, and
diagnosis in a chemical process scenario, fostering
advancements in the field [11].
In total, the original database has 55 columns, with 54
input attributes and 1 output attribute. The column that
presents the output attribute is called "faultNumber",
representing the fault number, ranging from 0 to 21. This
expresses a classification of 22 fault classes, where class 0
means no fault, and the other classes (1 to 21) represent the
fault classification number.
ISSN:1390-9266 e-ISSN:1390-9134 LAJC 2024
63
DOI:
LATIN-AMERICAN JOURNAL OF COMPUTING (LAJC), Vol XI, Issue 2, July 2024
10.5281/zenodo.12192085
LATIN-AMERICAN JOURNAL OF COMPUTING (LAJC), Vol XI, Issue 2, July 2024
In the present study, a database derived from a fault
detection model based on Qualitative Trend Analysis (QTA),
proposed by [12], was used. This model adopts a two-step
process for fault detection in the Tennessee Eastman database.
The first step uses the fuzzy set theory, while the second one
relies on a Bayesian approach for detecting change points in
time series, providing an indication of a possible fault. If this
indication is established, the proposed model takes
responsibility for identifying the specific nature of the fault.
Additionally, 22 variables were eliminated from the
dataset in two phases by [12]. In the first phase, a correlation
matrix obtained from the 52 input variables in the author's
used database was employed. Variables with a correlation
below 0.6 were eliminated, resulting in a reduction of 15
variables.
In the second phase, 7 more variables that showed no
indications of faults by the Fuzzy/Bayesian approach were
discarded. In other words, in the calculation of the new
probability vector for change points for these variables, no
changes were detected. Consequently, the vectors were zeroed
out, resulting in the variables having only zero values, which
does not affect fault classification.
Thus, 30 input variables were retained, which were used
to train and test the classifier proposed in this work.
In the end, the dataset proposed by [12] presented 4180
instances, with 30 input attributes and 1 output attribute. In
this sampling, all classes have 200 instances, except for
classes 1, 9, 15, 19, 18, and 20, which have 199, 190, 198, 195,
199, and 199 instances, respectively.
B. Developed Algorithm
Using the concepts of decision trees and genetic
programming, a predictive model was developed using
Python. Three classes were created in total: 'Node', which
stores the nodes of the tree, 'DecisionTree', that represents the
classification trees, and finally, the 'PG' class (Genetic
Programming). This contains the genetic operators that create
the tree population and evolve them with the aim of finding
better solutions for the fault classification problem.
When executed, the algorithm takes training parameters
and the maximum number of iterations as inputs. It can also
receive additional parameters such as the maximum depth of
the trees, population size, crossover rate, mutation rate,
elitism, the number of individuals participating in tournaments
during the selection phase, and the number of split points for
individuals during the crossover phase.
The pseudocode of the developed algorithm can be seen in
Fig. 1.
Fig. 1. Developed algorithm
In line 1 of the pseudocode, we have the first function to
be called, which is GenerateInitialPop() that takes the training
data as a parameter, aiming to generate the initial population
randomly.
After the initial population is formed, the algorithm enters
a loop, which lasts for the specif