ISSN:1390-9266 e-ISSN:1390-9134 LAJC 2024
62
DOI:
LATIN-AMERICAN JOURNAL OF COMPUTING (LAJC), Vol XI, Issue 2, July 2024
10.5281/zenodo.12192085
R. Rocha, L. Santos, R. Soares, F. Barbosa and M. D’Angelo,
“Classification of Failure Using Decision Trees Induced by Genetic Programming”,
Latin-American Journal of Computing (LAJC), vol. 11, no. 2, 2024.
multiple attributes simultaneously, reducing the dependence
on feature selection methods in preprocessing and still
providing a global search strategy [4]. This is an interesting
approach to be tested, given that in recent years, it is not
common to find works that use evolutionary computation
techniques to induce decision trees.
Therefore, this work aims to build a multiclass
classification algorithm based on decision trees induced by
genetic programming, with the purpose of classifying faults in
the adapted database of the Tennessee Eastman Process
Simulation and analyzing its accuracy results. Experiments
were conducted to assess the quality and complexity of the
solutions found. The results obtained indicated that the model
presents moderate results for fault classification in the chosen
database and results in complex trees; therefore, new
strategies need to be applied to the algorithm to achieve better
results and performance.
II. L
ITERATURE REVIEW
A. Decision Trees
Decision Trees are widely used algorithms in machine
learning to solve classification and regression problems. Data
is organized in a tree-like structure, wherein each inner node
signifies a decision derived from a particular attribute, and
each terminal node, or leaf, corresponds to either a
classification label or a regression value [5].
One of the advantages of decision trees is their
interpretability. Their representation, especially when viewed
graphically, is easily understandable. One can follow the logic
of each node and interpret it until reaching a leaf node, which
indicates the class of the instance, for example. Additionally,
decision trees have the ability to handle both numerical and
categorical data. They can represent complex relationships
between attributes and classes, making them suitable for
modeling nonlinear data [6].
To evaluate a decision tree, the Misclassification Error
criterion can be used [6]. In this criterion, the number of
correct predictions is measured by comparing predicted
outputs with true outputs, resulting in accuracy. Accuracy
assesses the ratio of correctly classified examples to the total
number of evaluated examples. Higher accuracy indicates a
greater number of accurate predictions.
B. Genetic Programming
Genetic Programming (GP) is an artificial intelligence
technique that uses principles inspired by biological evolution
to evolve solutions for complex problems. In this approach, a
set of random solutions is represented as genetic structures
that can be combined and mutated over several generations,
generating new individuals representing new solutions, with
the aim of finding optimal or approximate solutions to a
problem [3].
Genetic programming starts with an initial population of
potential solutions (good or bad), known as individuals. In
each generation, these individuals are evaluated based on a
fitness function that quantifies how well they solve the given
problem. Individuals with higher fitness are more likely to be
selected for reproduction, where crossover (recombination)
and mutation operations occur, similar to the processes of
genetic evolution [3].
The genetic programming approach allows the exploration
of a broad solution space in search of effective solutions for
complex and multidimensional problems. It is applied in
various fields, including optimization, machine learning, and
modeling.
C. Genetic Programming Applied to Decision Trees
Genetic programming (GP) applied to decision trees
represents an innovative approach in the field of artificial
intelligence. In this paradigm, decision trees are portrayed as
chromosomes, enabling the evolution of effective solutions
for multiclass classification problems. [7] emphasizes that this
genetic representation facilitates the application of
evolutionary operators, such as crossover and mutation, to
generate new generations of decision trees, allowing the
discovery of novel and improved solutions to the addressed
problem.
The evolutionary process unfolds over iterations, where
trees are selected for a reproduction pool, forming pairs that
crossbreed to produce new individuals. Trees that are more
adapted, as per a fitness function, have higher chances of
being chosen for reproduction. This evolutionary approach
aims to find decision trees that optimally fit the data patterns.
Nguyen et al. [8] underscore the importance of a well-defined
fitness function to efficiently guide the evolutionary process.
The advantages of this approach include the ability to
handle complex problems and the flexibility to evolve
decision tree structures without the need for manual definition.
However, challenges such as the potential uncontrolled
growth of the tree (overfitting) need to be addressed. [9]
discuss strategies, such as penalties in the fitness function, to
mitigate these challenges and ensure more generalized
solutions.
In summary, genetic programming applied to decision
trees offers a promising approach to solve multiclass
classification problems, combining the flexibility of genetic
evolution with the structured representation of decision trees.
However, the careful selection of parameters and strategies to
prevent overfitting is crucial in the development and
implementation of this technique.
III. M
ETHODOLOGY
A. Used Database
The Tennessee Eastman Process Simulation database is
widely recognized as a benchmark in the field of process
engineering and fault detection. Developed by the Oak Ridge
National Laboratory in the United States, this database was
designed to allow the evaluation and comparison of fault
detection, diagnosis, and prediction algorithms and methods
in a simulated environment of a complex chemical process
[10]. Researchers employ this dataset to test and compare
anomaly detection algorithms, pattern identification, and
diagnosis in a chemical process scenario, fostering
advancements in the field [11].
In total, the original database has 55 columns, with 54
input attributes and 1 output attribute. The column that
presents the output attribute is called "faultNumber",
representing the fault number, ranging from 0 to 21. This
expresses a classification of 22 fault classes, where class 0
means no fault, and the other classes (1 to 21) represent the
fault classification number.