45
H. Darton,
“Malware Detection with CNNs on Entropy and Greyscale Images”,
Latin-American Journal of Computing (LAJC), vol. 13, no. 1, 2026.
Malware Detection with
CNNs on Entropy and
Greyscale Images
ARTICLE HISTORY
Received 11 October 2025
Accepted 26 November 2025
Published 6 January 2026
Harry Darton
Sheeld Hallam University
School of Computing and Digital Technologies
Sheeld, United Kingdom
harryphuket@gmail.com
ORCID: 0009-0004-5674-2609
ISSN:1390-9266 e-ISSN:1390-9134 LAJC 2026
This work is licensed under a Creative Commons
Attribution-NonCommercial-ShareAlike 4.0 International License.
ISSN:1390-9266 e-ISSN:1390-9134 LAJC 2026 46
DOI:
LATIN-AMERICAN JOURNAL OF COMPUTING (LAJC), Vol XIII, Issue 1, January 2026
https://doi.org/10.33333/lajc.vol13n1.04
LATIN-AMERICAN JOURNAL OF COMPUTING (LAJC), Vol XIII, Issue 1, January - June 2026
Malware Detection with CNNs on Entropy and
Greyscale Images
Harry Darton
Sheffield Hallam University
School of Computing and Digital Technologies
Sheffield, United Kingdom
harryphuket@gmail.com
Abstract This study investigates whether convolutional neural
networks (CNNs) trained on visual representations of Portable
Executable (PE) files can rival traditional machine learning
classifiers trained on engineered features. A dataset of over 200,000
PE files [1] was used to derive two feature sets (Basic and Ember-
Lite) [2] and to generate 256x256 greyscale and entropy images
[3],[4]. Three CNNs (SimpleCNN, ResNet-18 [5], EfficientNet-B0
[6]) were trained and evaluated against five baselines (Random
Forest, XGBoost [7], CatBoost [8], LightGBM, Logistic
Regression). Tree-based models with enriched features achieved the
highest scores, with CatBoost reaching a ROC-AUC of 0.990. The
best CNN, EfficientNet-B0 on entropy images, obtained a ROC-
AUC of 0.954. Although CNNs did not surpass feature-based
models, they showed competitive results when feature engineering
was constrained. These findings indicate that visual approaches
offer a promising alternative for static malware detection,
particularly when combined with entropy-based representations [9].
Keywords malware detection, convolutional neural networks,
entropy images, greyscale images, static analysis
I. INTRODUCTION
The rapid evolution of malicious software continues to
pose a critical challenge to global cybersecurity. Traditional
static and dynamic analysis techniques remain central to
malware detection, yet their scalability and adaptability are
increasingly strained by the volume and complexity of new
samples emerging daily [10]. Static analysis, which inspects
binary structure without execution, provides efficiency and
safety but depends heavily on manually engineered features.
These handcrafted representations are sensitive to obfuscation
and require expert knowledge to maintain. Recent advances in
deep learning have introduced new possibilities for automated
feature extraction that may overcome such limitations [11].
Malware detection can be viewed as a binary classification
problem in which a model must learn discriminative patterns
between benign and malicious Portable Executable (PE) files.
Earlier static approaches focused on syntactic features such as
byte histograms, imported libraries, and section metadata [2].
While these features remain effective, they are often dataset-
specific and may fail when the underlying malware
distribution shifts. Convolutional Neural Networks (CNNs)
offer an alternative pathway [11] by learning directly from raw
or visually transformed data. In computer vision, CNNs have
achieved outstanding success in recognizing complex spatial
relationships within images. When applied to malware, they
can automatically extract high-level spatialstatistical
representations from a binary’s byte sequence, potentially
reducing reliance on expert feature engineering [12].
Transforming PE files into visual formats has gained
attention because it allows direct use of image-based
architectures without disassembling executables. Two
representations have become prominent: greyscale images,
formed by mapping byte values to pixel intensities, and
entropy images [3],[4],[13], which highlight structural
irregularities related to code density and compression. These
visualizations reveal characteristic patterns that correspond to
malware families, packing, and section entropy, providing
CNNs with texture-like information unavailable to
conventional static models. However, existing studies differ
widely in dataset size, preprocessing, and evaluation
methodology, making it difficult to assess [9],[14] whether
CNN-based visual analysis can truly compete with established
feature-based models.
Prior work has shown that tree-based ensembles such as
Random Forest, XGBoost [7], CatBoost [8], and LightGBM
deliver high accuracy on engineered feature sets like EMBER
[2], often exceeding 0.99 ROC-AUC. Although several
researchers have experimented with CNNs on image
representations [3],[9],[15], many comparisons are indirect,
rely on small datasets, or use pretrained vision networks
without systematic control of variables. The literature
therefore lacks a consistent large-scale benchmark that
contrasts CNN performance with traditional models under
identical data and evaluation conditions. Furthermore, while
some studies report promising CNN results, few investigate
how architectural complexity or input modality (greyscale vs
entropy vs combined) affect performance [6],[15] or
generalization.
This research addresses that gap through a controlled,
large-scale comparison between visual-representation CNNs
and feature-based classifiers for static malware detection. A
dataset of more than 200,000 PE files was processed to
generate both engineered features and 256x256 image
representations [1]. Three CNN architectures of increasing
depth, SimpleCNN, ResNet-18[5], and EfficientNet-B0 [6],
were trained from scratch using greyscale, entropy, and dual-
channel inputs. Their results were benchmarked against five
tree-based baselines (Random Forest, XGBoost [7], CatBoost
[8], LightGBM, and Logistic Regression) built on two feature
sets: a compact Basic subset and an Ember-Lite extension. All
models were evaluated under identical stratified splits and
metrics, including ROC-AUC, F1-score, precision, recall, and
accuracy.
The study contributes in three principal ways:
1. It provides a reproducible comparison between CNN-
based and feature-based static detection using the same
dataset and preprocessing pipeline.
ISSN:1390-9266 e-ISSN:1390-9134 LAJC 2026 47
DOI:
LATIN-AMERICAN JOURNAL OF COMPUTING (LAJC), Vol XIII, Issue 1, January 2026
https://doi.org/10.33333/lajc.vol13n1.04
H. Darton,
Malware Detection with CNNs on Entropy and Greyscale Images”,
Latin-American Journal of Computing (LAJC), vol. 13, no. 1, 2026.
2. It analyses the effect of image representation
(greyscale, entropy, and combined) on CNN
performance.
3. It evaluates how model depth influences the trade-off
between feature learning capacity and overfitting risk.
By unifying these aspects, the work offers an empirical
baseline for future research into vision-based malware
detection and clarifies the extent to which CNNs can replace
or complement engineered features [9].
II. BACKGROUND AND RELATED WORK
Research on static malware detection has evolved from
handcrafted feature extraction to representation-learning
approaches based on neural networks. Early static pipelines
focused on syntactic features derived from Portable
Executable (PE) headers, import tables, and byte histograms
[16], [17]. Ensemble methods such as Random Forest and
XGBoost became widely adopted because they balanced
predictive accuracy with interpretability.
Subsequent studies explored the visual encoding of
binaries as two-dimensional matrices, where raw bytes were
transformed into greyscale images to capture structural
patterns [3]. This approach enabled convolutional networks to
learn discriminative textures associated with packed or
obfuscated code without explicit feature engineering. Han et
al. [4] and Kalash et al. [12] extended this concept by
employing deeper CNNs that achieved performance
comparable to engineered-feature baselines.
Entropy-based representations introduced a further
dimension by quantifying local randomness across the file,
highlighting high-entropy regions characteristic of
compression and encryption [15]. Brosolo et al. [14]
demonstrated that combining byte and entropy channels
improved separability between benign and malicious samples.
However, most prior work relied on small datasets or
inconsistent preprocessing, limiting reproducibility.
The present study addresses these limitations through a
unified pipeline incorporating both feature-based and image-
based approaches under controlled preprocessing and multi-
seed evaluation. By comparing ensemble and CNN
architectures directly on 200,000+ samples, it contributes
empirical evidence on the relative merits of learned versus
handcrafted representations for static malware detection.
III. METHODOLOGY
In this section, the methodology featured in Fig. 2 is
explained as follows:
A. Dataset Construction
The experiments employed a static malware dataset
containing more than 200,000 Portable Executable (PE) files,
comprising roughly equal proportions of benign and malicious
samples. Each file was validated to ensure accessibility and
correct labelling. No dynamic execution or network traffic
data were used, maintaining strict static conditions. A group-
wise stratified split produced three subsets: 70% for training,
15% for validation, and 15% for independent testing.
Stratification ensured proportional representation of malware
families and preserved class balance across splits. All
preprocessing, feature extraction, and model training were
performed offline to eliminate any risk of infection.
Fig. 2. Methodological workflow for dataset processing, model training, and
evaluation.
The figure summarizes the experimental pipeline from dataset acquisition
and preprocessing through feature engineering, image generation, and model
evaluation across five random seeds.
B. Feature Engineering
Two engineered feature sets were created to provide
traditional baselines.
1. Basic set: extracted lightweight structural attributes
such as file size, section counts, imported library
frequencies, and entropy statistics.
2. Ember-Lite set: extended the Basic features with a
reduced subset of the EMBER 2018 dataset, including
byte histograms, header metadata, and string features.
The Ember-Lite feature set used here comprised only a
small, computationally lightweight subset of the
EMBER 2018 feature groups, selected to minimize
pre-processing overhead whilst still producing a more
detailed, EMBER-oriented set of features for
comparison.
Both sets were scaled using minmax normalization.
These vectors served as input to five machine-learning
models: Random Forest, XGBoost, CatBoost, LightGBM, and
Logistic Regression. Hyperparameters were tuned by grid
search on the validation split.
ISSN:1390-9266 e-ISSN:1390-9134 LAJC 2026 48
DOI:
LATIN-AMERICAN JOURNAL OF COMPUTING (LAJC), Vol XIII, Issue 1, January 2026
https://doi.org/10.33333/lajc.vol13n1.04
LATIN-AMERICAN JOURNAL OF COMPUTING (LAJC), Vol XIII, Issue 1, January - June 2026
C. Image Generation
For the visual-representation branch, each PE file was
converted into two 256x256 images: a greyscale byte map and
an entropy image. PE files shorter than 65,536 bytes were
zero padded before reshaping and files exceeding this length
were truncated so that all samples produced a consistent
256x256 matrix.
The greyscale representation mapped each byte value
(0-255) to a pixel intensity, preserving sequential
order.
The entropy representation applied a sliding-window
Shannon entropy calculation to highlight structural
irregularities related to code density, packing, and
compression.
Images were stored as compressed PNGs and normalized
to the [0, 1] range at load time. Three input modalities were
tested: single-channel greyscale, single-channel entropy, and
dual-channel combined images. The dual-channel version
concatenated normalized greyscale and entropy matrices
along the channel axis to retain spatial alignment.
Representative examples of the three image modalities are
shown in Fig. 1.
Fig. 1. Visualization of greyscale, entropy, and dual-channel image
representations used for CNN training. The dual-channel view merges
greyscale and entropy channels into red and green for clarity; CNNs process
both as greyscale tensors.
D. CNN Architectures
Three convolutional neural networks of increasing depth
and complexity were implemented to explore architectural
effects.
1. SimpleCNN: a custom lightweight model with three
convolutional blocks and max-pooling, designed to
provide a minimal baseline.
2. ResNet-18: a residual architecture enabling deeper
feature learning while mitigating vanishing-gradient
issues.
3. EfficientNet-B0: a compound-scaled network
optimizing depth, width, and resolution for parameter
efficiency.
Each model concluded with a global average-pooling layer
and a fully connected single sigmoid output neuron optimized
with binary cross-entropy loss. All architectures were trained
from scratch rather than fine-tuned from natural-image
weights to maintain domain specificity.
More recent vision architectures exist, but the chosen trio
provides a controlled progression from shallow to moderately
deep networks, enabling a fair comparison of representational
capacity while keeping reproducibility and computational cost
practical for large-scale malware datasets.
E. Training Procedure
Training was performed using the PyTorch 2.8 framework
on GPU hardware. Batch size, learning rate, and weight decay
were tuned empirically through pilot runs. Early stopping
based on validation loss prevented overfitting, and the best
model weights were checkpointed. Feature vectors and image
tensors were normalized to the [0, 1] range prior to training.
Early experiments confirmed that standardization to [-1, 1] or
z-scoring did not improve convergence for models trained
from scratch. All random seeds, splits, and preprocessing
parameters were fixed for reproducibility. During training,
data augmentation was limited to horizontal and vertical flips
to avoid distorting binary layout information. Each
experiment was repeated across five random seeds to estimate
variability.
F. Evaluation Metrics
Model performance was assessed on the held-out test set
using multiple metrics:
ROC-AUC as the principal discrimination measure.
F1-score, precision, recall, and accuracy for
complementary evaluation.
Calibration curves and confusion matrices for selected
runs to examine reliability and error distribution.
All results were reported as mean ± standard deviation
across seeds. Timing and resource usage were recorded but
not compared formally because training occurred on
heterogeneous hardware.
IV. DEPLOYMENT AND IMPLEMENTATION
All experiments were executed on a Windows 11
workstation equipped with an NVIDIA RTX 5080 GPU (16
GB VRAM), a Ryzen 7 7800X3D CPU, and 32 GB RAM.
The environment used Python 3.12 and PyTorch 2.8.0 (CUDA
12.8) within a PyCharm-managed virtual environment.
Dataset preprocessing and feature extraction were performed
offline to prevent malware execution risk.
A. Dataset and Preprocessing
A total of 201,549 PE files from Lester (2021) were
processed into multiple representations: (i) Basic PE features,
(ii) extended Ember-Lite features, (iii) greyscale byte-maps,
(iv) entropy images computed over 32-byte windows, and (v)
dual-channel stacks combining greyscale and entropy tensors.
Group-wise stratification by Imphash ensured partition
integrity across train, validation, and test splits, mitigating
leakage.
B. Model Implementation
Tree-based models (Logistic Regression, Random Forest,
XGBoost, LightGBM, CatBoost) were implemented via
Scikit-learn and native library APIs using identical folds and
hyperparameter grids. For CNNs, three architectures were
selected to represent increasing structural depth: SimpleCNN
(three convolutional layers), ResNet-18, and EfficientNet-B0.
All networks employed batch normalization, ReLU
activations, early stopping, and Adam optimization with
learning-rate scheduling.
C. Training Protocol
Training used Adam (learning rate = 1x10⁻³) with
CrossEntropyLoss, ReduceLROnPlateau (patience = 2), early
stopping (patience = 5), and batch size 128, capped at 40
ISSN:1390-9266 e-ISSN:1390-9134 LAJC 2026 49
DOI:
LATIN-AMERICAN JOURNAL OF COMPUTING (LAJC), Vol XIII, Issue 1, January 2026
https://doi.org/10.33333/lajc.vol13n1.04
H. Darton,
Malware Detection with CNNs on Entropy and Greyscale Images”,
Latin-American Journal of Computing (LAJC), vol. 13, no. 1, 2026.
epochs. Only the best-performing checkpoint (lowest
validation loss) per run was retained for test evaluation.
Results were averaged across seeds (4246) and reported with
standard deviation.
D. Reproducibility Controls
All outputs (model weights, logs, and metrics.json files)
were stored under versioned directories for full traceability.
Reproducibility was validated by selectively rerunning a
subset of experiments on different hardware (MX250 laptop
and MacBook CPU), confirming consistent metrics within
variance bounds. The entire pipeline is available via a public
GitHub repository.
V. ANALYSIS AND DISCUSSION OF FINDINGS
A. Overview
The study provides a comprehensive comparative
analysis, combining large-scale data, multiple feature
representations, and a multi-seed evaluation to support robust
and generalizable conclusions.
This section presents the quantitative results obtained from
both the feature-based and CNN-based branches of the study.
All models were evaluated on the held-out test partition using
identical metrics and configuration to ensure comparability.
Performance values reported correspond to the mean of five
random-seed runs, with standard deviation shown where
applicable.
B. Feature-Based Baselines
Traditional classifiers trained on the Basic and Ember-Lite
feature sets produced strong results across all metrics. The
Basic feature set already provided a solid baseline, achieving
ROC-AUC scores above 0.96 for ensemble methods.
Incorporating the additional Ember-Lite attributes further
improved separability between benign and malicious samples.
Among the evaluated models, CatBoost delivered the
highest and most consistent performance, reaching 0.990 ±
0.001 ROC-AUC and an F1-score of 0.947 ± 0.005. XGBoost
and LightGBM followed closely, differing by less than 0.002
ROC-AUC, while Random Forest achieved comparable
accuracy with slightly higher variance. Logistic Regression,
as expected, under-fit the nonlinear relationships and
produced the lowest AUC (≈ 0.94). These outcomes confirm
that gradient-boosted tree ensembles remain a robust
benchmark for static malware detection when high-quality
engineered features are available.
C. CNN Performance on Greyscale and Entropy Images
The CNN experiments evaluated three architectures of
increasing complexity (SimpleCNN, ResNet-18, and
EfficientNet-B0) across three input modalities: greyscale,
entropy, and dual-channel combined images.
Greyscale Images: captured structural layout but
limited semantic variation. SimpleCNN achieved
0.886 ± 0.011 ROC-AUC, ResNet-18 0.937 ± 0.008,
and EfficientNet-B0 0.931 ± 0.010.
Entropy Images: provided higher discriminative
information due to encoding of randomness and
packing density. Here, EfficientNet-B0 achieved 0.954
± 0.005 and 0.898 ± 0.008 F1, the best CNN result
overall, indicating that entropy-based spatial cues are
particularly informative for distinguishing obfuscated
malware.
Combined Images: merging greyscale and entropy
channels offered only marginal gains for shallower
networks and, in some cases, introduced redundancy.
ResNet-18 reached 0.950 ± 0.009 ROC-AUC, while
the dual-channel variant of EfficientNet-B0 achieved
0.947 ± 0.004, showing little additional benefit.
Training variability across seeds remained low (±0.002
AUC), confirming stable convergence. Validation loss curves
indicated earlier saturation for SimpleCNN, while deeper
models continued improving over more epochs, reflecting
greater representational capacity.
D. Comparative Analysis
A direct comparison between the best feature-based and
CNN models highlights a clear but narrowing performance
gap. CatBoost (Ember-Lite) exceeded EfficientNet-B0
(Entropy) by approximately 0.036 ROC-AUC, achieved 0.983
± 0.001 precision and 0.913 ± 0.011 recall compared with
0.954 ± 0.004 precision and 0.848 ± 0.013 recall for the CNN.
When feature engineering is limited, CNNs trained on entropy
visualizations approach ensemble-level accuracy without any
handcrafted inputs.
Fig. 3 visualizes the overall comparison. Feature-based
ensembles dominate the upper bound, while CNNs occupy a
competitive mid-band with notably reduced preprocessing
overhead.
Fig. 3. Overall performance comparison of tree-based and CNN-based models
across all input modalities
Table I reports mean ± standard deviation over five
random-seed runs for all models across Basic, Ember-Lite,
greyscale, entropy, and combined inputs.
ISSN:1390-9266 e-ISSN:1390-9134 LAJC 2026 50
DOI:
LATIN-AMERICAN JOURNAL OF COMPUTING (LAJC), Vol XIII, Issue 1, January 2026
https://doi.org/10.33333/lajc.vol13n1.04
LATIN-AMERICAN JOURNAL OF COMPUTING (LAJC), Vol XIII, Issue 1, January - June 2026
TABLE I. MODEL PERFORMANCE COMPARISON (FEATURE-BASED VS CNN)
Summary of quantitative performance metrics (mean ± standard deviation) for all models across Basic, Ember-Lite, Greyscale, Entropy, and Combined input
representations.
Model Feature Set Bound ROC-AUC F1 Accuracy Precision Recall
CatBoost Ember-Lite Upper 0.990 ± 0.001 0.947 ± 0.005 0.941 ± 0.005 0.983 ± 0.001 0.913 ± 0.011
XGBoost Ember-Lite Upper 0.990 ± 0.000 0.940 ± 0.000 0.933 ± 0.000 0.983 ± 0.000 0.900 ± 0.000
LightGBM Ember-Lite Upper 0.989 ± 0.000 0.936 ± 0.000 0.929 ± 0.000 0.982 ± 0.000 0.894 ± 0.000
RandomForest Ember-Lite Upper 0.989 ± 0.000 0.929 ± 0.004 0.922 ± 0.004 0.977 ± 0.003 0.886 ± 0.008
RandomForest Basic Upper 0.983 ± 0.001 0.932 ± 0.003 0.924 ± 0.003 0.966 ± 0.007 0.900 ± 0.010
CatBoost Basic Upper 0.983 ± 0.001 0.938 ± 0.002 0.930 ± 0.002 0.960 ± 0.003 0.916 ± 0.005
XGBoost Basic Upper 0.983 ± 0.000 0.937 ± 0.000 0.929 ± 0.000 0.963 ± 0.000 0.912 ± 0.000
LightGBM Basic Upper 0.976 ± 0.000 0.919 ± 0.000 0.911 ± 0.000 0.959 ± 0.000 0.882 ± 0.000
XGBoost Ember-Lite Lower 0.971 ± 0.000 0.898 ± 0.000 0.891 ± 0.000 0.980 ± 0.000 0.829 ± 0.000
CatBoost Ember-Lite Lower 0.969 ± 0.001 0.894 ± 0.005 0.887 ± 0.005 0.979 ± 0.003 0.822 ± 0.009
RandomForest Ember-Lite Lower 0.966 ± 0.002 0.904 ± 0.005 0.897 ± 0.005 0.976 ± 0.001 0.842 ± 0.008
LightGBM Ember-Lite Lower 0.965 ± 0.000 0.902 ± 0.000 0.895 ± 0.000 0.971 ± 0.000 0.843 ± 0.000
EfficientNetB0 Entropy - 0.954 ± 0.005 0.898 ± 0.008 0.889 ± 0.008 0.954 ± 0.004 0.848 ± 0.013
ResNet18 Combined - 0.950 ± 0.009 0.904 ± 0.010 0.890 ± 0.015 0.916 ± 0.044 0.895 ± 0.039
EfficientNetB0 Combined - 0.947 ± 0.004 0.898 ± 0.006 0.888 ± 0.006 0.949 ± 0.005 0.852 ± 0.011
ResNet18 Entropy - 0.943 ± 0.006 0.897 ± 0.008 0.888 ± 0.008 0.954 ± 0.014 0.847 ± 0.014
RandomForest Basic Lower 0.940 ± 0.001 0.879 ± 0.002 0.868 ± 0.002 0.930 ± 0.001 0.833 ± 0.003
LogisticRegression Ember-Lite Upper 0.940 ± 0.000 0.855 ± 0.000 0.844 ± 0.000 0.927 ± 0.000 0.793 ± 0.000
ResNet18 Greyscale - 0.937 ± 0.008 0.862 ± 0.013 0.854 ± 0.012 0.943 ± 0.006 0.795 ± 0.025
LogisticRegression Ember-Lite Lower 0.934 ± 0.000 0.876 ± 0.000 0.863 ± 0.000 0.917 ± 0.000 0.838 ± 0.000
EfficientNetB0 Greyscale - 0.931 ± 0.010 0.880 ± 0.008 0.871 ± 0.006 0.945 ± 0.017 0.824 ± 0.027
CatBoost Basic Lower 0.929 ± 0.001 0.874 ± 0.002 0.862 ± 0.003 0.927 ± 0.005 0.826 ± 0.002
XGBoost Basic Lower 0.924 ± 0.000 0.874 ± 0.000 0.862 ± 0.000 0.918 ± 0.000 0.835 ± 0.000
LightGBM Basic Lower 0.918 ± 0.000 0.872 ± 0.000 0.860 ± 0.000 0.918 ± 0.000 0.831 ± 0.000
SimpleCNN Combined - 0.898 ± 0.017 0.777 ± 0.153 0.788 ± 0.095 0.898 ± 0.055 0.728 ± 0.210
LogisticRegression Basic Upper 0.888 ± 0.000 0.801 ± 0.000 0.784 ± 0.000 0.853 ± 0.000 0.755 ± 0.000
SimpleCNN Greyscale - 0.886 ± 0.011 0.849 ± 0.011 0.833 ± 0.017 0.886 ± 0.035 0.817 ± 0.015
SimpleCNN Entropy - 0.879 ± 0.018 0.843 ± 0.012 0.835 ± 0.009 0.931 ± 0.020 0.771 ± 0.032
LogisticRegression Basic Lower 0.819 ± 0.000 0.757 ± 0.000 0.733 ± 0.000 0.798 ± 0.000 0.720 ± 0.000
To further illustrate error distribution between benign and
malicious classifications, Fig. 4 presents normalized
confusion matrices for the best-performing ensemble
(CatBoost) and CNN (EfficientNet-B0) models.
Fig. 4. Normalized confusion matrices comparing the best ensemble
(CatBoost Ember-Lite Upper Bound) and CNN (EfficientNet-B0 Entropy
Input) models. CatBoost achieves slightly superior precision on benign
samples, while EfficientNet-B0 maintains strong recall on malware detection,
demonstrating the narrowing gap between traditional and visual-based static
analysis approaches.
Examination of the confusion matrices showed that CNNs
occasionally misclassified small or lightly obfuscated
malware samples, particularly where entropy images
remained sparse after padding. In contrast, the tree-based
models sometimes mislabeled benign files exhibiting elevated
entropy or atypical section structures. These patterns are
consistent with the feature sensitivities of each model family
and help explain why tree-based models still retain a small
overall advantage.
ISSN:1390-9266 e-ISSN:1390-9134 LAJC 2026 51
DOI:
LATIN-AMERICAN JOURNAL OF COMPUTING (LAJC), Vol XIII, Issue 1, January 2026
https://doi.org/10.33333/lajc.vol13n1.04
H. Darton,
Malware Detection with CNNs on Entropy and Greyscale Images”,
Latin-American Journal of Computing (LAJC), vol. 13, no. 1, 2026.
E. Combined Image Representation and Interpretation
The hypothesis that merging greyscale and entropy inputs
would enhance classification performance was not supported
by the results. Across all CNN architectures, the dual-channel
variant produced near-identical or slightly inferior ROC-AUC
values compared with single-channel entropy inputs. This
outcome reflects an important characteristic of PE
visualization: while greyscale mappings and entropy images
appear visually distinct, they encode strongly correlated
structural information.
The greyscale representation captures the sequential byte
distribution of sections, making dense code regions appear
darker and sparse or zero-padded regions lighter. Entropy
visualization, computed over fixed 32-byte windows,
highlights the same structural boundaries by assigning higher
intensity to compressed or encrypted blocks and lower
intensity to static resources or padding. When concatenated,
both modalities effectively describe the same transitions in
spatial density and randomness. This redundancy dilutes
gradient salience during training, as filters in early CNN layers
receive conflicting but overlapping cues, hindering
convergence to strongly discriminative features.
From a signal-processing perspective, direct stacking of
channels also constrains the network to treat the two
modalities as spatially aligned, which may not reflect semantic
complementarity. A more effective fusion could involve
learned attention or adaptive weighting between channels,
allowing the model to emphasize entropy cues where they are
most informative. Another avenue would be RGB-style
compositing, in which greyscale, entropy, and a derived
statistical feature (such as local variance or byte-frequency
gradient) are encoded into separate color channels. This
approach may exploit richer feature interactions akin to
texture analysis in natural images. Alternatively, late fusion
strategies (where embeddings from separate CNN branches
are combined at the dense layer stage) could preserve each
modality’s distinct representational space while capturing
joint correlations.
The observed stagnation of performance in combined
inputs therefore does not imply redundancy between visual
and entropy domains per se, but rather that naive spatial
concatenation fails to capitalize on their complementary
nature. Future research should explore cross-channel
attention, feature-level fusion, or transformer-based
architectures capable of learning non-linear relationships
between modality-specific embeddings.
F. Interpretation of Key Findings
Three principal insights emerge from the experimental
results.
1. Entropy images outperform greyscale inputs, revealing
high-entropy packing and obfuscation regions strongly
correlated with malicious binaries. The results indicate
that entropy-based CNNs capture discriminative
information unavailable from structural byte layouts
alone.
2. Model depth directly influences performance. Deeper
CNNs such as ResNet-18 and EfficientNet-B0
consistently surpass SimpleCNN, confirming that
hierarchical feature extraction enhances visual
malware representation learning.
3. Feature engineering versus representation learning:
feature-based ensembles remain superior when rich
handcrafted features are available, but CNNs offer a
promising alternative for scalable, fully automated
pipelines where feature computation is infeasible or
costly.
The comparative results highlight that entropy
visualizations preserve generalizable statistical cues about
randomness and code density, while feature-engineered
models exploit explicit semantic attributes. The best CNN
configuration (EfficientNet-B0 trained on entropy images)
achieved 0.954 ± 0.005 ROC-AUC, compared with 0.990 ±
0.001 for the CatBoost ensemble on the Ember-Lite feature
set. This ~ 0.036 AUC difference demonstrates a narrowing
gap between learned visual and engineered feature
representations, even though ensemble methods remain the
upper bound.
Beyond comparative accuracy, these results highlight a
wider methodological shift in static malware research. As
handcrafted features reach maturity, incremental performance
gains often come at the cost of increased preprocessing
complexity and domain dependence. Visual learning
approaches, by contrast, demonstrate the capacity to
generalize across unseen binaries with minimal feature
engineering. Their scalability and model-agnostic input
pipeline make them well suited for future integration into
hybrid detection frameworks, combining the interpretability
of engineered features with the adaptability of deep
representation learning.
G. Broader Implications and Limitations
Although extensive, this comparison remains limited to
static byte-level analysis and does not incorporate dynamic or
hybrid behavioral features. Training and evaluation were
conducted on heterogeneous hardware, and while
reproducibility was confirmed, computational cost was not
measured formally. Furthermore, dataset balance was
maintained artificially, whereas real-world malware corpora
are often heavily skewed.
These constraints suggest several research extensions:
combining static and dynamic modalities; benchmarking
under naturally imbalanced distributions; and assessing
architecture efficiency across larger datasets or through
transfer learning.
The dataset contained 201,549 executables, comprising
86,812 benign files and 114,737 malware samples. Although
this distribution is only moderately skewed towards malicious
files, real-world environments are typically dominated by
benign software. Future evaluations should therefore assess
performance and calibration under more strongly imbalanced
conditions, or more benign-dominated conditions.
Nevertheless, the findings reinforce the viability of CNN-
based static detectors as interpretable, automated
complements to feature-based models. With entropy-derived
visual inputs, CNNs achieved competitive performance within
3-4% ROC-AUC of state-of-the-art ensembles while
eliminating manual feature design, indicating strong potential
for scalable, real-time malware triage systems.
ISSN:1390-9266 e-ISSN:1390-9134 LAJC 2026 52
DOI:
LATIN-AMERICAN JOURNAL OF COMPUTING (LAJC), Vol XIII, Issue 1, January 2026
https://doi.org/10.33333/lajc.vol13n1.04
LATIN-AMERICAN JOURNAL OF COMPUTING (LAJC), Vol XIII, Issue 1, January - June 2026
VI. CONCLUSION
This study compared feature-based machine-learning
models and convolutional neural networks (CNNs) for static
malware detection using a dataset of more than 200,000
Portable Executable files [1]. The experimental design
provided a unified benchmark in which both traditional
classifiers and visual deep-learning models were trained and
evaluated under identical conditions.
The results demonstrated that tree-based ensembles,
particularly CatBoost [8], remain the most accurate static
detectors when high-quality handcrafted features are
available, achieving 0.990 ± 0.001 ROC-AUC and 0.947 ±
0.005 F1. However, CNNs trained directly on byte-derived
image representations achieved competitive performance
without manual feature engineering. Among the visual
models, entropy-based inputs consistently outperformed
greyscale and combined modalities, and deeper networks such
as ResNet-18 [5] and EfficientNet-B0 [6] significantly
exceeded the accuracy of the shallow SimpleCNN. These
findings confirm that entropy visualization provides strong
discriminative cues and that model depth enhances
representational capacity in image-based malware analysis
[3],[4].
The comparative analysis revealed a narrowing gap of
approximately 0.036 ROC-AUC between the best feature-
based and CNN models, suggesting that vision-driven
approaches can serve as scalable, automated alternatives when
handcrafted features are unavailable or costly to compute.
Future work should explore hybrid staticdynamic
pipelines that combine image-based deep learning with
lightweight engineered features to further improve robustness
against obfuscation and dataset drift [14]. Benchmarking
computational efficiency across uniform hardware and
assessing real-world, imbalanced data distributions would
also strengthen the practical applicability of visual malware
detection methods.
The limited improvement from the combined greyscale
entropy representation further highlights the challenge of
naive multimodal fusion. Although the two inputs differ
visually, their spatially correlated content may reduce
discriminative gradients when processed jointly; suggesting
that future architectures should employ attention-based fusion
or late-stage embedding integration to better exploit
complementary features without introducing redundancy.
FUNDING STATEMENT
This research received no external funding. All
computational work and analysis were conducted using
personal and institutional resources without third-party
support.
ACKNOWLEDGMENT
The author thanks Dr. Shahrzad Zargari of Sheffield
Hallam University for supervision and constructive feedback
throughout the research project.
AUTHOR CONTRIBUTIONS
Harry Darton: Conceptualization, Methodology, Data
Curation, Formal Analysis, Writing Original Draft, Writing
Review & Editing.
REFERENCES
[1] M. Lester, “PE malware machine learning dataset [Data set],” Practical
Security Analytics, 2021. [Online]. Available:
https://practicalsecurityanalytics.com/pe-malware-machine-learning-
dataset/
[2] H. S. Anderson and P. Roth, “EMBER: An open dataset for training
static PE malware machine learning models,” arXiv preprint, 2018.
[Online]. Available: https://arxiv.org/abs/1804.04637
[3] L. Nataraj, S. Karthikeyan, G. Jacob, and B. S. Manjunath, “Malware
images: Visualization and automatic classification,” in Proc. 8th Int.
Symp. Visualization for Cyber Security (VizSec 2011), pp. 17, ACM,
2011. doi: 10.1145/2016904.2016908
[4] K. S. Han, J. H. Lim, B. Kang, and E. G. Im, “Malware analysis using
visualized images and entropy graphs,” Int. J. Inf. Security, vol. 14, no.
1, p. 1, 2014. doi: 10.1007/s10207-014-0242-0
[5] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proc. IEEE Conf. Computer Vision and Pattern
Recognition (CVPR 2016), pp. 770778, 2016. doi:
10.1109/CVPR.2016.90
[6] M. Tan and Q. V. Le, “EfficientNet: Rethinking model scaling for
convolutional neural networks,” in Proc. 36th Int. Conf. Machine
Learning (ICML 2019), vol. 97, pp. 61056114, PMLR, 2019.
[Online]. Available: https://proceedings.mlr.press/v97/tan19a.html
[7] T. Chen and C. Guestrin, “XGBoost: A scalable tree boosting system,”
in Proc. 22nd ACM SIGKDD Int. Conf. Knowledge Discovery and
Data Mining (KDD ’16), pp. 785794, ACM, 2016. doi:
10.1145/2939672.2939785
[8] L. Prokhorenkova, G. Gusev, A. Vorobev, A. V. Dorogush, and A.
Gulin, “CatBoost: Unbiased boosting with categorical features,” in
Proc. 32nd Int. Conf. Neural Information Processing Systems (NeurIPS
2018), pp. 66396649, 2018. [Online]. Available:
https://proceedings.neurips.cc/paper_files/paper/2018/hash/14491b75
6b3a51daac41c24863285549-Abstract.html
[9] A. Bensaoud, N. Abudawaood, and J. Kalita, “Classifying malware
images with convolutional neural network models,” arXiv preprint,
2020. [Online]. Available: https://arxiv.org/abs/2010.16108
[10] AV-TEST Institute, “Malware statistics & trends report,” 2024.
[Online]. Available: https://www.av-test.org/en/statistics/malware/
[11] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol.
521, no. 7553, pp. 436444, 2015. doi: 10.1038/nature14539
[12] M. Kalash et al., “Malware classification with deep convolutional
neural networks,” in Proc. 10th Int. Conf. New Technologies, Mobility
and Security (NTMS), pp. 15, IEEE, 2018. doi:
10.1109/NTMS.2018.8328749
[13] C. E. Shannon, “A mathematical theory of communication,” Bell Syst.
Tech. J., vol. 27, no. 3, pp. 379423, 1948. doi: 10.1002/j.1538-
7305.1948.tb01338.x
[14] M. Brosolo and M. Conti, “The road less travelled: Investigating
robustness and explainability in CNN malware detection,” arXiv
preprint, 2025. doi: 10.48550/arXiv.2503.01391
[15] B. Al-Masri, N. Bakir, A. El-Zaart, and K. Samrouth, “Dual
convolutional malware network (DCMN): An image-based malware
classification using dual convolutional neural networks,” Electronics,
vol. 13, no. 18, p. 3607, 2024. doi: 10.3390/electronics13183607
[16] J. Saxe and K. Berlin, “Deep neural network based malware detection
using two-dimensional binary program features,” arXiv preprint
arXiv:1508.03096, 2015.
[17] E. Raff et al., “Malware detection by eating a whole EXE,” arXiv
preprint arXiv:1710.09435, 2017.
Note on the Use of Artificial Intelligence (AI):
AI tools were used only for minor language editing and
reference formatting. All methodological design, analysis,
data interpretation, and writing decisions were performed by
the author.
ISSN:1390-9266 e-ISSN:1390-9134 LAJC 2026
53
LATIN-AMERICAN JOURNAL OF COMPUTING (LAJC), Vol XIII, Issue 1, January 2026
AUTHORS
Harry Darton graduated from Sheeld Hallam University with
a first class honours degree in Cyber Security and Digital
Forensics. His academic interests centre on digital forensics,
incident response, and the application of machine learning
techniques to security problems. His final year research focused
on static malware detection using image-based convolutional
neural networks, where he developed practical skills in Python
programming, data engineering, and large-scale model evaluation.
Outside his academic work, he has a long-standing involvement in
gaming and competitive esports. He is a former top-100 Overwatch
player and competed in the Contenders series as a semi-professional,
gaining experience in strategy, teamwork, and operating under
pressure. He also enjoys a range of outdoor activities including rock
climbing, hiking, and golf, and maintains a strong interest in emerging
technologies and personal technical projects. He is now developing
his professional portfolio as he prepares for a career in cyber security,
with particular interest in threat analysis, digital investigations, and
the role of artificial intelligence in defensive security.
Harry Darton
H. Darton,
“Malware Detection with CNNs on Entropy and Greyscale Images”,
Latin-American Journal of Computing (LAJC), vol. 13, no. 1, 2026.