Case study: Detecting IoT Malware

Case study

Analyzing IoT malware network data to create a machine learning model to detect it in near real-time

tl;dr

The threat of IoT (Internet of Things) malware is increasingly significant as the adoption of IoT devices continues to grow. Endpoint security software can’t be installed on IoT devices and they often have very weak security.

Attacks compromise vulnerable IoT devices to gain a foothold inside an organization, and use that as a difficult to trace pivot point to attack their real target.

I built a high speed, high performing machine learning model that can detect IoT malware based on netflow data, so it works even with encrypted traffic.

It’s like being able to catch malicious mail by only inspecting the outside of the envelopes, with no need to open it.

My final model has an F1 score of 0.985, meaning it has “excellent” performance, and a False Positive rate of 0.1%.

Even on my small laptop, the model can process over 15 million samples per second, which means it only takes a fraction of a second to detect malware and could quickly act to contain it.

High level methods

Data source

I used a dataset from Avast Security’s lab team that included over 16 million labeled instances of IoT malware.

Primary metric

False Positives reduce confidence in a tool and lead to alert fatigue, while False Negatives mean the tool failed to detect malware. I chose F1 since it the best categorization metric when you need to balance low alert fatigue and low missed detections.

Problem type

Malware is either present or not, so I approached this work as a binary classification problem.

Data preparation

Splitting up a large dataset

The uncompressed dataset is 3.2 GB, too large to train all at once with XGBoost. I could have reimplemented everything using a distributed ML framework like PyTorch, but analysis showed the data was homogenous enough to support “chunking” it into 9 pieces and taking the averages.
Apply specific data types: used pyArrow data types to reduce memory usage and increase speed
Drop features with weak correlation to the label: Keeping only the 8 most important features resulted in increased accuracy and speed.
Feature enhancements

I enhanced the dataset by implementing new features based on my InfoSec knowledge, and things I discovered through Exploratory Data Analysis.
For example, I:
Identified that malware tends to transmit outside normal business hours
Assigned countries to each IP address, since some countries are higher risk
Enhanced port numbers by mapping them to services
Winsorize & standardize

It’s still malware if it exfiltrates 1 GB or 1 TB, so I clipped & winsorized features to better correlate things like bandwidth.
Standardizing the data vastly improved the F1 score. Since the F1 score was already almost perfect and because the dataset was fairly balanced, there as no need to up/down sample or apply Principle Component Analysis.

Optimizing the model

The model performed surprisingly well even when it was very simple. Increasing complexity only led a performance increase of 0.1%. Hyperparameter optimization found the point of diminishing returns early, so the model was in the sweet spot:

High performance - F1 was 0.985, almost perfect
High speed - an individual prediction took only 573 μs, and bulk processing only took 65 nanoseconds per sample.

Additional metrics about the final model:


Accuracy		0.981
Recall		0.973
Precision		0.998
ROC AUC		0.984
Brier Skill Score		0.970
Average Precision		0.988
Balanced Accuracy		0.984

Takeaways

It’s possible and practical to detect the overwhelming majority (95%+) of IoT malware using just netflow data. It doesn’t matter if the traffic is encrypted, and it doesn’t matter that we have no insight to processes inside the IoT device
Since the model is very efficient, it could run on low cost hardware and result in a lower Total Cost of Ownership
Spending more time in exploratory data analysis and feature engineering drastically reduced the time I spent tuning the model

Case study

Analyzing IoT malware network data to create a machine learning model to detect it in near real-time

tl;dr

My final model has an F1 score of 0.985, meaning it has “excellent” performance, and a False Positive rate of 0.1%.

Even on my small laptop, the model can process over 15 million samples per second, which means it only takes a fraction of a second to detect malware and could quickly act to contain it.

High level methods

Data source

Primary metric

Problem type

Splitting up a large dataset

Feature enhancements

Winsorize & standardize

Optimizing the model

Takeaways

See the project code on GitHub

Tim Honker