Zeek-Pcap-Features-Extractor
Zeek Package that extends the functionality of Zeek network analysis framework. This package automatically recognizes connection from a (pcap) file and automatically extract features from it. The goal for the feature extraction is to describe an individual connection that occurs in the pcap file as accurately as possible.
Installation
To refresh the Zeek Package Manager and add the package, run the following command:
$ zkg refresh
To install the package using Zeek Package Manager, run the following command:
$ zkg install Zeek-Pcap-Features-Extractor
Run
To extract the features on the selected pcap file that contains different connections, run the following command in a terminal:
$ zeek Zeek-Pcap-Features-Extractor -r file.pcap
or optionally:
$ zeek Zeek-Pcap-Features-Extractor -r file.pcap ignore_checksums=T
The output will be stored in multiple log files in zeek log format. The fullLog.log
contains all the features extracted and looks like this:
$ #separator \x09
$ #set_separator ,
$ #empty_field (empty)
$ #unset_field -
$ #path fullLog
$ #open 2023-12-07-11-05-39
$ #fields ts uid id.orig_h id.orig_p id.resp_h id.resp_p proto service duration orig_bytes resp_bytes conn_state local_orig local_resp missed_bytes history orig_pkts orig_ip_bytes resp_pkts resp_ip_bytes tunnel_parents sload dload smeansz dmeansz trans_depth reb_bdy_len start_time last_time is_sm_ips_ports is_ftp_login user_ftp pwd_ftp ct_ftp_cmd pkts_dropped_ sjit djit sinpkt dinpkt
$ #types string string addr port addr port enum string string count count string bool bool count string count count count count set[string] double double double double count count string string string count string string count count interval interval interval interval
2023-01-23 21:52:10 Ctcnkh1AHaKnCbfish 185.175.0.3 59244 185.175.0.5 502 tcp modbus 0m0s 12 11 SF - - 0 ShADadFf 6 332 4 227 - 125476.218136 115019.866625 2.0 2.0 0 0 2023-01-23 21:52:10 2023-01-23 21:52:10 0 - - - - 0 0.025494 0.000000 0.124872 0.157102
2023-01-23 21:52:10 CwOqfa3Ov8oO8eJm6l 185.175.0.3 59246 185.175.0.5 502 tcp modbus 0m0s 12 11 SF - - 0 ShADadFf 6 332 4 227 - 74017.129412 67849.035294 2.0 2.0 0 0 2023-01-23 21:52:10 2023-01-23 21:52:10 0 - - - - 0 -0.001232 0.000000 0.001895 0.000500
$ #close 2023-12-07-11-06-17
The flowFeatures.log
is:
#separator \x09
#set_separator ,
#empty_field (empty)
#unset_field -
#path flowFeatures
#open 2023-12-07-16-04-05
#fields id.orig_h id.orig_p id.resp_h id.resp_p proto
#types addr port addr port enum
185.175.0.3 58040 185.175.0.5 502 tcp
185.175.0.3 58042 185.175.0.5 502 tcp
#close 2023-12-07-16-05-17
The infoPackets.log
is represented as:
#separator \x09
#set_separator ,
#empty_field (empty)
#unset_field -
#path infoPackets
#open 2023-12-07-16-04-05
#fields stcpb dtcpb ttl win
#types count count count count
- - 64 509
- 1 64 502
#close 2023-12-07-16-05-12
The infoTCPConn.log
is described in the following way:
#separator \x09
#set_separator ,
#empty_field (empty)
#unset_field -
#path infoTCPConn
#open 2023-12-07-16-04-05
#fields synack ackdat tcprtt m_int_s m_int_d
#types interval interval interval interval interval
0.000000 0.000000 0.000000 0.000000 0.000000
0.000000 0.000026 0.000026 0.000004 0.000000
0.000000 0.000019 0.000019 0.000003 0.000000
#close 2023-12-07-16-05-12
The output on the terminal
prints out:
-----------Feature 43-----------
There are 75 connections that have the same IP destination '185.175.0.8'
There are 3 connections that have the same IP destination '185.175.0.5'
There are 20 connections that have the same IP destination '185.175.0.4'
There is 1 connection that has the same IP destination '185.175.0.6'
----------------------------------------
Description
Once the data was obtained from the network traffic capture, a process was performed to extract the features. We analyzed the pcap files that portray the network traffic of the ModBus dataset, we calculated and printed 63 features, of which 49 for each connection recorded within the pcap file and 14 inherent to all the connections taken overall. To do this, we took advantage of the events and functions made available by Zeek, as well as custom functions (created on our own), to simplify the code or to calculate unknown data within the Zeek environment.
In addition, to make the most of the log files produced automatically by Zeek, we used Machine Learning algorithms to study and graph the occurrences and characteristics of the various attacks stored in pcap files of the ModBus dataset in the section relating to the attacks, called "attack".
Extracted Features
All the features to extract are highlighted in the excel file, called "featureCyber" provided in the previous part together with the code.
Extension (Python and Machine Learning)
First of all, it's important to parse a Zeek's log into Dataframe in python using Zat:
import zat
from zat.log_to_dataframe import LogToDataFrame
# Create a Pandas dataframe from the Zeek HTTP log for example
log_to_df = LogToDataFrame()
zeek_df = log_to_df.create_dataframe('http.log')
print('Read in {:d} Rows...'.format(len(zeek_df)))
zeek_df.head()
The following example of code is explains how to train/fit and predict anomalous instances using the Isolation Forest model:
odd_clf = IsolationForest(contamination=0.35)
odd_clf.fit(zeek_matrix)
The K-Means algorithm is a partitioning cluster analysis algorithm that allows the division of a set of objects into k groups based on their attributes. It is a variant of the Expectation-Maximization (EM) algorithm, aiming to determine the k groups of data generated by Gaussian distributions.
PCA, or Principal Component Analysis, is a dimensionality reduction technique designed to decrease the relatively high number of variables describing a dataset to a smaller set of latent variables, while minimizing information loss as much as possible.
These two concepts are used in the following example:
# K-Means and PCA
kmeans = KMeans(n_clusters=3).fit_predict(odd_matrix) # Change this to 3/5 for fun
pca = PCA(n_components=3).fit_transform(odd_matrix)
# Now we can put our ML results back onto our dataframe
odd_df['x'] = pca[:, 0] # PCA X Column
odd_df['y'] = pca[:, 1] # PCA Y Column
odd_df['cluster'] = kmeans
odd_df.head()
This is important to print out the observation of each cluster found:
# Now print out the details for each cluster
pd.set_option('display.width', 1000)
for key, group in cluster_groups:
print('\nCluster {:d}: {:d} observations'.format(key, len(group)))
print(group[features].head())