Zeek-Pcap-Features-Extractor

Zeek Package that extends the functionality of Zeek network analysis framework. This package automatically recognizes connection from a (pcap) file and automatically extract features from it. The goal for the feature extraction is to describe an individual connection that occurs in the pcap file as accurately as possible.

Installation

To refresh the Zeek Package Manager and add the package, run the following command:

$ zkg refresh

To install the package using Zeek Package Manager, run the following command:

$ zkg install Zeek-Pcap-Features-Extractor

Run

To extract the features on the selected pcap file that contains different connections, run the following command in a terminal:

$ zeek Zeek-Pcap-Features-Extractor -r file.pcap

or optionally:

$ zeek Zeek-Pcap-Features-Extractor -r file.pcap ignore_checksums=T

The output will be stored in multiple log files in zeek log format. The fullLog.log contains all the features extracted and looks like this:

$ #separator \x09
$ #set_separator  ,
$ #empty_field    (empty)
$ #unset_field    -
$ #path   fullLog
$ #open   2023-12-07-11-05-39
$ #fields ts  uid id.orig_h   id.orig_p   id.resp_h   id.resp_p   proto   service duration    orig_bytes  resp_bytes  conn_state  local_orig  local_resp  missed_bytes    history orig_pkts   orig_ip_bytes   resp_pkts   resp_ip_bytes   tunnel_parents  sload   dload   smeansz dmeansz trans_depth reb_bdy_len start_time  last_time   is_sm_ips_ports is_ftp_login    user_ftp    pwd_ftp ct_ftp_cmd  pkts_dropped_   sjit    djit    sinpkt  dinpkt
$ #types  string  string  addr    port    addr    port    enum    string  string  count   count   string  bool    bool    count   string  count   count   count   count   set[string] double  double  double  double  count   count   string  string  string  count   string  string  count   count   interval    interval    interval    interval
2023-01-23 21:52:10 Ctcnkh1AHaKnCbfish  185.175.0.3 59244   185.175.0.5 502 tcp modbus  0m0s    12  11  SF  -   -   0   ShADadFf    6   332 4   227 -   125476.218136   115019.866625   2.0 2.0 0   0   2023-01-23 21:52:10 2023-01-23 21:52:10 0   -   -   -   -   0   0.025494    0.000000    0.124872    0.157102
2023-01-23 21:52:10 CwOqfa3Ov8oO8eJm6l  185.175.0.3 59246   185.175.0.5 502 tcp modbus  0m0s    12  11  SF  -   -   0   ShADadFf    6   332 4   227 -   74017.129412    67849.035294    2.0 2.0 0   0   2023-01-23 21:52:10 2023-01-23 21:52:10 0   -   -   -   -   0   -0.001232   0.000000    0.001895    0.000500
$ #close  2023-12-07-11-06-17

The flowFeatures.log is:

#separator \x09
#set_separator  ,
#empty_field    (empty)
#unset_field    -
#path   flowFeatures
#open   2023-12-07-16-04-05
#fields id.orig_h   id.orig_p   id.resp_h   id.resp_p   proto
#types  addr    port    addr    port    enum
185.175.0.3 58040   185.175.0.5 502 tcp
185.175.0.3 58042   185.175.0.5 502 tcp
#close  2023-12-07-16-05-17

The infoPackets.log is represented as:

#separator \x09
#set_separator  ,
#empty_field    (empty)
#unset_field    -
#path   infoPackets
#open   2023-12-07-16-04-05
#fields stcpb   dtcpb   ttl win
#types  count   count   count   count
-   -   64  509
-   1   64  502
#close  2023-12-07-16-05-12

The infoTCPConn.log is described in the following way:

#separator \x09
#set_separator  ,
#empty_field    (empty)
#unset_field    -
#path   infoTCPConn
#open   2023-12-07-16-04-05
#fields synack  ackdat  tcprtt  m_int_s m_int_d
#types  interval    interval    interval    interval    interval
0.000000    0.000000    0.000000    0.000000    0.000000
0.000000    0.000026    0.000026    0.000004    0.000000
0.000000    0.000019    0.000019    0.000003    0.000000
#close  2023-12-07-16-05-12

The output on the terminal prints out:

-----------Feature 43-----------
There are 75 connections that have the same IP destination '185.175.0.8'
There are 3 connections that have the same IP destination '185.175.0.5'
There are 20 connections that have the same IP destination '185.175.0.4'
There is 1 connection that has the same IP destination '185.175.0.6'
----------------------------------------

Description

Once the data was obtained from the network traffic capture, a process was performed to extract the features. We analyzed the pcap files that portray the network traffic of the ModBus dataset, we calculated and printed 63 features, of which 49 for each connection recorded within the pcap file and 14 inherent to all the connections taken overall. To do this, we took advantage of the events and functions made available by Zeek, as well as custom functions (created on our own), to simplify the code or to calculate unknown data within the Zeek environment.

In addition, to make the most of the log files produced automatically by Zeek, we used Machine Learning algorithms to study and graph the occurrences and characteristics of the various attacks stored in pcap files of the ModBus dataset in the section relating to the attacks, called "attack".

Extracted Features

All the features to extract are highlighted in the excel file, called "featureCyber" provided in the previous part together with the code.

Extension (Python and Machine Learning)

First of all, it's important to parse a Zeek's log into Dataframe in python using Zat:

import zat
from zat.log_to_dataframe import LogToDataFrame

# Create a Pandas dataframe from the Zeek HTTP log for example
log_to_df = LogToDataFrame()
zeek_df = log_to_df.create_dataframe('http.log')
print('Read in {:d} Rows...'.format(len(zeek_df)))
zeek_df.head()

The following example of code is explains how to train/fit and predict anomalous instances using the Isolation Forest model:

odd_clf = IsolationForest(contamination=0.35)
odd_clf.fit(zeek_matrix)

The K-Means algorithm is a partitioning cluster analysis algorithm that allows the division of a set of objects into k groups based on their attributes. It is a variant of the Expectation-Maximization (EM) algorithm, aiming to determine the k groups of data generated by Gaussian distributions.

PCA, or Principal Component Analysis, is a dimensionality reduction technique designed to decrease the relatively high number of variables describing a dataset to a smaller set of latent variables, while minimizing information loss as much as possible.

These two concepts are used in the following example:

# K-Means and PCA
kmeans = KMeans(n_clusters=3).fit_predict(odd_matrix)  # Change this to 3/5 for fun
pca = PCA(n_components=3).fit_transform(odd_matrix)

# Now we can put our ML results back onto our dataframe
odd_df['x'] = pca[:, 0] # PCA X Column
odd_df['y'] = pca[:, 1] # PCA Y Column
odd_df['cluster'] = kmeans
odd_df.head()

This is important to print out the observation of each cluster found:

# Now print out the details for each cluster
pd.set_option('display.width', 1000)
for key, group in cluster_groups:
    print('\nCluster {:d}: {:d} observations'.format(key, len(group)))
    print(group[features].head())