masterthesis/content.tex

%{{{ introduction
\section{Introduction}

The Internet has become an irreplaceable part of our day-to-day lives.
We are always connected via numerous \enquote{smart} and \ac{iot} devices.
We use the Internet to communicate, shop, handle financial transactions, and much more.
Many personal and professional workflows are so dependent on the Internet, that they won't work when being offline, and with the pandemic, we are living through, this dependency grew even stronger.


In 2021, there were around 10 billion Internet-connected \ac{iot} devices and this number is estimated to more than double over the next years up to 25 billion in 2030~\cite{bib:statista_iot_2020}.
Many of these devices run on outdated software, don't receive regular updates, and don't follow general security best practices.
While in 2016 only \SI{77}{\percent} of German households had a broadband connection with a bandwidth of \SI{50}{\mega\bit\per\second} or more, in 2020 it was already \SI{95}{\percent} with more than \SI{50}{\mega\bit\per\second} and \SI{59}{\percent} with at least \SI{1000}{\mega\bit\per\second}~\cite{bib:statista_broadband_2021}.
Their nature as small, always online devices---often without any direct user interaction---behind Internet connections that are getting faster and faster makes them a desirable target for \emph{botnet} operators.

A \emph{botnet} is a network of malware-infected computers, called \emph{bots}, controlled by a \emph{botmaster}.
Botnets are controlled via a \emph{\ac{c2} channel}.
The communication patterns of a \ac{c2} channel can be \emph{centralized}, \emph{decentralized}, or \emph{distributed}.
Centralized or decentralized botnets use one or more coordinating hosts to contact and receive new commands.
Distributed botnets create a \emph{\ac{p2p}} network as their communication layer.
The \ac{c2} channel for centralized and decentralized botnets can use anything from \ac{irc} over HTTP to Twitter~\cite{bib:pantic_covert_2015}.

In recent years, \ac{iot} botnets have been responsible for some of the biggest \ac{ddos} attacks ever recorded---creating up to \SI{1}{\tera\bit\per\second} of traffic~\cite{bib:ars_ddos_2016}.
Other malicious use of bots includes several activities---\ac{ddos} attacks, banking fraud, proxies to hide the attacker's identity, and sending of spam emails, just to name a few.

The constantly growing damage produced by botnets has many researchers and law enforcement agencies trying to shut down these operations~\cite{bib:nadji_beheading_2013, bib:nadji_still_2017, bib:dittrich_takeover_2012, bib:fbiTakedown2014}.
A coordinated operation with help from law enforcement, hosting providers, domain registrars, and platform providers could shut down or take over the operation by changing how requests are routed or simply shutting down the controlling servers/accounts.

The monetary value of these botnets directly correlates with the amount of effort botmasters are willing to put into implementing defense mechanisms against take-down attempts.
Botnet operators came up with a number of ideas: \acp{dga} use pseudorandomly generated domain names to render simple domain blacklist-based approaches ineffective~\cite{bib:antonakakis_dga_2012} or fast-flux DNS entries, where a large pool of IP addresses is randomly assigned to the \ac{c2} domains to prevent IP based blacklisting and hide the actual \ac{c2} servers~\cite{bib:nazario_as_2008}.
Analyzing and shutting down a centralized or decentralized botnet is comparatively easy since the central means of communication (the \ac{c2} IP addresses or domain names, Twitter handles, or \ac{irc} channels), can be extracted from the malicious binaries or determined by analyzing network traffic and can therefore be considered publicly known.
A number of botnet operations were taken down by shutting down the \ac{c2} channel~\cite{bib:nadji_beheading_2013, bib:msZloader} and as the defenders upped their game, so did attackers---the concept of \ac{p2p} botnets emerged.
The idea is to build a distributed network without \acp{spof} in the form of \ac{c2} servers as shown in \Fref{fig:p2p}.

%{{{ fig:c2vsp2p
\begin{figure}[h]
\centering
\begin{subfigure}[b]{.5\textwidth}
  \centering
  \includegraphics[width=.9\linewidth]{c2.drawio.pdf}
  \caption{Topology of a \ac{c2} controlled botnet}\label{fig:c2}
\end{subfigure}%
\begin{subfigure}[b]{.5\textwidth}
  \centering
  \includegraphics[width=.9\linewidth]{p2p.drawio.pdf}
  \caption{Topology of a \ac{p2p} botnet}\label{fig:p2p}
\end{subfigure}%
\caption{Communication paths in different types of botnets}\label{fig:c2vsp2p}
\end{figure}
%}}}fig:c2vsp2p

This lack of a \ac{spof} makes \ac{p2p} botnets more resilient to take-down attempts since there is no easy way to stop the communication and botmasters can easily rejoin the network and send new commands.

Taking down a \ac{p2p} botnet requires intricate knowledge of the botnet's characteristics, \eg{} size, risk, distribution over IP subnets or geolocations, network topology, participating peers, and protocol characteristics.
Just like for centralized and decentralized botnets, to take down a \ac{p2p} botnet, the \ac{c2} channel needs to be identified and disrupted.
By \emph{monitoring} peer activity of known participants in the botnet, this knowledge can be obtained and used to find attack vectors in the botnet protocol.

In this work, we will show how a collaborative system of crawlers and sensors can make the monitoring and information gathering phase of a \ac{p2p} botnet more efficient, resilient to detection and how collaborative monitoring can help circumvent anti-monitoring techniques.

%}}} introduction


%%{{{ motivation
%\clearpage{}
%\section{Motivation}


%%}}} motivation

\clearpage{}
\section{Background}


In a \ac{p2p} botnet, each node in the network knows a number of its neighbors and connects to those. Each of these neighbors has a list of neighbors on its own, and so on.
The botmaster only needs to join the network to send new commands or receive stolen data but there is no need for a coordinating host, that is always connected to the network.
Any of the nodes in \Fref{fig:p2p} could be the botmaster but they don't even have to be online all the time since the peers will stay connected autonomously.
In fact, there have been arrests of \ac{p2p} botnet operators but due to the autonomy offered by the distributed approach, the botnet keeps intact and continues operating~\cite{bib:netlab_mozi}.
Especially worm-like botnets, where each peer tries to actively find and infect other systems, can keep lingering for many years.


Bots in a \ac{p2p} botnet can be split into two distinct groups according to their reachability: peers that are not publicly reachable (\eg{} because they are behind a \ac{nat} router or firewall) and those, that are publicly reachable, also known as \emph{superpeers}.
In contrast to centralized botnets with a fixed set of \ac{c2} servers, in a \ac{p2p} botnet, every superpeer might take the role of a \ac{c2} server and \emph{non-superpeers} will connect to those superpeers when joining the network.

As there is no well-known server in a \ac{p2p} botnet, they have to coordinate autonomously.
This is achieved by connecting the bots among each other.
Bot \textit{B} is considered a \emph{neighbor} of bot \textit{A}, if \textit{A} knows and connects to \textit{B}.
Since bots can go offline can become unavailable (\eg{} because the system was shut down or the malware infection was detected and removed), they have to consistently update their neighbor lists to avoid losing their connection into the botnet.
This is achieved by periodically querying their neighbor's neighbors in a process known as \emph{\ac{mm}}.

\Ac{mm} can be distinguished into two categories: \emph{structured} and \emph{unstructured}~\cite{bib:baileyNextGen}.
Structured \ac{p2p} botnets have strict rules for a bot's neighbors based on its unique ID and often use a \ac{dht}, which allows persisting data in a distributed network.
The \ac{dht} could contain an ordered ring structure of IDs and neighborhood in the structure also means neighborhood in the network, as is the case in the Kademila botnet~\cite{bib:kademlia2002}.
In \ac{p2p} botnets that employ unstructured \ac{mm}, on the other hand, bots ask any peer they know for new peers to connect to, in a process called \emph{peer discovery}.
To enable peers to join a unstructured \ac{p2p} botnets, the malware binaries include hardcoded lists of superpeers for the newly infected systems to connect to.

The concept of \emph{churn} describes when a bot becomes unavailable.
There are two types of churn:

\begin{itemize}

  \item \emph{IP churn}: A bot becomes unreachable because it got assigned a new IP address. The bot is still available but under another IP address.

  \item \emph{Device churn}: The device is actually offline, \eg{} because the infection was cleaned, it got shut down or lost its Internet connection.

\end{itemize}


%{{{ formal model
\subsection{Formal Model of \Acs*{p2p} Botnets}

A \ac{p2p} botnet can be modeled as a digraph

\begin{align*}
G &= (V, E)
\end{align*}

With the set of nodes \(V\) describing the peers in the network and the set of edges \(E\) describing the communication flow between bots.

\(G\) is not required to be a connected graph but might consist of multiple disjoint components~\cite{bib:rossow_sok_2013}. Components consisting of peers, that are infected by the same malware, are considered part of the same graph.

For a bot \(v \in V\), the \emph{predecessors} \(\text{pred}(v)\) and \emph{successors} (neighbors) \(\text{succ}(v)\) are defined as:

\begin{align*}
  \text{succ}(v) &= \{ u \in V \mid (v, u) \in E \} \\
  \text{pred}(v) &= \{ u \in V \mid (u, v) \in E \}
\end{align*}

The set of nodes \(\text{succ}(v)\) is also called the \emph{peer list} of \(v\).
Those are the nodes, a peer will connect to, to request new commands and other peers.

For a node \(v \in V\), the in and out degree \(\deg^{+}\) and \(\deg^{-}\) describe how many bots know \(v\) or are known by \(v\) respectively.

\begin{align*}
  \deg^{+}(v) &= \abs{\text{pred}(v)} \\
  \deg^{-}(v) &= \abs{\text{succ}(v)}
\end{align*}

%}}} formal model

%{{{ detection techniques
\subsection{Monitoring Techniques for \Acs*{p2p} Botnets}

There are two distinct methods to map and get an overview of the network topology of a \ac{p2p} botnet:

%{{{ passive detection
\subsubsection{Passive Monitoring}

For passive detection, traffic flows in large amounts of collected network traffic often obtained from \acp{isp} or network telescopes~\cite{bib:carnaNetworkTelescope2014} are analyzed.
This has some advantages: \eg{} it is not possible for botmasters to detect or prevent data-collection of that kind, though it is not trivial to distinguish valid \ac{p2p} application traffic (\eg{} BitTorrent, Skype, cryptocurrencies, \ldots) from \ac{p2p} bots.
\citeauthor{bib:zhang_building_2014} propose a system of statistical analysis to solve some of these problems in \citetitle{bib:zhang_building_2014}~\cite{bib:zhang_building_2014}.
Also getting access to the required datasets might not be possible for everyone.

Like most botnet detection mechanisms, also the passive ones work by building communication graphs and finding tightly coupled subgraphs that might be indicative of a botnet~\cite{bib:botgrep2010}. An advantage of passive detection is, that it is independent of protocol details, specific binaries, or the structure of the network (\ac{p2p} vs.\ centralized/decentralized)~\cite{bib:botminer2008}.

% \begin{itemize}

%   \item Large scale network analysis (hard to differentiate from legitimate \ac{p2p} traffic (\eg{} BitTorrent), hard to get data, knowledge of some known bots required)~\cite{bib:zhang_building_2014}

%   \item Heuristics: Same traffic patterns, same malicious behaviour

% \end{itemize}
% \todo{no context}

Passive monitoring is only mentioned for completeness and not further discussed in this thesis.

%}}} passive detection

%{{{ active detection
\subsubsection{Active Monitoring}

For active detection, a subset of the botnet protocol and behavior is reimplemented to participate in the network.
To do so, samples of the malware are reverse engineered to understand and recreate the protocol.
This partial implementation includes the communication part of the botnet but ignores the malicious functionality to not support and take part in illicit activity.
% The difference in behaviour from the reference implementation and conspicuous graph properties (\eg{} high \(\deg^{+}\) vs.\ low \(\deg^{-}\)) of these sensors allows botmasters to detect and block the sensor nodes.

There are two subtypes of active detection: \emph{sensors} wait to be contacted by other peers, while \emph{crawlers} actively query known bots and recursively ask for their neighbors~\cite{bib:karuppayah_sensorbuster_2017}.
Crawlers can only detect superpeers and therefore only see a small subset of the network, while sensors are also contacted by peers in private networks and behind firewalls.
To accurately monitor a \ac{p2p} botnet, a hybrid approach of crawlers and sensors is required.

A crawler starts with a predefined list of known bots, connects to those, and uses the peer exchange mechanism to request other bots.
Each found bot is crawled again, slowly building the graph of superpeers on the way.
Every entry \textit{E} in the peer exchange response received from bot \textit{A} represents an edge from \textit{A} to \textit{E} in the graph.
\citetitle{bib:andriesse_reliable_2015} describes disinformation attacks, in which bots will include invalid entries in their peer list replies~\cite{bib:andriesse_reliable_2015}.
Therefore, edges should only be considered valid, if at least one crawler or sensor was able to contact or contacted by peer \textit{E}, thereby confirming, that \textit{E} is an existing participant in the botnet.

A sensor implements the passive part of the botnet's \ac{mm}.
It is populated into the network by crawlers or other sensors and waits for other peers to contact them.
They cannot be used to create the botnet graph (only edges into the sensor node) or find new peers, but are required to enumerate the whole network, including non-superpeers.

%}}} active detection

%{{{ monitoring prevention
\subsubsection{Anti-Monitoring Techniques}


\citeauthor{bib:andriesse_reliable_2015} explore some monitoring countermeasures in \citetitle{bib:andriesse_reliable_2015}.
These include deterrence, which limits the number of bots per IP address or subnet; blacklisting, where known crawlers and sensors are blocked from communicating with other bots in the network (mostly IP based); disinformation, when fake bots are placed in the peer lists, to invalidate the data collected by crawlers; and active retaliation like \ac{ddos} attacks against sensors or crawlers~\cite{bib:andriesse_reliable_2015}.


%}}} monitoring prevention

%}}} detection techniques

\subsection{Related Work}

In \citetitle{bib:karuppayah_boobytrap_2016}~\cite{bib:karuppayah_boobytrap_2016}, the authors evaluate criteria to detect monitoring attempts in a \ac{p2p} botnet:

\begin{description}
  \item[Defiance] Peers that don't abide by the \ac{mm} protocol rules are most likely crawlers or sensors, \eg{} peers that query other peers that shouldn't be in their neighborhood according to geolocation or IP subnet rules.

  \item[Abuse] Higher \ac{mm} frequency as an indicator for a sensor or crawler

  \item[Avoidance] Peers that avoid aiding the botnet, \eg{} by returning empty replies on \ac{mm} requests are potential monitoring nodes
\end{description}

\citetitle{bib:karuppayah_sensorbuster_2017} explores graph ranking algorithms to detect monitoring activity in a \ac{p2p} botnet.
They depend on suspicious graph properties to enumerate candidate peers~\cite{bib:karuppayah_sensorbuster_2017}.

\begin{description}
  \item[PageRank] The algorithm used by Google to rank their search results uses the ratio of \(\deg^{+}\) and \(\deg^{-}\) to detect sensors since they have many incoming but few outgoing edges~\cite{bib:page_pagerank_1998}

  \item[SensorRank] A deviation of PageRank that normalizes the result, to better account for churn and valid peers with few high-ranked predecessors but only a few successors

  \item[SensorBuster] evaluates \ac{wcc} in a graph since sensors have only incoming but no outgoing edges, thereby creating a disconnected graph component
\end{description}

\Ac{bms}\footnotemark is a monitoring platform for \ac{p2p} botnets described by \citeauthor{bib:bock_poster_2019} in \citetitle{bib:bock_poster_2019}.
\footnotetext{\url{https://github.com/Telecooperation/BMS}}
\Ac{bms} is intended for a hybrid active approach of crawlers and sensors (reimplementations of the \ac{p2p} protocol of a botnet, that won't perform malicious actions) to collect live data from active botnets.

In an earlier project, we implemented different graph ranking algorithms---among others \emph{PageRank}~\cite{bib:page_pagerank_1998} and \emph{SensorRank}---to detect sensor candidates in a botnet, as described in \citetitle{bib:karuppayah_sensorbuster_2017}.

%%{{{ detection criteria
%\subsection{Detection Criteria}

%\begin{itemize}

%  \item \ac{p2p} online time vs host online time

%  \item neighbourhood lists

%  \item no/few \ac{dns} lookups; instead direct lookups from routing tables

%\end{itemize}
%%}}} detection criteria


%{{{ methodology
\clearpage{}
\section{Methodology}

The implementation of the concepts of this work will be done as part of \ac{bms}.
The goal of this work is to complicate detection and anti-monitoring mechanisms for botmasters by coordinating the work of the system's crawlers and sensors.
The coordinated work distribution also helps in efficiently monitoring large botnets where one crawler is not enough to track all peers.
The changes should allow the current \ac{bms} crawlers and sensors to use the new implementation with as few changes as possible to the existing code.

We will explore how cooperative monitoring of a \ac{p2p} botnet can help with the following problems:

\begin{itemize}
  \item Impede detection of monitoring attempts by reducing the impact of aforementioned graph metrics

  \item Circumvent anti-monitoring techniques

  \item Make crawling more efficient
\end{itemize}

The final results should be as general as possible and not depend on any botnet's specific behavior (except for the mentioned anti-monitoring techniques which might be unique to some botnets), but we assume that every \ac{p2p} botnet has some way of querying a bot's neighbors for the concept of crawlers and sensors to be applicable.

In the current implementation, each crawler will itself visit and monitor each new node it finds.
The general idea for the implementation of the concepts in this thesis is to report newfound nodes back to the \ac{bms} backend first, where the graph of the known network is created, and a fitting worker is selected to achieve the goal of the according coordination strategy.
That worker will be responsible to monitor the new node.

If it is not possible to select a sensor so that the monitoring activity stays inconspicuous, the coordinator can do a complete shuffle of all nodes between the sensors to restore the wanted graph properties or warn if more sensors are required to fulfill the goal defined by the strategy.

The improved crawler system should allow new crawlers to register themselves and their capabilities (\eg{} bandwidth, geolocation), so the amount of work can be scaled accordingly between hosts.

%}}} methodology

%{{{ primitives
\subsection{Protocol Primitives}\label{sec:protPrim}

The coordination protocol must allow the following operations:

\begin{description}

  \item[Register Worker] Register a new worker with capabilities (which botnet, the available bandwidth and processing power, if it is a crawler or sensor, \dots{}).
    This is called periodically and used to determine which worker is still active when assigning new tasks.

  \item[Report Peer] Report found peers.
    Both successful and failed attempts are reported, to detect churned peers, and blacklisted crawlers as soon as possible.

  \item[Report Edge] Report found edges.
    Edges are created by querying the peer list of a bot.
    This is how new peers are detected.

  \item[Request Tasks] Receive a batch of crawl tasks from the coordinator.
    The tasks consist of the target peer, if the worker should start or stop monitoring the peer, when the monitoring should start and stop and at which frequency the peer should be contacted.

  \item[Request Neighbors] Sensors can request a list of candidate peers to return when their peer list is queried.

\end{description}

\begin{listing}[H]
\begin{minted}{go}
type Peer struct {
  BotID string
  IP    string
  Port  uint16
}
type PeerTask struct {
  Peer         Peer
  StartAt      *Time
  StopAt       *Time
  Frequency    uint
  StopCrawling bool
}
\end{minted}
\caption{Relevant Fields for Peers and Tasks}\label{lst:peerFields}
\end{listing}

\Fref{lst:peerFields} shows the Go structures used for crawl tasks.

%}}} primitives

%}}} methodology

%{{{ strategies
\clearpage{}
\section{Coordination Strategies}

Let \(C\) be the set of available crawlers.
Without loss of generality, if not stated otherwise, we assume that \(C\) is known when \ac{bms} is started and will not change afterward.
There will be no joining or leaving crawlers.
This assumption greatly simplifies the implementation due to the lack of changing state that has to be tracked while still exploring the described strategies.
A production-ready implementation of the described techniques can drop this assumption but might have to recalculate the work distribution once a crawler joins or leaves.
The protocol primitives described in \Fref{sec:protPrim} already allow for this to be implemented by first creating tasks with the \mintinline{go}{StopCrawling} flag set to true for all active tasks, running the strategy again, and creating the according tasks to start crawling again.

%{{{ load balancing
\subsection{Load Balancing}\label{sec:loadBalancing}

Depending on a botnet's size, a single crawler is not enough to monitor all superpeers.
While it is possible to run multiple, uncoordinated crawlers, two or more of them can find and monitor the same peer, making the approach inefficient with regard to the computing resources at hand.

The load balancing strategy solves this problem by systematically splitting the crawl tasks into chunks and distributing them among the available crawlers.
The following load balancing strategies will be investigated:

\begin{description}
  \item[Round Robin] Evenly distribute the peers between crawlers in the order they are found

  \item[IP-based partitioning] Use the uniform distribution of cryptographic hash functions to assign peers to crawlers in a random manner but still evenly distributed
\end{description}

Load balancing in itself does not help prevent the detection of crawlers but it allows better usage of available resources.
It prevents unintentionally crawling the same peer with multiple crawlers and allows crawling of bigger botnets where the uncoordinated approach would reach its limit and could only be worked around by scaling up the machine where the crawler is executed.
Load balancing allows scaling out, which can be more cost-effective.

\subsubsection{Round Robin Distribution}\label{sec:rr}

This strategy distributes work evenly among crawlers by either naively assigning tasks to the crawlers rotationally or weighted according to their capabilities.
To keep the distribution as even as possible, we keep track of the last crawler a task was assigned to and start with the next in line in the subsequent round of assignments.
For the sake of simplicity, only the bandwidth will be considered as a capability but it can be extended by any shared property between the crawlers, \eg{} available memory or processing power.
For a given crawler \(c_i \in C\) let \(cap(c_i)\) be the capability of the crawler.
The total available capability is \(B = \sum\limits_{c \in C} cap(c)\).
With \(G\) being the greatest common divisor of all the crawler's capabilities, the weight of a crawler is \(W(c_i) = \frac{cap(c_i)}{G}\).
\(\frac{cap(c_i)}{B}\) gives us the percentage of the work a crawler is assigned.
% The set of target peers \(P = <p_0, p_1, \ldots, p_{n-1}>\), is partitioned into \(|C|\) subsets according to \(W(c_i)\) and each subset is assigned to its crawler \(c_i\).
% The mapping \mintinline{go}{gcd(C)} is the greatest common divisor of all peers in \mintinline{go}{C}, \(\text{maxWeight}(C) = \max \{ \forall c \in C : W(c) \}\).
The algorithm in \Fref{lst:wrr} distributes the work according to the crawler's capabilities.

\begin{listing}[H]
\begin{minted}{go}
func WeightCrawlers(crawlers ...Crawler) map[string]uint {
  weights := []int{}
  totalWeight := 0
  for _, crawler := range crawlers {
    totalWeight += crawler.Bandwith
    weights = append(weights, crawler.Bandwith)
  }
  gcd := Fold(Gcd, weights...)
  weightMap := map[string]uint{}
  for _, crawler := range crawlers {
    weightMap[crawler.ID] = uint(crawler.Bandwith / gcd)
  }
  return weightMap
}

func WeightedCrawlerList(crawlers ...Crawler) []string {
  weightMap := WeightCrawlers(crawlers...)
  didSomething := true
  crawlerIds := []string{}
  for didSomething {
    didSomething = false
    for k, v := range weightMap {
      if v != 0 {
        didSomething = true
        crawlerIds = append(crawlerIds, k)
        weightMap[k] -= 1
      }
    }
  }
  return crawlerIds
}
\end{minted}
\caption{Pseudocode for weighted round-robin}\label{lst:wrr}
\end{listing}

This creates a list of crawlers where a crawler can occur more than once, depending on its capabilities.
To ensure better distribution, first, every crawler is assigned one task, then, according to the capabilities, every crawler with a weight of 2 or more is assigned a task, repeating this process until all tasks are assigned.
The set of crawlers \(\{a, b, c\}\) with the capabilities \(cap(a) = 3\), \(cap(b) = 2\), \(cap(c) = 1\) would produce \(<a, b, c, a, b, a>\), allocating two and three times the work to crawlers \(b\) and \(a\) respectively.


\subsubsection{IP-based Partitioning}\label{sec:ipPart}

The output of cryptographic hash functions is uniformly distributed.
Given the hash function \(H\), calculating the hash of an IP address and distributing the work with regard to \(H(\text{IP}) \mod \abs{C}\) creates almost evenly sized buckets for each worker to handle.
This gives us the mapping \(m(i) = H(i) \mod \abs{C}\) to sort peers into buckets.

Any hash function can be used but since it must be calculated often, a fast function should be used.
While the \ac{md5} hash function must be considered broken for cryptographic use~\cite{bib:stevensCollision}, it is faster to calculate than hash functions with longer output.
Collisions for \ac{md5} have been found but collision resistance is not required.
For the use case at hand, only the uniform distribution property is required so \ac{md5} can be used without scarifying any kind of security.

This strategy can also be weighted using the crawlers' capabilities by modifying the list of available workers so that a worker can appear multiple times according to its weight.
The weighting algorithm from \Fref{lst:wrr} is used to create the weighted multiset of crawlers \(C_W\) and the mapping changes to \(m(i) = H(i) \mod \abs{C_W}\).

% \begin{figure}[H]
%   \centering
%   \includegraphics[width=1\linewidth]{./md5_ip_dist.png}
%   \caption{Distribution of the lowest byte of \ac{md5} hashes over IPv4}\label{fig:md5IPDist}
% \end{figure}
% \todo{remove this?}

\ac{md5} returns a \SI{128}{\bit} hash value.
The Go standard library includes helpers for arbitrarily sized integers\footnote{\url{https://pkg.go.dev/math/big\#Int}}.
This helps us in implementing the mapping \(m\) from above.

By exploiting the even distribution offered by hashing, the work of each crawler is also about evenly distributed over all IP subnets, \acp{as}, and geolocations.
This ensures neighboring peers (\eg{} in the same \ac{as}, geolocation, or IP subnet) get visited by different crawlers.
It also allows us to get rid of the state in our strategy since we don't have to keep track of the last crawler we assigned a task to, making it easier to implement and reason about.

%}}} load balancing

%{{{ frequency reduction
\subsection{Reduction of Request Frequency}\label{sec:stratRedReqFreq}

The GameOver Zeus botnet limited the number of requests a peer was allowed to perform and blacklisted peers, that exceeded the limit, as an anti-monitoring mechanism~\cite{bib:andriesse_goz_2013}.
In an uncoordinated crawler approach, the crawl frequency has to be limited to prevent hitting the request limit.

%{{{ fig:old_crawler_timeline
\begin{figure}[h]
\centering
\begin{chronology}[10]{0}{60}{0.9\textwidth}
  \event{0}{\(C_0\)}
  \event{20}{\(C_0\)}
  \event{40}{\(C_0\)}
  \event{60}{\(C_0\)}
\end{chronology}
\caption{Timeline of requests as received by a peer when crawled by a single crawler}\label{fig:old_crawler_timeline}
\end{figure}
%}}} fig:old_crawler_timeline

Using collaborative crawlers, an arbitrarily fast frequency can be achieved without being blacklisted.
With \(l \in \mathbb{N}\) being the maximum allowed frequency as defined by the botnet's protocol, \(f \in \mathbb{N}\) being the crawl frequency that should be achieved.
The number of crawlers \(n\) required to achieve the frequency \(f\) without being blacklisted and the offset \(o\) between crawlers are defined as

\begin{align*}
  n &= \left\lceil \frac{f}{l} \right\rceil \\
  o &= \frac{\SI{1}{\request}}{f}
\end{align*}

Taking advantage of the \mintinline{go}{StartAt} field from the \mintinline{go}{PeerTask} returned by the \mintinline{go}{requestTasks} primitive above, the crawlers can be scheduled offset by \(o\) at a frequency \(l\) to ensure, the overall requests to each peer are evenly distributed over time.

Given a limit \(l = \SI{6}{\request\per\minute}\), crawling a botnet at \(f = \SI{24}{\request\per\minute}\) requires \(n = \left\lceil \frac{\SI{24}{\request\per\minute}}{\SI{6}{\request\per\minute}} \right\rceil = 4\) crawlers.
Those crawlers must be scheduled \(o = \frac{\SI{1}{\request}}{\SI{24}{\request\per\minute}} = \SI{2.5}{\second}\) apart at a frequency of \(l\) to evenly distribute the requests over time.


%{{{ fig:crawler_timeline
\begin{figure}[h]
\centering
\begin{chronology}[10]{0}{60}{0.9\textwidth}
  \event{0}{\(C_0\)}
  \event{10}{\(C_0\)}
  \event{20}{\(C_0\)}
  \event{30}{\(C_0\)}
  \event{40}{\(C_0\)}
  \event{50}{\(C_0\)}
  \event{60}{\(C_0\)}

  \event{2.5}{\(C_1\)}
  \event{12.5}{\(C_1\)}
  \event{22.5}{\(C_1\)}
  \event{32.5}{\(C_1\)}
  \event{42.5}{\(C_1\)}
  \event{52.5}{\(C_1\)}

  \event{5}{\(C_2\)}
  \event{15}{\(C_2\)}
  \event{25}{\(C_2\)}
  \event{35}{\(C_2\)}
  \event{45}{\(C_2\)}
  \event{55}{\(C_2\)}

  \event{7.5}{\(C_3\)}
  \event{17.5}{\(C_3\)}
  \event{27.5}{\(C_3\)}
  \event{37.5}{\(C_3\)}
  \event{47.5}{\(C_3\)}
  \event{57.5}{\(C_3\)}
\end{chronology}
\caption{Timeline of crawler events when optimized for effective frequency}\label{fig:crawlerTimelineEffective}
\end{figure}
%}}} fig:crawler_timeline

As can be seen in~\Fref{fig:crawlerTimelineEffective}, each crawler \(C_0\) to \(C_3\) performs only \SI{6}{\request\per\minute} while overall achieving \(\SI{24}{\request\per\minute}\).

Vice versa, given an amount of crawlers \(n\) and a request limit \(l\), the effective frequency \(f\) can be maximized to \(f = n \times l\) without hitting the limit \(l\) and being blocked.

Using the example from above with \(l = \SI{6}{\request\per\minute}\) but now only two crawlers \(n = 2\), it is still possible to achieve an effective frequency of \(f = 2 \times \SI{6}{\request\per\minute} = \SI{12}{\request\per\minute}\) with \(o = \frac{\SI{1}{\request}}{\SI{12}{\request\per\minute}} = \SI{5}{s}\):

%TODO: name
%{{{ fig:crawler_timeline
\begin{figure}[h]
\centering
\begin{chronology}[10]{0}{60}{0.9\textwidth}
  \event{0}{\(C_0\)}
  \event{10}{\(C_0\)}
  \event{20}{\(C_0\)}
  \event{30}{\(C_0\)}
  \event{40}{\(C_0\)}
  \event{50}{\(C_0\)}
  \event{60}{\(C_0\)}

  \event{5}{\(C_1\)}
  \event{15}{\(C_1\)}
  \event{25}{\(C_1\)}
  \event{35}{\(C_1\)}
  \event{45}{\(C_1\)}
  \event{55}{\(C_1\)}

\end{chronology}
\caption{Timeline of crawler events when optimized over the number of crawlers}\label{fig:crawlerTimelineAmount}
\end{figure}
%}}} fig:crawler_timeline

While the effective frequency of the whole system is halved compared to~\Fref{fig:crawlerTimelineEffective}, it is still possible to double the effective frequency over the limit.

%}}} frequency reduction

%{{{ against graph metrics
\subsection{Creating and Reducing Edges for Sensors}

\citetitle*{bib:karuppayah_sensorbuster_2017} describes different graph metrics to find sensors in \ac{p2p} botnets.
These metrics depend on the uneven ratio between incoming and outgoing edges for sensors.
The \emph{SensorBuster} metric uses \acp{wcc} since naive sensors don't have any edges back to the main network in the graph.


With \(v \in V\), \(\text{succ}(v)\) being the set of successors of \(v\) and \(\text{pred}(v)\) being the set of predecessors of \(v\), \emph{PageRank} is recursively defined as~\cite{bib:page_pagerank_1998}:

\begin{align*}
  \text{PR}_0(v) &= \text{initialRank} \\
  \text{PR}_{n+1}(v) &= \text{dampingFactor} \times \sum\limits_{p \in \text{pred}(v)} \frac{\text{PR}_n(p)}{\abs{\text{succ}(p)}} + \frac{1 - \text{dampingFactor}}{\abs{V}}
\end{align*}

For the first iteration, the PageRank of all nodes is set to the same initial value. \citeauthor{bib:page_pagerank_1998} argue that when iterating often enough, any value can be chosen~\cite{bib:page_pagerank_1998}.

The dampingFactor describes the probability of a person visiting links on the web to continue doing so when using PageRank to rank websites in search results.
For simplicity---and since it is not required to model human behavior for automated crawling and ranking---a dampingFactor of \(1.0\) will be used, which simplifies the formula to

\[
  \text{PR}_{n+1}(v) = \sum\limits_{p \in \text{pred}(v)} \frac{\text{PR}_n(p)}{\abs{\text{succ}(p)}}
\]

Based on this, \emph{SensorRank} is defined as

\[
  \text{SR}(v) = \frac{\text{PR}(v)}{\abs{\text{succ}(v)}} \times \frac{\abs{\text{pred}(v)}}{|V|}
\]


Since crawlers never respond to peer list requests, they will always be detectable by the described approach but sensors might benefit from the following technique.

The PageRank and SensorRank metrics are calculated over the sum of the ranks of a node's predecessors.
We will investigate, how limiting the number of predecessors helps produce inconspicuous ranks for a sensor.

% By responding to peer list requests with plausible data and thereby producing valid outgoing edges from the sensors, we will try to make those metrics less suspicious.
To counter the SensorBuster metric, outgoing edges to valid peers from the botnet are required so the sensor does not build a \ac{wcc}.
The challenge here is deciding which peers can be returned without actually supporting the network.
The following candidates to place on the neighbor list will be investigated:

\begin{itemize}

  \item Return the other known sensors, effectively building a complete graph \(K_{\abs{C}}\) containing all sensors

  \item Detect churned peers from \ac{as} with dynamic IP allocation

  \item Detect peers behind carrier-grade \ac{nat} that rotate IP addresses very often and pick random IP addresses from the IP range

\end{itemize}

% Knowledge of only \num{90} peers leaving due to IP rotation would be enough to make a crawler look average in Sality\todo{repeat analysis, actual number}.
% This number will differ between different botnets, depending on implementation details and size of the network\todo{upper limit for NL size as impl detail}.

% \subsubsection{Other Sensors or Crawlers}

\textbf{Other Sensors:} Returning all the other sensors when responding to peer list requests, thereby effectively creating a complete graph \(K_{\abs{C}}\) among the workers, creates valid outgoing edges.
The resulting graph will still form a \ac{wcc} with now edges back into the main network.

Building a complete graph \(G_C = K_{\abs{C}}\) between the sensors by making them return the other known worker on peer list requests would still produce a disconnected component and while being bigger and maybe not as obvious at first glance, it is still easily detectable since there is no path from \(G_C\) back to the main network (see~\Fref{fig:sensorbuster2} and~\Fref{tab:metricsTable}).


%{{{ churned peers
% \subsubsection{Churned Peers After IP Rotation}

\textbf{Churned peers after IP rotation:} Churn describes the dynamics of peer participation in \ac{p2p} systems, \eg{} join and leave events~\cite{bib:stutzbach_churn_2006}.
Detecting if a peer just left the system, in combination with knowledge about \acp{as}, peers that just left and came from an \ac{as} with dynamic IP allocation (\eg{} many consumer broadband providers in the US and Europe), can be placed into the crawler's peer list.
If the timing of the churn event correlates with IP rotation in the \ac{as}, it can be assumed, that the peer left due to being assigned a new IP address---not due to connectivity issues or going offline---and will not return using the same IP address.
These peers, when placed in the peer list of the crawlers, will introduce paths back into the main network and defeat the \ac{wcc} metric.
It also helps with the PageRank and SensorRank metrics since the crawlers start to look like regular peers without actually supporting the network by relaying messages or propagating active peers.

%}}} churned peers

%{{{ cg nat
% \subsubsection{Peers Behind Carrier-Grade \acs*{nat}}

\textbf{Peers behind carrier-grade \acs{nat}:} Some peers show behavior, where their IP address changes almost after every request.
Those peers can be used as fake neighbors and create valid-looking outgoing edges for the sensor.

%}}} cg nat

% \clearpage{}
% \todo{clearpage?}
In theory, it would be possible to detect churned peers or peers behind carrier-grade \acs{nat}, without coordinating the sensors but the coordination gives us a few advantages:

\begin{itemize}

  \item A peer might blacklist a sensor that looks exactly the same as a churned peer from the point of view of an uncoordinated sensor.
    The coordination backend has more knowledge and can detect this if another sensor is still contacted by the peer in question.

  \item The coordination backend can include different streams of information to decide which peers to place in the sensor's neighborhood.
    Knowledge about geolocations, \ac{as} and their IP rotation behavior can be consulted to make better-informed choices for neighborhood candidates.

\end{itemize}

%}}} against graph metrics

%}}} strategies

%{{{ evaluation
\clearpage{}
\section{Evaluation}

To evaluate the strategies from above, we took a snapshot of the Sality~\cite{bib:falliere_sality_2011} botnet obtained from \ac{bms} throughout of \daterange{2021-04-21}{2021-04-28}, if not stated otherwise.

%{{{ eval load balancing
\subsection{Load Balancing}

To evaluate the real-world applicability of IP based partitioning, we will partition the dataset containing \num{1595} distinct IP addresses among \num{2}, \num{4}, \num{6}, and \num{10} crawlers and verify if the work is about evenly distributed between crawlers.

We will compare the variance \(\sigma^2\) and standard deviation \(\sigma\) to evaluate the applicability of this method.

\begin{landscape}
\begin{figure}[H]
\begin{subfigure}[b]{.6\textwidth}
\centering
\includegraphics[width=1\linewidth]{ip_part_c02.png}
% \caption{IP based partitioning for 2 crawlers}\label{fig:ipPartC02}
\end{subfigure}%
\begin{subfigure}[b]{.6\textwidth}
\centering
\includegraphics[width=1\linewidth]{ip_part_c04.png}
% \caption{IP based partitioning for 2 crawlers}\label{fig:ipPartC02}
\end{subfigure}%
\hfill
\begin{subfigure}[b]{.6\textwidth}
\centering
\includegraphics[width=1\linewidth]{ip_part_c06.png}
% \caption{IP based partitioning for 2 crawlers}\label{fig:ipPartC02}
\end{subfigure}%
\begin{subfigure}[b]{.6\textwidth}
\centering
\includegraphics[width=1\linewidth]{ip_part_c10.png}
\end{subfigure}%
\caption{IP based partitioning}\label{fig:ipPart}
\end{figure}
\end{landscape}

%%{{{ fig:ipPartC02
%\begin{figure}[H]
%\centering
%\includegraphics[width=1\linewidth]{ip_part_c02.png}
%\caption{IP based partitioning for 2 crawlers}\label{fig:ipPartC02}
%\begin{align*}
%  n &= 2 \\
%  \mu &= \frac{1595}{n} = 797.5 \\
%  s &= \sum\limits_{i=1}^{n} {(x_i - \mu)}^2 \\
%    &= {(808 - 797.5)}^2 + {(787 - 797.5)}^2 \\
%    &= 220.5 \\
%  \sigma^2 &= \frac{s}{n} = 110.2 \\
%  \sigma &= \sqrt{\sigma^2} = 10.5
%\end{align*}
%\end{figure}
%%}}}fig:ipPartC02


%%{{{ fig:ipPartC04
%\begin{figure}[H]
%\centering
%\includegraphics[width=1\linewidth]{ip_part_c04.png}
%\caption{IP based partitioning for 4 crawlers}\label{fig:ipPartC04}
%\begin{align*}
%  n &= 4 \\
%  \mu &= \frac{1595}{n} = 398.8 \\
%  s &= \sum\limits_{i=1}^{n} {(x_i - \mu)}^2 \\
%    &= {(403 - 398.8)}^2 + {(369 - 398.8)}^2 + {(405 - 398.8)}^2 \\
%    &+ {(418 - 398.8)}^2 \\
%    &= 1312.8 \\
%  \sigma^2 &= \frac{s}{n} = 328.2 \\
%  \sigma &= \sqrt{\sigma^2} = 18.1
%\end{align*}
%\end{figure}
%%}}}fig:ipPartC04


%%{{{ fig:ipPartC06
%\begin{figure}[H]
%\centering
%\includegraphics[width=1\linewidth]{ip_part_c06.png}
%\caption{IP based partitioning for 6 crawlers}\label{fig:ipPartC06}
%\begin{align*}
%  n &= 6 \\
%  \mu &= \frac{1595}{n} = 265.8 \\
%  s &= \sum\limits_{i=1}^{n} {(x_i - \mu)}^2 \\
%    &= {(258 - 265.8)}^2 + {(273 - 265.8)}^2 + {(257 - 265.8)}^2 \\
%    &+ {(264 - 265.8)}^2 + {(293 - 265.8)}^2 + {(250 - 265.8)}^2 \\
%    &= 1182.8 \\
%  \sigma^2 &= \frac{s}{n} = 197.1 \\
%  \sigma &= \sqrt{\sigma^2} = 14.0
%\end{align*}

%\end{figure}
%%}}}fig:ipPartC06

%%{{{ fig:ipPartC10
%\begin{figure}[H]
%\centering
%\includegraphics[width=1\linewidth]{ip_part_c10.png}
%\caption{IP based partitioning for 10 crawlers}\label{fig:ipPartC10}
%\begin{align*}
%  n &= 10 \\
%  \mu &= \frac{1595}{n} = 159.5 \\
%  s &= \sum\limits_{i=1}^{n} {(x_i - \mu)}^2 \\
%    &= {(140 - 159.5)}^2 + {(175 - 159.5)}^2 + {(186 - 159.5)}^2 \\
%    &+ {(166 - 159.5)}^2 + {(159 - 159.5)}^2 + {(152 - 159.5)}^2 \\
%    &+ {(172 - 159.5)}^2 + {(148 - 159.5)}^2 + {(151 - 159.5)}^2 \\
%    &+ {(146 - 159.5)}^2 \\
%    &= 1964.5 \\
%  \sigma^2 &= \frac{s}{n} = 196.4 \\
%  \sigma &= \sqrt{\sigma^2} = 14.0
%\end{align*}
%\end{figure}
%%}}}fig:ipPartC10

\begin{table}[H]
  \centering
\begin{tabular}{rSSSS}
  \(n\) & \(\mu\) & \(\sigma^{2}\) & \(\sigma\) & \(\frac{\sigma}{\mu}\) \\
  \num{2} & 797.5 & 110.2 & 10.5 & 1.3\% \\
  \num{4} & 398.8 & 328.2 & 18.1 & 4.5\% \\
  \num{6} & 265.8 & 197.1 & 14.0 & 5.3\% \\
  \num{10} & 159.5 & 196.4 & 14.0 & 8.8\% \\
\end{tabular}
  \caption{Variance and standard deviation for IP-based partitioning on \num{1595} IP addresses}\label{tab:varSmall}
\end{table}

\Fref{tab:varSmall} shows that the deviation from the expected even distribution is within \SI{10}{\percent}.
Since the used sample is not very big, according to the law of big numbers we would expect the deviation to get smaller, the bigger the sample gets.
Therefore, we simulate the partitioning on a bigger sample of \num{1000000} random IP addresses.

\begin{landscape}
\begin{figure}[H]
\begin{subfigure}[b]{.6\textwidth}
\centering
\includegraphics[width=1\linewidth]{rand_ip_part_c02.png}
% \caption{IP based partitioning for 2 crawlers}\label{fig:ipPartC02}
\end{subfigure}%
\begin{subfigure}[b]{.6\textwidth}
\centering
\includegraphics[width=1\linewidth]{rand_ip_part_c04.png}
% \caption{IP based partitioning for 2 crawlers}\label{fig:ipPartC02}
\end{subfigure}%
\hfill
\begin{subfigure}[b]{.6\textwidth}
\centering
\includegraphics[width=1\linewidth]{rand_ip_part_c06.png}
% \caption{IP based partitioning for 2 crawlers}\label{fig:ipPartC02}
\end{subfigure}%
\begin{subfigure}[b]{.6\textwidth}
\centering
\includegraphics[width=1\linewidth]{rand_ip_part_c10.png}
% \caption{IP based partitioning for 2 crawlers}\label{fig:ipPartC02}
\end{subfigure}%
\caption{IP based partitioning for crawlers on generated dataset}\label{fig:randIpPart}
\end{figure}
\end{landscape}

%%{{{ fig:randIpPartC02
%\begin{figure}[H]
%\centering
%\includegraphics[width=1\linewidth]{rand_ip_part_c02.png}
%\caption{IP based partitioning for 2 crawlers on generated dataset}\label{fig:randIpPartC02}
%\begin{align*}
%  n &= 2 \\
%  \mu &= \frac{1000000}{n} = 500000 \\
%  s &= \sum\limits_{i=1}^{n} {(x_i - \mu)}^2 \\
%    &= {(499322 - 500000)}^2 + {(500678 - 500000)}^2 \\
%    &= 919368 \\
%  \sigma^2 &= \frac{s}{n} = 459684 \\
%  \sigma &= \sqrt{\sigma^2} = 678
%\end{align*}
%\end{figure}
%%}}}fig:randIpPartC02


%%{{{ fig:randIpPartC04
%\begin{figure}[H]
%\centering
%\includegraphics[width=1\linewidth]{rand_ip_part_c04.png}
%\caption{IP based partitioning for 4 crawlers on generated dataset}\label{fig:randIpPartC04}
%\begin{align*}
%  n &= 4 \\
%  \mu &= \frac{1000000}{n} = 250000 \\
%  s &= \sum\limits_{i=1}^{n} {(x_i - \mu)}^2 \\
%    &= {(249504 - 250000)}^2 + {(250451 - 250000)}^2 + {(249818 - 250000)}^2 \\
%    &+ {(250227 - 250000)}^2 \\
%    &= 534070 \\
%  \sigma^2 &= \frac{s}{n} = 133517.5 \\
%  \sigma &= \sqrt{\sigma^2} = 365.4
%\end{align*}
%\end{figure}
%%}}}fig:randIpPartC04


%%{{{ fig:randIpPartC06
%\begin{figure}[H]
%\centering
%\includegraphics[width=1\linewidth]{rand_ip_part_c06.png}
%\caption{IP based partitioning for 6 crawlers on generated dataset}\label{fig:randIpPartC06}
%\begin{align*}
%  n &= 6 \\
%  \mu &= \frac{1000000}{n} = 166666.7 \\
%  s &= \sum\limits_{i=1}^{n} {(x_i - \mu)}^2 \\
%    &= {(166430 - 166666.7)}^2 + {(166861 - 166666.7)}^2 + {(166269 - 166666.7)}^2 \\
%    &+ {(166937 - 166666.7)}^2 + {(166623 - 166666.7)}^2 + {(166880 - 166666.7)}^2 \\
%    &= 372413.3 \\
%  \sigma^2 &= \frac{s}{n} = 62068.9 \\
%  \sigma &= \sqrt{\sigma^2} = 249.1
%\end{align*}
%\end{figure}
%%}}}fig:randIpPartC06


%%{{{ fig:randIpPartC10
%\begin{figure}[H]
%\centering
%\includegraphics[width=1\linewidth]{rand_ip_part_c10.png}
%\caption{IP based partitioning for 10 crawlers on generated dataset}\label{fig:randIpPartC10}
%\begin{align*}
%  n &= 10 \\
%  \mu &= \frac{1000000}{n} = 100000 \\
%  s &= \sum\limits_{i=1}^{n} {(x_i - \mu)}^2 \\
%    &= {(100424 - 100000)}^2 + {(99650 - 100000)}^2 + {(99307 - 100000)}^2 \\
%    &+ {(100305 - 100000)}^2 + {(99403 - 100000)}^2 + {(100562 - 100000)}^2 \\
%    &+ {(100277 - 100000)}^2 + {(99875 - 100000)}^2 + {(99911 - 100000)}^2 \\
%    &+ {(100286 - 100000)}^2 \\
%    &= 1729874 \\
%  \sigma^2 &= \frac{s}{n} = 172987.4 \\
%  \sigma &= \sqrt{\sigma^2} = 415.9
%\end{align*}
%\end{figure}
%%}}}fig:randIpPartC10


\begin{table}[H]
\centering
\begin{tabular}{rSSSS}
  \(n\) & \(\mu\) & \(\sigma^{2}\) & \(\sigma\) & \(\frac{\sigma}{\mu}\) \\
  \num{2} & 500000.0 & 459684.0 & 678.0 & 0.14\% \\
  \num{4} & 250000.0 & 133517.5 & 365.4 & 0.15\% \\
  \num{6} & 166666.7 & 62069.9 & 249.1 & 0.15\% \\
  \num{10} & 100000.0 & 172987.4 & 415.9 & 0.42\% \\
\end{tabular}
  \caption{Variance and standard deviation for IP-based partitioning on \num{1000000} IP addresses}\label{tab:varBig}
\end{table}

As expected, the work is still not perfectly distributed among the crawlers but evenly enough for our use case.
The deviation for larger botnets is expected to be within \SI{0.5}{\percent} of the even distribution.
This is good enough for balancing the tasks among workers.

%}}} eval load balancing

%{{{ eval redu requ freq
\subsection{Reduction of Request Frequency}

To evaluate the request frequency optimization described in \Fref{sec:stratRedReqFreq}, we crawl a simulated peer and check if the requests are evenly distributed and how big the deviation from the theoretically optimal result is.
To get more realistic results, the crawlers and simulated peer are running on different machines so they are not within the same LAN\@.
We use the same parameters as in the example above:

\begin{align*}
  n &= 4 \\
  l &= \SI{6}{\request\per\minute} \\
  f &= \SI{24}{\request\per\minute} \\
  o &= \SI{2.5}{\second}
\end{align*}

To recap, this is what the optimal timeline would look like:

\begin{center}
\begin{chronology}[10]{0}{60}{0.9\textwidth}
  \event{0}{\(C_0\)}
  \event{10}{\(C_0\)}
  \event{20}{\(C_0\)}
  \event{30}{\(C_0\)}
  \event{40}{\(C_0\)}
  \event{50}{\(C_0\)}
  \event{60}{\(C_0\)}

  \event{2.5}{\(C_1\)}
  \event{12.5}{\(C_1\)}
  \event{22.5}{\(C_1\)}
  \event{32.5}{\(C_1\)}
  \event{42.5}{\(C_1\)}
  \event{52.5}{\(C_1\)}

  \event{5}{\(C_2\)}
  \event{15}{\(C_2\)}
  \event{25}{\(C_2\)}
  \event{35}{\(C_2\)}
  \event{45}{\(C_2\)}
  \event{55}{\(C_2\)}

  \event{7.5}{\(C_3\)}
  \event{17.5}{\(C_3\)}
  \event{27.5}{\(C_3\)}
  \event{37.5}{\(C_3\)}
  \event{47.5}{\(C_3\)}
  \event{57.5}{\(C_3\)}
\end{chronology}
\end{center}


The ideal distribution would be \SI{2.5}{\second} between every two events.
Due to network latency and load from crawling other peers, we expect the actual result to deviate from the optimal value over time.
With this experiment, we try to estimate the impact of the latency.

% \begin{figure}[H]
%   \centering
%   \includegraphics[width=1\linewidth]{time_devi.png}
%   \caption{Deviation from the expected interval}\label{fig:timeDevi}
% \end{figure}

\begin{landscape}
\begin{figure}[H]
  \centering
\begin{subfigure}[b]{.6\textwidth}
  \centering
  \includegraphics[width=1\linewidth]{time_deviation/time_devi_c0.png}
\end{subfigure}%
\begin{subfigure}[b]{.6\textwidth}
  \centering
  \includegraphics[width=1\linewidth]{time_deviation/time_devi_c1.png}
\end{subfigure}%
\hfill
\begin{subfigure}[b]{.6\textwidth}
  \centering
  \includegraphics[width=1\linewidth]{time_deviation/time_devi_c2.png}
\end{subfigure}%
\begin{subfigure}[b]{.6\textwidth}
  \centering
  \includegraphics[width=1\linewidth]{time_deviation/time_devi_c3.png}
\end{subfigure}%
  \caption{Derivation from the expected interval per crawler}\label{fig:perCralwerDeviation}
\end{figure}
\end{landscape}

\begin{table}[H]
\centering
\begin{tabular}{rS}
  \textbf{Crawler} & \textbf{Average Deviation} \\
  c0 & \num{0.0003166149207321755} \\
  c1 & \num{0.0002065727194268201} \\
  c2 & \num{0.0003075813840032066} \\
  c3 & \num{0.0038056359425696364} \\
\end{tabular}
  \caption{Average deviation per crawler}\label{tab:perCralwerDeviation}
\end{table}

The monitored peer and crawler \emph{c0} are located in Falkenstein, Germany, \emph{c1} in Nurnberg, Germany, \emph{c2} is in Helsinki, Finland and \emph{c3} in Ashburn, USA, to have some geographic distribution.

The average deviation per crawler is below \SI{0.002}{\second} even with some outliers due to network latency or server load.
The crawler \emph{c3} in the experiment is the furthest away from the monitored host therefore the larger derivation due to network latency is expected.

% In general it is below \SI{0.0007}{\second}, which is a surprisingly accurate result.
In real-world scenarios, crawlers will monitor more than a single peer and the scheduling is expected to be less accurate.
Still, the deviation will always stay below the effective frequency \(f\), because after exceeding \(f\), a crawler is overtaken by the next in line.
The impact of the deviation when crawling real-world botnets has to be investigated and if it shows to be a problem, the tasks have to be rescheduled periodically to prevent this from happening.

% \begin{figure}[H]
%   \centering
%   \includegraphics[width=1\linewidth]{time_devi_overtake.png}
% \end{figure}


%}}} eval redu requ freq

%{{{ eval creating edges
\subsection{Impact of Additional Edges on Graph Metrics}

%{{{ other sensors
\subsubsection{Use Other Known Sensors}

By connecting the known sensors and effectively building a complete graph \(K_{\abs{C}}\) between them creates \(\abs{C} - 1\) outgoing edges per sensor.
In most cases, this won't be enough to reach the number of edges that would be needed.
Also, this does not help against the \ac{wcc} metric since this would create a bigger but still disconnected component.

% TODO: caption, label
\begin{figure}[H]
\centering
\begin{subfigure}[b]{.5\textwidth}
  \centering
  \includegraphics[width=1\linewidth]{sensorbuster1.drawio.pdf}
  \caption{\acp{wcc} for independent crawlers}\label{fig:sensorbuster1}
\end{subfigure}%
\begin{subfigure}[b]{.5\textwidth}
  \centering
  \includegraphics[width=1\linewidth]{sensorbuster2.drawio.pdf}
  \caption{\acp{wcc} for collaborated crawlers}\label{fig:sensorbuster2}
\end{subfigure}%
  \caption{Differences in graph metrics}\label{fig:sensorbuster}
\end{figure}

Applying PageRank with an initial rank of \(0.25\) once on the example graphs in \Fref{fig:sensorbuster} results in:

\begin{table}[H]
  \centering
\begin{tabular}{llllll}
  Node & \(\deg^{+}\) & \(\deg^{-}\) & In \ac{wcc}? & PageRank & SensorRank \\
  n0 & 0/0 & 4/4 & no & 0.75/0.5625 & 0.3125/0.2344 \\
  n1 & 1/1 & 3/3 & no & 0.25/0.1875 & 0.0417/0.0313 \\
  n2 & 2/2 & 2/2 & no & 0.5/0.375 & 0.3333/0.25 \\
  c0 & 3/5 & 0/2 & yes (1/3) & 0.0/0.125 & 0.0/0.0104 \\
  c1 & 1/3 & 0/2 & yes (1/3) & 0.0/0.125 & 0.0/0.0104 \\
  c2 & 2/4 & 0/2 & yes (1/3) & 0.0/0.125 & 0.0/0.0104 \\
\end{tabular}
\caption{Values for metrics from~\Fref{fig:sensorbuster} (a/b)}\label{tab:metricsTable}
\end{table}

While this works for small networks, the crawlers must account for a significant amount of peers in the network for this change to be noticeable.
The generated \(K_n\) needs to be at least as big as the smallest regular component in the botnet, which is not feasible.
Also, if detected, this would leak the information about all known sensors to the botmasters.
The limited scalability, and potential information leak, which might be used by botmasters to retaliate against the sensors or the whole monitoring operation, make this approach unusable in real-world scenarios.


%}}} other sensors


\subsubsection{Effectiveness against SensorBuster}

SensorBuster relies on the assumption that sensors don't have any outgoing edges, thereby creating a disconnected graph component.

\begin{figure}[H]
\centering
\begin{subfigure}[b]{.5\textwidth}
  \centering
  \includegraphics[width=.8\linewidth]{sensorbuster/sensor_without_outgoing.drawio.pdf}
  \caption{Sensor without outgoing edge creates disconnected graph component}
\end{subfigure}%
\begin{subfigure}[b]{.5\textwidth}
  \centering
  \includegraphics[width=.8\linewidth]{sensorbuster/sensor_with_outgoing.drawio.pdf}
  \caption{Single outgoing edge connects sensor back to the main component}\label{fig:sensorbusterWithOutgoing}
\end{subfigure}%
\end{figure}

\Fref{fig:sensorbusterWithOutgoing} shows how a single valid edge  back into the network (from \emph{Sensor} to peer \num{3} in the example) renders the SensorBuster metric ineffective by making the sensor part of the main graph component.

For the \ac{wcc} metric, it is obvious that even a single edge back into the main network is enough to connect the sensor back to the main graph and therefore beat this metric.

\subsection{Reducing Incoming Edges to Reduce Page- and SensorRank}

In this section, we will evaluate how adding outgoing edges to a sensor impacts its PageRank and SensorRank values.
Before doing so, we will check the impact of the initial rank by calculating it with different initial values and comparing the value distribution of the result.

\begin{table}[H]
  \centering
\begin{tabular}{lllll}
  \textbf{Iteration} & \textbf{Avg. PR} & \textbf{Crawler PR} & \textbf{Avg. SR} & \textbf{Crawler SR} \\
  \num{1} & \num{0.24854932} & \num{0.63277194} & \num{0.15393478} & \num{0.56545578} \\
  \num{2} & \num{0.24854932} & \num{0.63277194} & \num{0.15393478} & \num{0.56545578} \\
  \num{3} & \num{0.24501068} & \num{0.46486353} & \num{0.13810930} & \num{0.41540997} \\
  \num{4} & \num{0.24501068} & \num{0.46486353} & \num{0.13810930} & \num{0.41540997} \\
  \num{5} & \num{0.24233737} & \num{0.50602884} & \num{0.14101354} & \num{0.45219598} \\
\end{tabular}
  \caption{Values for PageRank iterations with initial rank \(\forall v \in V : \text{PR}_0(v) = 0.25\)}\label{tab:pr_iter_table_25}
\end{table}

\begin{figure}[H]
  \centering
\begin{subfigure}[b]{.5\textwidth}
  \centering
  \includegraphics[width=1\linewidth]{0.25_1_sr.png}
  \caption{Distribution after 1 iteration}\label{fig:dist_sr_25_1}
\end{subfigure}%
\begin{subfigure}[b]{.5\textwidth}
  \centering
  \includegraphics[width=1\linewidth]{0.25_5_sr.png}
  \caption{Distribution after 5 iterations}\label{fig:dist_sr_25_5}
\end{subfigure}%
  \caption{SensorRank distribution with initial rank \(\forall v \in V : \text{PR}_0(v) = 0.25\)}\label{fig:dist_sr_25}
\end{figure}

\begin{table}[H]
  \centering
\begin{tabular}{lllll}
  \textbf{Iteration} & \textbf{Avg. PR} & \textbf{Crawler PR} & \textbf{Avg. SR} & \textbf{Crawler SR} \\
  \num{1} & \num{0.49709865} & \num{1.26554389} & \num{0.30786955} & \num{1.13091156} \\
  \num{2} & \num{0.49709865} & \num{1.26554389} & \num{0.30786955} & \num{1.13091156} \\
  \num{3} & \num{0.49002136} & \num{0.92972707} & \num{0.27621861} & \num{0.83081993} \\
  \num{4} & \num{0.49002136} & \num{0.92972707} & \num{0.27621861} & \num{0.83081993} \\
  \num{5} & \num{0.48467474} & \num{1.01205767} & \num{0.28202708} & \num{0.90439196} \\
\end{tabular}
  \caption{Values for PageRank iterations with initial rank \(\forall v \in V : \text{PR}_0(v) = 0.5\)}\label{tab:pr_iter_table_5}
\end{table}

\begin{figure}[H]
  \centering
\begin{subfigure}[b]{.5\textwidth}
  \centering
  \includegraphics[width=1\linewidth]{0.50_1_sr.png}
  \caption{Distribution after 1 iteration}\label{fig:dist_sr_50_1}
\end{subfigure}%
\begin{subfigure}[b]{.5\textwidth}
  \centering
  \includegraphics[width=1\linewidth]{0.50_5_sr.png}
  \caption{Distribution after 5 iterations}\label{fig:dist_sr_50_5}
\end{subfigure}%
  \caption{SensorRank distribution with initial rank \(\forall v \in V : \text{PR}_0(v) = 0.5\)}\label{fig:dist_sr_50}
\end{figure}

\begin{table}[H]
  \centering
\begin{tabular}{lllll}
  \textbf{Iteration} & \textbf{Avg. PR} & \textbf{Crawler PR} & \textbf{Avg. SR} & \textbf{Crawler SR} \\
  \num{1} & \num{0.74564797} & \num{1.89831583} & \num{0.46180433} & \num{1.69636734} \\
  \num{2} & \num{0.74564797} & \num{1.89831583} & \num{0.46180433} & \num{1.69636734} \\
  \num{3} & \num{0.73503203} & \num{1.39459060} & \num{0.41432791} & \num{1.24622990} \\
  \num{4} & \num{0.73503203} & \num{1.39459060} & \num{0.41432791} & \num{1.24622990} \\
  \num{5} & \num{0.72701212} & \num{1.51808651} & \num{0.42304062} & \num{1.35658794} \\
\end{tabular}
  \caption{Values for PageRank iterations with initial rank \(\forall v \in V : \text{PR}_0(v) = 0.75\)}\label{tab:pr_iter_table_75}
\end{table}

\begin{figure}[H]
  \centering
\begin{subfigure}[b]{.5\textwidth}
  \centering
  \includegraphics[width=1\linewidth]{0.75_1_sr.png}
  \caption{Distribution after 1 iteration}\label{fig:dist_sr_75_1}
\end{subfigure}%
\begin{subfigure}[b]{.5\textwidth}
  \centering
  \includegraphics[width=1\linewidth]{0.75_5_sr.png}
  \caption{Distribution after 5 iterations}\label{fig:dist_sr_75_5}
\end{subfigure}%
  \caption{SensorRank distribution with initial rank \(\forall v \in V : \text{PR}_0(v) = 0.75\)}\label{fig:dist_sr_75}
\end{figure}

The distribution graphs in \Fref{fig:dist_sr_25}, \Fref{fig:dist_sr_50}, and \Fref{fig:dist_sr_75} show that the initial rank has no effect on the distribution, only on the actual numeric rank values and how far apart they are spread.

For all combinations of initial value and PageRank iterations, the rank for a well-known crawler is in the \nth{95} percentile, so for our use case---detecting sensors due to their high ranks---those parameters do not matter.

% On average, peers in the analyzed dataset have \num{223} successors over the whole week.
Looking at the data in smaller buckets of one hour each, the average number of successors per peer is \num{90}.

%%{{{ fig:avg_out_edges
%\begin{figure}[H]
%\centering
%\includegraphics[width=1\linewidth]{./avg_out_edges.png}
%\caption{Average outgoing edges per peer per hour}\label{fig:avg_out_edges}
%\end{figure}
%%}}}fig:avg_out_edges

Experiments were performed, in which the incoming edges for the known sensor are reduced by increasing factors, to see, when the sensor's rank reaches the overall average.

\begin{landscape}
\begin{figure}[H]
  \centering
\begin{subfigure}[b]{.6\textwidth}
  \centering
  \includegraphics[width=1\linewidth]{reduced_ranks/0/in_out.png}
\end{subfigure}%
\begin{subfigure}[b]{.6\textwidth}
  \centering
  \includegraphics[width=1\linewidth]{reduced_ranks/1/in_out.png}
\end{subfigure}%
\hfill
\begin{subfigure}[b]{.6\textwidth}
  \centering
  \includegraphics[width=1\linewidth]{reduced_ranks/2/in_out.png}
\end{subfigure}%
\begin{subfigure}[b]{.6\textwidth}
  \centering
  \includegraphics[width=1\linewidth]{reduced_ranks/3/in_out.png}
\end{subfigure}%
  \caption{In-degrees after removing edges}\label{fig:prFiltered}
\end{figure}
\end{landscape}

\begin{landscape}
\begin{figure}[H]
  \centering
\begin{subfigure}[b]{.6\textwidth}
  \centering
  \includegraphics[width=1\linewidth]{reduced_ranks/0/pr.png}
\end{subfigure}%
\begin{subfigure}[b]{.6\textwidth}
  \centering
  \includegraphics[width=1\linewidth]{reduced_ranks/1/pr.png}
\end{subfigure}%
\hfill
\begin{subfigure}[b]{.6\textwidth}
  \centering
  \includegraphics[width=1\linewidth]{reduced_ranks/2/pr.png}
\end{subfigure}%
\begin{subfigure}[b]{.6\textwidth}
  \centering
  \includegraphics[width=1\linewidth]{reduced_ranks/3/pr.png}
\end{subfigure}%
  \caption{PageRank after removing edges}\label{fig:prFiltered}
\end{figure}
\end{landscape}

\begin{landscape}
\begin{figure}[H]
  \centering
\begin{subfigure}[b]{.6\textwidth}
  \centering
  \includegraphics[width=1\linewidth]{reduced_ranks/0/sr.png}
\end{subfigure}%
\begin{subfigure}[b]{.6\textwidth}
  \centering
  \includegraphics[width=1\linewidth]{reduced_ranks/1/sr.png}
\end{subfigure}%
\hfill
\begin{subfigure}[b]{.6\textwidth}
  \centering
  \includegraphics[width=1\linewidth]{reduced_ranks/2/sr.png}
\end{subfigure}%
\begin{subfigure}[b]{.6\textwidth}
  \centering
  \includegraphics[width=1\linewidth]{reduced_ranks/3/sr.png}
\end{subfigure}%
  \caption{SensorRank after removing edges}\label{fig:srFiltered}
\end{figure}
\end{landscape}

The graphs with 0 removed edges show the situation on the base truth without modifications.


We can see in \Fref{fig:prFiltered} and \Fref{fig:srFiltered}, that we have to reduce the incoming edges by \SI{20}{\percent} and \SI{30}{\percent} respectively to get average values for SensorRank and PageRank.
This also means that the number of incoming edges for a sensor must be about the same as the average about of incoming edges.
Depending on the protocol details of the botnet (\eg{} how many incoming edges are allowed per peer), this means that a large amount of sensors is needed if we want to monitor the whole network.


%}}} eval creating edges

%}}} evaluation

%{{{ implementation
\clearpage{}
\section{Implementation}

Crawlers in \ac{bms} report to the backend using \acp{grpc}\footnote{\url{https://www.grpc.io}}.
Both crawlers and the backend \ac{grpc} server are implemented using the Go\footnote{\url{https://go.dev/}} programming language, so to make use of existing know-how and to allow others to use the implementation in the future, the coordinator backend, and crawler abstraction were also implemented in Go.

\Ac{bms} already has an existing abstraction for crawlers.
This implementation is highly optimized but also tightly coupled and grown over time.
The abstraction became leaky and extending it proved to be complicated.
A new crawler abstraction was created with testability, extensibility, and most features of the existing implementation in mind, which can be ported back to be used by the existing crawlers.

%{{{ fig:crawler_arch
\begin{figure}[h]
\centering
\includegraphics[width=1\linewidth]{architecture.drawio.pdf}
\caption{Architecture of the new crawler}\label{fig:crawler_arch}
\end{figure}
%}}}fig:crawler_arch

The new implementation consists of three main interfaces:

\begin{itemize}
  \item \mintinline{go}{FindPeer}, to receive new crawl tasks from any source

  \item \mintinline{go}{ReportPeer}, to report newly found peers

  \item \mintinline{go}{Protocol}, the actual botnet protocol implementation used to ping a peer and request its peer list
\end{itemize}

Currently, there are two sources \mintinline{go}{FindPeer} can use: read peers from a file on disk or request them from the \ac{grpc} BMS coordinator.
The \mintinline{go}{ExactlyOnceFinder} delegate can wrap another \mintinline{go}{FindPeer} instance and ensures the source is only requested once.
This is used to implement the bootstrapping mechanism of the old crawler, where once when the crawler is started, the list of bootstrap nodes is loaded from a text file.
\mintinline{go}{CombinedFinder} can combine any amount of \mintinline{go}{FindPeer} instances and will return the sum of requesting all the sources.

The \mintinline{go}{PeerTask} instances returned by \mintinline{go}{FindPeer} contain the IP address and port of the peer, if the crawler should start or stop the operation, when to start and stop crawling, and in which interval the peer should be crawled.
For each task, a \mintinline{go}{CrawlPeer} and \mintinline{go}{PingPeer} worker is started or stopped as specified in the received \mintinline{go}{PeerTask}.
These tasks use the \mintinline{go}{ReportPeer} interface to report any new peer that is found.

Current report possibilities are \mintinline{go}{LoggingReport} to simply log new peers to get feedback from the crawler at runtime, and \mintinline{go}{BMSReport} which reports back to \ac{bms}.
\mintinline{go}{BatchedReport} delegates a \mintinline{go}{ReportPeer} instance and batch newly found peers up to a specified batch size and only then flush and actually report.
\mintinline{go}{AutoCommitReport} will automatically flush a delegated \mintinline{go}{ReportPeer} instance after a fixed amount of time and is used in combination with \mintinline{go}{BatchedReport} to ensure the batches are written regularly, even if the batch limit is not reached yet.
\mintinline{go}{CombinedReport} works analogous to \mintinline{go}{CombinedFinder} and combines many \mintinline{go}{ReportPeer} instances into one.

\mintinline{go}{PingPeer} and \mintinline{go}{CrawlPeer} use the implementation of the botnet \mintinline{go}{Protocol} to perform the actual crawling in predefined intervals, which can be overwritten on a per \mintinline{go}{PeerTask} basis.

The server-side part of the system consists of a \ac{grpc} server to handle the client requests, a scheduler to assign new peers, and a \mintinline{go}{Strategy} interface for modularity over how tasks are assigned to crawlers.

%{{{ fig:bachend_arch
\begin{figure}[h]
\centering
\includegraphics[width=1\linewidth]{backend_architecture.drawio.pdf}
\caption{Architecture of the \ac{grpc} backend}\label{fig:bachend_arch}
\end{figure}
%}}}fig:bachend_arch

%}}} implementation

%{{{ conclusion
\clearpage{}
\section{Conclusion}

Collaborative monitoring of \ac{p2p} botnets allows circumventing some anti-monitoring efforts.
We were able to show, that it also enables more effective monitoring systems for larger botnets, since each peer can be visited by only one crawler.
The current concept of independent crawlers in \ac{bms} can also use multiple workers but there is no way to ensure a peer is not watched by multiple crawlers thereby using unnecessary resources.

We were able to show, that a collaborative monitoring approach for \ac{p2p} botnets helps to circumvent anti-monitoring and monitoring detection mechanisms and is helpful to improve resource usage when monitoring large botnets.
On the other hand, graph ranking algorithms have been proven to be hard to bypass without requiring large amounts of sensor nodes.

Luckily most of the anti-monitoring and monitoring detection techniques discussed in this work are of academic nature and have not yet been deployed in real-world botnets.
Further investigation and improvements in \ac{p2p} botnet monitoring are required to prevent a situation where a botmaster implements the currently theoretical concepts and renders monitoring as it is currently done, ineffective.

%}}} conclusion

%{{{ further work
\clearpage{}
\section{Further Work}

Following this work, it should be possible to rewrite the existing crawlers using the new abstraction.
This might bring some performance issues to light which can be solved by investigating the optimizations from the old implementation and applying them to the new one.

Another way to expand on this work is automatically scaling the available crawlers up and down, depending on the botnet size and the number of concurrently online peers.
Doing so would allow a constant crawl interval for even highly volatile botnets.
Autoscaling features offered by many cloud-computing providers can be evaluated to automatically add or remove crawlers based on the monitoring load, a botnet's size, and the number of active peers.
This should also allow the creation of workers with new IP addresses in different geolocations in a fast, easy and automated way.
This also requires investigating hosting providers which allow botnet crawling by their terms of use.

The current backend implementation assumes an immutable set of crawlers.
For autoscaling to work, efficient reassignment of peers has to be implemented to account for added or removed workers.

Placing churned peers or peers with suspicious network activity (those behind carrier-grade \acp{nat}) might just offer another characteristic to flag sensors in a botnet.
The feasibility of this approach should be investigated and maybe there are ways to mitigate this problem.

%}}} further work

%{{{ acknowledgments
\clearpage{}
\section*{Acknowledgments}

In the end, I would like to thank

\begin{itemize}
  \item Prof.\ Dr.\ Christoph Skornia for being a helpful supervisor in this and many earlier works of mine

  \item Leon Böck for offering the possibility to work on this research project, regular feedback and technical expertise

  \item Valentin Sundermann for being available for insightful ad hoc discussions at any time of day for many years

  \item Friends and family who pushed me into continuing this path
\end{itemize}

%}}} acknowledgments

% vim: set filetype=tex ts=2 sw=2 tw=0 et foldmethod=marker spell :