Many personal and professional workflows are so dependent on the internet, that they won't work when being offline, and with the pandemic we are living through, this dependency grew even stronger.
In 2021 there were around 10 billion internet connected \ac{iot} devices and this number is estimated to more than double over the next years up to 25 billion in 2030~\cite{bib:statista_iot_2020}.
Many of these devices run on outdated software, don't receive regular updates, and don't follow general security best practices.
While in 2016 only \SI{77}{\percent} of German households had a broadband connection with a bandwidth of \SI{50}{\mega\bit\per\second} or more, in 2020 it was already \SI{95}{\percent} with more than \SI{50}{\mega\bit\per\second} and \SI{59}{\percent} with at least \SI{1000}{\mega\bit\per\second}~\cite{bib:statista_broadband_2021}.
Their nature as small, always online devices---often without any direct user interaction---behind internet connections that are getting faster and faster makes them a desirable target for botnet operators.
In recent years, \ac{iot} botnets have been responsible for some of the biggest \ac{ddos} attacks ever recorded---creating up to \SI{1}{\tera\bit\per\second} of traffic~\cite{bib:ars_ddos_2016}.
These \ac{c2} servers can use any protocol from \ac{irc} over \ac{http} to Twitter~\cite{bib:pantic_covert_2015} as communication channel with the infected hosts.
The abuse of infected systems includes several activities---\ac{ddos} attacks, banking fraud, proxies to hide the attacker's identity, sending of spam emails\dots{}
Analyzing and shutting down a centralized or decentralized botnet is comparatively easy since the central means of communication (the \ac{c2} IP addresses or domain names, Twitter handles or \ac{irc} channels), can be extracted from the malicious binaries or determined by analyzing network traffic and can therefore be considered publicly known.
A coordinated operation with help from law enforcement, hosting providers, domain registrars, and platform providers could shut down or take over the operation by changing how requests are routed or simply shutting down the controlling servers/accounts.
To complicate take-down attempts, botnet operators came up with a number of ideas: \acp{dga} use pseudorandomly generated domain names to render simple domain blacklist-based approaches ineffective~\cite{bib:antonakakis_dga_2012} or fast-flux \ac{dns}, where a large pool of IP addresses is assigned randomly to the \ac{c2} domains to prevent IP based blacklisting~\cite{bib:nazario_as_2008}.
A number of botnet operations were shut down like this~\cite{bib:nadji_beheading_2013} and as the defenders upped their game, so did attackers---the concept of \ac{p2p} botnets emerged.
In a \ac{p2p} botnet, each node in the network knows a number of its neighbors and connects to those, each of these neighbors has a list of neighbors on its own, and so on.
The botmaster only needs to join the network to send new commands or receive stolen data.
Any of the nodes in \Fref{fig:p2p} could be the botmaster but they don't even have to be online all the time since the peers will stay connected autonomously.
In fact there have been arrests of operators of \ac{p2p} botnets but due to the autonomy offered by the distributed approach, the botnet keeps communicating~\cite{bib:netlab_mozi}.
This lack of a \ac{spof} makes \ac{p2p} botnets more resilient to take-down attempts since the communication is not stopped and botmasters can easily rejoin the network and send commands.
Bots in a \ac{p2p} botnet can be split into two distinct groups according to their reachability: peers that are not publicly reachable (\eg{} because they are behind a \ac{nat} router or firewall) and those, that are publicly reachable, also known as \textit{superpeers}.
In contrast to centralized botnets with a fixed set of \ac{c2} servers, in a \ac{p2p} botnet, every superpeer might take the roll of a \ac{c2} server and \textit{non-superpeers} will connect to those superpeers when joining the network.
To enable peers to connect to unstructured botnets, the malware binaries include hardcoded lists of superpeers for the newly infected systems to connect to.
\(G\) is not required to be a connected graph but might consist of multiple disjoint components~\cite{bib:rossow_sok_2013}. Components consisting of peers, that are infected by the same bot, are considered part of the same graph.
This has some advantages in that it is not possible for botmasters to detect or prevent data collection of that kind, but it is not trivial to distinguish valid \ac{p2p} application traffic (\eg{} BitTorrent, Skype, cryptocurrencies, \ldots) from \ac{p2p} bots.
As most botnet detection mechanisms, also the passive ones work by building communication graphs and finding tightly coupled subgraphs that might be indicative of a botnet~\cite{bib:botgrep2010}. An advantage of passive detection is, that it is independent of protocol details, specific binaries or the structure of the network (\ac{p2p} vs.\ centralized/decentralized)~\cite{bib:botminer2008}.
\item Large scale network analysis (hard to differentiate from legitimate \ac{p2p} traffic (\eg{} BitTorrent), hard to get data, knowledge of some known bots required)~\cite{bib:zhang_building_2014}
For active detection, a subset of the botnet protocol and behavior is reimplemented to take part in the network.
To do so, samples of the malware are reverse engineered to unterstand and recreate the protocol.
This partial implementation includes the communication part of the botnet but ignores the malicious functionality as to not support and take part in illicit activity.
% The difference in behaviour from the reference implementation and conspicuous graph properties (\eg{} high \(\deg^{+}\) vs.\ low \(\deg^{-}\)) of these sensors allows botmasters to detect and block the sensor nodes.
There are two subtypes of active detection: \textit{sensors} wait to be contacted by other peers, while \textit{crawlers} actively query known bots and recursively ask for their neighbors~\cite{bib:karuppayah_sensorbuster_2017}.
Obviously crawlers can only detect superpeers and therefore only see a small subset of the network, while sensors are also contacted by peers in private networks and behind firewalls.
To accurately monitor a \ac{p2p} botnet, a hybrid approach of crawlers and sensors is required.
The constantly growing damage produced by botnets has many researchers and law enforcement agencies trying to shut down these operations~\cite{bib:nadji_beheading_2013, bib:nadji_still_2017, bib:dittrich_takeover_2012, bib:fbiTakedown2014}.
The monetary value of these botnets directly correlates with the amount of effort botmasters are willing to put into implementing defense mechanisms against take-down attempts.
Some of these countermeasures are explored by \citeauthor{bib:andriesse_reliable_2015} in \citetitle{bib:andriesse_reliable_2015} and include deterrence, which limits the number of allowed bots per IP address or subnet to 1; blacklisting, where known crawlers and sensors are blocked from communicating with other bots in the network (mostly IP based); disinformation, when fake bots are placed in the peer lists, which invalidates the data collected by crawlers; and active retaliation like \ac{ddos} attacks against sensors or crawlers~\cite{bib:andriesse_reliable_2015}.
Successful take-downs of a \ac{p2p} botnet requires intricate knowledge over the network topology, protocol characteristics and participating peers.
In this work we try to find ways to make the monitoring and information gathering phase more efficient and resilient to detection.
The implementation of the concepts of this work will be done as part of \ac{bms}\footnotemark, a monitoring platform for \ac{p2p} botnets described by \citeauthor{bib:bock_poster_2019} in \citetitle{bib:bock_poster_2019}.
\Ac{bms} is intended for a hybrid active approach of crawlers and sensors (reimplementations of the \ac{p2p} protocol of a botnet, that won't perform malicious actions) to collect live data from active botnets.
In an earlier project, we implemented different node ranking algorithms (among others \enquote{PageRank}~\cite{bib:page_pagerank_1998}) to detect sensor candidates in a botnet, as described in \citetitle{bib:karuppayah_sensorbuster_2017}.
Both ranking algorithms exploit the differences in \(\deg^+\) and \(\deg^-\) for sensors to weight the nodes.
The goal of this work is to complicate detection mechanisms like this for botmasters by centralizing the coordination of the system's crawlers and sensors, thereby reducing the node's rank for specific graph metrics.
The coordinated work distribution also helps in efficiently monitoring large botnets where one crawler is not enough to track all peers.
The changes should allow the current crawlers and sensors to use the new abstraction with as few changes as possible to the existing code.
The goal of this work is to show how cooperative monitoring of a \ac{p2p} botnet can help with the following problems:
\begin{itemize}
\item Impede detection of monitoring attempts by reducing the impact of aforementioned graph metrics
\item Circumvent anti-monitoring techniques
\item Make crawling more efficient
\end{itemize}
The final results should be as general as possible and not depend on any botnet's specific behaviour (except for the mentioned anti-monitoring techniques which might be unique to some botnets), but we assume, that every \ac{p2p} botnet has some way of determining a bot's neighbors.
The general idea for the implementation of the ideas in this thesis is to report newfound nodes back to the \ac{bms} backend first, where the graph of the known network is created, and a fitting worker is selected to achieve the goal of the according coordination strategy.
That worker will be responsible to monitor the new node.
If it is not possible, to select a specific sensor so that the monitoring activity stays inconspicuous, the coordinator can do a complete shuffle of all nodes between the sensors to restore the wanted graph properties or warn if more sensors are required to stay undetected.
The improved crawler system should allow new crawlers to register themselves and their capabilities (\eg{} bandwidth, geolocation), so the amount of work can be scaled accordingly between hosts.
\item[Request Tasks] Receive a batch of crawl tasks from the coordinator.
The tasks consist of the target peer, if the worker should start or stop monitoring the peer, when the monitoring should start and stop and at which frequency the peer should be contacted.
This assumption greatly simplifies the implementation due to the lack of changing state that has to be tracked while still exploring the described strategies.
A production-ready implementation of the described techniques can drop this assumption but might have to recalculate the work distribution once a crawler joins or leaves.
The protocol primitives described in \Fref{sec:protPrim} already allow for this to be implemented by first creating tasks with the \mintinline{go}{StopCrawling} flag set to true for all active tasks, run the strategy again and create the according tasks to start crawling again.
Depending on a botnet's size, a single crawler is not enough to monitor all superpeers.
While it is possible to run multiple, uncoordinated crawlers, multiple crawlers can find and monitor the same peer, making the approach inefficient with regard to the computing resources at hand.
The load balancing strategy solves this problem by systematically splitting the crawl tasks into chunks and distributes them among the available crawlers.
\item Assuming IP addresses are evenly distributed and so are infections, take the IP address as an \SI{32}{\bit} integer modulo \(\abs{C}\). See~\Fref{sec:ipPart}
It prevents unintentionally crawling the same peer with multiple crawlers and allows crawling of bigger botnets where the uncoordinated approach would reach its limit and could only be worked around by scaling up the machine where the crawler is executed.
This strategy distributes work evenly among crawlers by either naively assigning tasks to the crawlers rotationally or weighted according to their capabilities\todo{1 -- 2 sentences about naive rr?}.
To keep the distribution as even as possible, we keep track of the last crawler a task was assigned to and start with the next in line in the subsequent round of assignments.
For the sake of simplicity, only the bandwidth will be considered as capability but it can be extended by any shared property between the crawlers, \eg{} available memory or processing power.
For a given crawler \(c_i \in C\) let \(cap(c_i)\) be the capability of the crawler.
The total available capability is \(B =\sum\limits_{c \in C} cap(c)\).
With \(G\) being the greatest common divisor of all the crawler's capabilities, the weight \(W(c_i)=\frac{cap(c_i)}{G}\).
\(\frac{cap(c_i)}{B}\) gives us the percentage of the work a crawler is assigned.
% The set of target peers \(P = <p_0, p_1, \ldots, p_{n-1}>\), is partitioned into \(|C|\) subsets according to \(W(c_i)\) and each subset is assigned to its crawler \(c_i\).
% The mapping \mintinline{go}{gcd(C)} is the greatest common divisor of all peers in \mintinline{go}{C}, \(\text{maxWeight}(C) = \max \{ \forall c \in C : W(c) \}\).
The algorithm in \Fref{lst:wrr}\todo{page numbers for forward refs?} distributes the work according to the crawler's capabilities.
To ensure better distribution, first every crawler is assigned one task, then, according to the capabilities, every crawler with a weight of 2 or more is assigned a task, and so on.\todo{better wording}
The set of crawlers \(\{a, b, c\}\) with the capabilities \(cap(a)=3\), \(cap(b)=2\), \(cap(c)=1\) would produce \(<a, b, c, a, b, a>\), allocating two and three times the work to crawlers \(b\) and \(a\) respectively.
Calculating the hash of an IP address and distributing the work with regard to \(H(\text{IP})\mod\abs{C}\) creates about evenly sized buckets for each worker to handle.
For any hash function \(H\), this gives us the mapping \(m(i)= H(i)\mod\abs{C}\) to sort peers into buckets.
While the \ac{md5} hash function must be considered broken for cryptographic use~\cite{bib:stevensCollision}, it is faster to calculate than hash functions with longer output.
This strategy can also be weighted using the crawlers capabilities by modifying the list of available workers so that a worker can appear multiple times according to its weight.
The weighting algorithm from \Fref{lst:wrr} is used to create the weighted multiset of crawlers \(C_W\) and the mapping changes to \(m(i)= H(i)\mod\abs{C_W}\).
By exploiting the even distribution offered by hashing, the work of each crawler is also evenly distributed over all IP subnets, \ac{as} and geolocations.
This ensures neighboring peers (\eg{} in the same \ac{as}, geolocation or IP subnet) get visited by different crawlers.
It also allows us to get rid of the state in our strategy since we don't have to keep track of the last crawler we assigned a task to, making it easier to implement and reason about.
Using collaborative crawlers, an arbitrarily fast frequency can be achieved without being blacklisted.
With \(L \in\mathbb{N}\) being the frequency limit at which a crawler will be blacklisted, \(F \in\mathbb{N}\) being the crawl frequency that should be achieved.
The amount of crawlers \(C\) required to achieve the frequency \(F\) without being blacklisted and the offset \(O\) between crawlers are defined as
Taking advantage of the \mintinline{go}{StartAt} field from the \mintinline{go}{PeerTask} returned by the \mintinline{go}{requestTasks} primitive above, the crawlers can be scheduled offset by \(O\) at a frequency \(L\) to ensure, the overall requests to each peer are evenly distributed over time.
Given a limit \(L =\SI{5}{\request\per100\second}\)\todo{better numbers for example?}, crawling a botnet at \(F =\SI{20}{\request\per100\second}\) requires \(C =\left\lceil\frac{\SI{20}{\request\per100\second}}{\SI{5}{\request\per100\second}}\right\rceil=4\) crawlers.
Those crawlers must be scheduled \(O =\frac{\SI{1}{\request}}{\SI{20}{\request\per100\second}}=\SI{5}{\second}\) apart at a frequency of \(L\) for an even request distribution.
As can be seen in~\Fref{fig:crawler_timeline}, each crawler \(C_0\) to \(C_3\) performs only \SI{5}{\request\per 100\second} while overall achieving \(\SI{20}{\request\per100\second}\).
Vice versa given an amount of crawlers \(C\) and a request limit \(L\), the effective frequency \(F\) can be maximized to \(F = C \times L\) without hitting the limit \(L\) and being blocked.
Using the example from above with \(L =\SI{5}{\request\per100\second}\) but now only two crawlers \(C =2\), it is still possible to achieve an effective frequency of \(F =2\times\SI{5}{\request\per100\second}=\SI{10}{\request\per100\second}\) and \(O =\frac{\SI{1}{\request}}{\SI{10}{\request\per100\second}}=\SI{10}{s}\):
While the effective frequency of the whole system is halved compared to~\Fref{fig:crawler_timeline}, it is still possible to double the frequency over the limit.
Building a complete graph \(G_C = K_{\abs{C}}\) between the crawlers by making them return the other crawlers on peer list requests would still produce a disconnected component and while being bigger and maybe not as obvious at first glance, it is still easily detectable since there is no path from \(G_C\) back to the main network (see~\Fref{fig:sensorbuster2} and~\Fref{tab:metricsTable}).
With \(v \in V\), \(\text{succ}(v)\) being the set of successors of \(v\) and \(\text{pred}(v)\) being the set of predecessors of \(v\), PageRank is recursively defined as~\cite{bib:page_pagerank_1998}:
For the first iteration, the PageRank of all nodes is set to the same initial value. \citeauthor{bib:page_pagerank_1998} argue that when iterating often enough, any value can be chosen~\cite{bib:page_pagerank_1998}.
The dampingFactor describes the probability of a person visiting links on the web to continue doing so, when using PageRank to rank websites in search results.
For simplicity---and since it is not required to model human behaviour for automated crawling and ranking---a dampingFactor of \(1.0\) will be used, which simplifies the formula to
In our experiments on a snapshot of the Sality~\cite{bib:falliere_sality_2011} botnet obtained from \ac{bms} over the span of \daterange{2021-04-21}{2021-04-28} even 1 iteration were enough to get distinct enough values to detect sensors and crawlers.
The distribution graphs in \Fref{fig:dist_sr_25}, \Fref{fig:dist_sr_50} and \Fref{fig:dist_sr_75} show that the initial rank has no effect on the distribution, only on the actual numeric rank values.
For all combinations of initial value and PageRank iterations, the rank for a well known crawler is in the \nth{95} percentile, so for our use case, those parameters do not matter.
On average, peers in the analyzed dataset have \num{223} successors over the whole week.
Since crawlers never respond to peer list requests, they will always be detectable by the described approach but sensors might benefit from the following technique.
By responding to peer list requests with plausible data, one can make those metrics less suspicious, because it produces valid outgoing edges from the sensors.
Knowledge of only \num{90} peers leaving due to IP rotation would be enough to make a crawler look average in Sality\todo{repeat analysis, actual number}.
This number will differ between different botnets, depending on implementation details and size of the network\todo{upper limit for NL size as impl detail}.
By connecting the known sensors and effectively building a complete graph \(K_{\abs{C}}\) between them creates \(\abs{C}-1\) outgoing edges per sensor.
In most cases this won't be enough to reach the amount of edges that would be needed.
Also this does not help against the \ac{wcc} metric since this would create a bigger but still disconnected component.
Detecting if a peer just left the system, in combination with knowledge about \acp{as}, peers that just left and came from an \ac{as} with dynamic IP allocation (\eg{} many consumer broadband providers in the US and Europe), can be placed into the crawler's peer list.\todo{what is an AS}
If the timing of the churn event correlates with IP rotation in the \ac{as}, it can be assumed, that the peer left due to being assigned a new IP address---not due to connectivity issues or going offline---and will not return using the same IP address.
It also helps with the PageRank and SensorRank metrics since the crawlers start to look like regular peers without actually supporting the network by relaying messages or propagating active peers.
Crawlers in \ac{bms} report to the backend using \acp{grpc}\footnote{\url{https://www.grpc.io}}.
Both crawlers and the backend \ac{grpc} server are implemented using the Go\footnote{\url{https://go.dev/}} programming language, so to make use of existing know-how and to allow others to use the implementation in the future, the coordinator backend and crawler abstraction were also implemented in Go.
\Ac{bms} already has an existing abstraction for crawlers.
This implementation is highly optimized but also tightly coupled and grown over time.
The abstraction became leaky and extending it proved to be complicated.
A new crawler abstraction was created with testability, extensibility and most features of the existing implementation in mind, which can be ported back to be used by the existing crawlers.
This is used to implement the bootstrapping mechanism of the old crawler, where once, when the crawler is started, the list of bootstrap nodes is loaded from a textfile.
The \mintinline{go}{PeerTask} instances returned by \mintinline{go}{FindPeer} contain the IP address and port of the peer, if the crawler should start or stop the operation, when to start and stop crawling and in which interval the peer should be crawled.
For each task, a \mintinline{go}{CrawlPeer} and \mintinline{go}{PingPeer} worker is started or stopped as specified in the received \mintinline{go}{PeerTask}.
These tasks use the \mintinline{go}{ReportPeer} interface to report any new peer that is found.
Current report possibilities are \mintinline{go}{LoggingReport} to simply log new peers to get feedback from the crawler at runtime, and \mintinline{go}{BMSReport} which reports back to \ac{bms}.
\mintinline{go}{BatchedReport} delegates a \mintinline{go}{ReportPeer} instance and batch newly found peers up to a specified batch size and only then flush and actually report.
\mintinline{go}{AutoCommitReport} will automatically flush a delegated \mintinline{go}{ReportPeer} instance after a fixed amount of time and is used in combination with \mintinline{go}{BatchedReport} to ensure the batches are written regularly, even if the batch limit is not reached yet.
\mintinline{go}{CombinedReport} works analogous to \mintinline{go}{CombinedFinder} and combines many \mintinline{go}{ReportPeer} instances into one.
\mintinline{go}{PingPeer} and \mintinline{go}{CrawlPeer} use the implementation of the botnet \mintinline{go}{Protocol} to perform the actual crawling in predefined intervals, which can be overwritten on a per \mintinline{go}{PeerTask} basis.
The server-side part of the system consists of a \ac{grpc} server to handle the client requests, a scheduler to assign new peers, and a \mintinline{go}{Strategy} interface for modularity over how work is assigned to crawlers.
Collaborative monitoring of \ac{p2p} botnets allows circumventing some anti-monitoring efforts.
It also enables more effective monitoring systems for larger botnets, since each peer can be visited by only one crawler.
The current concept of independent crawlers in \ac{bms} can also use multiple workers but there is no way to ensure a peer is not watched by multiple crawlers thereby using unnecessary resources.
This might bring some performance issues to light which can be solved by investigating the optimizations from the old implementation and applying them to the new one.
Another way to expand on this work is automatically scaling the available crawlers up and down, depending on the botnet size and the number of concurrently online peers.
Doing so would allow a constant crawl interval for even highly volatile botnets.
Placing churned peers or peers with suspicious network activity (those behind carrier-grade \acp{nat}) might just offer another characteristic to flag sensors in a botnet.
This should be investigated and maybe there are ways to mitigate this problem.
Autoscaling features offered by many cloud-computing providers should be evaluated to automatically add or remove crawlers based on the monitoring load, a botnet's size and number of active peers.
This should also allow create workers with new IP addresses in different geolocations fast and easy.