diff --git a/content.tex b/content.tex index c2b3b139..81ed96e0 100644 --- a/content.tex +++ b/content.tex @@ -21,9 +21,9 @@ In recent years, \ac{iot} botnets have been responsible for some of the biggest \section{Background} Botnets consist of infected computers, so called \textit{bots}, controlled by a \textit{botmaster}. -\textit{Centralized} and \textit{decentralized botnets} use one or more coordinating hosts called \textit{\ac{c2} servers} respectively\todo{wording}. +\textit{Centralized} and \textit{decentralized botnets} use one or more coordinating hosts, called \textit{\ac{c2} servers}, respectively. These \ac{c2} servers can use any protocol from \ac{irc} over \ac{http} to Twitter~\cite{bib:pantic_covert_2015} as communication channel with the infected hosts. -The abuse of infected systems includes several activities---\ac{ddos} attacks, banking fraud, proxies to hide the attacker's identity, sending of spam emails\dots{} +The abuse of infected systems includes several activities---\ac{ddos} attacks, banking fraud, proxies to hide the attacker's identity, sending of spam emails, just to name a few. Analyzing and shutting down a centralized or decentralized botnet is comparatively easy since the central means of communication (the \ac{c2} IP addresses or domain names, Twitter handles or \ac{irc} channels), can be extracted from the malicious binaries or determined by analyzing network traffic and can therefore be considered publicly known. @@ -50,7 +50,7 @@ To complicate take-down attempts, botnet operators came up with a number of idea A number of botnet operations were shut down like this~\cite{bib:nadji_beheading_2013} and as the defenders upped their game, so did attackers---the concept of \ac{p2p} botnets emerged. The idea is to build a distributed network without \acp{spof} in the form of \ac{c2} servers as shown in \Fref{fig:p2p}. -In a \ac{p2p} botnet, each node in the network knows a number of its neighbors and connects to those, each of these neighbors has a list of neighbors on its own, and so on. +In a \ac{p2p} botnet, each node in the network knows a number of its neighbors and connects to those. Each of these neighbors has a list of neighbors on its own, and so on. The botmaster only needs to join the network to send new commands or receive stolen data. Any of the nodes in \Fref{fig:p2p} could be the botmaster but they don't even have to be online all the time since the peers will stay connected autonomously. In fact there have been arrests of operators of \ac{p2p} botnets but due to the autonomy offered by the distributed approach, the botnet keeps communicating~\cite{bib:netlab_mozi}. @@ -58,6 +58,8 @@ Especially worm-like botnets, where each peer tries to find and infect other sys This lack of a \ac{spof} makes \ac{p2p} botnets more resilient to take-down attempts since the communication is not stopped and botmasters can easily rejoin the network and send commands. +Successful take-downs of a \ac{p2p} botnet require intricate knowledge over the network topology, protocol characteristics and participating peers. +This knowledge can be obtained by monitoring peer activity in the botnet. Bots in a \ac{p2p} botnet can be split into two distinct groups according to their reachability: peers that are not publicly reachable (\eg{} because they are behind a \ac{nat} router or firewall) and those, that are publicly reachable, also known as \textit{superpeers}. In contrast to centralized botnets with a fixed set of \ac{c2} servers, in a \ac{p2p} botnet, every superpeer might take the roll of a \ac{c2} server and \textit{non-superpeers} will connect to those superpeers when joining the network. @@ -178,7 +180,6 @@ The monetary value of these botnets directly correlates with the amount of effor Some of these countermeasures are explored by \citeauthor{bib:andriesse_reliable_2015} in \citetitle{bib:andriesse_reliable_2015} and include deterrence, which limits the number of allowed bots per IP address or subnet to 1; blacklisting, where known crawlers and sensors are blocked from communicating with other bots in the network (mostly IP based); disinformation, when fake bots are placed in the peer lists, which invalidates the data collected by crawlers; and active retaliation like \ac{ddos} attacks against sensors or crawlers~\cite{bib:andriesse_reliable_2015}. -Successful take-downs of a \ac{p2p} botnet requires intricate knowledge over the network topology, protocol characteristics and participating peers. In this work we try to find ways to make the monitoring and information gathering phase more efficient and resilient to detection. %}}} monitoring prevention @@ -309,13 +310,11 @@ While it is possible to run multiple, uncoordinated crawlers, multiple crawlers The load balancing strategy solves this problem by systematically splitting the crawl tasks into chunks and distributes them among the available crawlers. The following load balancing strategies will be investigated: -\begin{itemize} - \item Round Robin. See~\Fref{sec:rr} +\begin{description} + \item[Round Robin] Evenly distribute the peers between crawlers in the order they are found - \item Assuming IP addresses are evenly distributed and so are infections, take the IP address as an \SI{32}{\bit} integer modulo \(\abs{C}\). See~\Fref{sec:ipPart} - Problem: reassignment if a crawler joins or leaves -\end{itemize} -\todo{remove?} + \item[IP-based partitioning] Use the uniform distribution of cryptographic hash functions to assign peers to crawlers in a random manner but still evenly distributed +\end{description} Load balancing in itself does not help prevent the detection of crawlers but it allows better usage of available resources. It prevents unintentionally crawling the same peer with multiple crawlers and allows crawling of bigger botnets where the uncoordinated approach would reach its limit and could only be worked around by scaling up the machine where the crawler is executed. @@ -332,7 +331,7 @@ With \(G\) being the greatest common divisor of all the crawler's capabilities, \(\frac{cap(c_i)}{B}\) gives us the percentage of the work a crawler is assigned. % The set of target peers \(P = \), is partitioned into \(|C|\) subsets according to \(W(c_i)\) and each subset is assigned to its crawler \(c_i\). % The mapping \mintinline{go}{gcd(C)} is the greatest common divisor of all peers in \mintinline{go}{C}, \(\text{maxWeight}(C) = \max \{ \forall c \in C : W(c) \}\). -The algorithm in \Fref{lst:wrr}\todo{page numbers for forward refs?} distributes the work according to the crawler's capabilities. +The algorithm in \Fref{lst:wrr} distributes the work according to the crawler's capabilities. \begin{listing} \begin{minted}{go} @@ -390,18 +389,18 @@ For the use case at hand, only the uniform distribution property is required so This strategy can also be weighted using the crawlers capabilities by modifying the list of available workers so that a worker can appear multiple times according to its weight. The weighting algorithm from \Fref{lst:wrr} is used to create the weighted multiset of crawlers \(C_W\) and the mapping changes to \(m(i) = H(i) \mod \abs{C_W}\). -\begin{figure}[H] - \centering - \includegraphics[width=1\linewidth]{./md5_ip_dist.png} - \caption{Distribution of the lowest byte of \ac{md5} hashes over IPv4}\label{fig:md5IPDist} -\end{figure} -\todo{remove this?} +% \begin{figure}[H] +% \centering +% \includegraphics[width=1\linewidth]{./md5_ip_dist.png} +% \caption{Distribution of the lowest byte of \ac{md5} hashes over IPv4}\label{fig:md5IPDist} +% \end{figure} +% \todo{remove this?} \ac{md5} returns a \SI{128}{\bit} hash value. The Go standard library includes helpers for arbitrarily sized integers\footnote{\url{https://pkg.go.dev/math/big\#Int}}. This helps us in implementing the mapping \(m\) from above. -By exploiting the even distribution offered by hashing, the work of each crawler is also evenly distributed over all IP subnets, \ac{as} and geolocations. +By exploiting the even distribution offered by hashing, the work of each crawler is also about evenly distributed over all IP subnets, \ac{as} and geolocations. This ensures neighboring peers (\eg{} in the same \ac{as}, geolocation or IP subnet) get visited by different crawlers. It also allows us to get rid of the state in our strategy since we don't have to keep track of the last crawler we assigned a task to, making it easier to implement and reason about. @@ -433,79 +432,85 @@ With \(L \in \mathbb{N}\) being the frequency limit at which a crawler will be b The amount of crawlers \(C\) required to achieve the frequency \(F\) without being blacklisted and the offset \(O\) between crawlers are defined as \begin{align*} - C &= \left\lceil \frac{F}{L} \right\rceil \\ - O &= \frac{\SI{1}{\request}}{F} + n &= \left\lceil \frac{F}{L} \right\rceil \\ + o &= \frac{\SI{1}{\request}}{F} \end{align*} Taking advantage of the \mintinline{go}{StartAt} field from the \mintinline{go}{PeerTask} returned by the \mintinline{go}{requestTasks} primitive above, the crawlers can be scheduled offset by \(O\) at a frequency \(L\) to ensure, the overall requests to each peer are evenly distributed over time. -Given a limit \(L = \SI{5}{\request\per 100\second}\)\todo{better numbers for example?}, crawling a botnet at \(F = \SI{20}{\request\per 100\second}\) requires \(C = \left\lceil \frac{\SI{20}{\request\per 100\second}}{\SI{5}{\request\per 100\second}} \right\rceil = 4\) crawlers. -Those crawlers must be scheduled \(O = \frac{\SI{1}{\request}}{\SI{20}{\request\per 100\second}} = \SI{5}{\second}\) apart at a frequency of \(L\) for an even request distribution. +Given a limit \(L = \SI{6}{\request\per\minute}\), crawling a botnet at \(F = \SI{24}{\request\per\minute}\) requires \(n = \left\lceil \frac{\SI{24}{\request\per\minute}}{\SI{6}{\request\per\minute}} \right\rceil = 4\) crawlers. +Those crawlers must be scheduled \(o = \frac{\SI{1}{\request}}{\SI{24}{\request\per\minute}} = \SI{2.5}{\second}\) apart at a frequency of \(L\) for an even request distribution. %{{{ fig:crawler_timeline \begin{figure}[h] \centering -\begin{chronology}[10]{0}{100}{0.9\textwidth} +\begin{chronology}[10]{0}{60}{0.9\textwidth} \event{0}{\(C_0\)} + \event{10}{\(C_0\)} \event{20}{\(C_0\)} + \event{30}{\(C_0\)} \event{40}{\(C_0\)} + \event{50}{\(C_0\)} \event{60}{\(C_0\)} - \event{80}{\(C_0\)} - \event{100}{\(C_0\)} - \event{5}{\(C_1\)} - \event{25}{\(C_1\)} - \event{45}{\(C_1\)} - \event{65}{\(C_1\)} - \event{85}{\(C_1\)} + \event{2.5}{\(C_1\)} + \event{12.5}{\(C_1\)} + \event{22.5}{\(C_1\)} + \event{32.5}{\(C_1\)} + \event{42.5}{\(C_1\)} + \event{52.5}{\(C_1\)} - \event{10}{\(C_2\)} - \event{30}{\(C_2\)} - \event{50}{\(C_2\)} - \event{70}{\(C_2\)} - \event{90}{\(C_2\)} + \event{5}{\(C_2\)} + \event{15}{\(C_2\)} + \event{25}{\(C_2\)} + \event{35}{\(C_2\)} + \event{45}{\(C_2\)} + \event{55}{\(C_2\)} - \event{15}{\(C_3\)} - \event{35}{\(C_3\)} - \event{55}{\(C_3\)} - \event{75}{\(C_3\)} - \event{95}{\(C_3\)} + \event{7.5}{\(C_3\)} + \event{17.5}{\(C_3\)} + \event{27.5}{\(C_3\)} + \event{37.5}{\(C_3\)} + \event{47.5}{\(C_3\)} + \event{57.5}{\(C_3\)} \end{chronology} -\caption{Timeline of crawler events as seen from a peer when crawled by multiple crawlers}\label{fig:crawler_timeline} +\caption{Timeline of crawler events when optimized for effective frequency}\label{fig:crawlerTimelineEffective} \end{figure} %}}} fig:crawler_timeline -As can be seen in~\Fref{fig:crawler_timeline}, each crawler \(C_0\) to \(C_3\) performs only \SI{5}{\request\per 100\second} while overall achieving \(\SI{20}{\request\per 100\second}\). +As can be seen in~\Fref{fig:crawlerTimelineEffective}, each crawler \(C_0\) to \(C_3\) performs only \SI{6}{\request\per\minute} while overall achieving \(\SI{20}{\request\per 100\second}\). -Vice versa given an amount of crawlers \(C\) and a request limit \(L\), the effective frequency \(F\) can be maximized to \(F = C \times L\) without hitting the limit \(L\) and being blocked. +Vice versa given an amount of crawlers \(n\) and a request limit \(L\), the effective frequency \(F\) can be maximized to \(F = n \times L\) without hitting the limit \(L\) and being blocked. -Using the example from above with \(L = \SI{5}{\request\per 100\second}\) but now only two crawlers \(C = 2\), it is still possible to achieve an effective frequency of \(F = 2 \times \SI{5}{\request\per 100\second} = \SI{10}{\request\per 100\second}\) and \(O = \frac{\SI{1}{\request}}{\SI{10}{\request\per 100\second}} = \SI{10}{s}\): +Using the example from above with \(L = \SI{6}{\request\per\minute}\) but now only two crawlers \(n = 2\), it is still possible to achieve an effective frequency of \(F = 2 \times \SI{6}{\request\per\minute} = \SI{12}{\request\per\minute}\) and \(o = \frac{\SI{1}{\request}}{\SI{12}{\request\per\minute}} = \SI{5}{s}\): %TODO: name %{{{ fig:crawler_timeline \begin{figure}[h] \centering -\begin{chronology}[10]{0}{100}{0.9\textwidth} +\begin{chronology}[10]{0}{60}{0.9\textwidth} \event{0}{\(C_0\)} + \event{10}{\(C_0\)} \event{20}{\(C_0\)} + \event{30}{\(C_0\)} \event{40}{\(C_0\)} + \event{50}{\(C_0\)} \event{60}{\(C_0\)} - \event{80}{\(C_0\)} - \event{100}{\(C_0\)} - \event{10}{\(C_1\)} - \event{30}{\(C_1\)} - \event{50}{\(C_1\)} - \event{70}{\(C_1\)} - \event{90}{\(C_1\)} + \event{5}{\(C_1\)} + \event{15}{\(C_1\)} + \event{25}{\(C_1\)} + \event{35}{\(C_1\)} + \event{45}{\(C_1\)} + \event{55}{\(C_1\)} \end{chronology} -% \caption{Timeline of crawler events as seen from a peer}\label{fig:crawler_timeline} +\caption{Timeline of crawler events when optimized over the number of crawlers}\label{fig:crawlerTimelineAmount} \end{figure} %}}} fig:crawler_timeline -While the effective frequency of the whole system is halved compared to~\Fref{fig:crawler_timeline}, it is still possible to double the frequency over the limit. +While the effective frequency of the whole system is halved compared to~\Fref{fig:crawlerTimelineEffective}, it is still possible to double the effective frequency over the limit. %}}} frequency reduction diff --git a/report.pdf b/report.pdf index 13786ee7..ef1d722f 100644 Binary files a/report.pdf and b/report.pdf differ diff --git a/report.tex b/report.tex index 2f31ce87..088acb5b 100644 --- a/report.tex +++ b/report.tex @@ -116,6 +116,10 @@ headsepline, \graphicspath{{assets/}} % \setcounter{tocdepth}{2} +% makes biblatex happier to break URLs (for lowercase letters), see https://tex.stackexchange.com/a/134281 +\setcounter{biburllcpenalty}{1000} +% same as above, but for uppercase letters, slightly less happy to break there +\setcounter{biburlucpenalty}{2000} \begin{document}