diff --git a/abstract.tex b/abstract.tex index 46f8ae90..1ee8d64d 100644 --- a/abstract.tex +++ b/abstract.tex @@ -1,7 +1,8 @@ \begin{abstract} Botnets pose a huge risk to general internet infrastructure and services. -Distributed \Acs*{p2p} topologies make it harder to detect and take those botnets offline. +Distributed \Acs*{p2p} topologies make those botnets harder to detect, and more resilient to take-down attempts. To take a \ac{p2p} botnet down, it has to be monitored to estimate the size and learn about the network topology. +% Monitoring requires some kind of participation in the network to With the growing damage and monetary value produced by such botnets, ideas emerged on how to detect and prevent monitoring activity in the network. This work explores ways to make monitoring of fully distributed botnets more efficient, resilient, and harder to detect, by using a collaborative, coordinated approach. Further, we show how the coordinated approach helps in circumventing anti-monitoring techniques deployed by botnets. diff --git a/assets/time_deviation/time_devi_c0.png b/assets/time_deviation/time_devi_c0.png index f9bf8f5e..31cf6946 100644 Binary files a/assets/time_deviation/time_devi_c0.png and b/assets/time_deviation/time_devi_c0.png differ diff --git a/assets/time_deviation/time_devi_c1.png b/assets/time_deviation/time_devi_c1.png index 19f126fe..a592123a 100644 Binary files a/assets/time_deviation/time_devi_c1.png and b/assets/time_deviation/time_devi_c1.png differ diff --git a/assets/time_deviation/time_devi_c2.png b/assets/time_deviation/time_devi_c2.png index ce3289a8..71fd5bb2 100644 Binary files a/assets/time_deviation/time_devi_c2.png and b/assets/time_deviation/time_devi_c2.png differ diff --git a/assets/time_deviation/time_devi_c3.png b/assets/time_deviation/time_devi_c3.png index 71a12ca9..41294d21 100644 Binary files a/assets/time_deviation/time_devi_c3.png and b/assets/time_deviation/time_devi_c3.png differ diff --git a/codes/frequency_deriv/frequency_deriv.py b/codes/frequency_deriv/frequency_deriv.py index dcb775be..e6ec703c 100644 --- a/codes/frequency_deriv/frequency_deriv.py +++ b/codes/frequency_deriv/frequency_deriv.py @@ -1,5 +1,6 @@ #!/usr/bin/env python3 +import numpy as np import statistics from collections import defaultdict from typing import Dict @@ -26,7 +27,8 @@ def plot_devi(data: Dict[datetime, str]): # c = 0 per_diff = defaultdict(list) for prev, next in zip(sor, sor[1:]): - diff = abs(2.5 - (next[0].timestamp() - prev[0].timestamp())) + # diff = abs(2.5 - (next[0].timestamp() - prev[0].timestamp())) + diff = ((next[0].timestamp() - prev[0].timestamp()) - 2.5) diffs.append(diff) per_crawler[prev[1]].append(prev[0]) per_diff[prev[1]].append(diff) @@ -72,16 +74,20 @@ def plot_devi(data: Dict[datetime, str]): t = per_crawler[c] devi = [] for pre, nex in zip(t, t[1:]): - devi.append(abs(10 - (nex.timestamp() - pre.timestamp()))) - x = [10 * x for x in range(len(devi))] + # devi.append(abs(10 - (nex.timestamp() - pre.timestamp()))) + devi.append(((nex.timestamp() - pre.timestamp()) - 10)) + x = np.array([10 * x for x in range(len(devi))]) + devi = np.array(devi) fig, ax = plt.subplots() ax.scatter(x, devi, s=10) + m, b = np.polyfit(x, devi, 1) + plt.plot(x, m*x+b, color='red') ax.set_title(f'Timedeviation for {c}') ax.set_xlabel('Time passed in seconds') ax.set_ylabel('Deviation in seconds') plt.savefig(f'./time_devi_{c}.png') plt.close() - print(f'{c}: {statistics.mean(devi)}') + print(f'{c} & \\num{{{statistics.mean(devi)}}} \\\\') # for ts in per_crawler[c]: diff --git a/codes/frequency_deriv/time_devi.png b/codes/frequency_deriv/time_devi.png index d04129ae..7de84b21 100644 Binary files a/codes/frequency_deriv/time_devi.png and b/codes/frequency_deriv/time_devi.png differ diff --git a/codes/frequency_deriv/time_devi_c0.png b/codes/frequency_deriv/time_devi_c0.png index f9bf8f5e..31cf6946 100644 Binary files a/codes/frequency_deriv/time_devi_c0.png and b/codes/frequency_deriv/time_devi_c0.png differ diff --git a/codes/frequency_deriv/time_devi_c1.png b/codes/frequency_deriv/time_devi_c1.png index 19f126fe..a592123a 100644 Binary files a/codes/frequency_deriv/time_devi_c1.png and b/codes/frequency_deriv/time_devi_c1.png differ diff --git a/codes/frequency_deriv/time_devi_c2.png b/codes/frequency_deriv/time_devi_c2.png index ce3289a8..71fd5bb2 100644 Binary files a/codes/frequency_deriv/time_devi_c2.png and b/codes/frequency_deriv/time_devi_c2.png differ diff --git a/codes/frequency_deriv/time_devi_c3.png b/codes/frequency_deriv/time_devi_c3.png index 71a12ca9..41294d21 100644 Binary files a/codes/frequency_deriv/time_devi_c3.png and b/codes/frequency_deriv/time_devi_c3.png differ diff --git a/codes/frequency_deriv/xxx.png b/codes/frequency_deriv/xxx.png index 89bd44f0..721ab95f 100644 Binary files a/codes/frequency_deriv/xxx.png and b/codes/frequency_deriv/xxx.png differ diff --git a/content.tex b/content.tex index 1bc8fcc3..a0337d35 100644 --- a/content.tex +++ b/content.tex @@ -361,7 +361,7 @@ Load balancing allows scaling out, which can be more cost-effective. \subsubsection{Round Robin Distribution}\label{sec:rr} -This strategy distributes work evenly among crawlers by either naively assigning tasks to the crawlers rotationally or weighted according to their capabilities\todo{1 -- 2 sentences about naive rr?}. +This strategy distributes work evenly among crawlers by either naively assigning tasks to the crawlers rotationally or weighted according to their capabilities. To keep the distribution as even as possible, we keep track of the last crawler a task was assigned to and start with the next in line in the subsequent round of assignments. For the sake of simplicity, only the bandwidth will be considered as a capability but it can be extended by any shared property between the crawlers, \eg{} available memory or processing power. For a given crawler \(c_i \in C\) let \(cap(c_i)\) be the capability of the crawler. @@ -551,13 +551,12 @@ While the effective frequency of the whole system is halved compared to~\Fref{fi %}}} frequency reduction %{{{ against graph metrics -\subsection{Creating Edges for Crawlers and Sensors} +\subsection{Creating and Reducing Edges for Sensors} \citetitle*{bib:karuppayah_sensorbuster_2017} describes different graph metrics to find sensors in \ac{p2p} botnets. These metrics depend on the uneven ratio between incoming and outgoing edges for crawlers. The \emph{SensorBuster} metric uses \acp{wcc} since naive sensors don't have any edges back to the main network in the graph. -Building a complete graph \(G_C = K_{\abs{C}}\) between the sensors and crawlers by making them return the other known worker on peer list requests would still produce a disconnected component and while being bigger and maybe not as obvious at first glance, it is still easily detectable since there is no path from \(G_C\) back to the main network (see~\Fref{fig:sensorbuster2} and~\Fref{tab:metricsTable}). With \(v \in V\), \(\text{succ}(v)\) being the set of successors of \(v\) and \(\text{pred}(v)\) being the set of predecessors of \(v\), \emph{PageRank} is recursively defined as~\cite{bib:page_pagerank_1998}: @@ -605,17 +604,19 @@ The following candidates to place on the neighbor list will be investigated: % Knowledge of only \num{90} peers leaving due to IP rotation would be enough to make a crawler look average in Sality\todo{repeat analysis, actual number}. % This number will differ between different botnets, depending on implementation details and size of the network\todo{upper limit for NL size as impl detail}. -\subsubsection{Other Sensors or Crawlers} +% \subsubsection{Other Sensors or Crawlers} -Returning all the other sensors when responding to peer list requests, thereby effectively creating a complete graph \(K_{\abs{C}}\) among the workers, creates valid outgoing edges. +\textbf{Other Sensors:} Returning all the other sensors when responding to peer list requests, thereby effectively creating a complete graph \(K_{\abs{C}}\) among the workers, creates valid outgoing edges. The resulting graph will still form a \ac{wcc} with now edges back into the main network. +Building a complete graph \(G_C = K_{\abs{C}}\) between the sensors by making them return the other known worker on peer list requests would still produce a disconnected component and while being bigger and maybe not as obvious at first glance, it is still easily detectable since there is no path from \(G_C\) back to the main network (see~\Fref{fig:sensorbuster2} and~\Fref{tab:metricsTable}).\todo{where?} + %{{{ churned peers -\subsubsection{Churned Peers After IP Rotation} +% \subsubsection{Churned Peers After IP Rotation} -Churn describes the dynamics of peer participation in \ac{p2p} systems, \eg{} join and leave events~\cite{bib:stutzbach_churn_2006}.\todo{übergang} -Detecting if a peer just left the system, in combination with knowledge about \acp{as}, peers that just left and came from an \ac{as} with dynamic IP allocation (\eg{} many consumer broadband providers in the US and Europe), can be placed into the crawler's peer list.\todo{what is an AS} +\textbf{Churned peers after IP rotation:} Churn describes the dynamics of peer participation in \ac{p2p} systems, \eg{} join and leave events~\cite{bib:stutzbach_churn_2006}. +Detecting if a peer just left the system, in combination with knowledge about \acp{as}, peers that just left and came from an \ac{as} with dynamic IP allocation (\eg{} many consumer broadband providers in the US and Europe), can be placed into the crawler's peer list. If the timing of the churn event correlates with IP rotation in the \ac{as}, it can be assumed, that the peer left due to being assigned a new IP address---not due to connectivity issues or going offline---and will not return using the same IP address. These peers, when placed in the peer list of the crawlers, will introduce paths back into the main network and defeat the \ac{wcc} metric. It also helps with the PageRank and SensorRank metrics since the crawlers start to look like regular peers without actually supporting the network by relaying messages or propagating active peers. @@ -623,15 +624,15 @@ It also helps with the PageRank and SensorRank metrics since the crawlers start %}}} churned peers %{{{ cg nat -\subsubsection{Peers Behind Carrier-Grade \acs*{nat}} +% \subsubsection{Peers Behind Carrier-Grade \acs*{nat}} -Some peers show behavior, where their IP address changes almost after every request. +\textbf{Peers behind carrier-grade \acs{nat}:} Some peers show behavior, where their IP address changes almost after every request. Those peers can be used as fake neighbors and create valid-looking outgoing edges for the sensor. %}}} cg nat -\clearpage{} -\todo{clearpage?} +% \clearpage{} +% \todo{clearpage?} In theory, it would be possible to detect churned peers or peers behind carrier-grade \acs{nat}, without coordinating the sensors but the coordination gives us a few advantages: \begin{itemize} @@ -956,11 +957,11 @@ The ideal distribution would be \SI{2.5}{\second} between every two events. Due to network latency and load from crawling other peers, we expect the actual result to deviate from the optimal value over time. With this experiment, we try to estimate the impact of the latency. -\begin{figure}[H] - \centering - \includegraphics[width=1\linewidth]{time_devi.png} - \caption{Deviation from the expected interval}\label{fig:timeDevi} -\end{figure} +% \begin{figure}[H] +% \centering +% \includegraphics[width=1\linewidth]{time_devi.png} +% \caption{Deviation from the expected interval}\label{fig:timeDevi} +% \end{figure} \begin{landscape} \begin{figure}[H] @@ -986,21 +987,24 @@ With this experiment, we try to estimate the impact of the latency. \end{figure} \end{landscape} -The deviation between crawl events per crawler is below \SI{0.01}{\second} most of the time, with occasional outliers due to network latency or server load. - \begin{table}[H] \centering -\begin{tabular}{rr} +\begin{tabular}{rS} \textbf{Crawler} & \textbf{Average Deviation} \\ - c0 & \num{0.0005927812081134085} \\ - c1 & \num{0.0003700713297978895} \\ - c2 & \num{0.0006121075253902246} \\ - c3 & \num{0.0020807891511268814} \\ + c0 & \num{0.0003166149207321755} \\ + c1 & \num{0.0002065727194268201} \\ + c2 & \num{0.0003075813840032066} \\ + c3 & \num{0.0038056359425696364} \\ \end{tabular} \caption{Average deviation per crawler}\label{tab:perCralwerDeviation} \end{table} -The average deviation per crawler is below \SI{0.002}{\second} even with some huge outliers. In general it is below \SI{0.0007}{\second}, which is a surprisingly accurate result. +The monitored peer crawler \emph{c0} are located in Falkenstein, Germany, \emph{c1} in Nurnberg, Germany, \emph{c2} is in Helsinki, Finland and \emph{c3} in Ashburn, USA, to have some geographic distribution. + +The average deviation per crawler is below \SI{0.002}{\second} even with some outliers due to network latency or server load. +The crawler \emph{c3} in the experiment is the furthest away from the monitored host therefore the larger derivation due to network latency is expected. + +% In general it is below \SI{0.0007}{\second}, which is a surprisingly accurate result. In real-world scenarios, crawlers will monitor more than a single peer and the scheduling is expected to be less accurate. Still, the deviation will always stay below the effective frequency \(f\), because after exceeding \(f\), a crawler is overtaken by the next in line. The impact of the deviation when crawling real-world botnets has to be investigated and if it shows to be a problem, the tasks have to be rescheduled periodically to prevent this from happening. diff --git a/report.pdf b/report.pdf index caf666ce..299682e3 100644 Binary files a/report.pdf and b/report.pdf differ