diff --git a/content.tex b/content.tex index c9c7a8e4..9c132bae 100644 --- a/content.tex +++ b/content.tex @@ -384,14 +384,59 @@ One of those, \enquote{SensorBuster} uses \acp{wcc} since crawlers don't have an Building a complete graph \(G_C = K_{\abs{C}}\) between the crawlers by making them return the other crawlers on peer list requests would still produce a disconnected component and while being bigger and maybe not as obvious at first glance, it is still easily detectable since there is no path from \(G_C\) back to the main network (see~\autoref{fig:sensorbuster2} and~\autoref{fig:metrics_table}). \todo{rank? deg+ - deg-?} -With \(v \in V\), \(\text{rank}(v)\), \(\text{succ}(v)\) being the set of successors of \(v\) and \(\text{pred}(v)\) being the set of predecessors of \(v\), PageRank is defined as~\cite{bib:page_pagerank_1998}: +With \(v \in V\), \(\text{succ}(v)\) being the set of successors of \(v\) and \(\text{pred}(v)\) being the set of predecessors of \(v\), PageRank recursively is defined as~\cite{bib:page_pagerank_1998}: \[ - \text{PR}(v) = \text{dampingFactor} \times \sum\limits_{p \in \text{pred}(v)} \frac{\text{rank}(p)}{\abs{\text{succ}(p)}} + \frac{1 - \text{dampingFactor}}{\abs{V}} + \text{PR}(v) = \text{dampingFactor} \times \sum\limits_{p \in \text{pred}(v)} \frac{\text{PR}(p)}{\abs{\text{succ}(p)}} + \frac{1 - \text{dampingFactor}}{\abs{V}} \] +For the first iteration, the PageRank of all nodes is set to the same initial value. When iterating often enough, any value can be chosen~\cite{bib:page_pagerank_1998}.\todo{how often? experiments!} +In our experiments on a snapshot of the Sality botnet exported from \ac{bms} over the span of\todo{export timespan}, 3 iterations were enough to get distinct enough values to detect sensors and crawlers. + +\begin{figure}[H] + \centering +\begin{tabular}{lllll} + Iteration & Avg. PR & Crawler PR & Avg. SR & Crawler SR \\ + 1 & wat? & wut? & wit? & wot? \\ + 2 & wat? & wut? & wit? & wot? \\ + 3 & wat? & wut? & wit? & wot? \\ + 4 & wat? & wut? & wit? & wot? \\ + 5 & wat? & wut? & wit? & wot? \\ +\end{tabular} + \caption{Values for PageRank iterations with initial rank \(\forall v \in V : \text{PR}(v) = 0.25\)}\label{fig:pr_iter_table} +\end{figure} +\todo{proper table formatting} + +\begin{figure}[H] + \centering +\begin{tabular}{lllll} + Iteration & Avg. PR & Crawler PR & Avg. SR & Crawler SR \\ + 1 & wat? & wut? & wit? & wot? \\ + 2 & wat? & wut? & wit? & wot? \\ + 3 & wat? & wut? & wit? & wot? \\ + 4 & wat? & wut? & wit? & wot? \\ + 5 & wat? & wut? & wit? & wot? \\ +\end{tabular} + \caption{Values for PageRank iterations with initial rank \(\forall v \in V : \text{PR}(v) = 0.5\)}\label{fig:pr_iter_table} +\end{figure} +\todo{proper table formatting} + +\begin{figure}[H] + \centering +\begin{tabular}{lllll} + Iteration & Avg. PR & Crawler PR & Avg. SR & Crawler SR \\ + 1 & wat? & wut? & wit? & wot? \\ + 2 & wat? & wut? & wit? & wot? \\ + 3 & wat? & wut? & wit? & wot? \\ + 4 & wat? & wut? & wit? & wot? \\ + 5 & wat? & wut? & wit? & wot? \\ +\end{tabular} + \caption{Values for PageRank iterations with initial rank \(\forall v \in V : \text{PR}(v) = 0.75\)}\label{fig:pr_iter_table} +\end{figure} +\todo{proper table formatting} + The dampingFactor describes the probability of a person visiting links on the web to continue doing so, when using PageRank to rank websites in search results. -For simplicity, and since it is not required to model human behaviour for automated crawling and ranking, a dampingFactor of \(1.0\) will be used, which simplifies the formula to +For simplicity---and since it is not required to model human behaviour for automated crawling and ranking---a dampingFactor of \(1.0\) will be used, which simplifies the formula to \[ \text{PR}(v) = \sum\limits_{p \in \text{pred}(v)} \frac{\text{rank}(p)}{\abs{\text{succ}(p)}} @@ -424,20 +469,17 @@ Based on this, SensorRank is defined as Applying SensorRank PageRank once with an initial rank of \(0.25\) once on the example graphs above results in: -\todo{pagerank, sensorrank calculations, proper example graphs} +\todo{pagerank, sensorrank calculations, proper example graphs, proper table formatting} \begin{figure}[H] \centering -\begin{tabular}{|l|l|l|l|l|l|} - \hline +\begin{tabular}{llllll} Node & \(\deg^{+}\) & \(\deg^{-}\) & In \ac{wcc}? & PageRank & SensorRank \\ - \hline\hline n0 & 0/0 & 4/4 & no & 0.75/0.5625 & 0.3125/0.2344 \\ n1 & 1/1 & 3/3 & no & 0.25/0.1875 & 0.0417/0.0313 \\ n2 & 2/2 & 2/2 & no & 0.5/0.375 & 0.3333/0.25 \\ c0 & 3/5 & 0/2 & yes (1/3) & 0.0/0.125 & 0.0/0.0104 \\ c1 & 1/3 & 0/2 & yes (1/3) & 0.0/0.125 & 0.0/0.0104 \\ c2 & 2/4 & 0/2 & yes (1/3) & 0.0/0.125 & 0.0/0.0104 \\ - \hline \end{tabular} \caption{Values for metrics from~\autoref{fig:sensorbuster} (a/b)}\label{fig:metrics_table} \end{figure} @@ -450,7 +492,7 @@ While this works for small networks, the crawlers must account for a significant Churn describes the dynamics of peer participation of \ac{p2p} systems, \eg{} join and leave events~\cite{bib:stutzbach_churn_2006}. Detecting if a peer just left the system, in combination with knowledge about \acp{as}, peers that just left and came from an \ac{as} with dynamic IP allocation (\eg{} many consumer broadband providers in the US and Europe), can be placed into the crawler's neighbourhood list. -If the timing if the churn event correlates with IP rotation in the \ac{as}, it can be assumed, that the peer left due to being assigned a new IP address and not due to connectivity issues or going offline, and will not return using the same IP address. +If the timing of the churn event correlates with IP rotation in the \ac{as}, it can be assumed, that the peer left due to being assigned a new IP address---not due to connectivity issues or going offline---and will not return using the same IP address. These peers, when placed in the neighbourhood list of the crawlers, will introduce paths back into the main network and defeat the \ac{wcc} metric. It also helps with the PageRank and SensorRank metrics since the crawlers start to look like regular peers without actually supporting the network by relaying messages or propagating active peers. @@ -505,6 +547,14 @@ Current report possibilities are \mintinline{go}{LoggingReport} to simply log ne %}}} implementation +%{{{ further work +\section{Further Work} + +Following this work it should be possible to rewrite the existing crawlers to use the new abstraction. +This might bring some performance issues to light which can be solved by investigating the optimizations from the old implementation and apply them to the new one. + +%}}} further work + %{{{ acknowledgments \section*{Acknowledgments} diff --git a/report.pdf b/report.pdf index 0f9f08bf..034cb7e6 100644 Binary files a/report.pdf and b/report.pdf differ