This paper talks about detecting Internet filtering by analysing usage statistics such as the ones provided by The Tor Project. I used to follow Tor's existing censorship detector quite closely, but it tends to raise many alerts — especially for countries with low usage numbers — so I am mostly ignoring it at this point. That's why I was excited to see this paper and hoping that it would improve on the status quo.
Basically, the authors are proposing to use principal component analysis (PCA) to find events in countries whose usage time series deviates from all other countries. That has the advantage that sudden changes in all countries, such as the botnet that used a hidden service as C&C server, does not raise alerts. Then again, if there's a sudden change in all countries, we might want to learn about such an event anyway. I am typically somewhat reluctant to read theory-heavy papers, but the authors did a great job of explaining PCA and how they apply it. I wish other papers would have an equal amount of effort put into writing.
Unfortunately, their technique also seems to have issues with the high variance of countries that don't have a lot of Tor users. "In practice, we remove all those countries whose usage never rises above 1000 daily users", they write. Apparently, this removes 124 out of 251 time series, which is basically half the data. That's quite a lot. I wonder how many of the discarded countries have Internet filtering events hidden in their time series. For example, I think Ethiopia is not part of their data, but is documented to have filtered Tor.
One important assumption of this paper is that patterns of Tor usage are consistent worldwide. This seems to hold so far, but I wonder if it will also hold in the future? Imagine the botnet would only have infected machines in certain countries, as malware sometimes does. That might have split Tor usage statistics into the categories "normal" and "infected". I wonder how their technique would have dealt with that.
An interesting part is when the authors apply their technique to archived Tor usage statistics and show the ten "most anomalous" time series based on the median residual score. Among the top ten are some of the usual suspects, including China, Iran, and Syria. Other countries are more surprising, such as South Africa, Bangladesh, and India. I wish the authors had included a plot showing the distribution of residuals: I wonder how large of a gap there is between the top ten and all remaining residuals.
To evaluate their technique, the authors need ground truth. Since we don't have any, they inject their own anomalies and then verify if their technique can detect them. That sounds like a reasonable evaluation technique, but they only inject two anomalies, which does not seem like a lot to me. A higher number of more diverse anomalies would be more convincing.
Time and practical use will show how helpful this technique is in practice. I hope that the peer-reviewed version of this paper will come with an "Operational experience" section. The authors write that they will soon make their code available, which is promising.
Last updated: 2015-07-23