active probing

Late in 2011, a systems administrator noticed suspicious entries in his SSH log files. The payloads did not conform to the protocol—instead they were just long random-looking byte strings. Careful analysis of the log files revealed a pattern: IP addresses in China sent these strange payloads, and the triggering event was a genuine SSH login, by a real user, from a different Chinese IP address. The administrator concluded that the probes must be related to censorship by the Great Firewall of China (GFW) and moved on. His writeup of these events became the first public documentation of what we call active probing, a critical component in the real-time, versatile, and nation-scale traffic classification system commonly known as the “Great Firewall.”

Active probing is the most recent step in the ongoing arms race of Internet censorship. Users set up proxies to circumvent blocks; censors responded by identifying and blocking proxies by deep packet inspection (DPI); and circumventors made proxy protocols more difficult to detect in turn. Deprived of its capacity for easy, passive protocol identification, the censor now goes straight to the source and interrogates the server directly after it sees a potentially suspicious connection. The censor acts like a user by issuing its own connections to a suspected proxy server, as illustrated in the diagram to the right. If the server responds using a prohibited protocol, then the censor now takes some blocking action, such as adding its IP address to a blacklist.

In this research project, we improve on existing knowledge and study the following aspects of the GFW:

  • We identify various probe types and their volume over time since their first appearance in our data in 2013.
  • Using network protocol fingerprinting techniques, we infer the physical structure of the probing system.
  • We localize the sensors that trigger active probes and show they are likely distinct from China's main censorship infrastructure, the GFW.

Our results show that the system operates in real-time, but suspends regularly for a short amount of time. It currently blocks at least five circumvention protocols and is upgraded regularly. We show that the system makes use of a vast amount of IP addresses, provide evidence that all these IP addresses are controlled by a central system, and we determined the location of the Great Firewall's sensors. We also publish our datasets and code to stimulate more research.

This material is based upon work supported in part by the National Science Foundation under grant nos. #1223717, #1518918, #1540066, and #1518882. This work was also supported in part by funding from the Open Technology Fund through the Freedom2Connect Foundation and from the US Department of State, Bureau of Democracy, Human Rights and Labor. The opinions in this work are those of the authors and do not necessarily reflect those of any funding agency or governmental organization.


paper

Our research paper was presented at the Internet Measurement Conference 2015 in Tokyo, Japan. We also presented our work at the 32nd Chaos Communication Congress in Hamburg, Germany.

Examining How the Great Firewall Discovers Hidden Circumvention Servers [pdf, bib, IMC slides, 32C3 slides]
Roya Ensafi, David Fifield, Philipp Winter, Nick Feamster, Nicholas Weaver, and Vern Paxson
In Proc. of: Internet Measurement Conference, ACM, 2015

  • Sybil dataset (181 MiB)
    SHA-1: 852ad06879d41b4614ad4e6f7658c371e16bcd27
    Repository: git clone https://github.com/NullHypothesis/active-probing-tools.git
    Contains a pcap file with active probes that were captured in a short time window.
  • Log dataset and code (69 MiB)
    SHA-1: c245bb3c2f4b080a32878c192ca39a0c82adbc9d
    Repository: git clone https://www.bamsoftware.com/git/active-probing.git
    Contains logs of active probes sent to application ports on a single server since 2013, and the programs used to extract and process them.

There are a few simple things you can do to check your own computer systems for evidence of active probing. Did you find something interesting? Let us know!

Check for traffic from the IP address 202.108.181.70.

The IP address 202.108.181.70 is disproportionately involved in active probing (sending half of all probes in one study), for reasons we do not understand.

Look for certain requests in web server logs.

The pattern POST /vpnsvc/connect.cgi indicates a SoftEther probe. The pattern GET /twitter.com indicates an AppSpot probe.

Look for web requests with an unexpected Host header.

An unexpected Host header, especially one pointing to a subdomain of appspot.com, is possible evidence of an AppSpot probe. Your web server may not log the Host header by default. In Apache, you can enable mod_log_forensic to see request headers.

Check for binary garbage in application logs.

The obfs2 and obfs3 protocols look like random binary noise by design. They tend to stand out in application logs. For example, here is an obfs2 probe seen in an Apache log:

192.0.2.1 - - [13/Jul/2015:05:56:50 -0600] "\xba\xf4\xf1gy\x9e\xe7O9..." 400 0 "-" "-"

Try grepping your logs for escaped bytes. (Be aware that there may be many false positives; for example \x16\x03 usually simply indicates a TLS connection to a non-TLS port.)

grep '\\x' application.log

In the paper we describe a number of probe types that the GFW sends. Here are detailed probe payloads that we did not include in the paper for a lack of space.

Tor

The Great Firewall probes for Tor servers using a TLS connection containing a single Tor VERSIONS cell (see Section 4.1 of the linked specification). The VERSIONS cell declares support for versions 1 and 2 of the Tor protocol. In hexadecimal, the payload is this:

00 00 07 00 04 00 01 00 02

The p0f TLS fingerprint of Tor probes is:

3.1:39,38,35,16,13,a,33,32,2f,5,ff:23:compr

Obfs2 & Obfs3

Apart from a few anomalies such as occasionally repeated payloads, the active probers' implementation of obfs2 and obfs3 complies with the protocol specification (obfs2 spec, obfs3 spec). Because the protocols appear random by design, no single probe sample characterizes them. For a better understanding of how they work, see a visual explanation of obfs2 and a visual explanation of obfs3.

SoftEther

SoftEther probes resemble the HTTPS-based client handshake of SoftEther VPN, a multi-protocol VPN client.

POST /vpnsvc/connect.cgi HTTP/1.1
Connection: Keep-Alive
Content-Length: 1972
Content-Type: image/jpeg

GIF89a...

The value of the Content-Length header may vary. In the official SoftEther protocol, the Content-Length reflects a random amount of padding following the fixed part of the body. The body of the SoftEther probe we saw also included random padding, but because we only recovered one example in full detail, we cannot say for sure whether the length varies.

Despite the Content-Type header, the POST body is a GIF image, not a JPEG, 1,411 bytes in size. In the SoftEther source code, the file is found in src/Cedar/Watermark.c. As an image, it looks like this:

An image of some kind of rodent, with the text "SoftEther VPN: (C) 2004 SoftEther Corporation. All Rights Reserved.

The HTTPS request differs from that of the official SoftEther client. In July 2014, the official client added a Host header that is not reflected in the active probes. The probe's p0f TLS fingerprint is:

3.1:39,38,35,16,13,a,33,32,2f,5,4,15,12,9,14,11,8,6,3::compr

This differs from that of the official client, which in version 4.15 had the fingerprint:

3.1:c014,c00a,39,38,88,87,c00f,c005,35,84,c012,c008,16,13,c00d,c003,a,c013,c009,33,32,9a,99,45,44,c00e,c004,2f,96,41,c011,c007,c00c,c002,5,4,15,12,9,ff:?0,b,a,f:compr

AppSpot

The AppSpot probe type has taken on a few different forms. What they all have in common is a special Host: webncsproxyXX.appspot.com header, where XX is a two-digit number. We believe that this kind of request is intended to discover unknown Google servers that are capable of providing access to a proxy running on Google App Engine. The User-Agent string is fairly distinctive, reflecting a version of the Chromium web browser that was current for two weeks in April 2014. The User-Agent is faked, as the rest of the header does not match what that version of Chromium sends (for example, genuine Chromium would send Accept-Encoding: gzip).

Beginning on August 20, 2014, the AppSpot probe was a request for /:

GET / HTTP/1.1
Accept-Encoding: identity
Connection: close
Host: webncsproxyXX.appspot.com
Accept: */*
User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/34.0.1847.116 Chrome/34.0.1847.116 Safari/537.36

Between September 4, 2014 and March 3, 2015, the probe changed to request /twitter.com instead. (Such a request would cause the webncsproxy app to display the twitter.com home page.)

GET /twitter.com HTTP/1.1
Accept-Encoding: identity
Connection: close
Host: webncsproxyXX.appspot.com
Accept: */*
User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/34.0.1847.116 Chrome/34.0.1847.116 Safari/537.36

From March 3, 2015, onward, the probe changed back to requesting /:

GET / HTTP/1.1
Accept-Encoding: identity
Connection: close
Host: webncsproxyXX.appspot.com
Accept: */*
User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/34.0.1847.116 Chrome/34.0.1847.116 Safari/537.36

Starting on July 6, 2015, the probes come in pairs, separated by few seconds. The two probes in a pair do not come from the same IP address, and the number in the Host headers are different. The second probe has a shorter header.

GET / HTTP/1.1
Accept-Encoding: identity
Connection: close
Host: webncsproxyXX.appspot.com
Accept: */*
User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/34.0.1847.116 Chrome/34.0.1847.116 Safari/537.36
GET / HTTP/1.1
Accept: */*
Content-Type: text/html
Proxy-Connection: Keep-Alive
Content-length: 0
Host: webncsproxyYY.appspot.com

The p0f TLS fingerprint of the AppSpot probes is

3.1:39,38,88,87,35,84,16,13,a,33,32,9a,99,45,44,2f,96,41,5,4,15,12,9,14,11,8,6,3,ff:23:compr

It differs markedly from the TLS fingerprint of the version of Chromium it purports to be:

3.2:c00a,c009,c013,c014,c007,c011,33,32,39,2f,35,a,5,4:?0,ff01,a,b,23,3374,10,7550,5,12:ver,rtime

If you have any questions or feedback, please get in touch with us!

Last updated: 2016-12-01