Implementing a Tor relay from scratch

The last few weeks I have been spending my free time on implementing a Tor OR (Onion Relay) in the Go language (golang.org). After a while I realized the language was not suited for the project, and stopped working on it. Today I am sharing what I learned, and will release the partially working code on github.com/tvdw/gotor.

In this post I will refer to the network protocol as ‘Tor’, the original implementation as ‘CTor’, and my implementation as ‘GoTor’, in an attempt to reduce confusion between the different versions. For consistency all network speeds are noted as a combined upload plus download, so divide by two where you think it is relevant.

Tor

Tor is a peer-to-peer network protocol that provides privacy to its users. It does so by encrypting the data several times, and sending it through a path (‘circuit’) of multiple servers before it reaches the final destination. Each layer of encryption can only be decrypted by the server it is intended for, so that any node in the circuit can only ever know its immediate neighbors. torproject.org has a better explanation of the network in case you’re interested.

The network has a specification document, but carefully following it will result in an implementation that does not work (this has now been corrected). This posed some challenges, but luckily the CTor source code is relatively easy to read.

The goal

For a while I have been running Tor nodes using CTor. In November 2014 those nodes relayed a total of about 400 terabytes, which is an average of roughly 1200 Mbit per second. Each machine has cheap hardware to push the costs down, but the lack of AES-NI instructions on the CPUs cause a significant slowdown. CTor’s implementation is essentially single-threaded, with only a very small amount of work being done in background threads, so the slowdown of having cheap hardware is immediately noticeable in throughput.

To maximize the amount of relayed data, it is normal to simply run multiple instances of the program, up to two per IP address. Running multiple instances of CTor on the same computer causes inefficiencies, such as needing more connections to other servers and trouble in balancing network throughput, so I set out to write a Tor relay in a way that uses all CPU cores. My final goal was to beat the Tor speed record, which was at roughly 200 megabytes per second.

The path

I started by just reading the specification documents, and quickly understood the basics of the network. I then decided on a language: for a while a colleague had been pushing me to write Go, so I picked Go. It took me about a day to pick up the basics and I got started.

First up: the TLS code. Instead of using OpenSSL, Go has its own TLS implementation called “crypto/tls“, apparently because agl__, one of the people working on the language, decided so. Lots of features are missing, from DH key exchange to extracting the ClientHello message. Next to that, a quick performance benchmark showed that this implementation was very slow compared to popular OpenSSL bindings for Go. So I went with OpenSSL. While I had to fork the code to add some extra bindings, it was mostly painless.

Then came the Tor handshake. There are four versions of the link protocol, and the specification requires implementing at least the first one, which is a real pain. Instead of that, I implemented the fourth handshake, and sent a mail to the Tor developers requesting to drop support for the first and second versions. They agreed, and hopefully this week I will be finishing the patch for that.

So up to this point everything was great. I had spent only a few days and even though both the Tor protocol and the Go language were new to me, I had made significant progress. My local TorBrowser was able to connect to the relay and I saw the commands it was sending while trying to establish a circuit to the next server.

As soon as I started implementing the circuit logic, the first implementation issues started appearing. Each connection had its own buffered channel (a queue) and a circuit would be a bridge between the channels, but when both ends of the circuit would be writing to the other end simultaneously, and both channels were full, the code would deadlock. I mitigated this by just increasing the buffer, and planned to refactor the entire system later on.

Fast-forward a bit to the moment I had written enough to have a nearly full implementation of the relay. Some minor features were still missing, but it was stable enough to put a node in the network, which I did. It very slowly started building consensus (it took more than a week to get any weight assigned at all) but data was flowing nicely and it sometimes hit a few megabit per second. A few fixes had to be done but I was happy.

To get the testing up to speed, I moved the configuration and keys from a node that had been running CTor for months, and had my node run it. Traffic quickly ramped up to 500 Mbit/s, which was already more than a single CTor node could handle, but it didn’t really grow much more than that.

memory-day

On top of that, memory usage grew very quickly every time I started the program. I had to restart the node twice a day or it would start swapping. The growth was about 6 megabytes per minute on average, which is a lot less than the node relayed, but definitely suggested that there was a memory leak somewhere.

I mentioned the fast relay on the Tor IRC, and a fellow relay operator TheCthulhu offered to help with some testing hardware. He set up a server with my relay and within a few days we had broken the Tor speed record with a nice 250 megabytes per second, effectively maxing out the network link. CPU usage was at a nice 60% across 12 cores. But his relay also suffered from the memory issues and had to be restarted every few days.

Meanwhile my own relay was still at only 500 Mbit/s, which was a quarter of the server’s capacity. After profiling the code and looking at some graphs, I realized that this was because of Go internals. I’ll talk more about this in a bit.

Then there was the overhead of using OpenSSL. During initial benchmarking it looked minimal, but it turns out Go assumes the worst of all C code and the language will refuse to run C code inside Go code. Instead, it will allocate an entirely new stack just for C calls, and tell the scheduler to create a new thread in case the C code hangs. If 80% of the CPU time spent in your program is on C calls, your Go program won’t hesitate to run 500 OS threads.

In the end I decided to drop the project. It was a fun learning process, but the Go language made the final product too slow. By sharing this blog post I hope that others will learn about these and maybe some day adjust the language.

Lessons learned

  • [cpu] Go cannot call C functions directly. Instead, with cgo it creates an entirely new stack and runs the C code from there, while assuming the code will block for locks or I/O. This causes a massive buildup of OS threads;
  • [cpu] To prevent the buildup of OS threads caused by cgo, you can lock concurrent calls using a lock or buffered channel. However, this causes a lot of futex system calls, slowing down the process;
  • [mem] For every OS thread, Go has its own memory allocation cache. This is good, as it avoids locks, but when combining this with all the threads spawned by cgo you will see quite a bit of memory overhead (this claim has been disputed in the comments, it’s possible I misread something in the Go source);
  • [mem] In Go’s I/O model, every connection must have its own goroutine with its own read buffer. Assuming normal TLS records of 16K and a goroutine stack size of 4K, this means that with only 10.000 active connections you will already be using 200 megabytes just because of these buffers;
  • [mem] Go’s default channel implementation uses an array and a mutex to control access. This means that a buffered queue with a capacity of 2000 slices will always be using 48KB, even if it’s empty. However, to be fair, this could be mitigated by using an alternate channel implementation that uses a linked list;

In the end, I concluded that while Go is an amazing language, it’s not that good if there is C code in performance critical loops, if you have to deal with lots of connections, or if you have a lot of data flowing between goroutines. For most projects these restrictions are not an issue at all, but in my case I hit all three.

Lots of thanks to everyone who helped me during this short project. It was a lot of fun, and I am looking forward to working on CTor.

Screen Shot 2015-01-24 at 22.12.29

KPN Experia Box v8 and PPPoE passthrough

A few weeks ago I got a new internet connection, after switching from Online to KPN (both are ISPs). It also came with a new modem, and the old one wouldn’t let me connect anymore. That was acceptable though, as the old modem (Speedtouch 564v6) was horrible.

Sadly, KPN doesn’t exactly have a reputation for delivering quality modems either, and the modem I got from them can’t be configured. With “can’t be configured” I mean that it doesn’t allow me to turn off the firewall, turn off the DHCP server, use UPnP, change the DNS servers, etc. It didn’t even have a telnet interface or other advanced configuration interface.

After spending a few hours with it (and after several configuration resets), I discovered that KPN edited the HTML of the web interface to hide a lot of elements that allow you to configure the modem. With Firebug I was able to recover some settings, including the ability to turn off the DHCP server and use my own instead.

But I wasn’t there yet. UPnP still didn’t work properly, and with manual port forwarding I couldn’t get some applications to work (probably because the firewall was still active).

Since the modem was already connected to my normal router, Apple’s Airport Extreme (nice stuff, but wouldn’t buy again) which has support for PPPoE, I decided to try using that. In theory if I managed to get that to work, I would completely bypass the parts of the Experia Box that make it so annoying. It would disable the firewall, DHCP server, it would let my own router handle UPnP, etc.

The web interface had a checkbox in the interface that said ‘PPPoE passthrough’. “Nice,” I thought. But after changing the settings of the router, it didn’t work. I hooked up my laptop (Mac) directly to the modem, but my laptop couldn’t connect either (the logs indicated that a timeout occurred after a connection was made). Maybe it was a Mac thing? I connected my Windows desktop as well, but the same thing happened there. Apparently ‘PPPoE passthrough’ was just a checkbox that didn’t do anything.

After several more hours of experimenting (I really needed this modem to work), I found a hidden web interface page that let me setup what I wanted in only a few clicks. Which is why I’m writing this article, as I don’t want anyone else to go through this mess.

To enable PPPoE passthrough on the KPN Experia Box v8, reset the modem first (hold the reset button until it restarts), then go to http://192.168.2.254/atmint.stm/3/1 and click Apply. Wait for the modem to restart and then configure your router (which should be the only device connected to your modem, preferably on LAN1). That’s all! The phone function still works, but the wireless router part of the modem should probably be turned off.

For Telfort users, the procedure is a lot more complicated, and I have not yet found the easiest way to do it. Read the comments for a few suggestions from readers.

iTunes Store? Never again.

Every now and then, I download some songs from the internet. I respect the authors of these songs, so I usually buy them from iTunes. But I also have the good habit of cleaning my computer every 2 to 3 months (windows re-install) so I did that a few days ago.

Basically, what you expect from a web-store like the iTunes store, is that it will automatically re-download the songs you had before you cleaned your PC. I mean, if Steam can do this with games of 5 GB, it shouldn’t be too much of a problem for Apple with songs of about 3 MB-. Sure, we are talking about a lot more songs than games, but I don’t think many people buy 5 GB off the iTunes store.

So, I logged into my iTunes this afternoon, after installing iTunes again, and I clicked “Check for available downloads”. It told me I already had all songs! Well, in my Purchased tab, I only saw one song, and that was a free song I downloaded from the iTunes Store because I wanted to test something. So, I checked the iTunes FAQ and it told me I would have to backup all my songs!!

Isn’t it ridiculous that I have to re-buy all my songs again? Is it really too much that I expect the iTunes Store to re-download my purchases after I re-install my computer? Steam does it. The EA download manager does it. Battle.net does it. And a way bigger company called Apple can’t do something as simple as this?

They force us to have DRM on the songs we purchase. They force us to backup everything we buy, but we can’t burn them on CDs if we want to listen to them. They can only be on five computers at the same time, and if you accidentally forget to deauthorize your computer before re-installing your PC, you can only have four computers at the same time. The iTunes Store really limits what you can do with the music.

I don’t think I’ll ever buy off the iTunes Store again. I will resume my old habits of downloading them from Usenet and Torrents, and if I respect the author well enough, I will make a donation to them. Sorry Apple folks, but it’s 2009 and this kind of trickery isn’t wise.

Problems with Skype

It can happen to everyone. You have a problem with Skype.

So, what you do: you go to www.skype.com and try finding a solution. Too bad, you are the first one. So, you open a support ticket.

A few days later, you get a reply from them. Basically, there’s an Indian guy trying to answer your questions by clicking a pre-defined answer.

Our conversation, attached below.

Continue reading Problems with Skype