When I Messed Up the Gymkhana Server

Background

I am one of the new coordinators of the Programming Club, and I had been instructed by the outgoing coordinators to deploy a CTF.

What I had to do

For deploying any service we need to configure NGINX on two machines:

The Students’ Gymkhana VM : The actual service runs here. Configuring NGINX on this machine will make it available to anyone connected to the IITK network.
The Reverse Proxy VM : This machine has a public facing IP. Configuring NGINX here will forward requests over the internet to our internal network.

What I did

So I cloned CTFd, made some config changes, and used docker-compose to start the service. CTFd was up and working. Now I only needed to configure the Reverse Proxy VM to make it publicly accessible. However, I had not yet been granted access to the Reverse Proxy VM. So I sent a message to the outgoing coordinators requesting for access.

The movie starts here

The next morning, at 5:32 AM IST, I received a message from Yash:

Why is search.pclub.in down?
…
I have a monitor up which says this went down 8:30 hours ago.

Student Search is one of the most popular projects of PClub. And to think that it went down because I had been toying around on the VM was truly dreadful.

Sherlocking :D

So I tried to do some forensics:

I checked the docker containers. They were running. Good.
I checked if the service was accessible through the internal network using the IP of the Gymkahna VM. It was. Great.
I checked NGINX logs. They did not have anything fishy. In fact, new requests over the internet were not being logged. Sad.

Therefore, I concluded that everything was working upto the Gymkhana VM and that the Reverse Proxy VM was to blame. And since I didn’t even have access to that VM, it couldn’t have been my fault. I breathed a sigh of relief.

Later, Yugesh revealed that he could not ssh into the Reverse Proxy VM. This further substantiated the belief that the machine was down.

Daedalian mysteries lie ahead

So I mailed CC authorities regarding the issue. The mail was forwarded through 4 nodes and finally the reply that I received was that the machine was working from inside IITK. Okay.

Then I tried pinging the the Reverse Proxy VM after connecting to IITK proxy. It worked. Great.

Then I sshed into the Gymkhana VM and tried to ping the Reverse Proxy VM. It said Destination Host Unreachable. Wait what?

Now this was going over my head (I’m just a noob SysAdmin :’◖ ). The fact that I was able to ping the Reverse Proxy VM meant that the VM was alive. So I just decided to wait until Yugesh somehow found a way to let me inside that VM.

And it happened

2 weeks later, Yugesh found a machine that could ssh into the Reverse Proxy server, and created a user for me on the VM. Now the first task at hand for me was to get Student Search up again.

Forensics once more

First Step: Checking NGINX logs. The error log file was filled with upstream timed out (110: Connection timed out) errors. Sad.
Second Step: Restarting NGINX. I know it doesn’t make sense, but sometimes restarting things solves all the problems. :p It didn’t work this time though. :(
Third step: pinging the gymkhana VM. No response. 100% packet loss.

So now I realized that both the Gymkhana and the Reverse Proxy VMs were up and working properly. I could ssh into both of them. But they couldn’t ping each other. This didn’t make sense. And so I asked Yash.

Insights from Samrat

Yash, as always, had an answer. He recalled facing this issue, and quickly explained it - a new docker network had been created which took up the IP range of the reverse proxy server.

So then I checked docker networks. There was a network with the name ctfd_default, whose subnet contained the IP address of the Reverse Proxy VM. This network was automatically created when I started the CTFd docker container using docker-compose.

So whenever the Reverse Proxy VM would make a request to the Gymkhana VM, it would respond back correctly. But now since the IP of the Reverse Proxy VM was contained in the docker network subnet, the response would go to the docker network, and would never actually reach the Reverse Proxy VM. The Reverse Proxy VM would keep waiting, and hence the timeout.

Resolving the issue

The issue was resolved by deleting the offending network. I just did docker network rm ctfd_default, and everything fell back into place.

There was another problem here though. In the future, any new docker network would take that same subnet by default, unless specified otherwise. And specifiying a different subnet in the docker-compose file for all the projects is a tedious task.

To fix this, once and for all, I added a field default-address-pools in the /etc/docker/daemon.json file. Adding this field caused the docker service to crash, because this field requires docker version >=18.09.1, while the Gymkhana VM had 17.09.0. :( Updating docker-ce fixed everything.

Results

Student search is up and working. The CTF server is also up and working at https://ctf.pclub.in. I also learnt how to generate SSL certificates using certbot. And now perhaps I know a little more about docker networks :p.

Also, Yash had “intended to write a nice blog post” about this “very unique to docker + internal networks issue debugging insight” that he had, but I stole his idea (with permission). I hope he doesn’t mind. XD