Pages

Monday, 11 April 2016

Using TCP-LP with pycurl in Python

Intro


TCP-LP (low priority) is a TCP congestion control algorithm that is meant to be used by TCP connections that don't want to compete with other connections for bandwidth. Its goal is to use the idle bandwidth for file transfers. The details of TCP-LP are here.

With Linux' plugable congestion control algorithms, it is possible to change both the default algorithm for the whole system and the one used per connection. For the latter, one needs to be root.

Note: Changing the CC algorithm will only affect the transmissions. You cannot alter the remote end's behavior. This means that the below only make sense when you are going to upload data.

PyCurl


Changing the CC algorithm is a matter of using setsockopt on a socket. Doing this with pycurl can be a bit tricky. Even though pycurl supports the SOCKOPTFUNCTION, this is only for newer pycurl versions. For older, one can exploit pycurl's OPENSOCKETFUNCTION instead.

The trick is done with this piece of code:

[code language="python"]
import pycurl
import socket

def _getsock(family, socktype, protocol, addr):
    s=socket.socket(family, socktype, protocol)
    s.setsockopt(socket.IPPROTO_TCP, 13, 'lp' )
    return s

c = pycurl.Curl()
c.setopt(c.OPENSOCKETFUNCTION, _getsock)
c.setopt(c.URL, 'http://127.0.0.1/')
c.perform()
c.close()
[/code]

In the above, pycurl will call _getsock and expect it to return a socket. The function creates a new socket, then calls setsockopt with IPPPROTO_TCP and 13 (which is TCP_CONGESTION - see /usr/include/linux/tcp.h,  /usr/include/netinet/tcp.h). It then attempts to set the algorithm to "lp" which is the TCP-LP congestion control algorithm.

You most probably want to wrap the setsockopt around a try/except clause as it may fail if "lp" is not available (needs the module tcp_lp loaded) or if the program doesn't run as root.

The _getsock function also depends on the pycurl version, as its arguments have changed over time. Consult the docs for the fine details.

Results


Example of uploading two 500MB files in parallel on an already busy production network. One is with TCP-LP and the other with the default (TCP Cubic):

TCP-Cubic: 9.38 seconds
TCP-LP: 23.08 seconds

Same test, for 100MB files, again in parallel, on the same network:

TCP-Cubic: 3.14 seconds
TCP-LP: 5.38 seconds

Note: The above are random runs, presented to give an idea of the impact. For actual experimental results we would need to have  multiple runs and also monitor the background traffic.

Monday, 15 February 2016

Running an NTP server in a VM using KVM

The setup


Having physical server pA, running VMs using KVM. One of theVMs (vA) acts as an NTP server. pA gets the time from vA and vA gets it from the Internet.

It's not a great idea to run an NTP server in a VM, but in this case there was need for it.

The problem


NTP server gets frequently out of sync.

If you use nagios, you may get errors like this:
SERVICE ALERT: pA;ntpd;CRITICAL;SOFT;4;NTP CRITICAL: Offset unknown

Both for the physical server and other servers that fetch the time from vA.

The reason


There's some guessing involved here, but this should be pretty accurate:

VM vA needs to correct its clock every now and then by slowing down or speeding up things per ntpd/adjtimex. As expected, this creates a small discrepancy between vA and pA, as now the physical server gets out of sync and needs to correct its time using vA's reference time.

Once vA attempts to correct its time, again by slowing down or speeding up its clock, this has a direct effect on vA, as vA's clock is now affected by pA's ongoing adjustment. This happens because KVM by default uses kvmclock as its clock source (the source that ticks and not the source that returns the time of the day).

This action sometimes causes pA's ntpd to get even more out of sync and may even make it consider its peers inaccurate and become fully out of sync.

The problem gets even worse if you have two ntp servers (vA and vB) running on two different physical servers (pA and pB), because the amount of desync between the two is mostly random. Assuming that all your servers, including pA and pB, fetch the time from vA and vB, the discrepancy between them will make them mark at least one of them as wrong, as the stratum of vA and vB does not permit such difference between their clocks.

You can see the above by looking at the falsetick result in ntpq's associations:
ind assid status  conf reach auth condition  last_event cnt
===========================================================
  1 33082  961a   yes   yes  none  sys.peer    sys_peer  1
  2 33083  911a   yes   yes  none falsetick    sys_peer  1

Overall, the problem is that the physical servers will try to fix their clocks, thus affecting the clocks of the NTP servers running in VMs under them.

The solution


The problem is with the VMs using the kvmclock source. You can see that using dmesg:
$ dmesg | grep clocksource
Switching to clocksource kvm-clock

The way to disable this is to pass the "no-kvmclock" parameter to the kernel of your VMs. This will not always work though. The reason is that the kernel (at least the CentOS kernels) will panic very early in the boot process as it will still try to initialize the kvmclock even if it's not going to use it, and will fail.

The solution is to pass two parameters to your VM kernels: "no-kvmclock no-kvmclock-vsyscall". The second one is a bit undocumented, but will do the trick.

After that you can verify it through dmesg:
$ dmesg | grep Switching
Switching to clocksource refined-jiffies
Switching to clocksource acpi_pm
Switching to clocksource tsc

Example


Below is the output of a server running in such an environment. In this case the first ntp server (vA) runs with the extra kernel parameters and the other (vB) runs without them. The clock of the physical servers (pA and pB) was slowed down by hand using adjtimex in order to test the effect of the physical server's clock on the VM clocks. As you can see, this server is still in sync with vA and has a very large offset with vB. Note that this server is not a VM under pA or pB.
$ ntpq -nc peers
     remote           refid      st t when poll reach delay   offset  jitter
==============================================================================
*10.93.XXX.XXX   216.218.254.202  2 u   81  256  377 0.433  -87.076  20.341
 10.93.XXX.XXX   216.218.254.202  2 u  290  512  377 0.673  11487.6 9868.84

I.e., what happened is that the first one, using the extra parameters, kept its clock accurate while the second did not.