Pages

Sunday, 30 September 2012

IPsec, Racoon, setkey, Linux, Mikrotik, tunnel, transport and everything

It took me more than 6 months in order to sort all issues, so here are the experiences. Most of the trouble was because I didn't knew or I didn't had things clear in my mind.

I wanted to have IPsec communication between a bunch of servers and a home network. I believe that this includes almost all (if not all) the possible scenarios of IPsec so it's more complicated than it sounds. For obvious reasons I'm presenting a simplified version here omitting all duplicates (i.e. multiple hosts with the same characteristics).

The network


We have the following nodes:

  • A network behind a DSL line (home network) (normal, home DSL line with non-static IP, with NAT)

  • A server (srv1) somewhere on the Internet with a static public IP address without NAT.

  • A server (srv2) in Amazon's EC2 which has an allocated public IP address but uses local IP addresses and thus has NAT. Also Amazon doesn't allow ESP and AH protocol to be carried by IP packets inside their network.


We also have the following systems:

  • Home network: A bunch of Linux boxes on a private network plus a mikrotik router

  • srv1 and srv2: Squeeze Debian Linux


The home network uses IP addresses from the network 10.1.0.0/16. A secondary prefix (10.5.0.0/16) is allocated for IPsec addressing only. All home nodes have addresses from the 10.1.0.0/16. Some nodes (including the servers) have addresses from 10.5.0.0/16.

Apart from the above there's a custom CA setup which publishes certificates for all nodes.

The problem


Setup IPsec so that:

  • srv1 and srv2 can communicate with their public IP addresses with IPsec only

  • boxes on the home network can communicate both with srv1 and srv2 using IPsec


The setup


Since there are more than one boxes on the home network, the home network needs to be connected with tunneled IPsec to srv1 and srv2. srv1 and srv2 need to be connected with transport mode between them in order to encrypt communication that uses their public IP addresses.

We have setup the DSL router to forward everything to the mikrotik box (routerboard). This is usually referred as DMZ. By doing that it's possible to avoid NAT in IPsec (i.e. UDP encapsulation).

The solution


Mikrotik


In short, Mikrotik's IPsec works quite well and is easy to setup assuming that everything is correct. It is however harder to debug than Racoon. Here's the setup:

  • Add an IP address from 10.5.0.0/16

  • Import the box's certificate to the certificate storage, both certificate and public key are needed

  • Import CA's and other boxes' certificates to the certificate storage. Make sure you use sensible names to be able to look them up later.

  • Create a new proposal as follows:

    • Name: short (or pick something else)

    • Lifetime: 00:10:00 - This is essential in older to allow quick recovery when the IP address changes or racoon is restarted.

    • Pick your favorite values for everything else



  • Add two peers, one for each server:

    • srv1 (static public IP, no NAT):

      • Address: The public IP of srv1

      • Port: 500

      • Auth method: rsa signature

      • Certificate: Pick the local certificate (mikrotik's)

      • Remote certificate: Pick the certificate of srv1

      • Exchange Mode: main

      • Select: Send Initial Contact

      • Nat Traversal: No

      • My ID User FQDN: Leave empty - isn't needed

      • Proposal check: Claim (remember not to use similar or stricter on remote end)

      • Generate policy: No

      • Lifetime: 08:00:00

      • DPD Interval/Max failures: I use 10/3 but it doesn't make a difference. See notes bellow



    • srv2 (static IP, public IP, with NAT): Use the same settings as with srv1

      • I didn't use NAT but it may be worth testing it.





  • You need to add two policies per peer. One for each local source IP address range (10.1.0.0/16 and 10.5.0.0/16). So you will end up with 4 policies:

    • Src Address: 10.1.0.0/16 or 10.5.0.0/16

    • Dst Address: srv1's or srv2's public IP address

    • Src/Dst Port: Empty

    • Protocol: all (255)

    • Action: Encrypt

    • Level: Unique - very important

    • IPsec protocols: ESP

    • Tunnel: Yes

    • SA Src address: 0.0.0.0

    • SA Dst address: srv1's or srv2's IPsec IP address (i.e. allocated addresses from the 10.5.0.0/16)

    • Proposal: short (or whatever name you picked for the proposal you created)



  • Create a script named "ping-servers" (System -> Scripts) as follows:
    {
    :local servers
    :local locals

    :set servers {"10.5.1.11";"10.5.1.12"}
    :set locals {"10.1.1.1";"10.5.1.1"}

    foreach loc in=$locals do={
    foreach srv in=$servers do={
    put "ping $srv src-address=$loc count=1"
    ping $srv src-address=$loc count=1
    }
    }
    }

    servers is the list of server's addresses from the 10.5.0.0/16 network and locals are local addresses to the mikrotik box, one for each of the two networks.

  • Schedule the script to be executed every minute (System -> Scheduler). This will keep the policies active and also reactivate them if they go down.


srv1 (static public IP, no NAT)



  • Put the following in /etc/ipsec-tools.d/srv2.conf:
    spdadd srv1public srv2public[500] udp -P out none;
    spdadd srv2public srv1public[500] udp -P in none;
    spdadd srv1public srv2public[4500] udp -P out none;
    spdadd srv2public srv1public[4500] udp -P in none;
    spdadd srv1public srv2public 50 -P out none;
    spdadd srv2public srv1public 50 -P in none;
    spdadd srv1public srv2public 51 -P out none;
    spdadd srv2public srv1public 51 -P in none;

    spdadd srv1public srv2public any -P out ipsec
    esp/transport/srv1public[4500]-srv2public[4500]/require ;

    spdadd srv2public srv1public any -P in ipsec
    esp/transport/srv2public[4500]-srv1public[4500]/require ;


  • Put the following in /etc/ipsec-tools.d/srv2-priv.conf. Somehow it is required in order to establish the IPsec connection when it's triggered by srv2:
    spdadd srv1public srv2private[500] udp -P out none;
    spdadd srv2private srv1public[500] udp -P in none;
    spdadd srv1public srv2private[4500] udp -P out none;
    spdadd srv2private srv1public[4500] udp -P in none;
    spdadd srv1public srv2private 50 -P out none;
    spdadd srv2private srv1public 50 -P in none;
    spdadd srv1public srv2private 51 -P out none;
    spdadd srv2private srv1public 51 -P in none;

    spdadd srv1public srv2private any -P out ipsec
    esp/transport/srv1public[4500]-srv2private[4500]/require ;

    spdadd srv2private srv1public any -P in ipsec
    esp/transport/srv2private[4500]-srv1public[4500]/require ;


  • In the above, srv1public is the public static IP address of srv1, srv2public is the public static IP address of srv2 and srv2private is the private static IP address of srv2.

  • Setup racoon.conf's section for srv2 and home as follows. Obviously you need to change to match your parameters:
    remote "srv2" {
    exchange_mode main,base;
    verify_identifier on;
    peers_identifier asn1dn "Common name of srv2's certificate";
    remote_address srv2public;
    verify_cert on;
    certificate_type x509 "srv1.crt" "srv1.key";
    ca_type x509 "cacert.pem";
    my_identifier asn1dn;
    lifetime time 24 hours;
    nat_traversal on;
    proposal {
    authentication_method rsasig;
    encryption_algorithm 3des;
    hash_algorithm md5;
    dh_group modp1024;
    }
    passive off;
    proposal_check obey;
    generate_policy off;
    dpd_delay 10;
    dpd_retry 10;
    dpd_maxfail 6;
    initial_contact on;
    ike_frag on;
    }


  • Setup racoon.conf's section for the home network as follows:
    remote "home" {
    exchange_mode main,base;
    verify_identifier on;
    peers_identifier asn1dn "Common name of mikrotik's certificate ";
    verify_cert on;
    certificate_type x509 "srv1.crt" "srv1.key";
    ca_type x509 "cacert.pem";
    my_identifier asn1dn;
    nat_traversal off;
    proposal {
    authentication_method rsasig;
    encryption_algorithm 3des;
    hash_algorithm md5;
    dh_group modp1024;      # Group 2
    }
    passive on;
    proposal_check obey;
    generate_policy unique;
    dpd_delay 10;
    dpd_retry 10;
    dpd_maxfail 6;
    initial_contact on;
    ike_frag on;
    }


  • Notice the differences: passive should be on  for the home network since it's not possible to trigger that without remote address.

  • Notice the generate_policy. It must be "unique" and not "on". Otherwise only one policy per remote endpoint will be generated and will also cause problems when an SA becomes bad.

  • Setup the additional address to a loopback interface and not to a physical interface.

  • Add static routes for the two networks using the normal gateway and specifying the source IP address. Otherwise you will be using the tunnel with addresses that are not routed via the tunnel and are not protected by IPsec. Obviously this will prevent anything from working on top of IPsec. Surprisingly, this will work occasionally when the traffic is initiated by the remote end just because of the route cache. Your config can be added to the loopback interface as follows:
    auto lo:1
    iface lo:1 inet static
    address     10.5.1.12
    netmask     255.255.255.255
    up ip route add 10.5.0.0/16 via <gw> src 10.5.1.12 || true
    up ip route add 10.1.0.0/16 via <gw> src 10.5.1.12 || true
    down ip route del 10.1.0.0/16 via <gw> src 10.5.1.12 || true
    down ip route del 10.5.0.0/16 via <gw> src 10.5.1.12 || true

    where 10.5.1.12 is the address from the 10.5.0.0/16 network for srv1 and gw is the normal gateway of the server.


srv2 (static private IP, static public IP, NAT)



  • Setup the /etc/ipsec-tools.d/*.conf files in a similar way to the srv1's. You will need an entry for both the private and the public address.

  • Setup racoon like srv1's except from nat. You will have to set nat_traversal to on for srv1 and the home network.


The Hints / Lessons learned



  • Either test DPD (Dead Peer Detection) or don't use it at all. It didn't work for me at all.

  • You need to activate the policies from the home network's side proactively for both the IPsec networks (10.1.0.0/16 and 10.5.0.0/16). Otherwise it will be impossible for the remote ends to connect to local hosts. This is easily done by setting up a ping to run every minute. You need one ping per source IP address using -I.

  • You need to exclude ISAKMP traffic (UDP ports 500 and 4500) from static IPsec policies or otherwise you will have problems since outgoing traffic will be encrypted and incoming traffic will be dropped if not encrypted, which causes huge issues when one end goes down and requires the IPsec SA to expire from both ends (or flushed) before working again.

  • If you have firewall rules make sure that you allow ISAKMP traffic and IPsec traffic (protocols 50 (esp) and 51 (ah))

  • If you get errors that say that a policy is not available then it is not available! I can't stress this enough. While trying to make IPsec to work your brain will enter a bad state and it will start making mistakes. It's extremely easy to confuse static IPsec rules. I've done all sorts of mistakes including (but not limited to): using the wrong direction (in/out), using the address of another server, using tunnel instead of transport (and vice versa), not including the port numbers for esp-udp (UDP encapsulation) mode, not using the .conf extensions for files under /etc/ipsec-tools.d/, etc. Here's an example of that:
    Sep 27 15:02:04 srvX racoon: ERROR: no policy found: A.B.C.D/32[0] E.F.G.H/32[0] proto=any dir=in
    Sep 27 15:02:04 srvX racoon: ERROR: failed to get proposal for responder.
    Sep 27 15:02:04 srvX racoon: [I.J.K.L] ERROR: failed to pre-process ph2 packet (side: 1, status: 1).


  • When testing a connection from host A that has both the 10.1.1.1 and 10.5.1.1 addresses to host B with address 10.5.1.2 then you may not be able to ping from B to one of the A's addresses. That's because only one of the IPsec policies is activated. To activate both of them use -I parameter for ping:
    v13@hostA$ ping -I 10.1.1.1 10.5.1.2
    v13@hostA$ ping -I 10.5.1.1 10.5.1.2


  • Pay attention to routing. You need to use the proper source IP addresses.

Friday, 17 August 2012

DNSSEC key tag (keyid) and DS signature calculation in python

This one took me a considerable amount of hours to figure out so here it is.

While trying to automate DNS zone generation I had to calculate some of the values programmatically. Two of the auto-generated values had to do with DNSSEC entries: The key tag (or keyid) and the DS record's signatures.

The required details on how these are calculated are found in the following places:

For the calculations you need to provide the following:

  • For the key tag: flags, protocol, algorithm, public key

  • For the DS signatures: owner (the domain name), flags, protocol, algorithm, public key


Code


I used python for this but the approach is the same for other languages since the algorithms are the same.

So here it is:

[code lang="python"]
import struct
import hashlib
import base64

def calc_keyid(flags, protocol, algorithm, st):
"""
@param owner The corresponding domain
@param flags The flags of the entry (256 or 257)
@param protocol Should always be 3
@param algorithm Should always be 5
@param st The public key as listed in the DNSKEY record.
Spaces are removed.
@return The key tag
"""
# Remove spaces and create the wire format
st0=st.replace(' ', '')
st2=struct.pack('!HBB', int(flags), int(protocol), int(algorithm))
st2+=base64.b64decode(st0)

# Calculate the tag
cnt=0
for idx in xrange(len(st2)):
s=struct.unpack('B', st2[idx])[0]
if (idx % 2) == 0:
cnt+=s<<8
else:
cnt+=s

ret=((cnt & 0xFFFF) + (cnt>>16)) & 0xFFFF

return(ret)

def calc_ds(owner, flags, protocol, algorithm, st):
"""
@param flags Usually it is 257 or something that indicates a KSK.
It can be 256 though.
@param protocol Should always be 3
@param algorithm Should always be 5
@param st The public key as listed in the DNSKEY record.
Spaces are removed.
@return A dictionary of hashes where the key is the hashing algorithm.
"""

# Remove spaces and create the wire format
st0=st.replace(' ', '')
st2=struct.pack('!HBB', int(flags), int(protocol), int(algorithm))
st2+=base64.b64decode(st0)

# Ensure a trailing dot
if owner[-1]=='.':
owner2=owner
else:
owner2=owner+'.'

# Create the name wire format
owner3=''
for i in owner2.split('.'):
owner3+=struct.pack('B', len(i))+i

# Calculate the hashes
st3=owner3+st2
ret={
'sha1': hashlib.sha1(st3).hexdigest().upper(),
'sha256': hashlib.sha256(st3).hexdigest().upper(),
}

return(ret)
[/code]

Data


The following were created by bind's dnssec tools:

[code light="true"]
$ cat Ktest.hell.gr.+005+33630.key
; This is a zone-signing key, keyid 33630, for test.hell.gr.
; Created: 20101007114826 (Thu Oct  7 14:48:26 2010)
; Publish: 20101007114826 (Thu Oct  7 14:48:26 2010)
; Activate: 20101007114826 (Thu Oct  7 14:48:26 2010)
test.hell.gr. IN DNSKEY 256 3 5 AwEAAb+lTDjZCfq7D5N9cNd1ug30wLrbCXB9mVJJQGlQQHpiHHlMaLGG
sV2/j5+eojHp+WQUzNpOzrULF6msbEvUuV2gSEnpbueRV4twO8muGE+x
eUuseSoHh/aTpA8Z9SPubb01mduqqaUEN5Juz2Q4hF0dSUSJYlJPKhp6
NrOgoeyj

$ cat dsset-test.hell.gr.
test.hell.gr.        IN DS 33630 5 1 A2AD2648B353365631EBC9C70EDA1E0C04563FCC
test.hell.gr.        IN DS 33630 5 2 4177EAEC09A37178357871EBE3FB361CABB2861F12A1D51DDE18CBA2 439BB5C1
[/code]

Result


[code lang="python" light="true"]
>>> domain='test.hell.gr'
>>> flags=256
>>> protocol=3
>>> algorithm=5
>>> key='AwEAAb+...goeyj' # Truncated
>>> calc_keyid(flags, protocol, algorithm, key)
33630
>>> r=calc_ds(domain, flags, protocol, algorithm, key)
>>> r['sha1']
'A2AD2648B353365631EBC9C70EDA1E0C04563FCC'
>>> r['sha256']
'4177EAEC09A37178357871EBE3FB361CABB2861F12A1D51DDE18CBA2439BB5C1'
[/code]

Legal


You can use the above under the MIT license. If it doesn't fit your needs let me know. My intention is to make this usable by anyone for any kind of use with no obligation.

Tuesday, 24 July 2012

rsync as root with rrsync and sudo

Here's how to rsync something to a remote host as root without allowing root logins and with directory restriction. I did that because I wanted to sync /srv across servers.

In general it will use rsync over ssh, sudo, rrsync and a remote non-root user. I assume that rsync will run from srv1 to srv2.

rrsync


First you will need the rrsync (or rrsync.pl) script ad the server side that's part of the rsync package. In Debian you can find it at /usr/share/doc/rsync/scripts/rrsync.gz. This script acts as the server side and will restrict the destination directory (a'la chroot).

In short the server side will run "rrsync /srv". Then the client side will do something like this:

[code light="true"]
# rsync /srv remote:/
[/code]

and / will be relative to /srv that was defined as a parameter to rrsync.

You can put rrsync under /usr/local/bin.

User on srv2


At the destination server we will need a user that will be used for the ssh session. So go and create a user named 'syncer' on srv2. I'd avoid a username of 'rsync' as it may be used for other reasons at some point.

sudo on srv2


The user on srv2 should be able to run rrsync with sudo and with the -E parameter. -E is required in order to pass the checks of the rrsync script which checks for SSH_ORIGINAL_COMMAND in the environment. Feel free to make this even more strict to allow only this environment variable if you like.

Sample sudoers entry (e.g. to be put in /etc/sudoers.d/syncer):

[code light="true"]
syncer    ALL=SETENV:NOPASSWD:/usr/local/bin/rrsync /srv
[/code]

Obviously we need the user to be able to run this without requiring a password. SETENV will allow for the -E parameter to sudo.

SSH config


Next step is to allow root@srv1 to ssh as syncer@srv2 using public key. If you don't have a key pair generated for root@srv1 then go ahead and create it:

[code light="true"]
# ssh-keygen
[/code]

Then copy the contents of /root/.ssh/id_rsa.pub and paste them in syncer@srv2's authorized_keys file which is most probably at /home/srv2/syncer/.ssh/authorized_keys. Create the directory and the file if they don't exist.

To make rrsync work and make things safer you need to use the command=".." parameter and you should use the from=".." parameter. So your authorized_keys file will look something like this:

[code light="true" wraplines="true"]
from="srv1",command="sudo -E /usr/local/bin/rrsync /srv" ssh-rsa AAAA......siW root@srv1
[/code]

Don't forget to ssh at least once from srv1 to srv2 by hand in order to accept srv2's key and let ssh have it in in known_hosts.

Try it


Finally you are done and you can do the rsync:

[code light="true"]
# rsync --rsh=ssh -a --delete /srv syncer@srv2:/
[/code]

Monday, 23 July 2012

Venting machines and APIs

In our office we have a venting machine that offers free sodas. What the vending machine does is simple: Keep contents cool, wait for someone to request a soda and then serve it.

But then again this is just a fridge. Why have a machine over there to do what a fridge is doing all this time? It would be simpler to put the sodas in the fridge. Everyone could just pick up what he likes. Right? ... wrong...

One needs only to open one of the two fridges to see how this would end. At first things are put nicely and in place. But a couple of days later the contents of a publicly accessible fridge look really really bad. You may end up having open or spilled sodas in there, expired cans and even things that should never be in there. All of this can happen by a small minority of its users. And it can be either on purpose or not. At the end you may have to dig on a pile of cans and other left overs to find a coke zero which would make the mess even worse. Requiring so much time to find something could also cause queues. The whole thing would become unusable after a couple of months. Even if there was a fridge access policy clearly stated at the front door someone would ignore that at the end.

All of this justify a venting machine that keeps things in place, always looks shiny and serves people well. Everyone is happy and at the end it costs less since it doesn't need constant maintenance and doesn't brake because of bad usage.

This is the same thing as data access. In theory you may allow direct access to data but it will always end up in disaster. Your stored data may become a mess and may even be used to store other things. People will do things they way they see fit (and this guarantees that most of them will be wrong). Even worse, when people are in a hurry and want to do something quick and dirty they will not pay that much attention to access or storage rules.

The solution is exactly the same. Introduce an interface between the user and the contents. Either a venting machine that will return the contents of the fridge or an API that will return the desired data. Direct data access will only be granted to few people and will only be needed to fix random issues.

But it gets even better!

Our venting machine doesn't have predefined locations for different kinds of sodas. Any soda can be stored anywhere and the user just have to find the proper row and column to get it. For example, one needs to type C2 on a keypad to get a soda from row C, column 2. Of course this could be changed and sodas could be put in predefined places. Then a different venting machine could accept a request for 'Pepsi Max', locate it and serve it. There are pros and cons on this: It will be easier to get something, it will allow the machine to know when a kind of soda is finished, etc. It also adds a lot of complexity and makes adding more sodas of a kind harder since it will need to be reprogrammed.

Something similar happens with APIs: An API can either get a raw request and process it (e.g. insert these data to table) or get a higher level request (e.g. create a new user). The first requires less complexity and the second allows the API to perform clever tricks like keeping statistics, updating other tables, changing internal representation without changing the API, etc...

Obviously there's no 1-to-1 mapping but things are really similar: Using a venting machine to serve sodas or an API to serve data will result in nice and clean stored contents and will last longer. Oh.. and users will be happier.

Sunday, 25 March 2012

Linux Containers: Easy LXC

Linux containers (a.k.a. LXC) rock. It's the ultimate way of having multiple Linux boxes with minimal requirements.

Here's how I do it under Debian (and the script I'm using):

Requirements


This guide is for Debian  testing as of 25 March 2012. However it should work for other cases as well.

The procedure creates a minimal installation which can then be fully customized by hand or with puppet. The procedure installs Debian under Debian but should be easy to change for other distributions as well (especially Ubuntu).

Packages


You will need to install:

  • lxc - The linux containers package

  • bridge-utils - For bridging network interfaces

  • uml-utilities - For tun/tap interfaces

  • cdebootstrap - For the bootstrapping of the virtual machines

  • puppet (optional) - for managing multiple machines


Networking


I prefer networking between lxc installations to be separate from my normal network. It is trivial however to bridge with the outside network as well.

Add the following to /etc/network/interfaces:

[code]
auto virtlxc
iface virtlxc inet manual
tunctl_user     root
up              ip link set virtlxc up
down            ip link set virtlxc down

auto brvirt
iface brvirt inet static
bridge_ports            virtlxc
bridge_maxwait          0
bridge_stp              off
address                 10.3.1.1
netmask                 255.255.255.0
dns-search              virt.local
[/code]

Then add the following to /etc/hosts:

[code]
10.3.1.1    deb0 deb0.virt deb0.virt.local
10.3.1.11    deb1 deb1.virt deb1.virt.local
10.3.1.12    deb2 deb2.virt deb2.virt.local
10.3.1.13    deb3 deb3.virt deb3.virt.local
10.3.1.14    deb4 deb4.virt deb4.virt.local
[/code]

Add as many entries as you like. There should be one entry per virtual machine. It doesn't matter if you have more entries than virtual machines since you may use them in the future. The first (deb) entry is for the local machine.

Bring up the brvirt and virtlxc interfaces and keep reading (ifup virtlxc; ifup brvirt).

You may also want to run something like this to provide network access to the virtual machines (assuming that eth0 is the interface the connects you to the rest of the world):

[code]
echo 1 > /proc/sys/net/ipv4/ip_forward
iptables -I POSTROUTING -o eth0 -s 10.3.1.0/24 -j MASQUERADE
[/code]

Create the virtual machine


Get the following script and change the desired variables at the beginning as follows (assuming that you followed the network configuration):

  • SUITE: The Debian suite to use (e.g. squeeze)

  • MIRROR: A mirror to download debian from. If you use approx like me then you want to use the local machine (i.e. 10.3.1.1)

  • VIRTUSER: A username you want to have created in the virtual machine. After that you may ssh as that user.

  • LOCALUSERS: A space separated list of local users to get ssh public keys from and put then in VIRTUSER's authorized_keys file to allow ssh.

  • PUPPETMASTER: Leave it empty if you don't have a puppet master.

  • DNSSERVER: The DNS server to use. By default it is the local machine.


Each virtual machine should get a unique MAC address. The MAC addresses are auto-generated from the current y/m/d/H/M, so you should not create more than one virtual machines every minute. You're free to change this of course.

Now run the script at the end of the page and let it create a virtual machine:

[code]
./easylxc deb1
[/code]

The installation will happen under /var/lib/lxc (the default for lxc). You may visit that and fix things by hand if you (i.e.) manage to lock yourself out.

The virtual machine can be started with:

[code]
lxc-start -n deb4
- or -
lxc-start -n deb4 -d
[/code]

However, a bug/feature of rxvt will prevent that for succeeding. In that case you can run:

[code]
sudo lxc-start -n deb4
- or -
sudo lxc-start -n deb4 -d
[/code]

Inside the virtual machine you will be able to su to root by using "su" without password. You will be also able to ssh as root (using the same ssh keys).

Hints'n'tips


I strongly suggest using approx and changing the MIRROR variable as needed. It will speed the creation of many machines by orders of magnitude since there will be no network delays.

The script


[code lang="bash"]
#!/bin/bash

if [ -z "$1" ] ; then
echo "Pass the name of the machine as the first parameter"
exit 1
fi

# The name of the container to create. Also used as the hostname
NAME="$1"

# The name of the parent (local) machine without the domain
PARENTNAME="deb0"

# Distribution
SUITE="squeeze"

# The domain to be used by the virtual machines.
DOMAIN="virt.hell.gr"

# The network prefix (first 3 octets - it is assumed to be a /24 network)
NETPREFIX="10.3.1"

# Since we use approx, this is the approx server. If not, add a mirror.
MIRROR="http://ftp.debian.org/debian/"

# The gateway address for the virtual machine. This is most probably the
# address of the bridge interface.
GW="$NETPREFIX.1"

# The bridge interface to use for networking
BRIDGEIF="brvirt"

# The username of the user to create inside the container
VIRTUSER="v13"

# A list of local users that will have ssh access to the container
# They need to have a public key in the local machine
LOCALUSERS="v13 root"

# The puppet master. This must be the hostname of the master (not an IP addr).
# No puppet if this is empty.
PUPPETMASTER=""

# The DNS server to use.
DNSSERVER="$GW"

IPADDR2=$(getent hosts $NAME.$DOMAIN | awk '{print $1}')

if [ "x$IPADDR2" = "x169.254.1.1" ] ; then
IPADDR2=""
fi

if [ -z "$IPADDR2" ] ; then
echo "Could not resolve $NAME.$DOMAIN"
exit 1
fi

IPADDR="$IPADDR2/24"

MAC=$(date "+4a:%y:%m:%d:%H:%M")

lxc-stop -n $NAME
lxc-destroy -n $NAME

export SUITE
export MIRROR

R0=/var/lib/lxc/$NAME
R=$R0/rootfs

mkdir $R0 $R

# Install base system
echo cdebootstrap -f standard $SUITE $R $MIRROR
cdebootstrap -f standard $SUITE $R $MIRROR

CFG=$R0/config

# Create config file
cat << _KOKO > $CFG
# Auto-generated by: $*
# at $(date)

## Container
lxc.utsname = $NAME
lxc.rootfs = $R
lxc.tty = 6
lxc.pts = 1024

## Network
lxc.network.type = veth
lxc.network.hwaddr = $MAC
lxc.network.link = $BRIDGEIF
lxc.network.veth.pair = veth-$NAME

## Capabilities
lxc.cap.drop = mac_admin
lxc.cap.drop = mac_override
lxc.cap.drop = sys_admin
lxc.cap.drop = sys_module

## Devices
# Allow all device
lxc.cgroup.devices.allow = a
# Deny all device
lxc.cgroup.devices.deny = a
# Allow to mknod all devices (but not using them)
lxc.cgroup.devices.allow = c *:* m
lxc.cgroup.devices.allow = b *:* m

# /dev/console
lxc.cgroup.devices.allow = c 5:1 rwm
# /dev/fuse
lxc.cgroup.devices.allow = c 10:229 rwm
# /dev/null
lxc.cgroup.devices.allow = c 1:3 rwm
# /dev/ptmx
lxc.cgroup.devices.allow = c 5:2 rwm
# /dev/pts/*
lxc.cgroup.devices.allow = c 136:* rwm
# /dev/random
lxc.cgroup.devices.allow = c 1:8 rwm
# /dev/rtc
lxc.cgroup.devices.allow = c 254:0 rwm
# /dev/tty
lxc.cgroup.devices.allow = c 5:0 rwm
# /dev/urandom
lxc.cgroup.devices.allow = c 1:9 rwm
# /dev/zero
lxc.cgroup.devices.allow = c 1:5 rwm
# /dev/net/tun
lxc.cgroup.devices.allow = c 10:200 rwm

## Limits
#lxc.cgroup.cpu.shares = 1024
#lxc.cgroup.cpuset.cpus = 0
#lxc.cgroup.memory.limit_in_bytes = 256M
#lxc.cgroup.memory.memsw.limit_in_bytes = 1G

## Filesystem
lxc.mount.entry = proc $R/proc proc nodev,noexec,nosuid 0 0
lxc.mount.entry = sysfs $R/sys sysfs defaults,ro 0 0

_KOKO

# fix interfaces
T=$R/etc/network/interfaces
mv $T $T.orig
(
cat $T.orig \
| sed "s/^iface eth0.*$//"
echo "
auto lo
iface lo inet loopback

auto eth0
iface eth0 inet static
address $IPADDR2
netmask 255.255.255.0
gateway $GW
dns-nameservers $DNSSERVER
"
) > $T
rm $T.orig

# fix resolv.conf
T=$R/etc/resolv.conf
cat << _KOKO > $T
domain $DOMAIN
search $DOMAIN
nameserver $GW
_KOKO

# add info to hosts
T=$R/etc/hosts
echo "$IPADDR2 $NAME $NAME.$DOMAIN" >> $T
echo "$GW gw gw.$DOMAIN $PARENTNAME.$DOMAIN $PARENTNAME" >> $T

# set debian_chroot (for help)
echo "lxc-$NAME" >> $R/etc/debian_chroot

# create ttys
for i in $(seq 0 6) ; do
mknod $R/dev/tty$i c 4 $i
done

run()
{
echo chroot $R "$@"
LC_ALL=C chroot $R "$@"
}

run2()
{
ssh -o StrictHostKeyChecking=no $IPADDR2 "$@"
}

runmaster()
{
ssh -o StrictHostKeyChecking=no $PUPPETMASTER "$@"
}

# Install locales
run apt-get -y install locales

# disable init scripts
DISABLED="bootlogd bootlogs checkfs.sh checkroot.sh halt hostname.sh \
hwclockfirst.sh hwclock.sh module-init-tools mountall.sh \
mountdevsubfs.sh mountkernfs.sh mountnfs.sh mountoverflowtmp procps \
reboot stop-bootlogd stop-bootlogd-single udev umountfs umountnfs.sh \
umountroot"
for dis in $DISABLED ; do
run update-rc.d $dis disable
done

# disable rsyslog's kernel logging
run sed -i 's/^\(.*imklog.*\)$/#\1/' /etc/rsyslog.conf

# add user
run adduser --gecos $VIRTUSER --disabled-password $VIRTUSER
run adduser $VIRTUSER root

# fix sources.list
T=$R/etc/apt/sources.list
cat << _KOKO > $T
deb $MIRROR $SUITE main
_KOKO

# Install ssh
run apt-get update
run apt-get -y install openssh-server
run /etc/init.d/ssh stop

# Fix root and su
run passwd -l root
T=$R/etc/pam.d/su
mv $T $T.old
cat $T.old \
| sed 's/^# \(.*pam_wheel.so trust\)/\1/' \
> $T
rm $T.old

# Add ssh keys
T=$R/home/$VIRTUSER/.ssh/authorized_keys
T2=$R/root/.ssh/authorized_keys
mkdir $R/home/$VIRTUSER/.ssh $R/root/.ssh
for u in $LOCALUSERS ; do
H=$(getent passwd $u | cut -f 6 -d : )
cat $H/.ssh/id_rsa.pub >> $T
cat $H/.ssh/id_rsa.pub >> $T2
done
chown $VIRTUSER.$VIRTUSER $R/home/$VIRTUSER/.ssh $T
chown root.root $R/home/$VIRTUSER/.ssh $T2

# Start it
# Use sudo to bypass file descriptor problems
sudo lxc-start -n $NAME -d
sleep 1

if ! [ -z "$PUPPETMASTER" ] ; then
# Install packages
run2 apt-get -y install puppet

# Clear any existing certificate
runmaster puppet cert clean $NAME.$DOMAIN

# Fix puppet config
T=$R/etc/default/puppet
mv $T $T.old
cat $T.old \
| sed 's/START=no/START=yes/' \
| sed "s/DAEMON_OPTS=\"\"/DAEMON_OPTS=\"--server=$PUPPETMASTER --verbose\"/" \
> $T
rm -rf $T.old

run2 puppet agent --server=$PUPPETMASTER --no-daemonize --onetime

# sign the certificate
runmaster puppet cert --sign $NAME.$DOMAIN

run2 /etc/init.d/puppet start
fi

cat << _KOKO

LXC virtual box is ready!

Config file is at: $R0/config
Root fs is at: $R

Get a console with:
lxc-console -n $NAME

Stop it with:
lxc-stop -n $NAME

Start it with:
lxc-start -n $NAME -d

_KOKO
[/code]
Update: You can use the above code under the GPLv3 license.
#!/bin/bash

if [ -z "$1" ] ; then
echo "Pass the name of the machine as the first parameter"
exit 1
fi

# The name of the container to create. Also used as the hostname
NAME="$1"

# The name of the parent (local) machine without the domain
PARENTNAME="deb0"

# Distribution
SUITE="squeeze"

# The domain to be used by the virtual machines.
DOMAIN="virt.hell.gr"

# The network prefix (first 3 octets - it is assumed to be a /24 network)
NETPREFIX="10.3.1"

# Since we use approx, this is the approx server. If not, add a mirror.
MIRROR="http://ftp.debian.org/debian/"

# The gateway address for the virtual machine. This is most probably the
# address of the bridge interface.
GW="$NETPREFIX.1"

# The bridge interface to use for networking
BRIDGEIF="brvirt"

# The username of the user to create inside the container
VIRTUSER="v13"

# A list of local users that will have ssh access to the container
# They need to have a public key in the local machine
LOCALUSERS="v13 root"

# The puppet master. This must be the hostname of the master (not an IP addr).
# No puppet if this is empty.
PUPPETMASTER=""

IPADDR2=$(getent hosts $NAME.$DOMAIN | awk '{print $1}')

if [ "x$IPADDR2" = "x169.254.1.1" ] ; then
IPADDR2=""
fi

if [ -z "$IPADDR2" ] ; then
echo "Could not resolve $NAME.$DOMAIN"
exit 1
fi

IPADDR="$IPADDR2/24"

MAC=$(date "+4a:%y:%m:%d:%H:%M")

lxc-stop -n $NAME
lxc-destroy -n $NAME

export SUITE
export MIRROR

R0=/var/lib/lxc/$NAME
R=$R0/rootfs

mkdir $R0 $R

# Install base system
echo cdebootstrap -f standard $SUITE $R $MIRROR
cdebootstrap -f standard $SUITE $R $MIRROR

CFG=$R0/config

# Create config file
cat << _KOKO > $CFG
# Auto-generated by: $*
# at $(date)

## Container
lxc.utsname        = $NAME
lxc.rootfs        = $R
lxc.tty            = 6
lxc.pts            = 1024

## Network
lxc.network.type    = veth
lxc.network.hwaddr    = $MAC
lxc.network.link    = $BRIDGEIF
lxc.network.veth.pair    = veth-$NAME

## Capabilities
lxc.cap.drop        = mac_admin
lxc.cap.drop        = mac_override
lxc.cap.drop        = sys_admin
lxc.cap.drop        = sys_module

## Devices
# Allow all device
lxc.cgroup.devices.allow    = a
# Deny all device
lxc.cgroup.devices.deny        = a
# Allow to mknod all devices (but not using them)
lxc.cgroup.devices.allow    = c *:* m
lxc.cgroup.devices.allow    = b *:* m

# /dev/console
lxc.cgroup.devices.allow    = c 5:1 rwm
# /dev/fuse
lxc.cgroup.devices.allow    = c 10:229 rwm
# /dev/null
lxc.cgroup.devices.allow    = c 1:3 rwm
# /dev/ptmx
lxc.cgroup.devices.allow    = c 5:2 rwm
# /dev/pts/*
lxc.cgroup.devices.allow    = c 136:* rwm
# /dev/random
lxc.cgroup.devices.allow    = c 1:8 rwm
# /dev/rtc
lxc.cgroup.devices.allow    = c 254:0 rwm
# /dev/tty
lxc.cgroup.devices.allow    = c 5:0 rwm
# /dev/urandom
lxc.cgroup.devices.allow    = c 1:9 rwm
# /dev/zero
lxc.cgroup.devices.allow    = c 1:5 rwm
# /dev/net/tun
lxc.cgroup.devices.allow        = c 10:200 rwm

## Limits
#lxc.cgroup.cpu.shares                  = 1024
#lxc.cgroup.cpuset.cpus                 = 0
#lxc.cgroup.memory.limit_in_bytes       = 256M
#lxc.cgroup.memory.memsw.limit_in_bytes = 1G

## Filesystem
lxc.mount.entry        = proc $R/proc proc nodev,noexec,nosuid 0 0
lxc.mount.entry        = sysfs $R/sys sysfs defaults,ro 0 0

_KOKO

# fix interfaces
T=$R/etc/network/interfaces
mv $T $T.orig
(
cat $T.orig \
| sed "s/^iface eth0.*$//"
echo "
auto lo
iface lo inet loopback

auto eth0
iface eth0 inet static
address            $IPADDR2
netmask            255.255.255.0
gateway            $GW
dns-nameservers        $GW
"
) > $T
rm $T.orig

# fix resolv.conf
T=$R/etc/resolv.conf
cat << _KOKO > $T
domain $DOMAIN
search $DOMAIN
nameserver $GW
_KOKO

# add info to hosts
T=$R/etc/hosts
echo "$IPADDR2 $NAME $NAME.$DOMAIN" >> $T
echo "$GW gw gw.$DOMAIN $PARENTNAME.$DOMAIN $PARENTNAME" >> $T

# set debian_chroot (for help)
echo "lxc-$NAME" >> $R/etc/debian_chroot

# create ttys
for i in $(seq 0 6) ; do
mknod $R/dev/tty$i c 4 $i
done

run()
{
echo chroot $R "$@"
LC_ALL=C chroot $R "$@"
}

run2()
{
ssh -o StrictHostKeyChecking=no $IPADDR2 "$@"
}

runmaster()
{
ssh -o StrictHostKeyChecking=no $PUPPETMASTER "$@"
}

# Install locales
run apt-get -y install locales

# disable init scripts
DISABLED="bootlogd bootlogs checkfs.sh checkroot.sh halt hostname.sh \
hwclockfirst.sh hwclock.sh module-init-tools mountall.sh \
mountdevsubfs.sh mountkernfs.sh mountnfs.sh mountoverflowtmp procps \
reboot stop-bootlogd stop-bootlogd-single udev umountfs umountnfs.sh \
umountroot"
for dis in $DISABLED ; do
run update-rc.d $dis disable
done

# disable rsyslog's kernel logging
run sed -i 's/^\(.*imklog.*\)$/#\1/' /etc/rsyslog.conf

# add user
run adduser --gecos $VIRTUSER --disabled-password $VIRTUSER
run adduser $VIRTUSER root

# fix sources.list
T=$R/etc/apt/sources.list
cat << _KOKO > $T
deb $MIRROR $SUITE main
_KOKO

# Install ssh
run apt-get update
run apt-get -y install openssh-server
run /etc/init.d/ssh stop

# Fix root and su
run passwd -l root
T=$R/etc/pam.d/su
mv $T $T.old
cat $T.old \
| sed 's/^# \(.*pam_wheel.so trust\)/\1/' \
> $T
rm $T.old

# Add ssh keys
T=$R/home/$VIRTUSER/.ssh/authorized_keys
T2=$R/root/.ssh/authorized_keys
mkdir $R/home/$VIRTUSER/.ssh $R/root/.ssh
for u in $LOCALUSERS ; do
H=$(getent passwd $u | cut -f 6 -d :)
cat $H/.ssh/id_rsa.pub >> $T
cat $H/.ssh/id_rsa.pub >> $T2
done
chown $VIRTUSER.$VIRTUSER $R/home/$VIRTUSER/.ssh $T
chown root.root $R/home/$VIRTUSER/.ssh $T2

# Start it
# Use sudo to bypass file descriptor problems
sudo lxc-start -n $NAME -d
sleep 1

if ! [ -z "$PUPPETMASTER" ] ; then
# Install packages
run2 apt-get -y install puppet

# Clear any existing certificate
runmaster puppet cert clean $NAME.$DOMAIN

# Fix puppet config
T=$R/etc/default/puppet
mv $T $T.old
cat $T.old \
| sed 's/START=no/START=yes/' \
| sed "s/DAEMON_OPTS=\"\"/DAEMON_OPTS=\"--server=$PUPPETMASTER --verbose\"/" \
> $T
rm -rf $T.old

run2 puppet agent --server=$PUPPETMASTER --no-daemonize --onetime

# sign the certificate
runmaster puppet cert --sign $NAME.$DOMAIN

run2 /etc/init.d/puppet start
fi

cat << _KOKO

LXC virtual box is ready!

Config file is at: $R0/config
Root fs is at: $R

Get a console with:
lxc-console -n $NAME

Stop it with:
lxc-stop -n $NAME

Start it with:
lxc-start -n $NAME -d

_KOKO

Tuesday, 13 March 2012

TalkTalk traffic interception

Recently I was really annoyed by my ISP (TalkTalk @ UK).

In short: They are intercepting traffic and doing deep packet inspection without any warning or approval.

But wait, there's more: In general they monitor web traffic (read: the data) and after intercepting an HTTP request the replay that (yes.. they replay the request).

Here's an example:
78.149.130.80 - - [12/Mar/2012:22:47:23 +0100]
 "GET /korokokokokoLALALALA HTTP/1.1" 404 536 "-"
"Wget/1.13.4 (linux-gnu)"

62.24.252.133 - - [12/Mar/2012:22:47:55 +0100]
 "GET /korokokokokoLALALALA HTTP/1.0" 404 498
"http://<removed>/korokokokokoLALALALA"
"Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0;
.NET CLR 2.0.50727; .NET CLR 3.0.04506.648; .NET CLR 3.5.21022;
.NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; InfoPath.2)"

The first request was performed by me and it was followed by a second one 32 seconds later (for the unbeliever: this is 100% reproducible). The IP address of the offender is always the same (62.24.252.133).

Digging a bit on this I found these:

I don't think there are enough things I can say about this. However, here are a couple of them:

  • They shouldn't do that

  • They are a bunch of $#&@%)&#$%

  • This actually doubles the web traffic that originates from the talktalk network

  • It can be really harmful as it is possible to trigger a delete action (i.e. when GET is used instead of POST)

  • It is completely unethical

  • It is performed even when you have all the security features disabled

  • Did I say that they are a bunch of @$#)*&%*#$ ?


In any case, I started exploiting this a bit:

Fortunately I've a domain under my control and a couple of servers. As such I used one of them that is already running apache and added this rule:

[code lang="shell"]
iptables -I OUTPUT -d 62.24.252.133 -p tcp --sport 80 \
-m tcp --tcp-flags SYN '' -j REJECT --reject-with tcp-reset
[/code]

After that I ran this from a PC at home:

[code lang="shell"]
for b in $(seq 0 9) ; do
( (
for i in $(seq 22${b}00 22${b}99) ; do
wget http://xxx.xxx.xxx/bad-talktalk-bad-bad-$i ;
done
) & ) ;
done
[/code]

The idea is to create an iptables rule at the server that matches outgoing TCP segments without the SYN flag (iow: segments after the initial 2 SYNs of the handshake) and reset that connection.  The result is that when that host (62.24.252.133) tries to re-fetch the page (i.e. replay the request):

  • It initiates the TCP connection sending a SYN to the server which is accepted

  • The server replies with SYN+ACK which is normally sent

  • The offender receives the SYN+ACK so it goes to the ESTABLISHED state, sending an ACK plus some data

  • The server receives the data and responds with an ACK

  • The iptables rule takes effect, dropping the packet and responding to the server (itself) with a reset


After the above steps, the server closes the connection abnormally and the offender stays with a fully open connection and keeps trying to send the data.

The rational behind that was to exhaust the offender's TCP ports by creating >60000 connections.

The good news is that the theory behind that works. The bad news is that the offender is very slow (most probably on purpose). The good news after that is that it keeps a backlog of the connections and tries to perform them all (or a big part of them).

My tests showed the offender trying to perform connections up to an hour after that.

Here are the results of itpables' accounting:
Chain OUTPUT (policy ACCEPT 75792 packets, 16M bytes)
pkts bytes target     prot opt in     out     source               destination
14600  584K REJECT     tcp  --  *      *       0.0.0.0/0            62.24.252.133

Chain INPUT (policy ACCEPT 0 packets, 0 bytes)
pkts bytes target     prot opt in   out     source               destination
28462 6088K ACCEPT     tcp  --  *      *       62.24.252.133        0.0.0.0/0

The first one is the output rule. The second one is a rule I added for accounting purposes.

So we ended up having 14600 resets sent and 28400 packets from the offender. (note: I only created about 3000 connections)

The day is over but I have other plans as well:

  • Create a bunch (i.e. 100) alternate DNS names for the server and perform the requests against them, pushing the offender to perform more requests in parallel

  • Write a python program that operates at client side and server side constructing fake TCP packets with predefined sequence numbers and fake IP addresses so that the offender will believe that it will have to follow more than one users.

  • Study whether the offender it intelligent. It is possible that it only inspects packet payloads for HTTP requests instead of fully reconstructing the whole TCP stream. In that case I'll only have to send a large number of data packets with HTTP requests


78.149.130.80 - - [12/Mar/2012:22:47:23 +0100] "GET /korokokokokoLALALALA HTTP/1.1" 404 536 "-" "Wget/1.13.4 (linux-gnu)"
62.24.252.133 - - [12/Mar/2012:22:47:55 +0100] "GET /korokokokokoLALALALA HTTP/1.0" 404 498 "http://srv3.v13.gr/korokokokokoLALALALA" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 2.0.50727; .NET CLR 3.0.04506.648; .NET CLR 3.5.21022; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; InfoPath.2)"

Friday, 27 January 2012

Quick fix for X.org screensaver bypass

This vulnerability is quite annoying if you're locking your desktop in work or anywhere else.

In short, one is able to kill xorg's xscreensaver's lock by just pressing alt-ctrl-* or alt-ctrl-/ (both * and / need to be from the keypad).

A workaround that was posted suggests to modify files in the system. If you don't want to (like me - for various reasons) then you can do this on-the-fly.

Put the following script in a file and make it run whenever you log in to your X session (e.g. by putting it in ~/.kde/Autostart/ if you're using KDE):

[code lang="shell"]
#!/bin/bash

xkbcomp :0 - > /tmp/xkbcomp
cat /tmp/xkbcomp \
| sed -n '/key <KPMU> {/,/^ *}/ !p' \
| sed -n '/key <KPDV> {/,/^ *}/ !p' \
> /tmp/xkbcomp.new
xkbcomp /tmp/xkbcomp.new :0
[/code]

On each login, this will get rid of the offending xkb entries.

Friday, 6 January 2012

fix for radeon + opensource driver + kde effects = crash

The problem


Kwin crashes when enabling opengl effects. It doesn't crash immediately but it crashes after specific actions so it is 100% reproducible. For example when exiting from desktop-grid effect.

The situation


I'm using:

  • Radeon 4870 graphics card (RV770)

  • Kernel 3.1.5 (but seems irrelevant)

  • Open source ATI driver with KMS using Gallium

  • Xorg 1.11.2.902 (but happened with previous versions)

  • MESA 7.11.2

  • KDE 4.7.4 from debian

  • DRM 2.4.29

  • xserver radeon driver 6.14.3


I'm not using the blur effect

The solution


cd to ~/.kde/env/ (create it if it doesn't exist)

create a file named gl.sh (or any other name) with execute permissions (should not be needed) and with the following contents:

[code]
#!/bin/bash

export LIBGL_ALWAYS_INDIRECT=1
[/code]

The first line should not be needed as this file most probably gets source'd, but it will not hurt.

The drawback


Every GL app you'll be using will inherit the LIBGL_ALWAYS_INDIRECT from environment, which may cause problems. If you want to play (for example) a game then open a terminal and run:

[code]
unset LIBGL_ALWAYS_INDIRECT
nexuiz # or whichever opengl app you want to launch
[/code]
Note: Fireofx is one of the applications that may use GL.

Monday, 2 January 2012

Big nfs_inode_cache

The story


Boxes with various kernel versions have weird free memory problems. After examining the memory usage it seems that processes don't add up to the actual memory that is being used.

Taking a look at /proc/meminfo we see something like this:

[code]
MemTotal:      8161544 kB
MemFree:        115676 kB
Buffers:          3900 kB
Cached:         200520 kB
SwapCached:      42336 kB
Active:         546824 kB
Inactive:       138336 kB
HighTotal:           0 kB
HighFree:            0 kB
LowTotal:      8161544 kB
LowFree:        115676 kB
SwapTotal:     2096472 kB
SwapFree:       547480 kB
Dirty:            1020 kB
Writeback:           0 kB
AnonPages:      453480 kB
Mapped:          66928 kB
Slab:          7250176 kB
PageTables:      75408 kB
...
[/code]

Notice that Slab is about 7.5GB, almost the whole memory (8GB) (!).

Slab is the kernel memory and we can see where it is allocated by examining /proc/slabinfo. Here's an excerpt:

[code]
# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
nfs_direct_cache       0      0    136   28    1 : tunables  120   60    8 : slabdata      0      0      0
nfs_write_data        62     63    832    9    2 : tunables   54   27    8 : slabdata      7      7      0
nfs_read_data        215    297    832    9    2 : tunables   54   27    8 : slabdata     33     33     54
nfs_inode_cache   5384386 5399040   1032    3    1 : tunables   24   12    8 : slabdata 1799680 1799680     40
nfs_page             534    750    128   30    1 : tunables  120   60    8 : slabdata     25     25    264
rpc_buffers            8      8   2048    2    1 : tunables   24   12    8 : slabdata      4      4      0
...
[/code]

Notice the nfs_inode_cache which is 5.3M objects of 1032 bytes each, adding up to about 5.4GB.

The workaround


Looking a bit about this on the internet we see that this is most probably a bug. Fortunately there are two workaround: A slow and a fast one:

Slow workaround: Login to that box and run "sync". Then leave it alone for a couple of minutes while the nfs_inode_cache memory goes down and down. It make take a couple of minutes before starting going down and there may be pauses in the process. It can take more than an hour to free the memory.

Fast workaround: Login to that box and run:

[code]
# sync
# echo 2 > /proc/sys/vm/drop_caches
[/code]

I'm not sure why the first one works, but it looks like it is triggering a chain reaction that frees the memory.