Feb 9, 2013

Uninterruptible sleep (D state)

Sometimes you will see processes on your linux box that are in D state as shown by ps, top, htop or similar. D means uninterruptible sleep. As opposed to normal sleep, you can't do anything with these processes (i.e. kill them).
Additional information could be found on LWN.

Usually this means that process is stuck reading or writing something to disk or NFS and remote site does not answer. But not always. I'v stumbled upon processes that periodically jump into D state and back. These are processes under heavy network load. Strange...

While not been successful in finding out what is going on yet, I would like to share pointers how to dig deeper into these processes.

ps

Familiar tool ps can tell us in what function inside kernel our process is sleeping.

# ps -eo ppid,pid,user,stat,pcpu,comm,wchan:32

Adding some bash magic will help us dealing with process that is constantly jumping between D, S and R states:

# for x in $(seq 10000); do ps -eo ppid,pid,user,stat,pcpu,comm,wchan:32 | grep D | grep laccess | awk '{print $7}' | grep -v "-"; done;
lock_sock_nested
lock_sock_nested
^C

Ok, so our process is sleeping in kernel function called lock_sock_nested.

sysrq

We can force linux kernel to dump backtraces for processes that are in D state. Information could be seen via dmesg utility or in /var/log/messages file.

echo w > /proc/sysrq-trigger

Again, with additional bash magic our command looks like this:

echo w > /proc/sysrq-trigger; dmesg -c | less;

This is how output looks like:

[688596.998782] SysRq : Show Blocked State
[688596.999054]   task                        PC stack   pid father
[688596.999101] laccessd      D 00000000ffffffff     0 26940  25632 0x00000000
[688596.999104]  ffff88090885fdd8 0000000000000086 ffff8800282d3700 ffff88125a5b2380
[688596.999107]  0000000000013700 ffff88090885ffd8 0000000000013700 ffff88090885ffd8
[688596.999109]  0000000000013700 0000000000013700 0000000000013700 0000000000013700
[688596.999111] Call Trace:
[688596.999113] Inexact backtrace:
[688596.999113] 
[688596.999122]  [<ffffffff810467de>] ? try_to_wake_up+0x1be/0x400
[688596.999128]  [<ffffffff8139cb13>] ? schedule_timeout+0x143/0x240
[688596.999133]  [<ffffffff81059a30>] ? process_timeout+0x0/0x10
[688596.999136]  [<ffffffff81135b23>] ? ep_poll+0x153/0x240
[688596.999139]  [<ffffffff81046a20>] ? default_wake_function+0x0/0x10
[688596.999141]  [<ffffffff81135ce4>] ? sys_epoll_wait+0xd4/0xe0
[688596.999146]  [<ffffffff81002f7b>] ? system_call_fastpath+0x16/0x1b