Sep 22, 2016

ssh_exchange_identification connection reset by peer problem

At Badoo we have a distributed system that uses ssh on start to connect to each node and exchange some data.

Sometimes I've seen ssh fail with error

ssh_exchange_identification: read: Connection reset by peer error

Usually, If you see this error, it means that server didn't like your certificate or you were banned from accessing this server.

The best way to debug this problem is to append -vvv while executing ssh client to increase verbosity and to do the same on server.

If your Linux distribution uses systemd, then the best way to look into sshd logs is journalctl -f command or specifically journalctl -u sshd.service -f command.

To increase ssd verbosity, edit /etc/ssh/sshd_config and add LogLevel DEBUG3 and then restart sshd with command sudo systemctl restart ssd.service.

Unfortunately, logs didn't show anything useful for me.

But I had a hunch that it is somehow connected to the fact that our distributed system makes a lot of ssh connections and very rapidly.

While browsing trough sshd_config man page I've found config variable that could be to blame. And I was right.

If you happen to have the same problem I did, try increasing MaxStartups variable and restarting sshd service.

I've increased it to 100:30:600 from default 10:30:60 and connection resets vanished.

It is unfortunate that ssh and sshd logs gave me no clue about the problem.

Aug 12, 2015

Memory Mappings limit

We have some very strange memory corruption problem in one of our daemons. As one of the many approaches to find the bug, we have built and pushed to production this daemon, but without jemalloc, memory allocator library that we use in all our projects.

Few hours after the release monitoring team reported an unusal amount of errors returned from the system. It happens that few percent of the requests returned "out of memory" error. It was strange because there were a lot of a memory available in the system.

Output of a strace utility showed that mprotect() call that we use to install a guard page for a coroutines stack returns an error:

strace -emprotect -ttt -T -p 25754 -c

1438269965.801936 mmap(NULL, 69632, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f73baf9c000 <0.000092>
1438269965.802065 mprotect(0x7f73baf9c000, 4096, PROT_NONE) = -1 ENOMEM (Cannot allocate memory) <0.000005>

First we thought that this could not happen because mmap returned without an error and that something is wrong with the kernel we are using. Especially because this particular machine have very old kernel

# uname -a
Linux cppbig31 3.0.82-0.7.99-default #1 SMP Thu Jun 27 13:19:18 UTC 2013 (6efde93) x86_64 x86_64 x86_64 GNU/Linux

But brief googling and kernel source gave us a much more simple answer: memory mappings limit. mmap() call creates a map, mprotect() can split one mapping into two. And we have a lot of these calls.

root@cppbig31.ulan:~# cat /proc/25754/maps | wc -l
root@cppbig31.ulan:~# cat /proc/sys/vm/max_map_count

libc memory allocator seems to be another source of a maps because similar daemon with jemalloc doesn't use so much mappings:

# cat /proc/5667/maps | wc -l

This is how one can increase the limit:

echo "655300" > /proc/sys/vm/max_map_count

Jun 19, 2014

perf and stack traces

I was wondering why perf record -g don't show proper stack traces for my programs in production environment. First I thought that kernel was too old, but after performing few experiments I have found out that it wasn't the case. Problem was that when you compile with optimizations (-O3), gcc automatically omits frame pointers. And it is not easy to unwind stack traces without frame pointers. But gcc can do it somehow. So I have continued digging and stumbled upon article where people bash one of the perf authors for not being able to unwind stacks without frame pointers.

If you read thoroughly you can find out that gcc uses DWARF debugging information for stack unwinding, but it's too slow for profilers.

So I wanted to know how much slower would my programs be if I include frame pointers. And seems that performance losses are negligible:
I tested two MySQL builds, one built with ‘-O3 -g -fno-omit-frame-pointer’ and other with -fomit-frame-pointer instead – and performance difference was negligible. It was around 1% in sysbench tests, and slightly over 3% at tight-loop select benchmark(100000000,(select asin(5+5)+sin(5+5))); on a 2-cpu Opteron box.
So I will try including frame pointers from now on. To include them, use -fno-omit-frame-pointers gcc parameter when building executable.

If you are interested what frame pointers are and how they work, I would recommend reading answer to this stackoverflow question.

May 2, 2014

Who's calling?

Sometimes your program uses a lot of system time. Let's say 90%. You fire up your favorite profiling tool and it tells you which system call it is. If you are experiences and maybe lucky, you can say straight away which part of your program is to blame. But it's not always so obvious.

GDB comes to the rescue. You can use 'catch syscall <syscall>' and it will break when particular syscall is called. And then you can use 'bt' to find where in your code this syscall is being called from.

P.S. Don't use random() in multi-threaded program. Use random_r().

perf on latest linux kernel

I have being playing today with perf on latest available kernel in Fedora (3.13). These are just some observations and thoughts.

perf trace

This tool is similar to strace, but almost without overhead. Contrary to strace, you can use perf trace to watch syscalls system wide or system calls generated by processes owned by certain user.

perf top

Function names are shown correctly and -g parameter is now available. It means that you can get call traces in a real time (without perf record).

perf timechart

This tool creates cool looking diagrams of a system workload.

perf sched

This tool shows you very verbose system scheduler statistics. Including latencies for a particular process. You can see if your server wasn't available because of scheduler.

I don't know what for, but perf sched can replay previously recorded system load. Maybe it is useful for scheduler development and optimization.

perf mem

This tool collects memory requests. It can show you whether your memory request was served by CPU cashe or not for example.

perf lock

This tool presumably allows you to collect information about various locks in your process and in the kernel, but it's not working on a last Fedora kernel because it lacks particular option:

$ sudo perf lock record -p 12727
tracepoint lock:lock_acquire is not enabled. Are CONFIG_LOCKDEP and CONFIG_LOCK_STAT enabled?

Feb 9, 2013

Uninterruptible sleep (D state)

Sometimes you will see processes on your linux box that are in D state as shown by ps, top, htop or similar. D means uninterruptible sleep. As opposed to normal sleep, you can't do anything with these processes (i.e. kill them).
Additional information could be found on LWN.

Usually this means that process is stuck reading or writing something to disk or NFS and remote site does not answer. But not always. I'v stumbled upon processes that periodically jump into D state and back. These are processes under heavy network load. Strange...

While not been successful in finding out what is going on yet, I would like to share pointers how to dig deeper into these processes.


Familiar tool ps can tell us in what function inside kernel our process is sleeping.

# ps -eo ppid,pid,user,stat,pcpu,comm,wchan:32

Adding some bash magic will help us dealing with process that is constantly jumping between D, S and R states:

# for x in $(seq 10000); do ps -eo ppid,pid,user,stat,pcpu,comm,wchan:32 | grep D | grep laccess | awk '{print $7}' | grep -v "-"; done;

Ok, so our process is sleeping in kernel function called lock_sock_nested.


We can force linux kernel to dump backtraces for processes that are in D state. Information could be seen via dmesg utility or in /var/log/messages file.

echo w > /proc/sysrq-trigger

Again, with additional bash magic our command looks like this:

echo w > /proc/sysrq-trigger; dmesg -c | less;

This is how output looks like:

[688596.998782] SysRq : Show Blocked State
[688596.999054]   task                        PC stack   pid father
[688596.999101] laccessd      D 00000000ffffffff     0 26940  25632 0x00000000
[688596.999104]  ffff88090885fdd8 0000000000000086 ffff8800282d3700 ffff88125a5b2380
[688596.999107]  0000000000013700 ffff88090885ffd8 0000000000013700 ffff88090885ffd8
[688596.999109]  0000000000013700 0000000000013700 0000000000013700 0000000000013700
[688596.999111] Call Trace:
[688596.999113] Inexact backtrace:
[688596.999122]  [<ffffffff810467de>] ? try_to_wake_up+0x1be/0x400
[688596.999128]  [<ffffffff8139cb13>] ? schedule_timeout+0x143/0x240
[688596.999133]  [<ffffffff81059a30>] ? process_timeout+0x0/0x10
[688596.999136]  [<ffffffff81135b23>] ? ep_poll+0x153/0x240
[688596.999139]  [<ffffffff81046a20>] ? default_wake_function+0x0/0x10
[688596.999141]  [<ffffffff81135ce4>] ? sys_epoll_wait+0xd4/0xe0
[688596.999146]  [<ffffffff81002f7b>] ? system_call_fastpath+0x16/0x1b

Jun 30, 2012

Searching for a position of set bits in a variable

In a deamon that I have created in Badoo I need to search for a positions of a bits in a 32 bit variable. The simplest solution is to look at each bit, but there is a faster solution using GCC built in function __builtin_ctzl:
Returns the number of trailing 0-bits in x, starting at the least significant bit position. If x is 0, the result is undefined.
So I wrote test to compare speed of these two approaches:

Here is the result of testing on my VirtualBox:

While writing test, I have stumbled upon issue with bit shifting. It happened that if you right shift by the size of a variable, you get not 0, but same number. So I have additional if statement for case when I have to shift by 32.