Jun 19, 2014

perf and stack traces

I was wondering why perf record -g don't show proper stack traces for my programs in production environment. First I thought that kernel was too old, but after performing few experiments I have found out that it wasn't the case. Problem was that when you compile with optimizations (-O3), gcc automatically omits frame pointers. And it is not easy to unwind stack traces without frame pointers. But gcc can do it somehow. So I have continued digging and stumbled upon article where people bash one of the perf authors for not being able to unwind stacks without frame pointers.

If you read thoroughly you can find out that gcc uses DWARF debugging information for stack unwinding, but it's too slow for profilers.

So I wanted to know how much slower would my programs be if I include frame pointers. And seems that performance losses are negligible:
I tested two MySQL builds, one built with ‘-O3 -g -fno-omit-frame-pointer’ and other with -fomit-frame-pointer instead – and performance difference was negligible. It was around 1% in sysbench tests, and slightly over 3% at tight-loop select benchmark(100000000,(select asin(5+5)+sin(5+5))); on a 2-cpu Opteron box.
So I will try including frame pointers from now on. To include them, use -fno-omit-frame-pointers gcc parameter when building executable.

If you are interested what frame pointers are and how they work, I would recommend reading answer to this stackoverflow question.

May 2, 2014

Who's calling?

Sometimes your program uses a lot of system time. Let's say 90%. You fire up your favorite profiling tool and it tells you which system call it is. If you are experiences and maybe lucky, you can say straight away which part of your program is to blame. But it's not always so obvious.

GDB comes to the rescue. You can use 'catch syscall <syscall>' and it will break when particular syscall is called. And then you can use 'bt' to find where in your code this syscall is being called from.

P.S. Don't use random() in multi-threaded program. Use random_r().

perf on latest linux kernel

I have being playing today with perf on latest available kernel in Fedora (3.13). These are just some observations and thoughts.

perf trace

This tool is similar to strace, but almost without overhead. Contrary to strace, you can use perf trace to watch syscalls system wide or system calls generated by processes owned by certain user.

perf top

Function names are shown correctly and -g parameter is now available. It means that you can get call traces in a real time (without perf record).

perf timechart

This tool creates cool looking diagrams of a system workload.

perf sched

This tool shows you very verbose system scheduler statistics. Including latencies for a particular process. You can see if your server wasn't available because of scheduler.

I don't know what for, but perf sched can replay previously recorded system load. Maybe it is useful for scheduler development and optimization.

perf mem

This tool collects memory requests. It can show you whether your memory request was served by CPU cashe or not for example.

perf lock

This tool presumably allows you to collect information about various locks in your process and in the kernel, but it's not working on a last Fedora kernel because it lacks particular option:

$ sudo perf lock record -p 12727
tracepoint lock:lock_acquire is not enabled. Are CONFIG_LOCKDEP and CONFIG_LOCK_STAT enabled?