Discussion:
[sr-dev] [kamailio/kamailio] Random segfault or general protection crash on siptrace.so module (#2718)
Joshua Riffle
2021-04-28 17:49:12 UTC
Permalink
### Description
After upgrading from Kamailio 5.2.x, a high volume Kamailio 5.4.4 instance randomly crashes with either a general protection or segfault error message in `siptrace.so` during use of `sip_trace` function from one of its child processes (which cascades to the parent crashing). This appears to occur once about every 36 hours on average, but has not yet appeared to correspond with any particular event.

We are continuing to collect debug information and will be populating this ticket as more information becomes available. However, this issue has been observed.

Sip trace function is applied in this example snippet:
```
# ------- siptrace --------
modparam("siptrace", "hep_mode_on", 1)
modparam("siptrace", "hep_version", 3)
modparam("siptrace", "trace_to_database", 0)
modparam("siptrace", "trace_flag", 22)
modparam("siptrace", "trace_on", 1)

request_route {
#....
if ( is_method("INVITE") && !has_totag() ) {
# Only start sip_trace on initial INVITE
sip_trace("HEP_URL","$ci-MY_IP","d");
}
setflag(22);
#...
}
```


### Troubleshooting
We attempted packet collection with Homer v5 and Homer v7 and changed between HEP protocol v2 and v3.

#### Reproduction
We have not determined a means of reproducing this issue without simply letting the server run until a crash occurs. There are four almost identical servers all experiencing the same random crashing but not at the same time.

#### Debugging Data
Our next troubleshooting case will be to simply comment out the `sip_trace` function, but this effectively disables the `siptrace` module completely rather than addressing an underlying problem.

Core dumps are still in-progress for retrieval. Debug logs should also be more readily available soon. There will be delays since these are high volume production servers.

<!--
If you got a core dump, use gdb to extract troubleshooting data - full backtrace,
local variables and the list of the code at the issue location.

gdb /path/to/kamailio /path/to/corefile
bt full
info locals
list

If you are familiar with gdb, feel free to attach more of what you consider to
be relevant.
-->

#### Log Messages

All of them have randomly crashed with the following example log entry. Regardless of troubleshooting tactics to date:
```
kernel: traps: kamailio[7579] general protection ip:7fb1a64e2dbf sp:7ffc60f04180 error:0 in siptrace.so[7fb1a64b8000+4e000]
systemd: kamailio.service: main process exited, code=exited, status=1/FAILURE
systemd: Unit kamailio.service entered failed state.
systemd: kamailio.service failed.
```


#### SIP Traffic
To date, there is no corresponding SIP Traffic with the crash.

### Possible Solutions
To date, only disabling the `siptrace` module seems to be the solution.

### Additional Information

* **Kamailio Version** - output of `kamailio -v`

```
version: kamailio 5.4.4 (x86_64/linux) e16352
flags: USE_TCP, USE_TLS, USE_SCTP, TLS_HOOKS, USE_RAW_SOCKS, DISABLE_NAGLE, USE_MCAST, DNS_IP_HACK, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, TLSF_MALLOC, DBG_SR_MEMORY, USE_FUTEX, FAST_LOCK-ADAPTIVE_WAIT, USE_DNS_CACHE, USE_DNS_FAILOVER, USE_NAPTR, USE_DST_BLACKLIST, HAVE_RESOLV_RES
ADAPTIVE_WAIT_LOOPS 1024, MAX_RECV_BUFFER_SIZE 262144, MAX_URI_SIZE 1024, BUF_SIZE 65535, DEFAULT PKG_SIZE 8MB
poll method support: poll, epoll_lt, epoll_et, sigio_rt, select.
id: e16352
compiled on 15:56:46 Feb 15 2021 with gcc 4.8.5
```

* **Operating System**:

<!--
Details about the operating system, the type: Linux (e.g.,: Debian 8.4, Ubuntu 16.04, CentOS 7.1, ...), MacOS, xBSD, Solaris, ...;
Kernel details (output of `uname -a`)
-->

```
Linux <hostname> 3.10.0-957.27.2.el7.x86_64 #1 SMP Mon Jul 29 17:46:05 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
```
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/kamailio/kamailio/issues/2718
Daniel-Constantin Mierla
2021-04-28 18:45:28 UTC
Permalink
Getting `bt full` with gdb from all core dump files would be useful, otherwise is no lead at this moment where the issue can be.
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/kamailio/kamailio/issues/2718#issuecomment-828693745
Joshua Riffle
2021-04-28 20:36:52 UTC
Permalink
Any hints on getting a core dump? I have been working my way through everything in this documentation on CentOS https://www.kamailio.org/wiki/tutorials/troubleshooting/coredumpfile and it appears to absolutely refuse everything ulimit, sysctl parameters, CORE_DUMP=yes, etc... I am sure I will find it eventually but in the meantime I am testing with "kill -ABRT"

ALERT: <core> [main.c:786]: handle_sigs(): core was not generated

I'll find it eventually I am sure...
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/kamailio/kamailio/issues/2718#issuecomment-828760513
Joshua Riffle
2021-04-28 21:07:02 UTC
Permalink
In case someone else runs into this issue -- it was far from transparent, but my best guess is it was caused by some combination of file permission issues. These are the various things on CentOS 7 that were tweaked to allow a core dump

The following combination of things were changed:

/etc/sysctl.d/99-sysctl.conf
```
kernel.core_pattern = /tmp/core.%e.%p.%h.%t
kernel.core_uses_pid=1
fs.suid_dumpable = 2
```

`sysctl --system` to reload kernel parameters

/etc/sysconfig/kamailio
```
DUMP_CORE=yes
```

May be required but unclear:
`ulimit -c unlimited`

Tested by running
`kill -ABRT <PID of Kamailio Child>`
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/kamailio/kamailio/issues/2718#issuecomment-828780347
Daniel-Constantin Mierla
2021-04-29 06:25:56 UTC
Permalink
@rampageai: thanks for the notes for CentOS, I added them to wiki:

* https://www.kamailio.org/wiki/tutorials/troubleshooting/coredumpfile#centos_core_dump_settings

The `ulimit` is useful if the core file is going to have a large size, it may be already set to a value large enough, however setting it to `unlimited` ensures the size won't hit any system limit.
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/kamailio/kamailio/issues/2718#issuecomment-828972055
Joshua Riffle
2021-04-30 16:39:41 UTC
Permalink
Core dumps for parent and child using `bt full` in gdb.

NOTE: I am most definitely not a gdb ninja so I focused on running the full backtrace command and hoping the output of it provides some further input from anyone with additional knowledge.

PARENT:
```
(gdb) bt full
#0 0x00007efc7d7152c7 in raise () from /lib64/libc.so.6
No symbol table info available.
#1 0x00007efc7d7169b8 in abort () from /lib64/libc.so.6
No symbol table info available.
#2 0x000000000041e300 in sig_alarm_abort ()
No symbol table info available.
#3 <signal handler called>
No symbol table info available.
#4 0x00007efc7d7c1ec7 in sched_yield () from /lib64/libc.so.6
No symbol table info available.
#5 0x00000000004f72ad in timer_del_safe ()
No symbol table info available.
#6 0x00007efc765f5799 in unlink_timers () from /usr/lib64/kamailio/modules/tm.so
No symbol table info available.
#7 0x00007efc765f65e5 in free_cell_helper () from /usr/lib64/kamailio/modules/tm.so
No symbol table info available.
#8 0x00007efc765f8dec in free_hash_table () from /usr/lib64/kamailio/modules/tm.so
No symbol table info available.
#9 0x00007efc766a6b51 in tm_shutdown () from /usr/lib64/kamailio/modules/tm.so
No symbol table info available.
#10 0x000000000058916f in destroy_modules ()
No symbol table info available.
#11 0x000000000041cd77 in cleanup ()
No symbol table info available.
#12 0x000000000041e652 in shutdown_children ()
No symbol table info available.
#13 0x00000000004216fa in handle_sigs ()
No symbol table info available.
#14 0x000000000042b9b6 in main_loop ()
No symbol table info available.
#15 0x0000000000433f34 in main ()
No symbol table info available.
```

CHILD:
```
#0 0x00007efc73a6adbf in msg_send_buffer ()
from /usr/lib64/kamailio/modules/siptrace.so
No symbol table info available.
#1 0x00007efc73a7371d in trace_send_hep2_duplicate ()
from /usr/lib64/kamailio/modules/siptrace.so
No symbol table info available.
#2 0x00007efc73a6dfa9 in trace_send_hep_duplicate ()
from /usr/lib64/kamailio/modules/siptrace.so
No symbol table info available.
#3 0x00007efc73a4c517 in sip_trace_store ()
from /usr/lib64/kamailio/modules/siptrace.so
No symbol table info available.
#4 0x00007efc73a5c760 in trace_onreply_out ()
from /usr/lib64/kamailio/modules/siptrace.so
No symbol table info available.
#5 0x00007efc766b1ec2 in run_trans_callbacks_internal ()
from /usr/lib64/kamailio/modules/tm.so
No symbol table info available.
#6 0x00007efc766b2147 in run_trans_callbacks_off_params ()
from /usr/lib64/kamailio/modules/tm.so
No symbol table info available.
#7 0x00007efc7664a1b2 in _reply_light () from /usr/lib64/kamailio/modules/tm.so
No symbol table info available.
#8 0x00007efc7664b1fb in _reply () from /usr/lib64/kamailio/modules/tm.so
No symbol table info available.
#9 0x00007efc766570aa in t_reply_str_unsafe ()
No symbol table info available.
#10 0x00007efc766cc03e in ki_t_reply () from /usr/lib64/kamailio/modules/tm.so
No symbol table info available.
#11 0x00007efc766cc8d4 in w_t_reply () from /usr/lib64/kamailio/modules/tm.so
No symbol table info available.
#12 0x000000000046617e in do_action ()
No symbol table info available.
#13 0x000000000047341b in run_actions ()
No symbol table info available.
#14 0x0000000000465fd7 in do_action ()
No symbol table info available.
#15 0x000000000047341b in run_actions ()
No symbol table info available.
#16 0x0000000000465fd7 in do_action ()
No symbol table info available.
#17 0x000000000047341b in run_actions ()
No symbol table info available.
#18 0x0000000000473b5f in run_top_route ()
No symbol table info available.
#19 0x00007efc7664e860 in run_failure_handlers ()
from /usr/lib64/kamailio/modules/tm.so
No symbol table info available.
#20 0x00007efc76652edb in t_should_relay_response ()
from /usr/lib64/kamailio/modules/tm.so
No symbol table info available.
#21 0x00007efc7665843d in relay_reply () from /usr/lib64/kamailio/modules/tm.so
No symbol table info available.
#22 0x00007efc7669cce3 in fake_reply () from /usr/lib64/kamailio/modules/tm.so
No symbol table info available.
#23 0x00007efc7669d150 in final_response_handler ()
from /usr/lib64/kamailio/modules/tm.so
No symbol table info available.
#24 0x00007efc7669d211 in retr_buf_handler () from /usr/lib64/kamailio/modules/tm.so
No symbol table info available.
#25 0x00000000004f8c0e in slow_timer_main ()
No symbol table info available.
#26 0x000000000042a602 in main_loop ()
No symbol table info available.
#27 0x0000000000433f34 in main ()
No symbol table info available.
```
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/kamailio/kamailio/issues/2718#issuecomment-830218164
Daniel-Constantin Mierla
2021-04-30 17:20:25 UTC
Permalink
The one for parent is a side effect, due to the crash of the child process.

You have to install kamailio debug symbols package (kamailio-dbg or so) and then get again the `bt full`, it will contain more useful details. It should work with existing core files.
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/kamailio/kamailio/issues/2718#issuecomment-830242018
Joshua Riffle
2021-04-30 20:18:55 UTC
Permalink
Acknowledged. I am learning the debug system better at least. I was able to install `kamailio-debuginfo` CentOS package from the repository, test by forcing a core dump, and prove that it does indeed carry more information in the full backtrace. I have to wait until one of my instances generates a random crash and core dump this following week, though, so it will be some time.
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/kamailio/kamailio/issues/2718#issuecomment-830357761
Daniel-Constantin Mierla
2021-05-03 06:19:07 UTC
Permalink
You do not need to wait for new core dumps. As I wrote above, it should work with existing core files, after you installed debugging info package, just re-do the gdb commands and should get backtraces with more debugging details.
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/kamailio/kamailio/issues/2718#issuecomment-831047254
Joshua Riffle
2021-05-03 14:22:30 UTC
Permalink
Thank you for pointing that out again. I did miss that. I attached the gdb logging file for the core dump of the child process (since it's more interesting) to this message. I really hope it's helpful.

[gdb-20210503.txt](https://github.com/kamailio/kamailio/files/6415372/gdb-20210503.txt)
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/kamailio/kamailio/issues/2718#issuecomment-831292323
Daniel-Constantin Mierla
2021-05-07 05:51:25 UTC
Permalink
I need further details from the core file, run the following commands in the gdb and paste the output here:

```
frame 0
list
p *dst
```
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/kamailio/kamailio/issues/2718#issuecomment-834086485
Joshua Riffle
2021-05-07 14:30:14 UTC
Permalink
Hi, here is what I got from running those commands in order:
```
(gdb) frame 0
#0 0x00007efc73a6adbf in msg_send_buffer (dst=0x7efc0a6064b8,
buf=0x7efc79a0e1e8 "\002\020\002\021\023\304\023\304\064\bɀ4\bɀ\330\037\214`\237\002\f", len=753, flags=1)
at ../../core/forward.h:220
220 if (unlikely((dst->send_sock==0) ||
(gdb) list
215 goto done;
216 }
217 #endif
218
219 if (likely(dst->proto==PROTO_UDP)){
220 if (unlikely((dst->send_sock==0) ||
221 (dst->send_sock->flags & SI_IS_MCAST))){
222 new_dst=*dst;
223 new_dst.send_sock=get_send_socket(0, &dst->to, dst->proto);
224 if (unlikely(new_dst.send_sock==0)){
(gdb) p *dst
$1 = {send_sock = 0x726177726f462d78, to = {s = {sa_family = 29540, sa_data = ": 70\r\nContent-"}, sin = {
sin_family = 29540, sin_port = 8250, sin_addr = {s_addr = 168636471}, sin_zero = "Content-"}, sin6 = {
sin6_family = 29540, sin6_port = 8250, sin6_flowinfo = 168636471, sin6_addr = {__in6_u = {
__u6_addr8 = "Content-Length: ", __u6_addr16 = {28483, 29806, 28261, 11636, 25932, 26478, 26740, 8250},
__u6_addr32 = {1953394499, 762605157, 1735288140, 540698740}}}, sin6_scope_id = 1426722096}, sas = {
ss_family = 29540,
__ss_padding = ": 70\r\nContent-Length: 0\r\nUser-Agent: Skyetel\r\n\r\n", '\000' <repeats 69 times>,
__ss_align = 0}}, id = 0, send_flags = {f = 0, blst_imask = 0}, proto = 1 '\001', proto_pad0 = 0 '\000',
proto_pad1 = 0}
```


--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/kamailio/kamailio/issues/2718#issuecomment-834456207
Joshua Riffle
2021-05-26 20:24:54 UTC
Permalink
Hi, any thoughts or updates on this issue? It looks roughly like there is garbled data making it into the networking.
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/kamailio/kamailio/issues/2718#issuecomment-849093295
Loading...