了解 "corrupted size vs. prev_size" glibc 错误
Understanding "corrupted size vs. prev_size" glibc error
我已经实现了一个到 FDK-AAC 的 JNA 桥。源代码可以在 here
中找到
当对我的代码进行基准测试时,我可以在同一个输入上获得数百个成功的 运行,然后偶尔会发生 C 级崩溃,这会终止整个进程,导致核心转储待生成:
查看核心转储,它看起来像这样:
#1 0x00007f3e92e00f5d in __GI_abort () at abort.c:90
#2 0x00007f3e92e4928d in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7f3e92f70528 "*** Error in `%s': %s: 0x%s ***\n") at ../sysdeps/posix/libc_fatal.c:181
#3 0x00007f3e92e5064a in malloc_printerr (action=<optimized out>, str=0x7f3e92f6cdee "corrupted size vs. prev_size", ptr=<optimized out>, ar_ptr=<optimized out>) at malloc.c:5426
#4 0x00007f3e92e5304a in _int_free (av=0x7f3de0000020, p=<optimized out>, have_lock=0) at malloc.c:4337
#5 0x00007f3e92e5744e in __GI___libc_free (mem=<optimized out>) at malloc.c:3145
#6 0x00007f3e113921e9 in FDKfree (ptr=0x7f3de009df60) at libSYS/src/genericStds.cpp:233
#7 0x00007f3e1130d7d3 in Free_AacEncoder (p=0x7f3de0115740) at libAACenc/src/aacenc_lib.cpp:407
#8 0x00007f3e1130fbb3 in aacEncClose (phAacEncoder=0x7f3de0115740) at libAACenc/src/aacenc_lib.cpp:1395
如果我 运行 重复基准测试足够多次,这个 back/stack 跟踪错误是可以重现的,尽管我很难理解这种错误的可能原因是什么?分配给指针 0x7f3de009df60
的内存也在 CPP/C 代码内部分配,我可以保证释放分配的同一实例。当然,基准是单线程的。
看完这些:
security checks &&
internal functions
我仍然很难理解 - 什么可能是导致我出现上述错误的真实(非利用,而是错误))场景?为什么它很少发生?
当前怀疑:
运行 一个详细的回溯,我得到这个输入:
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
set = {__val = {4, 6378670679680, 645636045657660056, 90523359816, 139904561311072, 292199584, 139903730612120, 139903730611784, 139904561311088, 1460617926600, 47573685816, 4119199860131166208,
139904593745464, 139904553224483, 139904561311136, 288245657}}
pid = <optimized out>
tid = <optimized out>
#1 0x00007f3e92e00f5d in __GI_abort () at abort.c:90
save_stage = 2
act = {__sigaction_handler = {sa_handler = 0x7f3de026db10, sa_sigaction = 0x7f3de026db10}, sa_mask = {__val = {139903730540556, 19, 30064771092, 812522497172832284, 139903728706672, 1887866374039011357,
139900298780168, 3775732748407067896, 763430436865, 35180077121538, 4119199860131166208, 139904561311552, 139904553065676, 1, 139904561311584, 139904561312192}}, sa_flags = 4096,
sa_restorer = 0x14}
sigs = {__val = {32, 0 <repeats 15 times>}}
#2 0x00007f3e92e4928d in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7f3e92f70528 "*** Error in `%s': %s: 0x%s ***\n") at ../sysdeps/posix/libc_fatal.c:181
ap = {{gp_offset = 40, fp_offset = 32574, overflow_arg_area = 0x7f3e11adf1d0, reg_save_area = 0x7f3e11adf160}}
fd = <optimized out>
list = <optimized out>
nlist = <optimized out>
cp = <optimized out>
written = <optimized out>
#3 0x00007f3e92e5064a in malloc_printerr (action=<optimized out>, str=0x7f3e92f6cdee "corrupted size vs. prev_size", ptr=<optimized out>, ar_ptr=<optimized out>) at malloc.c:5426
buf = "00007f3de009e9f0"
cp = <optimized out>
ar_ptr = <optimized out>
ptr = <optimized out>
str = 0x7f3e92f6cdee "corrupted size vs. prev_size"
action = <optimized out>
#4 0x00007f3e92e5304a in _int_free (av=0x7f3de0000020, p=<optimized out>, have_lock=0) at malloc.c:4337
size = 2720
fb = <optimized out>
nextchunk = 0x7f3de009e9f0
nextsize = 736
nextinuse = <optimized out>
prevsize = <optimized out>
bck = <optimized out>
fwd = <optimized out>
errstr = 0x0
locked = <optimized out>
#5 0x00007f3e92e5744e in __GI___libc_free (mem=<optimized out>) at malloc.c:3145
ar_ptr = <optimized out>
p = <optimized out>
hook = <optimized out>
#6 0x00007f3e113921e9 in FDKfree (ptr=0x7f3de009df60) at libSYS/src/genericStds.cpp:233
No locals.
#7 0x00007f3e1130d7d3 in Free_AacEncoder (p=0x7f3de0115740) at libAACenc/src/aacenc_lib.cpp:407
No locals.
#8 0x00007f3e1130fbb3 in aacEncClose (phAacEncoder=0x7f3de0115740) at libAACenc/src/aacenc_lib.cpp:1395
hAacEncoder = 0x7f3de009df60
err = AACENC_OK
- 在第#6帧中,可以看到题目中的指针是
0x7f3de009df60
.
- 在frame#4中可以看到大小为2720,确实是发布结构的预期大小
- 但是
nextchunk
的地址是0x7f3de009e9f0
,在释放当前指针之后只有2704字节。
- 我可以确认错误重现时总是如此。
- 这是否是我所面临错误的有力指示??
好的,我已经成功解决了这个问题。
首先 - "corrupted size vs. prev_size" 的实际原因非常简单 - 由于代码 out-of-bounds 访问,相邻后续块中的内存块控制结构字段正在被覆盖。如果您为指针 p
分配了 x
字节,但对于同一个指针的写入超出了 x
,您可能会收到此错误,表明当前内存分配(块)大小不是与在下一个块控制结构中发现的相同(由于它被覆盖)。
至于此内存泄漏的原因 - 在 Java/JNA 层中完成的结构映射暗示 #pragma
相关的 padding/alignment 与编译 dll/so 的不同。这反过来又导致数据写入超出分配的结构边界。禁用对齐会使问题消失。 (数千次执行没有一次崩溃!)。
我已经实现了一个到 FDK-AAC 的 JNA 桥。源代码可以在 here
中找到当对我的代码进行基准测试时,我可以在同一个输入上获得数百个成功的 运行,然后偶尔会发生 C 级崩溃,这会终止整个进程,导致核心转储待生成:
查看核心转储,它看起来像这样:
#1 0x00007f3e92e00f5d in __GI_abort () at abort.c:90
#2 0x00007f3e92e4928d in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7f3e92f70528 "*** Error in `%s': %s: 0x%s ***\n") at ../sysdeps/posix/libc_fatal.c:181
#3 0x00007f3e92e5064a in malloc_printerr (action=<optimized out>, str=0x7f3e92f6cdee "corrupted size vs. prev_size", ptr=<optimized out>, ar_ptr=<optimized out>) at malloc.c:5426
#4 0x00007f3e92e5304a in _int_free (av=0x7f3de0000020, p=<optimized out>, have_lock=0) at malloc.c:4337
#5 0x00007f3e92e5744e in __GI___libc_free (mem=<optimized out>) at malloc.c:3145
#6 0x00007f3e113921e9 in FDKfree (ptr=0x7f3de009df60) at libSYS/src/genericStds.cpp:233
#7 0x00007f3e1130d7d3 in Free_AacEncoder (p=0x7f3de0115740) at libAACenc/src/aacenc_lib.cpp:407
#8 0x00007f3e1130fbb3 in aacEncClose (phAacEncoder=0x7f3de0115740) at libAACenc/src/aacenc_lib.cpp:1395
如果我 运行 重复基准测试足够多次,这个 back/stack 跟踪错误是可以重现的,尽管我很难理解这种错误的可能原因是什么?分配给指针 0x7f3de009df60
的内存也在 CPP/C 代码内部分配,我可以保证释放分配的同一实例。当然,基准是单线程的。
看完这些:
security checks && internal functions
我仍然很难理解 - 什么可能是导致我出现上述错误的真实(非利用,而是错误))场景?为什么它很少发生?
当前怀疑:
运行 一个详细的回溯,我得到这个输入:
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
set = {__val = {4, 6378670679680, 645636045657660056, 90523359816, 139904561311072, 292199584, 139903730612120, 139903730611784, 139904561311088, 1460617926600, 47573685816, 4119199860131166208,
139904593745464, 139904553224483, 139904561311136, 288245657}}
pid = <optimized out>
tid = <optimized out>
#1 0x00007f3e92e00f5d in __GI_abort () at abort.c:90
save_stage = 2
act = {__sigaction_handler = {sa_handler = 0x7f3de026db10, sa_sigaction = 0x7f3de026db10}, sa_mask = {__val = {139903730540556, 19, 30064771092, 812522497172832284, 139903728706672, 1887866374039011357,
139900298780168, 3775732748407067896, 763430436865, 35180077121538, 4119199860131166208, 139904561311552, 139904553065676, 1, 139904561311584, 139904561312192}}, sa_flags = 4096,
sa_restorer = 0x14}
sigs = {__val = {32, 0 <repeats 15 times>}}
#2 0x00007f3e92e4928d in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7f3e92f70528 "*** Error in `%s': %s: 0x%s ***\n") at ../sysdeps/posix/libc_fatal.c:181
ap = {{gp_offset = 40, fp_offset = 32574, overflow_arg_area = 0x7f3e11adf1d0, reg_save_area = 0x7f3e11adf160}}
fd = <optimized out>
list = <optimized out>
nlist = <optimized out>
cp = <optimized out>
written = <optimized out>
#3 0x00007f3e92e5064a in malloc_printerr (action=<optimized out>, str=0x7f3e92f6cdee "corrupted size vs. prev_size", ptr=<optimized out>, ar_ptr=<optimized out>) at malloc.c:5426
buf = "00007f3de009e9f0"
cp = <optimized out>
ar_ptr = <optimized out>
ptr = <optimized out>
str = 0x7f3e92f6cdee "corrupted size vs. prev_size"
action = <optimized out>
#4 0x00007f3e92e5304a in _int_free (av=0x7f3de0000020, p=<optimized out>, have_lock=0) at malloc.c:4337
size = 2720
fb = <optimized out>
nextchunk = 0x7f3de009e9f0
nextsize = 736
nextinuse = <optimized out>
prevsize = <optimized out>
bck = <optimized out>
fwd = <optimized out>
errstr = 0x0
locked = <optimized out>
#5 0x00007f3e92e5744e in __GI___libc_free (mem=<optimized out>) at malloc.c:3145
ar_ptr = <optimized out>
p = <optimized out>
hook = <optimized out>
#6 0x00007f3e113921e9 in FDKfree (ptr=0x7f3de009df60) at libSYS/src/genericStds.cpp:233
No locals.
#7 0x00007f3e1130d7d3 in Free_AacEncoder (p=0x7f3de0115740) at libAACenc/src/aacenc_lib.cpp:407
No locals.
#8 0x00007f3e1130fbb3 in aacEncClose (phAacEncoder=0x7f3de0115740) at libAACenc/src/aacenc_lib.cpp:1395
hAacEncoder = 0x7f3de009df60
err = AACENC_OK
- 在第#6帧中,可以看到题目中的指针是
0x7f3de009df60
. - 在frame#4中可以看到大小为2720,确实是发布结构的预期大小
- 但是
nextchunk
的地址是0x7f3de009e9f0
,在释放当前指针之后只有2704字节。 - 我可以确认错误重现时总是如此。
- 这是否是我所面临错误的有力指示??
好的,我已经成功解决了这个问题。
首先 - "corrupted size vs. prev_size" 的实际原因非常简单 - 由于代码 out-of-bounds 访问,相邻后续块中的内存块控制结构字段正在被覆盖。如果您为指针 p
分配了 x
字节,但对于同一个指针的写入超出了 x
,您可能会收到此错误,表明当前内存分配(块)大小不是与在下一个块控制结构中发现的相同(由于它被覆盖)。
至于此内存泄漏的原因 - 在 Java/JNA 层中完成的结构映射暗示 #pragma
相关的 padding/alignment 与编译 dll/so 的不同。这反过来又导致数据写入超出分配的结构边界。禁用对齐会使问题消失。 (数千次执行没有一次崩溃!)。