背景

前几天深夜更新游戏版本,半夜服务器集群启动失败了

排查过程

发现有启动失败coredump文件

使用gdb查看coredump原因

gdb ../pf/main coredump_cds_2022-09-15_00\:25\:20.txt

查询所有线程堆栈, 找到coredump问题线程

thread apply all bt

Thread 6 (LWP 32566):
#0  WriteCoreDumpLimited (
    file_name=0x7ff8101470f8 "coredump_cds_2022-09-15_00\:25\:20.txt", max_length=1073741824)
    at src/coredumper.c:183
#1  0x00007ff87208bafe in sig_handler (sig=6, 
    si=0x7ff859d63bf0, unused=0x7ff859d63ac0)
    at console_linux.cpp:238
#2  <signal handler called>
#3  0x00007ff87164e2c7 in raise ()
from /lib64/libc.so.6
#4  0x00007ff87164f9b8 in abort ()
from /lib64/libc.so.6
#5  0x00007ff8716470e6 in __assert_fail_base ()
---Type <return> to continue, or q <return> to quit---
from /lib64/libc.so.6
#6  0x00007ff871647192 in __assert_fail ()
from /lib64/libc.so.6
#7  0x00007ff866f757e6 in ConnMgr::threadAllocateClientConn (this=0x7ff8671aa6e0 <__g_ConnMgr_singleton>, 
    szIP=..., uPort=7085, nCookies=25130)
    at ConnMgr.cpp:745

排查祖传代码
```
map<int, ClientConnPtr>::iterator mapIter = m_FdClientConnMap.find(fd);
if( mapIter != m_FdClientConnMap.end())
{
    assert(0);
}
```
- 可以看出,当创建连接的时候,如果申请的到fd原本就存在我们的m_FdClientConnMap中,那么就认为创建的socket连接有问题
排查祖传代码是否有问题
1. 排查代码后发现,当lua脚本层消息发送失败后,会直接关闭对应socket连接
2. 而对m_FdClientConnMap的数据却需要到下一帧的时候才进行处理
3. 因为在linux中会复用fd编号, 如果这时候创建新的socket连接就会导致新连接的fd还在m_FdClientConnMap中,从而导致启动游戏服务器集群失败

发表评论取消回复

要发表评论，您必须先登录。