Glusterfs: Get core dump on a customer set up without killing the process
Hello, Gluster Ants,
This short writeup explains how to capture a core dump on a system without killing a process.
Why do we need this?
Finding the root cause of an issue that occurred in the customer/production setup is a challenging task. Most of the time we cannot replicate/setup the environment and scenario which is leading to the issue on our test setup. In such cases, we got to grab most of the information from the system where the problem has occurred.
What information we look for and also useful?
The information like a core dump is very helpful to catch the root cause of an issue by adding ASSERT() in the code at the places where we feel something is wrong and install the custom build on the affected setup. But the issue is ASSERT() would kill the process and produce the core dump.
Is it a good idea to do ASSERT() on customer setup?
Remember we are seeking help from customer setup, they unlikely agree to kill the process and produce the core dump for us to root cause it. It affects the customer’s business and nobody agrees with this proposal.
What if we have a way to produce a core dump without a kill?
Yes, Glusterfs provides a way to do this. Gluster has customized ASSERT() i.e GF_ASSERT() in place which helps in producing the core dump without killing the associated process and also provides a script which can be run on the customer set up that produces the core dump without harming the running process (This presumes we already have GF_ASSERT() at the expected place in the current build running on customer setup. If not, we need to install custom build on that setup by adding GF_ASSERT()).
Is GF_ASSERT() newly introduced in Gluster code?
No. GF_ASSERT() is already there in the codebase before this improvement. In the debug build, GF_ASSERT() kills the process and produces the core dump but in the production build, it just logs the error and moves on. What we have done is we just changed the implementation of the code and now in production build also we get the core dump but the process won’t be killed. The code places where GF_ASSERT() is not covered, please add it as per the requirement.
Here are the steps to achieve the goal:
- Add GF_ASSERT() in the Gluster code where you expect something wrong is happening. Ex: I am forcefully asserting by doing GF_ASEERT(0) in fuse_write() path
3. Now, in the other terminal, run the gfcore.py script
# ./extras/debug/gfcore.py 123673 1 /tmp/ (123673 pid of the glusterfs, got it by running “ps -ef | grep gluster” in the previous step)
4. Hit the code path where you have introduced GF_ASSERT(). In my case, I have introduced it in fuse_write() path. So I do “dd” on a file present under Gluster mount
5. Go to the terminal where the gdb is running (step 3) and observe that the gdb process is terminated
6. Go to the directory where the core-dump is produced. In my case, it is “/tmp” and find the binary associated with the core dump using the file command
7. Access the core dump using gdb as shown below. (1st arg would be core file name and 2nd arg is o/p of file command in the previous step)
8. Observe that the Gluster process is unaffected by checking its process state. (pid = 123673 is still running in below o/p)
Yay! Isn’t it useful !! This is powered by Gluster. Thanks, Xavi Hernandez(email@example.com) for the idea. This will ease many Gluster developer's/maintainer’s life.
Please suggest corrections if any and help to improvise.
Keep contributing !!