[Gluster-devel] a bug when read files in a symbol-link directory
He Xiaobin
allreol at gmail.com
Sun Sep 6 03:38:24 UTC 2009
I use glusterfs in a cluster system (configured as:
dht->afr->client->server->iothreads->locks->posix), after days running, it
is stable, but with a poor porformance (slower thann NFS exported from only
one server), and most important is that a bug came to me these days. This is
really an emergency, so I need your help!
What is the BUG? In this system, I use mvapich+blcr for task checkpoint and
restore. I don't know how mvapich works, but I am sure it used glusterfs in
my case. When using glusterfs in checkpointing a task, it created one ckpt
file for each proccess of the task, all the ckpt files placed in directory
called 1, and it will create a symbol link called 0 pointing to directory 1.
There is example, fortest is username, .ckpt is the ckpt file directory for
this user, 1972 is the task id, 0 is the symbol link and bt.C.64-19.ckpt is
a ckpt file the task's 19th proccess
[fortest at gfsclient02 1972]$ pwd
/mnt/glusterfs/.ckpt/1972
[fortest at gfsclient02 1972]$ ll
total 132
lrwxrwxrwx 1 fortest fortest 31 Sep 4 17:09 0 ->
/mnt/glusterfs/fortest/.ckpt/1972/1
drwx------ 2 fortest fortest 65536 Sep 4 20:06 1
[fortest at gfsclient02 1972]$ ls 1/
bt.C.64-0.ckpt bt.C.64-21.ckpt bt.C.64-33.ckpt bt.C.64-45.ckpt
bt.C.64-57.ckpt
bt.C.64-10.ckpt bt.C.64-22.ckpt bt.C.64-34.ckpt bt.C.64-46.ckpt
bt.C.64-58.ckpt
bt.C.64-11.ckpt bt.C.64-23.ckpt bt.C.64-35.ckpt bt.C.64-47.ckpt
bt.C.64-59.ckpt
bt.C.64-12.ckpt bt.C.64-24.ckpt bt.C.64-36.ckpt bt.C.64-48.ckpt
bt.C.64-5.ckpt
bt.C.64-13.ckpt bt.C.64-25.ckpt bt.C.64-37.ckpt bt.C.64-49.ckpt
bt.C.64-60.ckpt
bt.C.64-14.ckpt bt.C.64-26.ckpt bt.C.64-38.ckpt bt.C.64-4.ckpt
bt.C.64-61.ckpt
bt.C.64-15.ckpt bt.C.64-27.ckpt bt.C.64-39.ckpt bt.C.64-50.ckpt
bt.C.64-62.ckpt
bt.C.64-16.ckpt bt.C.64-28.ckpt bt.C.64-3.ckpt bt.C.64-51.ckpt
bt.C.64-63.ckpt
bt.C.64-17.ckpt bt.C.64-29.ckpt bt.C.64-40.ckpt bt.C.64-52.ckpt
bt.C.64-6.ckpt
bt.C.64-18.ckpt bt.C.64-2.ckpt bt.C.64-41.ckpt bt.C.64-53.ckpt
bt.C.64-7.ckpt
bt.C.64-19.ckpt bt.C.64-30.ckpt bt.C.64-42.ckpt bt.C.64-54.ckpt
bt.C.64-8.ckpt
bt.C.64-1.ckpt bt.C.64-31.ckpt bt.C.64-43.ckpt bt.C.64-55.ckpt
bt.C.64-9.ckpt
bt.C.64-20.ckpt bt.C.64-32.ckpt bt.C.64-44.ckpt bt.C.64-56.ckpt
When the task need to be restored, mvapich will read the ckpt file from 0
(the symbol link) and restore the task! All this perform smoothly in NFS,
but in glusterfs it will output following messages. However sometimes task
restoring can finish at last, while others can't almost with the same
messages. I have verifed the missing files mvapich outputed was indeed
there. Another useful tips is that fewer gluster client doing the task, few
times it would be came to this bug when task restoring. And startup
glusterfs without direct-io could not help too.
OUTPUT OF THE TASK WHEN RESTORE:
19: Restart: path /mnt/glusterfs/fortest/.ckpt/1972/0/bt.C.64-19.ckpt: No
such file or directory20: Restart: path
/mnt/glusterfs/fortest/.ckpt/1972/0/bt.C.64-20.ckpt: No such file or
directorysrun: error: gfsclient10: task[19-20]: Exited with exit code 1
21: Restart: path /mnt/glusterfs/fortest/.ckpt/1972/0/bt.C.64-21.ckpt: No
such file or directory18: Restart: path
/mnt/glusterfs/fortest/.ckpt/1972/0/bt.C.64-18.ckpt: No such file or
directorysrun: error: gfsclient10: task21: Exited with exit code 1
srun: error: cn010: task18: Exited with exit code 1
17: Restart: path /mnt/glusterfs/fortest/.ckpt/1972/0/bt.C.64-17.ckpt: No
such file or directorysrun: error: gfsclient10: task17: Exited with exit
code 1
23: Restart: path /mnt/glusterfs/fortest/.ckpt/1972/0/bt.C.64-23.ckpt: No
such file or directory22: Restart: path
/mnt/glusterfs/fortest/.ckpt/1972/0/bt.C.64-22.ckpt: No such file or
directorysrun: error: gfsclient10: task23: Exited with exit code 1
srun: error: cn010: task[16,22]: Exited with exit code 1
16: Restart: path /mnt/glusterfs/fortest/.ckpt/1972/0/bt.C.64-16.ckpt: No
such file or directory
I use "debug/trace" and start the gluster with "-L DEBUG", and got the
following logs when the ckpt can't to be found:
[2009-09-04 17:12:35] N [trace.c:1290:trace_readlink] tr0: 174536: (loc
{path=/fortest/.ckp
t/1972/0, ino=1380450540}, size=4096)
[2009-09-04 17:12:35] N [trace.c:484:trace_readlink_cbk] tr0: 174536:
(op_ret=31, op_errno=
0, buf=/mnt/glusterfs/fortest/.ckpt/1972/1)
[2009-09-04 17:12:35] E [fuse-bridge.c:987:fuse_readlink_cbk]
glusterfs-fuse: 174536: /fortest/
.ckpt/1972/0 => /mnt/glusterfs/fortest/.ckpt/1972/1 @ 1252055555
[2009-09-04 17:12:35] N [trace.c:1245:trace_lookup] tr0: 174537: (loc
{path=/fortest/.ckpt/
1972/1, ino=0})
[2009-09-04 17:12:35] N [trace.c:513:trace_lookup_cbk] tr0: 174508:
(op_ret=0, ino=0, *buf
{st_dev=2065, st_ino=7068450884, st_mode=40700, st_nlink=2, st_uid=1001,
st_gid=1001, st_rd
ev=0, st_size=65536, st_blksize=4096, st_blocks=256})
[2009-09-04 17:12:35] E [fuse-bridge.c:255:fuse_loc_fill] glusterfs-fuse:
inode_path failed for
8003256399/bt.C.64-22.ckpt @ 1252055555
[2009-09-04 17:12:35] W [fuse-bridge.c:436:fuse_lookup] glusterfs-fuse:
174539: LOOKUP 80032563
99/bt.C.64-22.ckpt (fuse_loc_fill() failed)
[2009-09-04 17:12:35] N [trace.c:513:trace_lookup_cbk] tr0: 174522:
(op_ret=0, ino=0, *buf
{st_dev=2065, st_ino=7068450884, st_mode=40700, st_nlink=2, st_uid=1001,
st_gid=1001, st_rd
ev=0, st_size=65536, st_blksize=4096, st_blocks=256})
[2009-09-04 17:12:35] E [fuse-bridge.c:255:fuse_loc_fill] glusterfs-fuse:
inode_path failed for
8003256399/bt.C.64-16.ckpt @ 1252055555
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20090906/776d3165/attachment-0003.html>
More information about the Gluster-devel
mailing list