[Gluster-users] Replicated Volume Crashed

Philip flips01 at googlemail.com
Tue Feb 26 11:15:51 UTC 2013


Hi,

I have a gluster volume that consists of 22Bricks and includes a single
folder with 3.6 Million files. Yesterday the volume crashed and turned out
to be completely unresposible and I was forced to perform a hard reboot on
all gluster servers because they were not able to execute a reboot command
issued by the shell because they were that heavy overloaded. Each gluster
server has 12 CPU cores and each of them was maxxed out.

The servers log looks like this: http://pastebin.com/rv0WQ14E
The client log looks like this: http://filebin.ca/YLiU6yxURKd/client-log.txt

The most interesting parts from the server log are:

2013-02-25 19:48:01.672143] W [socket.c:195:__socket_rwv]
0-tcp.adata-server: writev failed (Connection reset by peer)
[2013-02-25 19:48:01.742239] I [server-helpers.c:474:do_fd_cleanup]
0-adata-server: fd cleanup on /files/random-file.dat
[2013-02-25 19:48:01.749871] E [server.c:176:server_submit_reply]
(-->/usr/lib/glusterfs/3.3.1/xlator/features/marker.so(marker_lookup_cbk+0x103)
[0x7ffdfce19e53]
(-->/usr/lib/glusterfs/3.3.1/xlator/debug/io-stats.so(io_stats_lookup_cbk+0xff)
[0x7ffdfcc00f9f]
(-->/usr/lib/glusterfs/3.3.1/xlator/protocol/server.so(server_lookup_cbk+0x39d)
[0x7ffdfc9e480d]))) 0-: Reply submission failed
[2013-02-25 19:48:01.805132] I [server-helpers.c:330:do_lock_table_cleanup]
0-adata-server: finodelk released on /files/random-file.dat
[2013-02-25 19:59:46.438680] W [socket.c:195:__socket_rwv]
0-tcp.adata-server: readv failed (Connection timed out)
[2013-02-25 20:03:01.825280] I [server-helpers.c:394:do_lock_table_cleanup]
0-adata-server: entrylk released on /files
[2013-02-25 20:03:01.825309] I [server-helpers.c:474:do_fd_cleanup]
0-adata-server: fd cleanup on /files

The most interesting parts from the client log are:

[2013-02-25 19:44:43.568372] W [client3_1-fops.c:5306:client3_1_finodelk]
0-adata-client-7:  (b468c831-5c04-4bbd-8b03-b94d7f937e55) remote_fd is -1.
EBADFD
[2013-02-25 19:44:43.568405] W [client3_1-fops.c:4098:client3_1_flush]
0-adata-client-7:  (b468c831-5c04-4bbd-8b03-b94d7f937e55) remote_fd is -1.
EBADFD
[2013-02-25 19:44:55.423638] I
[afr-self-heal-common.c:1189:afr_sh_missing_entry_call_impunge_recreate]
0-adata-replicate-8: no missing files - /files/random-file.dat. proceeding
to metadata check
[2013-02-25 19:44:59.816525] I
[afr-self-heal-common.c:1941:afr_sh_post_nb_entrylk_conflicting_sh_cbk]
0-adata-replicate-1: Non blocking entrylks failed.
[2013-02-25 19:44:59.816554] E
[afr-self-heal-common.c:2160:afr_self_heal_completion_cbk]
0-adata-replicate-1: background  data missing-entry gfid self-heal failed
on /files/random-file.dat
[2013-02-25 19:47:01.548989] I
[afr-self-heal-common.c:1941:afr_sh_post_nb_entrylk_conflicting_sh_cbk]
0-adata-replicate-8: Non blocking entrylks failed.
[2013-02-25 19:47:01.548996] E
[afr-self-heal-common.c:2160:afr_self_heal_completion_cbk]
0-adata-replicate-8: background  data missing-entry gfid self-heal failed
on /files/random-file.dat
[2013-02-25 19:56:20.623378] I [dht-layout.c:593:dht_layout_normalize]
0-adata-dht: found anomalies in /files. holes=1 overlaps=0
[2013-02-25 19:56:20.623399] W [dht-layout.c:186:dht_layout_search]
0-adata-dht: no subvolume for hash (value) = 2967408968
[2013-02-25 19:56:20.623404] W [dht-selfheal.c:882:dht_selfheal_directory]
0-adata-dht: 1 subvolumes have unrecoverable errors
[2013-02-25 19:56:20.623413] E [dht-common.c:1372:dht_lookup] 0-adata-dht:
Failed to get hashed subvol for /files/random-file.dat
[2013-02-25 19:57:02.780965] W [fuse-bridge.c:292:fuse_entry_cbk]
0-glusterfs-fuse: 3960898: LOOKUP() /files/random-file.dat => -1 (Invalid
argument)
[2013-02-25 19:57:02.781385] W [dht-layout.c:186:dht_layout_search]
0-adata-dht: no subvolume for hash (value) = 2967408968
[2013-02-25 20:00:32.715869] I [dht-layout.c:593:dht_layout_normalize]
0-adata-dht: found anomalies in /files. holes=2 overlaps=0
[2013-02-25 20:00:32.715886] W [dht-selfheal.c:882:dht_selfheal_directory]
0-adata-dht: 2 subvolumes have unrecoverable errors
[2013-02-25 20:00:47.566817] I [dht-common.c:543:dht_revalidate_cbk]
0-adata-dht: subvolume adata-replicate-9 for /files returned -1
(Input/output error)

What could be the reason for this? I've seen that the folder which contains
the 3M files was locked and I am restructuring the directory layout so
there is a directory tree with approximately 100-500 files within every
directory. Would this prevent a lock of all files?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20130226/a7508752/attachment.html>


More information about the Gluster-users mailing list