[Gluster-devel] 3.0.1

Gordan Bobic gordan at bobich.net
Wed Jan 27 15:03:24 UTC 2010


I'm not sure if this is one of the problems you are aware of a cause 
for, but I just observed the permissions on home directories changing to 
777 again. Not sure yet what causes it, but the fact that it happened 
again within 24 hours indicates that it arises regularly.

Gordan

Anand Avati wrote:
> Please hold back using 3.0.1. We found some issues and are making
> 3.0.2 very quickly. Apologies for all the inconvenience.
> 
> Avati
> 
> On Tue, Jan 26, 2010 at 6:30 PM, Gordan Bobic <gordan at bobich.net> wrote:
>> I upgraded to 3.0.1 last night and it still doesn't seem as stable as 2.0.9.
>> Things I have bumped into since the upgrade:
>>
>> 1) I've had unfsd lock up hard when exporting the volume, it couldn't be
>> "kill -9"-ed. This happened just after a spurious disconnect (see 2).
>>
>> 2) Seeing random disconnects/timeouts between the servers that are on the
>> same switch (this was happening with 2.0.x as well, though, so not sure
>> what's going on). This is where the file clobbering/corruption used to occur
>> that causes contents of one file to be replaced with contents of a different
>> file, when the files are open. I HAVEN'T observed clobbering with 3.0.1 (yet
>> at least - it wasn't a particularly frequent occurrence, but the chances of
>> it were high on shared libraries during a big yum update when glfs is
>> rootfs), but the disconnects still happen occassionally, usually under
>> heavy-ish load.
>>
>> My main concern here is that open file self-healing may cover up the
>> underlying bug that causes the clobbering, and possibly make it occur in
>> even more heisenbuggy ways.
>>
>> ssh sessions to both servers don't show any problems/disconnections/dropouts
>> at the same time as the disconnects on glfs happen. Is there a setting to
>> set how many heartbeat packets have to be lost before the disconnect is
>> initiated?
>>
>> This is the sort of thing I see in the logs:
>> [2010-01-26 07:36:56] N [server-protocol.c:6780:notify] server:
>> 10.2.0.13:1010 disconnected
>> [2010-01-26 07:36:56] N [server-protocol.c:6780:notify] server:
>> 10.2.0.13:1013 disconnected
>> [2010-01-26 07:36:56] N [server-helpers.c:849:server_connection_destroy]
>> server: destroyed connection of
>> thor.winterhearth.co.uk-11823-2010/01/26-05:29:32:239464-home2
>> [2010-01-26 07:37:25] E [saved-frames.c:165:saved_frames_unwind] home3:
>> forced unwinding frame type(1) op(SETATTR)
>> [2010-01-26 07:37:25] E [saved-frames.c:165:saved_frames_unwind] home3:
>> forced unwinding frame type(1) op(SETXATTR)
>> [2010-01-26 07:37:25] E [saved-frames.c:165:saved_frames_unwind] home3:
>> forced unwinding frame type(2) op(PING)
>> [2010-01-26 07:37:25] N [client-protocol.c:6973:notify] home3: disconnected
>> [2010-01-26 07:38:19] E [client-protocol.c:415:client_ping_timer_expired]
>> home3: Server 10.2.0.13:6997 has not responded in the last 42 seconds,
>> disconnecting.
>> [2010-01-26 07:38:19] E [saved-frames.c:165:saved_frames_unwind] home3:
>> forced unwinding frame type(2) op(SETVOLUME)
>> [2010-01-26 07:38:19] E [saved-frames.c:165:saved_frames_unwind] home3:
>> forced unwinding frame type(2) op(SETVOLUME)
>> [2010-01-26 08:06:17] N [server-protocol.c:5811:mop_setvolume] server:
>> accepted client from 10.2.0.13:1018
>> [2010-01-26 08:06:17] N [server-protocol.c:5811:mop_setvolume] server:
>> accepted client from 10.2.0.13:1017
>> [2010-01-26 08:06:17] N [client-protocol.c:6225:client_setvolume_cbk] home3:
>> Connected to 10.2.0.13:6997, attached to remote volume 'home3'.
>> [2010-01-26 08:06:17] N [client-protocol.c:6225:client_setvolume_cbk] home3:
>> Connected to 10.2.0.13:6997, attached to remote volume 'home3'.
>>
>>
>> 3) Something that started off as not being able to ssh in using public keys
>> turned out to be due to my home directory somehow acquiring 777 permissions.
>> I certainly didn't do it, so at a guess it's a file corruption issue,
>> possibly during an unclean shutdown. Further, I've found that / directory
>> (I'm running glusterfs root on this cluster) had permissions 777, too, which
>> seems to have happened at the same time as the home directory getting 777
>> permissions. If sendmail and ssh weren't failing to work properly because of
>> this, it's possible I wouldn't have noticed. It's potentially quite a
>> concerning problem, even if it is caused by an unclean shutdown (put it this
>> way - I've never seen it happen on any other file system).
>>
>> 4) This looks potentially a bit concerning:
>>  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>
>>
>>  5633 root      15   0 25.8g 119m 1532 S 36.7  3.0  36:25.42
>> /usr/sbin/glusterfs --log-level=NONE --log-file=/dev/null
>> --disable-direct-io-mode --volfile=/etc/glusterfs.root/root2.vol
>> /mnt/newroot
>>
>> This is the rootfs daemon. 25.8GB of virtual address space mapped? Surely
>> that can't be right, even if the resident size looks reasonably sane.
>>
>> Worse - it's growing by about 100MB/minute during heavy compiling on the
>> system. I've just tried to test the nvidia driver installer to see if that
>> old bug report I filed is still valid, and it doesn't seem to get anywhere
>> (just makes glusterfsd and gcc use CPU time but doesn't ever finish - which
>> is certainly a different fail case from 2.0.9 - that at least finishes the
>> compile stage).
>>
>> The virtual memory bloat is rather reminiscent of the memory
>> fragmentation/leak problem that was fixed on 2.0.x branch a while back that
>> was arising when shared libraries were on glusterfs. A bit leaked every time
>> a shared library call was made. A regression, perhaps? Wasn't there a memory
>> consumption sanity check added to the test suite after this was fixed last
>> time?
>>
>> Other glfs daemons are exhibiting similar behaviour:
>>
>>  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>
>>
>>  5633 root      15   0 26.1g 119m 1532 S  0.7  3.0  37:57.01
>> /usr/sbin/glusterfs --log-level=NONE --log-file=/dev/null
>> --disable-direct-io-mode --volfile=/etc/glusterfs.root/root2.vol
>> /mnt/newroot
>> 12037 root      15   0 24.8g  68m 1072 S  0.0  1.7   3:21.41
>> /usr/sbin/glusterfs --log-level=NORMAL --volfile=/etc/glusterfs/shared.vol
>> /shared
>>
>> 11977 root      15   0 24.8g  67m 1092 S  0.7  1.7   3:59.11
>> /usr/sbin/glusterfs --log-level=NORMAL --disable-direct-io-mode
>> --volfile=/etc/glusterfs/home.vol /home
>>
>> 11915 root      15   0 24.9g  32m  972 S  0.0  0.8   0:21.65
>> /usr/sbin/glusterfs --log-level=NORMAL --volfile=/etc/glusterfs/boot.vol
>> /boot
>>
>> The home, shared and boot volumes don't have any shared libraries on them,
>> and 24.9GB of virtual memory mapped for the /boot volume which is backed
>> with a 250MB file system also seems a bit excessive.
>>
>> Gordan
>>
>>
>> _______________________________________________
>> Gluster-devel mailing list
>> Gluster-devel at nongnu.org
>> http://lists.nongnu.org/mailman/listinfo/gluster-devel
>>






More information about the Gluster-devel mailing list