[Gluster-devel] GlusterFS 3.3.0 and NFS client failures

Sat Mar 23 23:40:38 UTC 2013

Do you have any more details, like -

1) when the system was "hung", was the client still flushing data to the
nfs server? Any network activity? The backtrace below shows that the system
was just waiting for a long time waiting for a write to complete.

2) anything in the gluster nfs logs?

3) is it possible DHCP assigned a different IP while renewing lease?

Avati

On Fri, Mar 22, 2013 at 3:04 AM, Ian Latter <ian.latter at midnightcode.org>wrote:

> Hello,
>
>
>   This is a problem that I've been chipping at on and off for a while and
> its finally cost me one recording too many - I just want to get it cured -
> any
> help would be greatly appreciated.
>
>   I'm using the kernel NFS client on a number of Linux machines (four, I
> believe), to map back to two Gluster 3.3.0 shares.
>
>   I have seen Linux Mint and Ubuntu machines of various generations and
> configurations (one is 64bit) hang intermittently on either one of the two
> Gluster shares on "access" (I can't say if its writing or not - the below
> log
> is for a write).  But by far the most common failure example is my MythTV
> Backend server.  It has 5 tuners pulling down up to a gigabyte per hour
> each directly to an NFS share from Gluster 3.3.30 with two local 3TB
> drives in a "distribute" volume.  It also re-parses each recording for Ad
> filtering, so the share gets a good thrashing.  The myth backend box would
> fail (hang the system) once each 2-4 days.
>
> The backend server was also updating its NIC via DHCP.  I have been using
> an MTU of 1460 and each DHCP event would thus result in this note in syslog;
>  [  12.248640] r8169: WARNING! Changing of MTU on this NIC may lead to
> frame reception errors!
>
> I change the DHCP MTU to 1500 and didn't see an improvement.  So, the
> last change I made was a hard coded address and default MTU (of 1500).
> The most recent trial saw a 13 day run time which is well outside the norm,
> but it still borked (one test only - may have been lucky).
>
> >> syslog burp;
> [1204800.908075] INFO: task mythbackend:21353 blocked for more than 120
> seconds.
> [1204800.908084] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [1204800.908091] mythbackend   D f6af9d28     0 21353      1 0x00000000
> [1204800.908107]  f6af9d38 00000086 00000002 f6af9d28 f6a4e580 c05d89e0
> c08c3700 c08c3700
> [1204800.908123]  bd3a4320 0004479c c08c3700 c08c3700 bd3a0e4e 0004479c
> 00000000 c08c3700
> [1204800.908138]  c08c3700 f6a4e580 00000001 f6a4e580 c2488700 f6af9d80
> f6af9d48 c05c6e51
> [1204800.908152] Call Trace:
> [1204800.908170]  [<c05c6e51>] io_schedule+0x61/0xa0
> [1204800.908180]  [<c01d9c4d>] sync_page+0x3d/0x50
> [1204800.908190]  [<c05c761d>] __wait_on_bit+0x4d/0x70
> [1204800.908197]  [<c01d9c10>] ? sync_page+0x0/0x50
> [1204800.908211]  [<c01d9e71>] wait_on_page_bit+0x91/0xa0
> [1204800.908221]  [<c0165e60>] ? wake_bit_function+0x0/0x50
> [1204800.908229]  [<c01da1f4>] filemap_fdatawait_range+0xd4/0x150
> [1204800.908239]  [<c01da3c7>] filemap_write_and_wait_range+0x77/0x80
> [1204800.908248]  [<c023aad4>] vfs_fsync_range+0x54/0x80
> [1204800.908257]  [<c023ab5e>] generic_write_sync+0x5e/0x80
> [1204800.908265]  [<c01dbda1>] generic_file_aio_write+0xa1/0xc0
> [1204800.908292]  [<fb0bc94f>] nfs_file_write+0x9f/0x200 [nfs]
> [1204800.908303]  [<c0218454>] do_sync_write+0xa4/0xe0
> [1204800.908314]  [<c032e626>] ? apparmor_file_permission+0x16/0x20
> [1204800.908324]  [<c0302a74>] ? security_file_permission+0x14/0x20
> [1204800.908333]  [<c02185d2>] ? rw_verify_area+0x62/0xd0
> [1204800.908342]  [<c02186e2>] vfs_write+0xa2/0x190
> [1204800.908350]  [<c02183b0>] ? do_sync_write+0x0/0xe0
> [1204800.908359]  [<c0218fa2>] sys_write+0x42/0x70
> [1204800.908367]  [<c05c90a4>] syscall_call+0x7/0xb
>
> This might suggest a hardware fault on the Myth Backend host (like the
> NIC) but I don't believe that to be the case because I've seen the same
> issue on other clients.  I suspect that they are much more rare because
> the data volume on those clients pales in comparison to the Myth Backend
> process (virtual guests, etc - light work - months between failures,
> doesn't
> feel time related).
>
> The only cure is a hard reset (of the host with the NFS client) as any FS
> operation on that share hangs - including df, ls, sync and umount - so the
> system fails to shutdown.
>
> The kernel on the Myth Backend host isn't new ..
>
> >> uname -a;
> Linux jupiter 2.6.35-22-generic #33-Ubuntu SMP Sun Sep 19 20:34:50 UTC
> 2010 i686 GNU/Linux
>
> Is there a known good/bad version for the kernel/NFS client?  Am I under
> that bar?
>
>
> The GlusterFS NFS server an embedded platform (Saturn) that has been
> running for 74 days;
>
> >> uptime output;
> 08:39:07 up 74 days, 22:16,  load average: 0.87, 0.94, 0.94
>
> It is a much more modern platform;
>
> >> uname -a;
> Linux (none) 3.2.14 #1 SMP Tue Apr 10 12:46:47 EST 2012 i686 GNU/Linux
>
> It has had one error in all of that time;
> >> dmesg output;
> Pid: 4845, comm: glusterfsd Not tainted 3.2.14 #1
> Call Trace:
>  [<c10512d0>] __rcu_pending+0x64/0x294
>  [<c1051640>] rcu_check_callbacks+0x87/0x98
>  [<c1034521>] update_process_times+0x2d/0x58
>  [<c1047bdf>] tick_periodic+0x63/0x65
>  [<c1047c2d>] tick_handle_periodic+0x17/0x5e
>  [<c1015ae9>] smp_apic_timer_interrupt+0x67/0x7a
>  [<c1b2a691>] apic_timer_interrupt+0x31/0x40
>
> .. this occurred months ago.
>
> Unfortunately due to its embedded nature, there are no logs coming from
> this platform, only a looped buffer for syslog (and gluster doesn't seem to
> syslog).  In previous discussions here (months ago) you'll see where I was
> working to disable/remove logging from GlusterFS so that I could keep it
> alive in an embedded environment - this is the current run configuration.
>
> The Myth Backend host only mounts one of the two NFS shares, but I've seen
> the fault on the hosts that only mount the other - so I'm reluctant to
> believe that its a hardware failure at the Drive level on the Saturn /
> Gluster
> server.
>
> The /etc/fstab entry for this share, on the Myth Backend host, is;
>
>   saturn:/recordings /var/lib/mythtv/saturn_recordings nfs
> nfsvers=3,rw,rsize=8192,wsize=8192,hard,intr,sync,dirsync,noac,noatime,nodev,nosuid
> 0  0
>
> When I softened this to async with soft failures (a config taken straight
> from the Gluster site/FAQ) it crashed out in a much shorter time-frame
> (less than a day, one test only - may have been unlucky);
>
>   saturn:/recordings /var/lib/mythtv/saturn_recordings nfs
> defaults,_netdev,nfsvers=3,proto=tcp 0  0
>
>
> Other than the high use Myth Backend host I've failed to accurately nail
> down the trigger for this issue - which is making diagnostics painful (I
> like
> my TV too much to do more than reboot the failed box - and heaven forbid
> the dad that fails to record Pepper Pig!).
>
>
> Any thoughts?  Beyond enabling logs on the Saturn side ...
>
> Is it possible this is a bug that was reaped in later versions of Gluster?
>
> Appreciate being set straight ..
>
>
>
>
>
> Cheers,
>
>
>
>
> --
> Ian Latter
> Late night coder ..
> http://midnightcode.org/
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at nongnu.org
> https://lists.nongnu.org/mailman/listinfo/gluster-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20130323/9773140a/attachment-0001.html>