[Gluster-users] Glusterd seems to be ignoring that the underling filesystem got missing

Fri Sep 23 16:31:19 UTC 2016

So my question is, how did you get from each of the bricks being killed, 
"Sep 22 19:57:32 glu03 bricks-vol-video-asset-manager-brick1[45162]: 
[2016-09-22 17:57:32.714461] M [MSGID: 113075] 
[posix-helpers.c:1850:posix_health_check_thread_proc] 
0-vol-video-asset-manager-posix: still alive! -> SIGTERM", to having 
them running again?

Maybe there's a clue in the brick logs, have you looked in those?

On 09/23/2016 09:06 AM, Luca Gervasi wrote:
> Hi guys,
> I've got a strange problem involving this timeline (matches the "Log 
> fragment 1" excerpt)
> 19:56:50: disk is detached from my system. This disk is actually the 
> brick of the volume V.
> 19:56:50: LVM sees the disk as unreachable and starts its maintenance 
> procedures
> 19:56:50: LVM umounts my thin provisioned volumes
> 19:57:02: Health check on specific bricks fails thus moving the brick 
> to a down state
> 19:57:32: XFS filesystem umounts
>
> At this point, the brick filesystem is no longer mounted. The 
> underlying filesystems is empty (misses the brick directory too). My 
> assumption is that gluster would stop itself in such conditions: it is 
> not.
> Gluster slowly fills my entire root partition, creating its full tree.
>
> My only warning point is the disk that starts to fill its inodes to 100%.
>
> I've read release notes for every version subsequent mine (3.7.14, 
> 3.7.15) without finding relevant fixes and at this point i'm pretty 
> sure is some bug undocumented.
> Servers were made symmetric.
>
> Could you please help me understand how to avoid that gluster coninues 
> write on an unmounted filesystem? Thanks.
>
> I'm running a 3 node replica on 3 azure vms. This is the configuration:
>
> MD (yes, i use md to aggregate 4 disks into a single 4Tb volume):
> /dev/md128:
>         Version : 1.2
>   Creation Time : Mon Aug 29 18:10:45 2016
>      Raid Level : raid0
>      Array Size : 4290248704 (4091.50 GiB 4393.21 GB)
>    Raid Devices : 4
>   Total Devices : 4
>     Persistence : Superblock is persistent
>
>     Update Time : Mon Aug 29 18:10:45 2016
>           State : clean
>  Active Devices : 4
> Working Devices : 4
>  Failed Devices : 0
>   Spare Devices : 0
>
>      Chunk Size : 512K
>
>            Name : 128
>            UUID : d5c51214:43e48da9:49086616:c1371514
>          Events : 0
>
>     Number   Major   Minor   RaidDevice State
>        0       8       80        0      active sync /dev/sdf
>        1       8       96        1      active sync /dev/sdg
>        2       8      112        2      active sync /dev/sdh
>        3       8      128        3      active sync /dev/sdi
>
> PV, VG, LV status
>   PV         VG      Fmt  Attr PSize PFree DevSize PV UUID
>   /dev/md127 VGdata  lvm2 a--  2.00t 2.00t   2.00t 
> Kxb6C0-FLIH-4rB1-DKyf-IQuR-bbPE-jm2mu0
>   /dev/md128 gluster lvm2 a--  4.00t 1.07t   4.00t 
> lDazuw-zBPf-Duis-ZDg1-3zfg-53Ba-2ZF34m
>  VG      Attr   Ext   #PV #LV #SN VSize VFree VG UUID                 
>            VProfile
>   VGdata  wz--n- 4.00m   1   0   0 2.00t 2.00t 
> XI2V2X-hdxU-0Jrn-TN7f-GSEk-7aNs-GCdTtn
>   gluster wz--n- 4.00m   1   6   0 4.00t 1.07t 
> ztxX4f-vTgN-IKop-XePU-OwqW-T9k6-A6uDk0
>
>  LV                  VG      #Seg Attr       LSize   Maj Min KMaj KMin 
> Pool     Origin Data%  Meta%  Move Cpy%Sync Log Convert LV UUID       
>                          LProfile
>   apps-data           gluster    1 Vwi-aotz--  50.00g  -1  -1  253   
> 12 thinpool        0.08          znUMbm-ax1N-R7aj-dxLc-gtif-WOvk-9QC8tq
>   feed                gluster    1 Vwi-aotz-- 100.00g  -1  -1  253   
> 14 thinpool        0.08          hZ4Isk-dELG-lgFs-2hJ6-aYid-8VKg-3jJko9
>   homes               gluster    1 Vwi-aotz--   1.46t  -1  -1  253   
> 11 thinpool        58.58           salIPF-XvsA-kMnm-etjf-Uaqy-2vA9-9WHPkH
>   search-data         gluster    1 Vwi-aotz-- 100.00g  -1  -1  253   
> 13 thinpool        16.41           Z5hoa3-yI8D-dk5Q-2jWH-N5R2-ge09-RSjPpQ
>   thinpool            gluster    1 twi-aotz--   2.93t  -1  -1  253   
>  9                 29.85  60.00         
>  oHTbgW-tiPh-yDfj-dNOm-vqsF-fBNH-o1izx2
>   video-asset-manager gluster    1 Vwi-aotz-- 100.00g  -1  -1  253   
> 15 thinpool        0.07          4dOXga-96Wa-u3mh-HMmE-iX1I-o7ov-dtJ8lZ
>
> Gluster volume configuration (all volumes use the same exact 
> configuration, listing them all would be redundant)
> Volume Name: vol-homes
> Type: Replicate
> Volume ID: 0c8fa62e-dd7e-429c-a19a-479404b5e9c6
> Status: Started
> Number of Bricks: 1 x 3 = 3
> Transport-type: tcp
> Bricks:
> Brick1: glu01.prd.azr:/bricks/vol-homes/brick1
> Brick2: glu02.prd.azr:/bricks/vol-homes/brick1
> Brick3: glu03.prd.azr:/bricks/vol-homes/brick1
> Options Reconfigured:
> performance.readdir-ahead: on
> cluster.server-quorum-type: server
> nfs.disable: disable
> cluster.lookup-unhashed: auto
> performance.nfs.quick-read: on
> performance.nfs.read-ahead: on
> performance.cache-size: 4096MB
> cluster.self-heal-daemon: enable
> diagnostics.brick-log-level: ERROR
> diagnostics.client-log-level: ERROR
> nfs.rpc-auth-unix: off
> nfs.acl: off
> performance.nfs.io-cache: on
> performance.client-io-threads: on
> performance.nfs.stat-prefetch: on
> performance.nfs.io-threads: on
> diagnostics.latency-measurement: on
> diagnostics.count-fop-hits: on
> performance.md-cache-timeout: 1
> performance.cache-refresh-timeout: 1
> performance.io-thread-count: 16
> performance.high-prio-threads: 16
> performance.normal-prio-threads: 16
> performance.low-prio-threads: 16
> performance.least-prio-threads: 1
> cluster.server-quorum-ratio: 60
>
> fstab:
> /dev/gluster/homes                              /bricks/vol-homes   
> xfs defaults,noatime,nobarrier,nofail 0 2
>
> Software:
> CentOS Linux release 7.1.1503 (Core)
> glusterfs-api-3.7.13-1.el7.x86_64
> glusterfs-libs-3.7.13-1.el7.x86_64
> glusterfs-3.7.13-1.el7.x86_64
> glusterfs-fuse-3.7.13-1.el7.x86_64
> glusterfs-server-3.7.13-1.el7.x86_64
> glusterfs-client-xlators-3.7.13-1.el7.x86_64
> glusterfs-cli-3.7.13-1.el7.x86_64
>
>
> Log fragment 1:
> Sep 22 19:56:50 glu03 lvm[868]: WARNING: Device for PV 
> lDazuw-zBPf-Duis-ZDg1-3zfg-53Ba-2ZF34m not found or rejected by a filter.
> Sep 22 19:56:50 glu03 lvm[868]: Cannot change VG gluster while PVs are 
> missing.
> Sep 22 19:56:50 glu03 lvm[868]: Consider vgreduce --removemissing.
> Sep 22 19:56:50 glu03 lvm[868]: Failed to extend thin metadata 
> gluster-thinpool-tpool.
> Sep 22 19:56:50 glu03 lvm[868]: Unmounting thin volume 
> gluster-thinpool-tpool from /bricks/vol-homes.
> Sep 22 19:56:50 glu03 lvm[868]: Unmounting thin volume 
> gluster-thinpool-tpool from /bricks/vol-search-data.
> Sep 22 19:56:50 glu03 lvm[868]: Unmounting thin volume 
> gluster-thinpool-tpool from /bricks/vol-apps-data.
> Sep 22 19:56:50 glu03 lvm[868]: Unmounting thin volume 
> gluster-thinpool-tpool from /bricks/vol-video-asset-manager.
> Sep 22 19:57:02 glu03 bricks-vol-video-asset-manager-brick1[45162]: 
> [2016-09-22 17:57:02.713428] M [MSGID: 113075] 
> [posix-helpers.c:1844:posix_health_check_thread_proc] 
> 0-vol-video-asset-manager-posix: health-check failed, going down
> Sep 22 19:57:05 glu03 bricks-vol-apps-data-brick1[44536]: [2016-09-22 
> 17:57:05.186146] M [MSGID: 113075] 
> [posix-helpers.c:1844:posix_health_check_thread_proc] 
> 0-vol-apps-data-posix: health-check failed, going down
> Sep 22 19:57:18 glu03 bricks-vol-search-data-brick1[40928]: 
> [2016-09-22 17:57:18.674279] M [MSGID: 113075] 
> [posix-helpers.c:1844:posix_health_check_thread_proc] 
> 0-vol-search-data-posix: health-check failed, going down
> Sep 22 19:57:32 glu03 bricks-vol-video-asset-manager-brick1[45162]: 
> [2016-09-22 17:57:32.714461] M [MSGID: 113075] 
> [posix-helpers.c:1850:posix_health_check_thread_proc] 
> 0-vol-video-asset-manager-posix: still alive! -> SIGTERM
> Sep 22 19:57:32 glu03 kernel: XFS (dm-15): Unmounting Filesystem
> Sep 22 19:57:35 glu03 bricks-vol-apps-data-brick1[44536]: [2016-09-22 
> 17:57:35.186352] M [MSGID: 113075] 
> [posix-helpers.c:1850:posix_health_check_thread_proc] 
> 0-vol-apps-data-posix: still alive! -> SIGTERM
> Sep 22 19:57:35 glu03 kernel: XFS (dm-12): Unmounting Filesystem
> Sep 22 19:57:48 glu03 bricks-vol-search-data-brick1[40928]: 
> [2016-09-22 17:57:48.674444] M [MSGID: 113075] 
> [posix-helpers.c:1850:posix_health_check_thread_proc] 
> 0-vol-search-data-posix: still alive! -> SIGTERM
> Sep 22 19:57:48 glu03 kernel: XFS (dm-13): Unmounting Filesystem
>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20160923/0fb30d71/attachment.html>