[Gluster-users] Issue with Pro active self healing for Erasure coding

Fri Jun 26 07:41:30 UTC 2015

Could you file a bug for this ?

I'll investigate the problem.

Xavi

On 06/26/2015 08:58 AM, Mohamed Pakkeer wrote:
> Hi Xavier
>
> We are facing same I/O error after upgrade into gluster 3.7.2.
>
> Description of problem:
> =======================
> In a 3 x (4 + 2) = 18 distributed disperse volume, there are
> input/output error of some files on fuse mount after simulating the
> following scenario
>
> 1.   Simulate the disk failure by killing the disk pid and again adding
> the same disk after formatting the drive
> 2.   Try to read the recovered or healed file after 2 bricks/nodes were
> brought down
>
> Version-Release number of selected component (if applicable):
> ==============================================================
>
> admin at node001:~$ sudo gluster --version
> glusterfs 3.7.2 built on Jun 19 2015 16:33:27
> Repository revision: git://git.gluster.com/glusterfs.git
> <http://git.gluster.com/glusterfs.git>
> Copyright (coffee) 2006-2011 Gluster Inc. <http://www.gluster.com>
> GlusterFS comes with ABSOLUTELY NO WARRANTY.
> You may redistribute copies of GlusterFS under the terms of the GNU
> General Public License.
>
> Steps to Reproduce:
>
> 1. create a 3x(4+2) disperse volume across nodes
> 2. FUSE mount on the client and start creating files/directories with mkdir and rsync/dd
> 3. simulate the disk failure by killing pid of any disk on one node and add again the same disk after formatting the drive
> 4. start volume by force
> 5. self haling adding the file name with 0 bytes in newly formatted drive
> 6. wait more time to finish self healing, but self healing is not happening the file lies on 0 bytes
> 7. Try to read same file from client, now the file name with 0 byte try to recovery and recovery completed. Get the md5sum of the file with all client live and the result is positive
> 8. Now, bring down 2 of the node
> 9. Now try to get the mdsum of same recoverd file, client throws I/O error
>
> Screen shots
>
> admin at node001:~$ sudo gluster volume info
>
> Volume Name: vaulttest21
> Type: Distributed-Disperse
> Volume ID: ac6a374d-a0a2-405c-823d-0672fd92f0af
> Status: Started
> Number of Bricks: 3 x (4 + 2) = 18
> Transport-type: tcp
> Bricks:
> Brick1: 10.1.2.1:/media/disk1
> Brick2: 10.1.2.2:/media/disk1
> Brick3: 10.1.2.3:/media/disk1
> Brick4: 10.1.2.4:/media/disk1
> Brick5: 10.1.2.5:/media/disk1
> Brick6: 10.1.2.6:/media/disk1
> Brick7: 10.1.2.1:/media/disk2
> Brick8: 10.1.2.2:/media/disk2
> Brick9: 10.1.2.3:/media/disk2
> Brick10: 10.1.2.4:/media/disk2
> Brick11: 10.1.2.5:/media/disk2
> Brick12: 10.1.2.6:/media/disk2
> Brick13: 10.1.2.1:/media/disk3
> Brick14: 10.1.2.2:/media/disk3
> Brick15: 10.1.2.3:/media/disk3
> Brick16: 10.1.2.4:/media/disk3
> Brick17: 10.1.2.5:/media/disk3
> Brick18: 10.1.2.6:/media/disk3
> Options Reconfigured:
> performance.readdir-ahead: on
>
> *_After simulated the disk failure( node3- disk2) and adding aging by
> formatting the drive _*
>
> admin at node003:~$ date
>
> Thu Jun 25 *16:21:58* IST 2015
>
>
> admin at node003:~$ ls -l -h /media/disk2
>
> total 1.6G
>
> drwxr-xr-x 3 root root   22 Jun 25 16:18 1
>
> *-rw-r--r-- 2 root root    0 Jun 25 16:17 up1*
>
> *-rw-r--r-- 2 root root    0 Jun 25 16:17 up2*
>
> -rw-r--r-- 2 root root 797M Jun 25 16:03 up3
>
> -rw-r--r-- 2 root root 797M Jun 25 16:04 up4
>
> --
>
> admin at node003:~$ date
>
> Thu Jun 25 *16:25:09* IST 2015
>
>
> admin at node003:~$ ls -l -h  /media/disk2
>
> total 1.6G
>
> drwxr-xr-x 3 root root   22 Jun 25 16:18 1
>
> *-rw-r--r-- 2 root root    0 Jun 25 16:17 up1*
>
> *-rw-r--r-- 2 root root    0 Jun 25 16:17 up2*
>
> -rw-r--r-- 2 root root 797M Jun 25 16:03 up3
>
> -rw-r--r-- 2 root root 797M Jun 25 16:04 up4
>
>
> admin at node003:~$ date
>
> Thu Jun 25 *16:41:25* IST 2015
>
>
> admin at node003:~$  ls -l -h  /media/disk2
>
> total 1.6G
>
> drwxr-xr-x 3 root root   22 Jun 25 16:18 1
>
> -rw-r--r-- 2 root root    0 Jun 25 16:17 up1
>
> -rw-r--r-- 2 root root    0 Jun 25 16:17 up2
>
> -rw-r--r-- 2 root root 797M Jun 25 16:03 up3
>
> -rw-r--r-- 2 root root 797M Jun 25 16:04 up4
>
>
> *after waiting nearly 20 minutes, self healing is not recovered the full
> data junk . Then try to read the file using md5sum*
> *
> *
> root at mas03:/mnt/gluster# time md5sum up1
> 4650543ade404ed5a1171726e76f8b7c  up1
>
> real    1m58.010s
> user    0m6.243s
> sys     0m0.778s
>
> *corrupted junk starts growing*
>
> admin at node003:~$ ls -l -h  /media/disk2
> total 2.6G
> drwxr-xr-x 3 root root   22 Jun 25 16:18 1
> -rw-r--r-- 2 root root 797M Jun 25 15:57 up1
> -rw-r--r-- 2 root root    0 Jun 25 16:17 up2
> -rw-r--r-- 2 root root 797M Jun 25 16:03 up3
> -rw-r--r-- 2 root root 797M Jun 25 16:04 up4
>
> *_To verify healed file after two node 5 & 6 taken offline_*
>
> root at mas03:/mnt/gluster# time md5sum up1
> md5sum: up1:*Input/output error*
>
> Still the I/O error is not rectified. Could you suggest, if any thing
> wrong on our testing?
>
>
> admin at node001:~$ sudo gluster volume get vaulttest21 all
> Option                                  Value
> ------                                  -----
> cluster.lookup-unhashed                 on
> cluster.lookup-optimize                 off
> cluster.min-free-disk                   10%
> cluster.min-free-inodes                 5%
> cluster.rebalance-stats                 off
> cluster.subvols-per-directory           (null)
> cluster.readdir-optimize                off
> cluster.rsync-hash-regex                (null)
> cluster.extra-hash-regex                (null)
> cluster.dht-xattr-name                  trusted.glusterfs.dht
> cluster.randomize-hash-range-by-gfid    off
> cluster.rebal-throttle                  normal
> cluster.local-volume-name               (null)
> cluster.weighted-rebalance              on
> cluster.entry-change-log                on
> cluster.read-subvolume                  (null)
> cluster.read-subvolume-index            -1
> cluster.read-hash-mode                  1
> cluster.background-self-heal-count      16
> cluster.metadata-self-heal              on
> cluster.data-self-heal                  on
> cluster.entry-self-heal                 on
> cluster.self-heal-daemon                on
> cluster.heal-timeout                    600
> cluster.self-heal-window-size           1
> cluster.data-change-log                 on
> cluster.metadata-change-log             on
> cluster.data-self-heal-algorithm        (null)
> cluster.eager-lock                      on
> cluster.quorum-type                     none
> cluster.quorum-count                    (null)
> cluster.choose-local                    true
> cluster.self-heal-readdir-size          1KB
> cluster.post-op-delay-secs              1
> cluster.ensure-durability               on
> cluster.consistent-metadata             no
> cluster.stripe-block-size               128KB
> cluster.stripe-coalesce                 true
> diagnostics.latency-measurement         off
> diagnostics.dump-fd-stats               off
> diagnostics.count-fop-hits              off
> diagnostics.brick-log-level             INFO
> diagnostics.client-log-level            INFO
> diagnostics.brick-sys-log-level         CRITICAL
> diagnostics.client-sys-log-level        CRITICAL
> diagnostics.brick-logger                (null)
> diagnostics.client-logger               (null)
> diagnostics.brick-log-format            (null)
> diagnostics.client-log-format           (null)
> diagnostics.brick-log-buf-size          5
> diagnostics.client-log-buf-size         5
> diagnostics.brick-log-flush-timeout     120
> diagnostics.client-log-flush-timeout    120
> performance.cache-max-file-size         0
> performance.cache-min-file-size         0
> performance.cache-refresh-timeout       1
> performance.cache-priority
> performance.cache-size                  32MB
> performance.io-thread-count             16
> performance.high-prio-threads           16
> performance.normal-prio-threads         16
> performance.low-prio-threads            16
> performance.least-prio-threads          1
> performance.enable-least-priority       on
> performance.least-rate-limit            0
> performance.cache-size                  128MB
> performance.flush-behind                on
> performance.nfs.flush-behind            on
> performance.write-behind-window-size    1MB
> performance.nfs.write-behind-window-size1MB
> performance.strict-o-direct             off
> performance.nfs.strict-o-direct         off
> performance.strict-write-ordering       off
> performance.nfs.strict-write-ordering   off
> performance.lazy-open                   yes
> performance.read-after-open             no
> performance.read-ahead-page-count       4
> performance.md-cache-timeout            1
> features.encryption                     off
> encryption.master-key                   (null)
> encryption.data-key-size                256
> encryption.block-size                   4096
> network.frame-timeout                   1800
> network.ping-timeout                    42
> network.tcp-window-size                 (null)
> features.lock-heal                      off
> features.grace-timeout                  10
> network.remote-dio                      disable
> client.event-threads                    2
> network.ping-timeout                    42
> network.tcp-window-size                 (null)
> network.inode-lru-limit                 16384
> auth.allow                              *
> auth.reject                             (null)
> transport.keepalive                     (null)
> server.allow-insecure                   (null)
> server.root-squash                      off
> server.anonuid                          65534
> server.anongid                          65534
> server.statedump-path                   /var/run/gluster
> server.outstanding-rpc-limit            64
> features.lock-heal                      off
> features.grace-timeout                  (null)
> server.ssl                              (null)
> auth.ssl-allow                          *
> server.manage-gids                      off
> client.send-gids                        on
> server.gid-timeout                      300
> server.own-thread                       (null)
> server.event-threads                    2
> performance.write-behind                on
> performance.read-ahead                  on
> performance.readdir-ahead               on
> performance.io-cache                    on
> performance.quick-read                  on
> performance.open-behind                 on
> performance.stat-prefetch               on
> performance.client-io-threads           off
> performance.nfs.write-behind            on
> performance.nfs.read-ahead              off
> performance.nfs.io-cache                off
> performance.nfs.quick-read              off
> performance.nfs.stat-prefetch           off
> performance.nfs.io-threads              off
> performance.force-readdirp              true
> features.file-snapshot                  off
> features.uss                            off
> features.snapshot-directory             .snaps
> features.show-snapshot-directory        off
> network.compression                     off
> network.compression.window-size         -15
> network.compression.mem-level           8
> network.compression.min-size            0
> network.compression.compression-level   -1
> network.compression.debug               false
> features.limit-usage                    (null)
> features.quota-timeout                  0
> features.default-soft-limit             80%
> features.soft-timeout                   60
> features.hard-timeout                   5
> features.alert-time                     86400
> features.quota-deem-statfs              off
> geo-replication.indexing                off
> geo-replication.indexing                off
> geo-replication.ignore-pid-check        off
> geo-replication.ignore-pid-check        off
> features.quota                          off
> features.inode-quota                    off
> features.bitrot                         disable
> debug.trace                             off
> debug.log-history                       no
> debug.log-file                          no
> debug.exclude-ops                       (null)
> debug.include-ops                       (null)
> debug.error-gen                         off
> debug.error-failure                     (null)
> debug.error-number                      (null)
> debug.random-failure                    off
> debug.error-fops                        (null)
> nfs.enable-ino32                        no
> nfs.mem-factor                          15
> nfs.export-dirs                         on
> nfs.export-volumes                      on
> nfs.addr-namelookup                     off
> nfs.dynamic-volumes                     off
> nfs.register-with-portmap               on
> nfs.outstanding-rpc-limit               16
> nfs.port                                2049
> nfs.rpc-auth-unix                       on
> nfs.rpc-auth-null                       on
> nfs.rpc-auth-allow                      all
> nfs.rpc-auth-reject                     none
> nfs.ports-insecure                      off
> nfs.trusted-sync                        off
> nfs.trusted-write                       off
> nfs.volume-access                       read-write
> nfs.export-dir
> nfs.disable                             false
> nfs.nlm                                 on
> nfs.acl                                 on
> nfs.mount-udp                           off
> nfs.mount-rmtab                         /var/lib/glusterd/nfs/rmtab
> nfs.rpc-statd                           /sbin/rpc.statd
> nfs.server-aux-gids                     off
> nfs.drc                                 off
> nfs.drc-size                            0x20000
> nfs.read-size                           (1 * 1048576ULL)
> nfs.write-size                          (1 * 1048576ULL)
> nfs.readdir-size                        (1 * 1048576ULL)
> nfs.exports-auth-enable                 (null)
> nfs.auth-refresh-interval-sec           (null)
> nfs.auth-cache-ttl-sec                  (null)
> features.read-only                      off
> features.worm                           off
> storage.linux-aio                       off
> storage.batch-fsync-mode                reverse-fsync
> storage.batch-fsync-delay-usec          0
> storage.owner-uid                       -1
> storage.owner-gid                       -1
> storage.node-uuid-pathinfo              off
> storage.health-check-interval           30
> storage.build-pgfid                     off
> storage.bd-aio                          off
> cluster.server-quorum-type              off
> cluster.server-quorum-ratio             0
> changelog.changelog                     off
> changelog.changelog-dir                 (null)
> changelog.encoding                      ascii
> changelog.rollover-time                 15
> changelog.fsync-interval                5
> changelog.changelog-barrier-timeout     120
> changelog.capture-del-path              off
> features.barrier                        disable
> features.barrier-timeout                120
> features.trash                          off
> features.trash-dir                      .trashcan
> features.trash-eliminate-path           (null)
> features.trash-max-filesize             5MB
> features.trash-internal-op              off
> cluster.enable-shared-storage           disable
> features.ctr-enabled                    off
> features.record-counters                off
> features.ctr_link_consistency           off
> locks.trace                             (null)
> cluster.disperse-self-heal-daemon       enable
> cluster.quorum-reads                    no
> client.bind-insecure                    (null)
> ganesha.enable                          off
> features.shard                          off
> features.shard-block-size               4MB
> features.scrub-throttle                 lazy
> features.scrub-freq                     biweekly
> features.expiry-time                    120
> features.cache-invalidation             off
> features.cache-invalidation-timeout     60
>
>
> Thanks & regards
> Backer
>
>
>
>
>
> On Mon, Jun 15, 2015 at 1:26 PM, Xavier Hernandez <xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>> wrote:
>
>     On 06/15/2015 09:25 AM, Mohamed Pakkeer wrote:
>
>         Hi Xavier,
>
>         When can we expect the 3.7.2 release for fixing the I/O error
>         which we
>         discussed on this mail thread?.
>
>
>     As per the latest meeting held last wednesday [1] it will be
>     released this week.
>
>     Xavi
>
>     [1]
>     http://meetbot.fedoraproject.org/gluster-meeting/2015-06-10/gluster-meeting.2015-06-10-12.01.html
>
>
>         Thanks
>         Backer
>
>         On Wed, May 27, 2015 at 8:02 PM, Xavier Hernandez
>         <xhernandez at datalab.es <mailto:xhernandez at datalab.es>
>         <mailto:xhernandez at datalab.es <mailto:xhernandez at datalab.es>>>
>         wrote:
>
>              Hi again,
>
>              in today's gluster meeting [1] it has been decided that
>         3.7.1 will
>              be released urgently to solve a bug in glusterd. All fixes
>         planned
>              for 3.7.1 will be moved to 3.7.2 which will be released
>         soon after.
>
>              Xavi
>
>              [1]
>         http://meetbot.fedoraproject.org/gluster-meeting/2015-05-27/gluster-meeting.2015-05-27-12.01.html
>
>
>              On 05/27/2015 12:01 PM, Xavier Hernandez wrote:
>
>                  On 05/27/2015 11:26 AM, Mohamed Pakkeer wrote:
>
>                      Hi Xavier,
>
>                      Thanks for your reply. When can we expect the 3.7.1
>         release?
>
>
>                  AFAIK a beta of 3.7.1 will be released very soon.
>
>
>                      cheers
>                      Backer
>
>                      On Wed, May 27, 2015 at 1:22 PM, Xavier Hernandez
>                      <xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es> <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>>
>                      <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>
>
>                      <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>>>> wrote:
>
>                           Hi,
>
>                           some Input/Output error issues have been
>         identified and
>                      fixed. These
>                           fixes will be available on 3.7.1.
>
>                           Xavi
>
>
>                           On 05/26/2015 10:15 AM, Mohamed Pakkeer wrote:
>
>                               Hi Glusterfs Experts,
>
>                               We are testing glusterfs 3.7.0 tarball on
>         our 10
>                      Node glusterfs
>                               cluster.
>                               Each node has 36 dirves and please find
>         the volume
>                      info below
>
>                               Volume Name: vaulttest5
>                               Type: Distributed-Disperse
>                               Volume ID:
>         68e082a6-9819-4885-856c-1510cd201bd9
>                               Status: Started
>                               Number of Bricks: 36 x (8 + 2) = 360
>                               Transport-type: tcp
>                               Bricks:
>                               Brick1: 10.1.2.1:/media/disk1
>                               Brick2: 10.1.2.2:/media/disk1
>                               Brick3: 10.1.2.3:/media/disk1
>                               Brick4: 10.1.2.4:/media/disk1
>                               Brick5: 10.1.2.5:/media/disk1
>                               Brick6: 10.1.2.6:/media/disk1
>                               Brick7: 10.1.2.7:/media/disk1
>                               Brick8: 10.1.2.8:/media/disk1
>                               Brick9: 10.1.2.9:/media/disk1
>                               Brick10: 10.1.2.10:/media/disk1
>                               Brick11: 10.1.2.1:/media/disk2
>                               Brick12: 10.1.2.2:/media/disk2
>                               Brick13: 10.1.2.3:/media/disk2
>                               Brick14: 10.1.2.4:/media/disk2
>                               Brick15: 10.1.2.5:/media/disk2
>                               Brick16: 10.1.2.6:/media/disk2
>                               Brick17: 10.1.2.7:/media/disk2
>                               Brick18: 10.1.2.8:/media/disk2
>                               Brick19: 10.1.2.9:/media/disk2
>                               Brick20: 10.1.2.10:/media/disk2
>                               ...
>                               ....
>                               Brick351: 10.1.2.1:/media/disk36
>                               Brick352: 10.1.2.2:/media/disk36
>                               Brick353: 10.1.2.3:/media/disk36
>                               Brick354: 10.1.2.4:/media/disk36
>                               Brick355: 10.1.2.5:/media/disk36
>                               Brick356: 10.1.2.6:/media/disk36
>                               Brick357: 10.1.2.7:/media/disk36
>                               Brick358: 10.1.2.8:/media/disk36
>                               Brick359: 10.1.2.9:/media/disk36
>                               Brick360: 10.1.2.10:/media/disk36
>                               Options Reconfigured:
>                               performance.readdir-ahead: on
>
>                               We did some performance testing and
>         simulated the
>                      proactive self
>                               healing
>                               for Erasure coding. Disperse volume has been
>                      created across
>                      nodes.
>
>                               _*Description of problem*_
>
>                               I disconnected the *network of two nodes*
>         and tried
>                      to write
>                               some video
>                               files and *glusterfs* *wrote the video
>         files on
>                      balance 8 nodes
>                               perfectly*. I tried to download the
>         uploaded file
>                      and it was
>                               downloaded
>                               perfectly. Then i enabled the network of
>         two nodes,
>                      the pro
>                               active self
>                               healing mechanism worked perfectly and
>         wrote the
>                      unavailable
>                      junk of
>                               data to the recently enabled node from the
>         other 8
>                      nodes. But
>                      when i
>                               tried to download the same file node, it
>         showed
>                      Input/Output
>                               error. I
>                               couldn't download the file. I think there
>         is an
>                      issue in pro
>                               active self
>                               healing.
>
>                               Also we tried the simulation with one node
>         network
>                      failure. We
>                      faced
>                               same I/O error issue while downloading the
>         file
>
>
>                               _Error while downloading file _
>                               _
>                               _
>
>                               root at master02:/home/admin# rsync -r --progress
>                               /mnt/gluster/file13_AN
>                               ./1/file13_AN-2
>
>                               sending incremental file list
>
>                               file13_AN
>
>                                   3,342,355,597 100% 4.87MB/s    0:10:54
>         (xfr#1,
>                      to-chk=0/1)
>
>                               rsync: read errors mapping
>         "/mnt/gluster/file13_AN":
>                               Input/output error (5)
>
>                               WARNING: file13_AN failed verification --
>         update
>                      discarded (will
>                               try again).
>
>                                  root at master02:/home/admin# cp
>         /mnt/gluster/file13_AN
>                               ./1/file13_AN-3
>
>                               cp: error reading ‘/mnt/gluster/file13_AN’:
>                      Input/output error
>
>                               cp: failed to extend ‘./1/file13_AN-3’:
>                      Input/output error_
>                               _
>
>
>                               We can't conclude the issue with glusterfs
>         3.7.0 or
>                      our glusterfs
>                               configuration.
>
>                               Any help would be greatly appreciated
>
>                               --
>                               Cheers
>                               Backer
>
>
>
>
>           _______________________________________________
>                               Gluster-users mailing list
>         Gluster-users at gluster.org <mailto:Gluster-users at gluster.org>
>         <mailto:Gluster-users at gluster.org
>         <mailto:Gluster-users at gluster.org>>
>                      <mailto:Gluster-users at gluster.org
>         <mailto:Gluster-users at gluster.org>
>                      <mailto:Gluster-users at gluster.org
>         <mailto:Gluster-users at gluster.org>>>
>         http://www.gluster.org/mailman/listinfo/gluster-users
>
>
>
>
>
>
>                  _______________________________________________
>                  Gluster-users mailing list
>         Gluster-users at gluster.org <mailto:Gluster-users at gluster.org>
>         <mailto:Gluster-users at gluster.org
>         <mailto:Gluster-users at gluster.org>>
>         http://www.gluster.org/mailman/listinfo/gluster-users
>
>
>
>
>
>
>
>
> --
> Thanks & Regards
> K.Mohamed Pakkeer
> Mobile- 0091-8754410114
>