[Gluster-users] Input/Output Error when deleting folder

Thu Jan 8 13:12:48 UTC 2015

Thanks a lot for support and exhaustive explanation, Xavier.
A.

-----Original Message-----
From: Xavier Hernandez [mailto:xhernandez at datalab.es] 
Sent: giovedì 8 gennaio 2015 14:05
To: RASTELLI Alessandro
Cc: gluster-users at gluster.org
Subject: Re: [Gluster-users] Input/Output Error when deleting folder

Hi,

On 01/08/2015 09:51 AM, RASTELLI Alessandro wrote:
> Hi Xavi,
> now there are some files on nodes1-2-3 and others on nodes4-5, so I think I'm going to destroy and re-create the volume from scratch (I can afford it now).

If data is not needed, this is the best way to remove all problems. 
However, if you continue testing and arrive at this situation again, I would be very interested to know what operations have you made and as many details on your workload as you can. Maybe there's a bug causing this problem.

>
> In your opinion, having 5 nodes with 10x 4TB disks each, what's the best way to dimension the bricks?
> Now we configured disperse FS, 2 bricks per node per volume (2x 4TB 
> RAID0 each), if I'm not wrong we can afford losing 2 bricks (= an entire node) Would it be better using distributed FS, and having 1 brick per node (10x 4TB RAID5 each)?
> Or you have other suggestions?

The best configuration depends on your specific hardware characteristics and your needs or preferences.

The main factor is the MTBF/AFR of the disks (Mean Time Between Failures / Annualized Failure Rate).

Relationship between MTBF and AFR is defined by (assuming disks are working uninterrupted all the year):

     AFR = 1 - exp(-8760 / MTBF)

AFR is the probability that a single disk fails in one year.

In your environment, each server has 10 disks. If we assume an AFR of 3%, we can calculate some failure probabilities in different configurations (all probabilities are in one year):

Failure probability of a single disk:            3.00%
Failure probability of a RAID-0  with  5 disks: 14.13% Failure probability of a RAID-0  with 10 disks: 26.26% Failure probability of a RAID-5  with  5 disks:  0.85% Failure probability of a RAID-5  with 10 disks:  3.45% * Failure probability of a RAID-6  with  5 disks:  0.03% Failure probability of a RAID-6  with 10 disks:  0.28% Failure probability of a RAID-50 with 10 disks:  1.69%

* Note that a RAID-1 of 10 disks has more probability of failure than a single disk in this case.

Once you have calculated the failure probability for your hardware configuration, this probability can be considered as the AFR of a single disk used as a brick for gluster.

Then you can calculate the failure probability of the gluster volume (assuming you have an AFR of 3.45% using a RAID-5 of 10 disks):

Failure probability of a Disperse  3:1:  0.35% Failure probability of a Disperse  5:1:  1.11% Failure probability of a Disperse  6:2:  0.08% Failure probability of a Disperse 10:2:  0.41%

Gluster has the possibility of using Distribute. Distribute is similar to a RAID-0 (it combines multiple subvolumes into a single one), but if one subvolume fails, only data stored in that subvolume is lost (in a RAID-0, if a single disk fails, the entire RAID is lost).

This doesn't reduce the probability of failure, but it reduces the impact of that failure (it's much harder to lose all data):

Failure prob of a Distributed-Dispersed 2x3:1: 0.6956% (1 subvol)
                                                0.0012% (2 subvol) Failure prob of a Distributed-Dispersed 2x5:1: 0.0428135% (1 subvol)
                                                0.0000046% (2 subvol)

Of course all these numbers are only statistical. A batch of defective drives or servers can ruin any configuration.

You should also consider the time needed to rebuild a brick if a RAID fails. If you create RAID-5 of 10 disks, for example, gluster will need to recover up to 36 TB of information (if brick was full). Using smaller RAIDs reduces this amount of data.

If you use a single RAID to store multiple bricks, you will get multiple brick failures in case of a RAID or server failure. In any case it's not recommended to have more than one brick of the same subvolume in the same server. It's better to use distribute in this case (a 10:2 configuration where a single server failure causes 2 bricks to fail, is almost equivalent to a 5:1 configuration with respect to probabilities, specially if disks are configured in a RAID-0).

I wouldn't recommend to use RAID-0 with gluster. Instead of creating a
RAID-0 of 2 disks, it's better to create 2 bricks belonging to two different gluster subvolumes and use distribute.

Failure probability of one brick using RAID-0 of two disks:  5.91% Failure probability of two bricks using two disks:  5.82% (1 subvol)
                                                     0.09% (2 subvol)

RAID-5 or RAID-6 can be useful for single disk failure because the disk can be recovered locally in the server without having to read data from other servers. Only a more critical failure will require that gluster rebuilds brick contents. However bigger RAIDs have greater failure probabilities (though they waste less physical disk space).

You must also consider the cost of growing a volume. Disperse and Replicate need to grow in multiples of the subvolume size. This means that if you create a 3:1 configuration you will need to add 3 new bricks if you want to get more space. If you start with a 10:2 configuration you will need to add 10 new bricks to get more space.

In your case I would recommend using two RAID-5 of 5 disks each, or a single RAID-6 of 10 disks, in each server. You can also opt to not use any RAID and have 10 independent disks in each server. I would also create relatively small bricks (for example 4TB each) and use a distributed-dispersed 5:1, with one brick of each subvolume in each server.

With this configuration, if you lose one RAID or an entire server, you will only lose, at most, one brick of each subvolume.

Probability of failure using RAID-6:            0.0076% (1 subvol)
Probability of failure using RAID-5 (5 disks):  0.07% (1 subvol)
Probability of failure without RAID:            0.85% (1 subvol)
Probability of failure using RAID-5 (10 disks): 1.11% (1 subvol)

Of course it's better using RAID, but you also waste more space:

Available space using RAID-6:            128 TB
Available space using RAID-5 (5 disks):  128 TB Available space using RAID-5 (10 disks): 144 TB
Available space without RAID:            160 TB

Using RAID you will recover integrity faster when only one or two disks fails. But it will take more time when gluster has to recover more than one brick (all bricks contained in the failed RAID).

You can also use disperse with redundancy 2. In your case it should be a 5:2. This configuration is not considered optimal, but it's possible that with your workload it performs quite well (you should test it). 
With this configuration I wouldn't recommend any RAID, or RAID-5 with 5 disks at most.

Probability of failure using RAID-6:            0.00002% (1 subvol)
Probability of failure using RAID-5 (5 disks):  0.00060% (1 subvol)
Probability of failure without RAID:            0.026% (1 subvol)
Probability of failure using RAID-5 (10 disks): 0.039% (1 subvol)

Available space using RAID-6:             96 TB
Available space using RAID-5 (5 disks):   96 TB
Available space using RAID-5 (10 disks): 108 TB
Available space without RAID:            120 TB

Hope this helps a little to decide the best configuration for you.

Xavi

>
> Thanks
> A.
>
> -----Original Message-----
> From: Xavier Hernandez [mailto:xhernandez at datalab.es]
> Sent: mercoledì 7 gennaio 2015 18:14
> To: RASTELLI Alessandro
> Cc: gluster-users at gluster.org; CAZZANIGA Stefano; UBERTINI Gabriele; 
> TECHNOLOGY - Supporto Sistemi OTT e Cloud; ORLANDO Luca
> Subject: Re: [Gluster-users] Input/Output Error when deleting folder
>
> If that file is missing only from gluster03-mi, and it has the same attributes in all remaining bricks, self-heal should recover it automatically.
>
> Are there differences in the extended attributes of the file on bricks that have it ?
>
> On 01/07/2015 05:22 PM, RASTELLI Alessandro wrote:
>> It worked... partially :)
>> now I can access the folders again,  but I can't delete them because 
>> that there are a couple of files into them (which I don't need) The files exist only on node1,2,4,5 , but not on node3:
>>
>> [root at gluster02-mi ~]# getfattr -m. -e hex -d 
>> /brick1/recorder/Rec218/Rec_218_1_part_14656.ts
>> getfattr: Removing leading '/' from absolute path names # file:
>> brick1/recorder/Rec218/Rec_218_1_part_14656.ts
>> trusted.ec.config=0x0000080a02000200
>> trusted.ec.size=0x0000000034400000
>> trusted.ec.version=0x0000000000001a20
>> trusted.gfid=0x8d5da5a1cd1949618a5b96657857ceb6
>>
>> [root at gluster03-mi ~]# getfattr -m. -e hex -d 
>> /brick1/recorder/Rec218/Rec_218_1_part_14656.ts
>> getfattr: /brick1/recorder/Rec218/Rec_218_1_part_14656.ts: No such 
>> file or directory
>>
>> How do I proceed?
>> Thanks
>>
>> -----Original Message-----
>> From: Xavier Hernandez [mailto:xhernandez at datalab.es]
>> Sent: mercoledì 7 gennaio 2015 16:45
>> To: RASTELLI Alessandro
>> Cc: gluster-users at gluster.org; CAZZANIGA Stefano; UBERTINI Gabriele; 
>> TECHNOLOGY - Supporto Sistemi OTT e Cloud; ORLANDO Luca
>> Subject: Re: [Gluster-users] Input/Output Error when deleting folder
>>
>> Sorry, the command should be:
>>
>>        setfattr -n trusted.ec.version -v 0x0000000000000001 <brick
>> path>/Rec218
>>
>> On 01/07/2015 04:34 PM, RASTELLI Alessandro wrote:
>>> See my answers below:
>>> 1.
>>> [root at gluster03-mi ~]# ls -l
>>> /brick1/recorder/.glusterfs/a9/d9/a9d904af-0d9e-4018-acb2-881bd8b3c2
>>> e
>>> 4
>>> ls: cannot access
>>> /brick1/recorder/.glusterfs/a9/d9/a9d904af-0d9e-4018-acb2-881bd8b3c2
>>> e
>>> 4
>>> : No such file or directory [root at gluster03-mi ~]# ls -l 
>>> /brick1/recorder/.glusterfs/bd/a8/bda849fc-a556-469e-ad84-ed074f2c1b
>>> c d lrwxrwxrwx 1 root root 55 Dec 17 17:37 
>>> /brick1/recorder/.glusterfs/bd/a8/bda849fc-a556-469e-ad84-ed074f2c1b
>>> c
>>> d
>>> -> ../../00/00/00000000-0000-0000-0000-000000000001/Rec218
>>> [root at gluster03-mi ~]# ls -l
>>> /brick2/recorder/.glusterfs/a9/d9/a9d904af-0d9e-4018-acb2-881bd8b3c2
>>> e
>>> 4
>>> ls: cannot access
>>> /brick2/recorder/.glusterfs/a9/d9/a9d904af-0d9e-4018-acb2-881bd8b3c2
>>> e
>>> 4
>>> : No such file or directory [root at gluster03-mi ~]# ls -l 
>>> /brick2/recorder/.glusterfs/bd/a8/bda849fc-a556-469e-ad84-ed074f2c1b
>>> c d lrwxrwxrwx 1 root root 55 Dec 17 17:37 
>>> /brick2/recorder/.glusterfs/bd/a8/bda849fc-a556-469e-ad84-ed074f2c1b
>>> c
>>> d
>>> -> ../../00/00/00000000-0000-0000-0000-000000000001/Rec218
>>>
>>> 2.
>>> /Rec218 is supposed to be empty (or, I don't need to restore the
>>> files) I stopped the volume, but when executing the command I get an error:
>>> [root at gluster01-mi ~]# setfattr -n trusted.ec.version -v 0x1
>>> /brick1/recorder/Rec218 bad input encoding
>>>
>>> Regards
>>> A.
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Xavier Hernandez [mailto:xhernandez at datalab.es]
>>> Sent: mercoledì 7 gennaio 2015 16:08
>>> To: RASTELLI Alessandro
>>> Cc: gluster-users at gluster.org; CAZZANIGA Stefano; UBERTINI Gabriele; 
>>> TECHNOLOGY - Supporto Sistemi OTT e Cloud; ORLANDO Luca
>>> Subject: Re: [Gluster-users] Input/Output Error when deleting folder
>>>
>>> I see two problems here:
>>>
>>> 1. There has happened something very strange on gluster03-mi. It 
>>> contains the directory, but it's not the same one that there's on 
>>> the other bricks (8 bricks have gfid 
>>> a9d904af-0d9e-4018-acb2-881bd8b3c2e4,
>>> while that node has gfid bda849fc-a556-469e-ad84-ed074f2c1bcd)
>>>
>>> Whatever that has happened here has affected both bricks of that node in the same way.
>>>
>>> What return these commands on gluster03-mi:
>>>
>>> ls -l
>>> /brick1/recorder/.glusterfs/a9/d9/a9d904af-0d9e-4018-acb2-881bd8b3c2
>>> e
>>> 4
>>> ls -l
>>> /brick1/recorder/.glusterfs/bd/a8/bda849fc-a556-469e-ad84-ed074f2c1b
>>> c
>>> d
>>>
>>> ls -l
>>> /brick2/recorder/.glusterfs/a9/d9/a9d904af-0d9e-4018-acb2-881bd8b3c2
>>> e
>>> 4
>>> ls -l
>>> /brick2/recorder/.glusterfs/bd/a8/bda849fc-a556-469e-ad84-ed074f2c1b
>>> c
>>> d
>>>
>>> 2. It seems that node gluster04-mi has been stopped (or rebooted or 
>>> has
>>> failed) while an operation that modifies the directory contents was being executed, so it has lost an update an it's out of sync (both bricks on the same server have missed one update, so it seems clear that it's not a brick problem but a server problem).
>>>
>>> The global result of all this is that you have 4 failed bricks on a configuration that only supports 2 failed bricks.
>>>
>>> BTW, having two or more bricks on the same server is not recommended because a single server failure causes multiple bricks to be lost. In this case a directory can be recovered, but if this happens to a file, it won't be 100% recoverable.
>>>
>>> Are there any files inside /Rec218 ?
>>>
>>> If you are going to delete the directory and all its contents and 
>>> brick contents in gluster03-mi are the same than in other servers, 
>>> the following commands should be safe (otherwise let me know before 
>>> doing
>>> anything):
>>>
>>> Before starting you must be sure that nothing is creating or deleting entries inside /Rec218. It would be even better if this could be done with volume stopped.
>>>
>>> On each brick (including gluster03-mi):
>>>         setfattr -n trusted.ec.version -v 0x1 <brick path>/Rec218
>>>
>>> On bricks in gluster03-mi:
>>>         setfattr -n trusted.gfid -v 
>>> 0xa9d904af0d9e4018acb2881bd8b3c2e4
>>> <brick path>/Rec218
>>>         setfattr -n trusted.glusterfs.dht -v 
>>> 0x000000010000000000000000ffffffff <brick path>/Rec218
>>>
>>> On client:
>>>         check that the directory is accessible and its contents seem ok. If so:
>>>             rm -rf <mount point>/Rec218
>>>
>>> If you have a way to reproduce this situation, let me know.
>>>
>>> Xavi
>>>
>>> On 01/07/2015 03:31 PM, RASTELLI Alessandro wrote:
>>>> [root at gluster01-mi ~]# getfattr -m. -e hex -d
>>>> /brick1/recorder/Rec218
>>>> getfattr: Removing leading '/' from absolute path names # file:
>>>> brick1/recorder/Rec218 trusted.ec.version=0x000000000000693a
>>>> trusted.gfid=0xa9d904af0d9e4018acb2881bd8b3c2e4
>>>> trusted.glusterfs.dht=0x000000010000000000000000ffffffff
>>>>
>>>> [root at gluster01-mi ~]# getfattr -m. -e hex -d
>>>> /brick2/recorder/Rec218
>>>> getfattr: Removing leading '/' from absolute path names # file:
>>>> brick2/recorder/Rec218 trusted.ec.version=0x000000000000693a
>>>> trusted.gfid=0xa9d904af0d9e4018acb2881bd8b3c2e4
>>>> trusted.glusterfs.dht=0x000000010000000000000000ffffffff
>>>>
>>>>
>>>> [root at gluster02-mi ~]# getfattr -m. -e hex -d
>>>> /brick1/recorder/Rec218
>>>> getfattr: Removing leading '/' from absolute path names # file:
>>>> brick1/recorder/Rec218 trusted.ec.version=0x000000000000693a
>>>> trusted.gfid=0xa9d904af0d9e4018acb2881bd8b3c2e4
>>>> trusted.glusterfs.dht=0x000000010000000000000000ffffffff
>>>>
>>>> [root at gluster02-mi ~]# getfattr -m. -e hex -d
>>>> /brick2/recorder/Rec218
>>>> getfattr: Removing leading '/' from absolute path names # file:
>>>> brick2/recorder/Rec218 trusted.ec.version=0x000000000000693a
>>>> trusted.gfid=0xa9d904af0d9e4018acb2881bd8b3c2e4
>>>> trusted.glusterfs.dht=0x000000010000000000000000ffffffff
>>>>
>>>>
>>>> [root at gluster03-mi ~]# getfattr -m. -e hex -d
>>>> /brick1/recorder/Rec218
>>>> getfattr: Removing leading '/' from absolute path names # file:
>>>> brick1/recorder/Rec218
>>>> trusted.gfid=0xbda849fca556469ead84ed074f2c1bcd
>>>>
>>>> [root at gluster03-mi ~]# getfattr -m. -e hex -d
>>>> /brick2/recorder/Rec218
>>>> getfattr: Removing leading '/' from absolute path names # file:
>>>> brick2/recorder/Rec218
>>>> trusted.gfid=0xbda849fca556469ead84ed074f2c1bcd
>>>>
>>>>
>>>> [root at gluster04-mi ~]# getfattr -m. -e hex -d
>>>> /brick1/recorder/Rec218
>>>> getfattr: Removing leading '/' from absolute path names # file:
>>>> brick1/recorder/Rec218
>>>> trusted.ec.version=0x0000000000006939
>>>> trusted.gfid=0xa9d904af0d9e4018acb2881bd8b3c2e4
>>>> trusted.glusterfs.dht=0x000000010000000000000000ffffffff
>>>>
>>>> [root at gluster04-mi ~]# getfattr -m. -e hex -d
>>>> /brick2/recorder/Rec218
>>>> getfattr: Removing leading '/' from absolute path names # file:
>>>> brick2/recorder/Rec218
>>>> trusted.ec.version=0x0000000000006939
>>>> trusted.gfid=0xa9d904af0d9e4018acb2881bd8b3c2e4
>>>> trusted.glusterfs.dht=0x000000010000000000000000ffffffff
>>>>
>>>>
>>>> [root at gluster05-mi ~]# getfattr -m. -e hex -d
>>>> /brick1/recorder/Rec218
>>>> getfattr: Removing leading '/' from absolute path names # file:
>>>> brick1/recorder/Rec218 trusted.ec.version=0x000000000000693a
>>>> trusted.gfid=0xa9d904af0d9e4018acb2881bd8b3c2e4
>>>> trusted.glusterfs.dht=0x000000010000000000000000ffffffff
>>>>
>>>> [root at gluster05-mi ~]# getfattr -m. -e hex -d
>>>> /brick2/recorder/Rec218
>>>> getfattr: Removing leading '/' from absolute path names # file:
>>>> brick2/recorder/Rec218 trusted.ec.version=0x000000000000693a
>>>> trusted.gfid=0xa9d904af0d9e4018acb2881bd8b3c2e4
>>>> trusted.glusterfs.dht=0x000000010000000000000000ffffffff