[Gluster-devel] Fwd: Question about EC locking

Thu Feb 2 07:12:32 UTC 2017

On Fri, Jan 13, 2017 at 8:03 PM, Xavier Hernandez <xhernandez at datalab.es>
wrote:

> Hi,
>
> On 13/01/17 10:58, jayakrishnan mm wrote:
>
>> Hi Xavier,
>> I went through the source  code. Some questions remain.
>>
>> 1. If two clients try to write to same file, it should succeed, even if
>> they overlap. (Locks should ensure it happens in sequence, in the bricks).
>> from the source code
>>          lock->flock.l_type = F_WRLCK;
>>          lock->flock.l_whence = SEEK_SET;
>>
>>             fop->flock.l_len += ec_adjust_offset(fop->xl->private,
>>                                                  &fop->flock.l_start, 1);
>>             fop->flock.l_len = ec_adjust_size(fop->xl->private,
>>                                               fop->flock.l_len, 1);
>> if flock.l_len is 0, the entire file  is locked for writing
>>
>> In my test case  with 2 clients, I always  get  flock.l_len as 0. But
>> still  I am able to write to the same file  from both clients at the
>>  same time.
>>
>
> How are you sure you are really writing at the same time ? do you get
> partial writes from some of the client ?

I am not sure, if they are happening simultaneously. I am using  fio to do
that.

>
>
>
>> If it is  acquiring lock chunk by chunk, why I am getting l_len =0
>> always ?
>>
>
> EC doesn't acquire partial locks. The entire file is locked when a
> modification is needed. This makes possible to reuse locks for future
> operations (eager locking).
>
> Why I am not getting the actual write size  and offset f(for
>> flock.l_len & flock.l_start respectively) for each  write FOP ?
>> (In afr , it is set to transaction.len transaction.start respectively,
>> which in turn is  write length & offset  for the normal write case)
>>
>
> Because an erasure code splits the data is smaller fragments for each
> brick, so offsets and lengths need to be adjusted.
>
>
>> 2. As per source code ,a full file lock is taken by the shd also.
>>
>> ec_heal_inodelk(heal, F_WRLCK, 1, 0, 0);
>>  which means  offset=0 & size=0  in  ec_heal_lock() function in ec-heal.c
>> flock.l_start = offset;
>> flock.l_len = size;
>> Does it mean , in a single file write cannot happen simultaneously with
>> healing?
>>
>
> Correct. Heal procedure is like an additional client. If a client and the
> heal process try to write at the same time, they must be serialized, like
> any other regular write. However heal only takes the full lock for some
> critical operations. Regular self heal of file contents is done locking
> chunk by chunk.
>

Have got a question about index heal/full heal.
As per the code, index healer thread (ec_shd_index_healer)is created  when
there is a child_up event OR  when there is a
TRANSLATOR_OP/GF_SHD_OP_HEAL_INDEX.
When does the second case arise ?

Full heal  thread(ec_shd_full_healer) is created  only when
TRANSLATOR_OP/GF_SHD_OP_HEAL_FULL arise. Does this happen during replace
brick condition only ?

Thanks & regards
JK

>
> Xavi
>
>
>> Correct me , if I am wrong.
>>
>> Best Regards
>> JK
>>
>>
>>
>>
>>
>>
>> On Wed, Dec 14, 2016 at 12:07 PM, jayakrishnan mm
>> <jayakrishnan.mm at gmail.com <mailto:jayakrishnan.mm at gmail.com>> wrote:
>>
>>     Thanks Xavier, for making it clear.
>>     Regards
>>     JK
>>
>>
>>     On Dec 13, 2016 3:52 PM, "Xavier Hernandez" <xhernandez at datalab.es
>>     <mailto:xhernandez at datalab.es>> wrote:
>>
>>         Hi JK,
>>
>>
>>         On 12/13/2016 08:34 AM, jayakrishnan mm wrote:
>>
>>             Dear Xavi,
>>
>>             How do I test  the locks, for example locks  for write fop.
>>             I have two
>>             clients(independent), both  are  trying to write to same file.
>>
>>
>>             1. According to my understanding, both  can successfully
>>             write  if the
>>             offsets don't overlap . I mean, the WRITE FOP  takes a chunk
>>             lock on the
>>             file . As
>>             long as the clients don't try  to write to the same chunk,
>>             it should be
>>             OK. If no locks  present, it can lead to inconsistency.
>>
>>
>>         With locks all writes will be fine as defined by posix (i.e. the
>>         final result will be equivalent to the sequential execution of
>>         both operations, though in an undefined order), even if they
>>         overlap. Without locks, there are chances that some bricks
>>         execute the operations in one order and the remaining bricks
>>         execute the same operations in the reverse order, causing data
>>         corruption.
>>
>>
>>
>>
>>             2.  Different FOPs can always run simultaneously. (Example
>>             WRITE  and
>>             READ FOPs, or  two READ FOPs).
>>
>>
>>         All fops can be executed concurrently. If there's any chance
>>         that two operations could interfere, locks are taken in the
>>         appropriate places. For example, reads cannot be merged with
>>         overlapping writes. Otherwise they could return inconsistent data.
>>
>>
>>
>>             3. WRITE & some metadata FOP (like setattr)  together .
>>             Cannot happen
>>             together with locks , even though chances  are very low.
>>
>>
>>         As in 2, if there's any possible interference, the appropriate
>>         locks will be taken.
>>
>>         You can look at the code to see which locks are taken for each
>>         fop. See the corresponding ec_manager_<fop>() function, in the
>>         EC_STATE_LOCK switch case. There you will see calls to
>>         ec_lock_prepare_xxx() for each taken lock.
>>
>>         Xavi
>>
>>
>>             Pls. clarify.
>>
>>             Best regards
>>             JK
>>
>>
>>
>>             On Wed, Nov 30, 2016 at 5:49 PM, jayakrishnan mm
>>             <jayakrishnan.mm at gmail.com
>>             <mailto:jayakrishnan.mm at gmail.com>
>>             <mailto:jayakrishnan.mm at gmail.com
>>             <mailto:jayakrishnan.mm at gmail.com>>> wrote:
>>
>>                 Hi Xavier,
>>
>>                 Thank you very much for your explanation. This helped  me
>> to
>>                 understand  more  about  locking in EC.
>>
>>                 Best Regards
>>                 JK
>>
>>
>>                 On Mon, Nov 28, 2016 at 4:17 PM, Xavier Hernandez
>>                 <xhernandez at datalab.es <mailto:xhernandez at datalab.es>
>>             <mailto:xhernandez at datalab.es
>>
>>             <mailto:xhernandez at datalab.es>>> wrote:
>>
>>                     Hi,
>>
>>                     On 11/28/2016 02:59 AM, jayakrishnan mm wrote:
>>
>>                         Hi Xavier,
>>
>>                         Notice  that EC xlator uses blocking locks. Any
>>             specific
>>                         reason for this?
>>
>>
>>                     In a distributed filesystem like gluster a
>>             synchronization
>>                     mechanism is a must to avoid data corruption.
>>
>>
>>                         Do you think this will  affect the  performance ?
>>
>>
>>                     Of course the need for locks has a performance
>>             impact, and we
>>                     cannot avoid them to guarantee data integrity.
>>             However some
>>                     optimizations have been applied, specially the eager
>>             locking
>>                     which allows a lock to be reused without
>>             unlocking/locking again.
>>
>>
>>                         (In comparison AFR  first tries  non blocking
>>             locks  and if not
>>                         successful, tries blocking locks then)
>>
>>
>>                     EC also tries a non-blocking lock first.
>>
>>
>>                         Also, why two locks  are  needed  per FOP ? One
>>             for normal
>>                         I/O and
>>                         another for self healing?
>>
>>
>>                     The only fop that currently needs two locks is
>>             'rename', and
>>                     only when source and destination directories are
>>             different. All
>>                     other fops only take one lock at most.
>>
>>                     Best regards,
>>
>>                     Xavi
>>
>>
>>                         Best regards
>>                         JK
>>
>>
>>                         _______________________________________________
>>                         Gluster-devel mailing list
>>                         Gluster-devel at gluster.org
>>             <mailto:Gluster-devel at gluster.org>
>>             <mailto:Gluster-devel at gluster.org
>>             <mailto:Gluster-devel at gluster.org>>
>>
>>             http://www.gluster.org/mailman/listinfo/gluster-devel
>>             <http://www.gluster.org/mailman/listinfo/gluster-devel>
>>
>>             <http://www.gluster.org/mailman/listinfo/gluster-devel
>>             <http://www.gluster.org/mailman/listinfo/gluster-devel>>
>>
>>
>>
>>
>>
>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-devel/attachments/20170202/edc2c475/attachment-0001.html>