[Gluster-devel] [RFC] Zerofill FOP support for GlusterFS

Ric Wheeler ricwheeler at gmail.com
Tue Jul 16 10:41:13 UTC 2013


On 07/16/2013 06:21 AM, Aakash wrote:
> On 07/16/2013 02:25 PM, Niels de Vos wrote:
>> On Mon, Jul 15, 2013 at 01:17:54PM -0400, aakash at linux.vnet.ibm.com wrote:
>>> Add support for a new ZEROFILL fop. Zerofill writes zeroes to a file in the
>>> specified range. This fop will be useful when a whole file needs to be
>>> initialized with zero (could be useful for zero filled VM disk image
>>> provisioning or  during scrubbing of VM disk images).
>>>
>>> Client/application can issue this FOP for zeroing out. Gluster server will
>>> zero out required range of bytes ie server offloaded zeroing. In the
>>> absence of
>>> this fop,  client/application has to repetitively issue write (zero)
>>> fop to the
>>> server, which is very inefficient method because of the overheads involved in
>>> RPC calls  and acknowledgements.
>>>
>>> WRITESAME is a  SCSI T10 command that takes a block of data as input
>>> and writes
>>> the same data to other blocks and this write is handled completely within the
>>> storage and hence is known as offload . Linux ,now has support for SCSI
>>> WRITESAME command which is exposed to the user in the form of
>>> BLKZEROOUT ioctl.
>>> BD Xlator can exploit BLKZEROOUT ioctl to implement this fop. Thus zeroing out
>>> operations can be completely offloaded to the storage device ,
>>> making it highly
>>> efficient.
>> Just wondering (and I think it was mentioned earlier by Vijay already),
>> why not implement a WRITESAME fop and detect in the storage xlators if
>> the BLKZEROOUT ioctl() should be used in the case of writing zero's?
>    Thank you Niels for your comments.
>
>      In Linux, we can exploit SCSI WRITESAME using BLKZEROOUT ioctl. This 
> ioctl issues
>      WRITESAME ,with zero filled block as input block. So Linux supports 
> writing only
>      zeroes using WRITESAME. Also writing zeroes is a very common operation 
> during
>      initialization and scrubbing of VM disk images. We have  BD Xlator in 
> GlusterFS for
>      block devices which can  issue this  ioctl. Hence instead of a generic 
> WRITESAME fop
>      we are adding zerofill fop. I have a patch which  makes use of this ioctl 
> to implement
>      zerofill in BD xlator. I will be posting it soon.

A lot of enterprise arrays do this in a clever way, but if you use WRITE_SAME 
against a physical SAS drive, it can be a very long running command...

ric

>>   I'll try to keep an eye open on the merging of this change. Whenever
>> that happens, we can send a patch to Wireshark so that the new fop gets
>> detected correctly.
>>
>> Thanks,
>> Niels
>>
>>> The fop takes two arguments offset and size. It zeroes out 'size' number of
>>> bytes in an opened file starting from 'offset' position.
>>>
>>> This patch adds zerofill support to the following areas:
>>>
>>>          - libglusterfs
>>>          - io-stats
>>>          - performance/md-cache,open-behind
>>>          - quota
>>>          - cluster/afr,dht,stripe
>>>          - rpc/xdr
>>>          - protocol/client,server
>>>          - io-threads
>>>          - marker
>>>          - storage/posix
>>>          - libgfapi
>>>
>>> Client applications can exloit this fop by using glfs_zerofill introduced in
>>> libgfapi.FUSE support to this fop has not been added as there is no
>>> system call
>>> for this fop.
>>>
>>> TODO :
>>>       * Add zerofill support to trace xlator
>>>       * Expose zerofill capability as part of gluster volume info
>>>
>>> Here is a performance comparison of server offloaded zeofill vs zeroing out
>>> using repeated writes.
>>>
>>> [root at llmvm02 remote]# time ./offloaded aakash-test log 20
>>>
>>> real        3m34.155s
>>> user        0m0.018s
>>> sys        0m0.040s
>>> [root at llmvm02 remote]# time ./manually aakash-test log 20
>>>
>>> real        4m23.043s
>>> user        0m2.197s
>>> sys        0m14.457s
>>> [root at llmvm02 remote]# time ./offloaded aakash-test log 25;
>>>
>>> real        4m28.363s
>>> user        0m0.021s
>>> sys        0m0.025s
>>> [root at llmvm02 remote]# time ./manually aakash-test log 25
>>>
>>> real        5m34.278s
>>> user        0m2.957s
>>> sys        0m18.808s
>>>
>>> The argument 'log' is a file which we want to set for logging purpose and the
>>> third argument is size in GB .
>>>
>>> As we can see there is a performance improvement of around 20% with
>>> this fop. For
>>> block devices with the use of BLKZEROOUT ioctl, we can improve the
>>> performance even more.
>>>
>>> The applications used for performance comparison can be found here:
>>>
>>> For manually writing zeros: 
>>> https://docs.google.com/file/d/0B4jeWncLrfS3LVNybW9lR2dPZkk/edit?usp=sharing
>>>
>>> For offloaded zeroing : 
>>> https://docs.google.com/file/d/0B4jeWncLrfS3LVNybW9lR2dPZkk/edit?usp=sharing
>>>
>>> Change-Id: I081159f5f7edde0ddb78169fb4c21c776ec91a18
>>> Signed-off-by: Aakash Lal Das <aakash at linux.vnet.ibm.com>
>>>
>>>
>>> _______________________________________________
>>> Gluster-devel mailing list
>>> Gluster-devel at nongnu.org
>>> https://lists.nongnu.org/mailman/listinfo/gluster-devel
>
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at nongnu.org
> https://lists.nongnu.org/mailman/listinfo/gluster-devel





More information about the Gluster-devel mailing list