[Gluster-devel] [RFC] Zerofill FOP support for GlusterFS

Aakash aakash at linux.vnet.ibm.com
Tue Jul 16 11:34:00 UTC 2013


On 07/16/2013 04:11 PM, Ric Wheeler wrote:
> On 07/16/2013 06:21 AM, Aakash wrote:
>> On 07/16/2013 02:25 PM, Niels de Vos wrote:
>>> On Mon, Jul 15, 2013 at 01:17:54PM -0400, aakash at linux.vnet.ibm.com 
>>> wrote:
>>>> Add support for a new ZEROFILL fop. Zerofill writes zeroes to a 
>>>> file in the
>>>> specified range. This fop will be useful when a whole file needs to be
>>>> initialized with zero (could be useful for zero filled VM disk image
>>>> provisioning or  during scrubbing of VM disk images).
>>>>
>>>> Client/application can issue this FOP for zeroing out. Gluster 
>>>> server will
>>>> zero out required range of bytes ie server offloaded zeroing. In the
>>>> absence of
>>>> this fop,  client/application has to repetitively issue write (zero)
>>>> fop to the
>>>> server, which is very inefficient method because of the overheads 
>>>> involved in
>>>> RPC calls  and acknowledgements.
>>>>
>>>> WRITESAME is a  SCSI T10 command that takes a block of data as input
>>>> and writes
>>>> the same data to other blocks and this write is handled completely 
>>>> within the
>>>> storage and hence is known as offload . Linux ,now has support for 
>>>> SCSI
>>>> WRITESAME command which is exposed to the user in the form of
>>>> BLKZEROOUT ioctl.
>>>> BD Xlator can exploit BLKZEROOUT ioctl to implement this fop. Thus 
>>>> zeroing out
>>>> operations can be completely offloaded to the storage device ,
>>>> making it highly
>>>> efficient.
>>> Just wondering (and I think it was mentioned earlier by Vijay already),
>>> why not implement a WRITESAME fop and detect in the storage xlators if
>>> the BLKZEROOUT ioctl() should be used in the case of writing zero's?
>>    Thank you Niels for your comments.
>>
>>      In Linux, we can exploit SCSI WRITESAME using BLKZEROOUT ioctl. 
>> This ioctl issues
>>      WRITESAME ,with zero filled block as input block. So Linux 
>> supports writing only
>>      zeroes using WRITESAME. Also writing zeroes is a very common 
>> operation during
>>      initialization and scrubbing of VM disk images. We have  BD 
>> Xlator in GlusterFS for
>>      block devices which can  issue this  ioctl. Hence instead of a 
>> generic WRITESAME fop
>>      we are adding zerofill fop. I have a patch which  makes use of 
>> this ioctl to implement
>>      zerofill in BD xlator. I will be posting it soon.
>
> A lot of enterprise arrays do this in a clever way, but if you use 
> WRITE_SAME against a physical SAS drive, it can be a very long running 
> command...
>
> ric
Thanks ric for your comment.

I still feel that writing zeroes using this command will be faster than 
traditional way of writing zeroes.
Since SAS is latest storage Interface for Direct Attach Storage, should 
it not be faster ? Please enlighten
me.

Thanks,
Aakash


>
>>>   I'll try to keep an eye open on the merging of this change. Whenever
>>> that happens, we can send a patch to Wireshark so that the new fop gets
>>> detected correctly.
>>>
>>> Thanks,
>>> Niels
>>>
>>>> The fop takes two arguments offset and size. It zeroes out 'size' 
>>>> number of
>>>> bytes in an opened file starting from 'offset' position.
>>>>
>>>> This patch adds zerofill support to the following areas:
>>>>
>>>>          - libglusterfs
>>>>          - io-stats
>>>>          - performance/md-cache,open-behind
>>>>          - quota
>>>>          - cluster/afr,dht,stripe
>>>>          - rpc/xdr
>>>>          - protocol/client,server
>>>>          - io-threads
>>>>          - marker
>>>>          - storage/posix
>>>>          - libgfapi
>>>>
>>>> Client applications can exloit this fop by using glfs_zerofill 
>>>> introduced in
>>>> libgfapi.FUSE support to this fop has not been added as there is no
>>>> system call
>>>> for this fop.
>>>>
>>>> TODO :
>>>>       * Add zerofill support to trace xlator
>>>>       * Expose zerofill capability as part of gluster volume info
>>>>
>>>> Here is a performance comparison of server offloaded zeofill vs 
>>>> zeroing out
>>>> using repeated writes.
>>>>
>>>> [root at llmvm02 remote]# time ./offloaded aakash-test log 20
>>>>
>>>> real        3m34.155s
>>>> user        0m0.018s
>>>> sys        0m0.040s
>>>> [root at llmvm02 remote]# time ./manually aakash-test log 20
>>>>
>>>> real        4m23.043s
>>>> user        0m2.197s
>>>> sys        0m14.457s
>>>> [root at llmvm02 remote]# time ./offloaded aakash-test log 25;
>>>>
>>>> real        4m28.363s
>>>> user        0m0.021s
>>>> sys        0m0.025s
>>>> [root at llmvm02 remote]# time ./manually aakash-test log 25
>>>>
>>>> real        5m34.278s
>>>> user        0m2.957s
>>>> sys        0m18.808s
>>>>
>>>> The argument 'log' is a file which we want to set for logging 
>>>> purpose and the
>>>> third argument is size in GB .
>>>>
>>>> As we can see there is a performance improvement of around 20% with
>>>> this fop. For
>>>> block devices with the use of BLKZEROOUT ioctl, we can improve the
>>>> performance even more.
>>>>
>>>> The applications used for performance comparison can be found here:
>>>>
>>>> For manually writing zeros: 
>>>> https://docs.google.com/file/d/0B4jeWncLrfS3LVNybW9lR2dPZkk/edit?usp=sharing
>>>>
>>>> For offloaded zeroing : 
>>>> https://docs.google.com/file/d/0B4jeWncLrfS3LVNybW9lR2dPZkk/edit?usp=sharing
>>>>
>>>> Change-Id: I081159f5f7edde0ddb78169fb4c21c776ec91a18
>>>> Signed-off-by: Aakash Lal Das <aakash at linux.vnet.ibm.com>
>>>>
>>>>
>>>> _______________________________________________
>>>> Gluster-devel mailing list
>>>> Gluster-devel at nongnu.org
>>>> https://lists.nongnu.org/mailman/listinfo/gluster-devel
>>
>>
>> _______________________________________________
>> Gluster-devel mailing list
>> Gluster-devel at nongnu.org
>> https://lists.nongnu.org/mailman/listinfo/gluster-devel
>





More information about the Gluster-devel mailing list