[GEDI] [PATCH v5 07/11] block: use int64_t instead of int in driver write_zeroes handlers

Sat Jun 5 13:50:54 UTC 2021

04.06.2021 23:09, Eric Blake wrote:
> On Wed, May 05, 2021 at 10:49:57AM +0300, Vladimir Sementsov-Ogievskiy wrote:
>> We are generally moving to int64_t for both offset and bytes parameters
>> on all io paths.
>>
>> Main motivation is realization of 64-bit write_zeroes operation for
>> fast zeroing large disk chunks, up to the whole disk.
>>
>> We chose signed type, to be consistent with off_t (which is signed) and
>> with possibility for signed return type (where negative value means
>> error).
>>
>> So, convert driver write_zeroes handlers bytes parameter to int64_t.
>>
>> The only caller of all updated function is bdrv_co_do_pwrite_zeroes().
>>
>> bdrv_co_do_pwrite_zeroes() itself is of course OK with widening of
>> callee parameter type. Also, bdrv_co_do_pwrite_zeroes()'s
>> max_write_zeroes is limited to INT_MAX. So, updated functions all are
>> safe, the will not get "bytes" larger than before.
> 
> they
> 
>>
>> Still, let's look through all updated functions, and add assertions to
>> the ones which are actually unprepared to values larger than INT_MAX.
>> For these drivers also set explicit max_pwrite_zeroes limit.
>>
>> Let's go:
>>
>> backup-top: Calls backup_top_cbw() and bdrv_co_pwrite_zeroes, both
>>    have 64bit argument.
> 
> backup_top_cbw has uint64_t argument (at least on current qemu.git; I
> didn't spot if it was changed in the meantime earlier in this series
> or if I'm missing review of a prerequisite), but we're safe since the
> block layer does not pass in negative values.
> 
>>
>> blkdebug: calculations can't overflow, thanks to
>>    bdrv_check_qiov_request() in generic layer. rule_check() and
>>    bdrv_co_pwrite_zeroes() both have 64bit argument.
> 
> rule_check() is another function currently using uint64_t.
> 
>>
>> blklogwrites: pass to blk_log_writes_co_log() with 64bit argument.
> 
> That, and struct BlkLogWritesFileReq, still use uint64_t.
> 
>>
>> blkreply, copy-on-read, filter-compress: pass to bdrv_co_pwrite_zeroes() which is OK
> 
> blkreplay
> 
>>
>> file-posix: both handler calls raw_do_pwrite_zeroes, which is updated.
>>    In raw_do_pwrite_zeroes() calculations are OK due to
>>    bdrv_check_qiov_request(), bytes go to RawPosixAIOData::aio_nbytes
>>    which is uint64_t.
> 
> We also have to check where that uint64_t gets handed;
> handle_aiocb_write_zeroes_block() passes a uint64_t[2] to
> ioctl(BLKZEROOUT), handle_aiocb_write_zeroes() calls do_fallocate()
> which takes off_t (and we compile to always have 64-bit off_t), as
> does handle_aiocb_write_zeroes_unmap.  All look safe.
> 
>>
>> gluster: bytes go to GlusterAIOCB::size which is int64_t and to
>>    glfs_zerofill_async works with off_t.
>>
>> iscsi: Aha, here we deal with iscsi_writesame16_task() that has
>>    uint32_t num_blocks argument and iscsi_writesame16_task() has
> 
> s/16/10/
> 
>>    uint16_t argument. Make comments, add assertions and clarify
>>    max_pwrite_zeroes calculation.
>>    iscsi_allocmap_() functions already has int64_t argument
>>    is_byte_request_lun_aligned is simple to update, do it.
>>
>> mirror_top: pass to bdrv_mirror_top_do_write which has uint16_t
> 
> s/16/64/
> 
>>    argument
>>
>> nbd: Aha, here we have protocol limitation, and NBDRequest::len is
>>    uint32_t. max_pwrite_zeroes is cleanly set to 32bit value, so we are
>>    OK for now.
>>
>> nvme: Again, protocol limitation. And no inherent limit for
>>    write-zeroes at all. But from code that calculates cdw12 it's obvious
>>    that we do have limit and alignment. Let's clarify it. Also,
>>    obviously the code is not prepared to bytes=0. Let's handle this
> 
> to handle bytes=0
> 
>>    case too.
>>    trace events already 64bit
>>
>> qcow2: offset + bytes and alignment still works good (thanks to
>>    bdrv_check_qiov_request()), so tail calculation is OK
>>    qcow2_subcluster_zeroize() has 64bit argument, should be OK
>>    trace events updated
>>
>> qed: qed_co_request wants int nb_sectors. Also in code we have size_t
>>    used for request length which may be 32bit.. So, let's just keep
> 
> Double dot looks odd.
> 
>>    INT_MAX as a limit (aligning it down to pwrite_zeroes_alignment) and
>>    don't care.
> 
> Yeah, even when size_t is 64-bit, qed is not our high priority so
> sticking to 32-bit limit encourages people to switch away from qed ;)
> 
>>
>> raw-format: Is OK. raw_adjust_offset and bdrv_co_pwrite_zeroes are both
>>    64bit.
> 
> I probably already mentioned it in earlier reviews, but hopefully by
> the end of the series, we clean up raw_adjust_offset() to take
> int64_t* instead of uint64_t* to get rid of ugly casts.  Doesn't have
> to be done in this patch.
> 
>>
>> throttle: Both throttle_group_co_io_limits_intercept() and
>>    bdrv_co_pwrite_zeroes() are 64bit.
>>
>> vmdk: pass to vmdk_pwritev which is 64bit
>>
>> quorum: pass to quorum_co_pwritev() which is 64bit
>>
>> preallocated: pass to handle_write() and bdrv_co_pwrite_zeroes(), both
> 
> File is named preallocate.c, the 'd' seems odd. Worth sorting this
> before qcow2 in the commit message?

I think yes

> 
>>    64bit.
>>
>> Hooray!
>>
>> At this point all block drivers are prepared to 64bit write-zero
>> requests or has explicitly set max_pwrite_zeroes.
> 
> are prepared to support 64bit write-zero requests, or have explicitly set
> 
>>
>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov at virtuozzo.com>
>> ---
> 
>> +++ b/block/iscsi.c
>> @@ -1205,15 +1205,16 @@ out_unlock:
>>   
>>   static int
>>   coroutine_fn iscsi_co_pwrite_zeroes(BlockDriverState *bs, int64_t offset,
>> -                                    int bytes, BdrvRequestFlags flags)
>> +                                    int64_t bytes, BdrvRequestFlags flags)
>>   {
>>       IscsiLun *iscsilun = bs->opaque;
>>       struct IscsiTask iTask;
>>       uint64_t lba;
>> -    uint32_t nb_blocks;
>> +    uint64_t nb_blocks;
>>       bool use_16_for_ws = iscsilun->use_16_for_rw;
>>       int r = 0;
>>   
>> +
>>       if (!is_byte_request_lun_aligned(offset, bytes, iscsilun)) {
> 
> Why the added blank line here?

mistake

> 
>>           return -ENOTSUP;
>>       }
>> @@ -1250,11 +1251,21 @@ coroutine_fn iscsi_co_pwrite_zeroes(BlockDriverState *bs, int64_t offset,
>>       iscsi_co_init_iscsitask(iscsilun, &iTask);
>>   retry:
>>       if (use_16_for_ws) {
>> +        /*
>> +         * iscsi_writesame16_task num_blocks argument is uint32_t. We rely here
>> +         * on our max_pwrite_zeroes limit.
>> +         */
>> +        assert(nb_blocks < UINT32_MAX);
>>           iTask.task = iscsi_writesame16_task(iscsilun->iscsi, iscsilun->lun, lba,
>>                                               iscsilun->zeroblock, iscsilun->block_size,
>>                                               nb_blocks, 0, !!(flags & BDRV_REQ_MAY_UNMAP),
>>                                               0, 0, iscsi_co_generic_cb, &iTask);
>>       } else {
>> +        /*
>> +         * iscsi_writesame10_task num_blocks argument is uint16_t. We rely here
>> +         * on our max_pwrite_zeroes limit.
>> +         */
>> +        assert(nb_blocks < UINT16_MAX);
>>           iTask.task = iscsi_writesame10_task(iscsilun->iscsi, iscsilun->lun, lba,
>>                                               iscsilun->zeroblock, iscsilun->block_size,
>>                                               nb_blocks, 0, !!(flags & BDRV_REQ_MAY_UNMAP),
> 
> Thanks - these assertions and comments are indeed a lifesaver for
> future maintenance.
> 
>> @@ -2074,10 +2085,10 @@ static void iscsi_refresh_limits(BlockDriverState *bs, Error **errp)
>>           bs->bl.pdiscard_alignment = iscsilun->block_size;
>>       }
>>   
>> -    if (iscsilun->bl.max_ws_len < 0xffffffff / block_size) {
>> -        bs->bl.max_pwrite_zeroes =
>> -            iscsilun->bl.max_ws_len * iscsilun->block_size;
>> -    }
>> +    bs->bl.max_pwrite_zeroes =
>> +        MIN_NON_ZERO(iscsilun->bl.max_ws_len * iscsilun->block_size,
>> +                     max_xfer_len * iscsilun->block_size);
> 
> Works because max_xfer_len is 0xffff if we are stuck using
> writesame10, or 0xffffffff if we are able to use writesame16.
> 
>> +++ b/block/nvme.c
>> @@ -1266,19 +1266,29 @@ static coroutine_fn int nvme_co_flush(BlockDriverState *bs)
>>   
>>   static coroutine_fn int nvme_co_pwrite_zeroes(BlockDriverState *bs,
>>                                                 int64_t offset,
>> -                                              int bytes,
>> +                                              int64_t bytes,
>>                                                 BdrvRequestFlags flags)
>>   {
>>       BDRVNVMeState *s = bs->opaque;
>>       NVMeQueuePair *ioq = s->queues[INDEX_IO(0)];
>>       NVMeRequest *req;
>> -
>> -    uint32_t cdw12 = ((bytes >> s->blkshift) - 1) & 0xFFFF;
>> +    uint32_t cdw12;
>>   
>>       if (!s->supports_write_zeroes) {
>>           return -ENOTSUP;
>>       }
>>   
>> +    if (bytes == 0) {
>> +        return 0;
>> +    }
>> +
>> +    cdw12 = ((bytes >> s->blkshift) - 1) & 0xFFFF;
>> +    /*
>> +     * We should not loose information. pwrite_zeroes_alignment and
> 
> lose (this is a common English typo; "lose" rhymes with "use" and is
> opposite of "gain", while "loose" rhymes with "goose" and is opposite
> of "tighten")
> 
>> +++ b/block/qed.c
> 
>> @@ -1408,6 +1409,12 @@ static int coroutine_fn bdrv_qed_co_pwrite_zeroes(BlockDriverState *bs,
>>        */
>>       QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, NULL, bytes);
>>   
>> +    /*
>> +     * QED is not prepared for 62bit write-zero requests, so rely on
> 
> 64bit
> 
>> +     * max_pwrite_zeroes.
>> +     */
>> +    assert(bytes <= INT_MAX);
>> +
>>       /* Fall back if the request is not aligned */
>>       if (qed_offset_into_cluster(s, offset) ||
>>           qed_offset_into_cluster(s, bytes)) {
> 
> Reviewed-by: Eric Blake <eblake at redhat.com>
> 

Thanks!

-- 
Best regards,
Vladimir