[Bugs] [Bug 1705884] New: Image size as reported from the fuse mount is incorrect

bugzilla at redhat.com bugzilla at redhat.com
Fri May 3 06:44:21 UTC 2019


https://bugzilla.redhat.com/show_bug.cgi?id=1705884

            Bug ID: 1705884
           Summary: Image size as reported from the fuse mount is
                    incorrect
           Product: GlusterFS
           Version: mainline
          Hardware: x86_64
                OS: Linux
            Status: NEW
         Component: sharding
          Severity: high
          Assignee: bugs at gluster.org
          Reporter: kdhananj at redhat.com
        QA Contact: bugs at gluster.org
                CC: bugs at gluster.org, kdhananj at redhat.com, pasik at iki.fi,
                    rhs-bugs at redhat.com, sabose at redhat.com,
                    sankarshan at redhat.com, sasundar at redhat.com,
                    storage-qa-internal at redhat.com
        Depends On: 1668001
            Blocks: 1667998
  Target Milestone: ---
    Classification: Community



+++ This bug was initially created as a clone of Bug #1668001 +++

Description of problem:
-----------------------
The size of the VM image file as reported from the fuse mount is incorrect.
For the file of size 1 TB, the size of the file on the disk is reported as 8
ZB.

Version-Release number of selected component (if applicable):
-------------------------------------------------------------
upstream master

How reproducible:
------------------
Always

Steps to Reproduce:
-------------------
1. On the Gluster storage domain, create the preallocated disk image of size
1TB
2. Check for the size of the file after its creation has succeesded

Actual results:
---------------
Size of the file is reported as 8 ZB, though the size of the file is 1TB

Expected results:
-----------------
Size of the file should be the same as the size created by the user


Additional info:
----------------
Volume in the question is replica 3 sharded
[root at rhsqa-grafton10 ~]# gluster volume info data

Volume Name: data
Type: Replicate
Volume ID: 7eb49e90-e2b6-4f8f-856e-7108212dbb72
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: rhsqa-grafton10.lab.eng.blr.redhat.com:/gluster_bricks/data/data
Brick2: rhsqa-grafton11.lab.eng.blr.redhat.com:/gluster_bricks/data/data
Brick3: rhsqa-grafton12.lab.eng.blr.redhat.com:/gluster_bricks/data/data
(arbiter)
Options Reconfigured:
performance.client-io-threads: on
nfs.disable: on
transport.address-family: inet
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.low-prio-threads: 32
network.remote-dio: off
cluster.eager-lock: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
cluster.data-self-heal-algorithm: full
cluster.locking-scheme: granular
cluster.shd-max-threads: 8
cluster.shd-wait-qlength: 10000
features.shard: on
user.cifs: off
cluster.choose-local: off
client.event-threads: 4
server.event-threads: 4
storage.owner-uid: 36
storage.owner-gid: 36
network.ping-timeout: 30
performance.strict-o-direct: on
cluster.granular-entry-heal: enable
cluster.enable-shared-storage: enable

--- Additional comment from SATHEESARAN on 2019-01-21 16:32:39 UTC ---

Size of the file as reported from the fuse mount:

[root@ ~]# ls -lsah
/rhev/data-center/mnt/glusterSD/rhsqa-grafton10.lab.eng.blr.redhat.com\:_data/bbeee86f-f174-4ec7-9ea3-a0df28709e64/images/0206953c-4850-4969-9dad-15140579d354/eaa5e81d-103c-4ce6-947e-8946806cca1b 
8.0Z -rw-rw----. 1 vdsm kvm 1.1T Jan 21 17:14
/rhev/data-center/mnt/glusterSD/rhsqa-grafton10.lab.eng.blr.redhat.com:_data/bbeee86f-f174-4ec7-9ea3-a0df28709e64/images/0206953c-4850-4969-9dad-15140579d354/eaa5e81d-103c-4ce6-947e-8946806cca1b

[root@ ~]# du -shc
/rhev/data-center/mnt/glusterSD/rhsqa-grafton10.lab.eng.blr.redhat.com\:_data/bbeee86f-f174-4ec7-9ea3-a0df28709e64/images/0206953c-4850-4969-9dad-15140579d354/eaa5e81d-103c-4ce6-947e-8946806cca1b
16E    
/rhev/data-center/mnt/glusterSD/rhsqa-grafton10.lab.eng.blr.redhat.com:_data/bbeee86f-f174-4ec7-9ea3-a0df28709e64/images/0206953c-4850-4969-9dad-15140579d354/eaa5e81d-103c-4ce6-947e-8946806cca1b
16E     total

Note that the disk image is preallocated with 1072GB of space

--- Additional comment from SATHEESARAN on 2019-04-01 19:25:15 UTC ---

(In reply to SATHEESARAN from comment #5)
> (In reply to Krutika Dhananjay from comment #3)
> > Also, do you still have the setup in this state? If so, can I'd like to take
> > a look.
> > 
> > -Krutika
> 
> Hi Krutika,
> 
> The setup is no longer available. Let me recreate the issue and provide you
> the setup

This issue is very easily reproducible. Create a preallocated image on the
replicate volume with sharding enabled.
Use 'qemu-img' to create the VM image.

See the following test:
[root@ ~]# qemu-img create -f raw -o preallocation=falloc /mnt/test/vm1.img 1T
Formatting '/mnt/test/vm1.img', fmt=raw size=1099511627776
preallocation='falloc' 

[root@ ]# ls /mnt/test
vm1.img

[root@ ]# ls -lsah vm1.img 
8.0Z -rw-r--r--. 1 root root 1.0T Apr  2 00:45 vm1.img

--- Additional comment from Krutika Dhananjay on 2019-04-11 06:07:35 UTC ---

So I tried this locally and I am not hitting the issue -

[root at dhcpxxxxx ~]# qemu-img create -f raw -o preallocation=falloc /mnt/vm1.img
10G
Formatting '/mnt/vm1.img', fmt=raw size=10737418240 preallocation=falloc
[root at dhcpxxxxx ~]# ls -lsah /mnt/vm1.img
10G -rw-r--r--. 1 root root 10G Apr 11 11:26 /mnt/vm1.img

[root at dhcpxxxxx ~]# qemu-img create -f raw -o preallocation=falloc /mnt/vm1.img
30G
Formatting '/mnt/vm1.img', fmt=raw size=32212254720 preallocation=falloc
[root at dhcpxxxxx ~]# ls -lsah /mnt/vm1.img
30G -rw-r--r--. 1 root root 30G Apr 11 11:32 /mnt/vm1.img

Of course, I didn't go beyond 30G due to space constraints on my laptop.

If you could share your setup where you're hitting this bug, I'll take a look.

-Krutika

--- Additional comment from SATHEESARAN on 2019-05-02 05:21:01 UTC ---

(In reply to Krutika Dhananjay from comment #7)
> So I tried this locally and I am not hitting the issue -
> 
> [root at dhcpxxxxx ~]# qemu-img create -f raw -o preallocation=falloc
> /mnt/vm1.img 10G
> Formatting '/mnt/vm1.img', fmt=raw size=10737418240 preallocation=falloc
> [root at dhcpxxxxx ~]# ls -lsah /mnt/vm1.img
> 10G -rw-r--r--. 1 root root 10G Apr 11 11:26 /mnt/vm1.img
> 
> [root at dhcpxxxxx ~]# qemu-img create -f raw -o preallocation=falloc
> /mnt/vm1.img 30G
> Formatting '/mnt/vm1.img', fmt=raw size=32212254720 preallocation=falloc
> [root at dhcpxxxxx ~]# ls -lsah /mnt/vm1.img
> 30G -rw-r--r--. 1 root root 30G Apr 11 11:32 /mnt/vm1.img
> 
> Of course, I didn't go beyond 30G due to space constraints on my laptop.
> 
> If you could share your setup where you're hitting this bug, I'll take a
> look.
> 
> -Krutika

I could see this very consistenly in two fashions

1. Create VM image >= 1TB
--------------------------
[root at rhsqa-grafton7 test]# qemu-img create -f raw -o preallocation=falloc
vm1.img 10G
Formatting 'vm1.img', fmt=raw size=10737418240 preallocation=falloc

[root@ ]# ls -lsah vm1.img 
10G -rw-r--r--. 1 root root 10G May  2 10:30 vm1.img

[root@ ]# qemu-img create -f raw -o preallocation=falloc vm2.img 50G
Formatting 'vm2.img', fmt=raw size=53687091200 preallocation=falloc

[root@ ]# ls -lsah vm2.img 
50G -rw-r--r--. 1 root root 50G May  2 10:30 vm2.img

[root@ ]# qemu-img create -f raw -o preallocation=falloc vm3.img 100G
Formatting 'vm3.img', fmt=raw size=107374182400 preallocation=falloc

[root@ ]# ls -lsah vm3.img 
100G -rw-r--r--. 1 root root 100G May  2 10:33 vm3.img

[root@ ]# qemu-img create -f raw -o preallocation=falloc vm4.img 500G
Formatting 'vm4.img', fmt=raw size=536870912000 preallocation=falloc

[root@ ]# ls -lsah vm4.img 
500G -rw-r--r--. 1 root root 500G May  2 10:33 vm4.img

Once the size reached 1TB, you will see this issue
[root@ ]# qemu-img create -f raw -o preallocation=falloc vm6.img 1T
Formatting 'vm6.img', fmt=raw size=1099511627776 preallocation=falloc

[root@ ]# ls -lsah vm6.img 
8.0Z -rw-r--r--. 1 root root 1.0T May  2 10:35 vm6.img            <--------
size on disk is too much than expected

2. Recreate the image with the same name
-----------------------------------------
Observe that for the second time, the image is created with the same name 

[root@ ]# qemu-img create -f raw -o preallocation=falloc vm1.img 10G
Formatting 'vm1.img', fmt=raw size=10737418240 preallocation=falloc

[root@ ]# ls -lsah vm1.img
10G -rw-r--r--. 1 root root 10G May  2 10:40 vm1.img

[root@ ]# qemu-img create -f raw -o preallocation=falloc vm1.img 20G <--------
The same file name vm1.img is used
Formatting 'vm1.img', fmt=raw size=21474836480 preallocation=falloc

[root@ ]# ls -lsah vm1.img 
30G -rw-r--r--. 1 root root 20G May  2 10:40 vm1.img      <---------- size on
the disk is 30G, though the file is created with 20G

I will provide setup for the investigation

--- Additional comment from SATHEESARAN on 2019-05-02 05:23:07 UTC ---

The setup details:
-------------------

rhsqa-grafton7.lab.eng.blr.redhat.com ( root/redhat )
volume: data ( replica 3, sharded )
The volume is currently mounted at: /mnt/test

Note: This is the RHVH installation.

@krutika, if you need more info, just ping me in IRC / google chat

--- Additional comment from Krutika Dhananjay on 2019-05-02 10:16:40 UTC ---

Found part of the issue.

It's just a case of integer overflow.
32-bit signed int is being used to store delta between post-stat and pre-stat
block-counts.
The range of numbers for 32-bit signed int is [-2,147,483,648, 2,147,483,647]
whereas the number of blocks allocated
as part of creating a preallocated 1TB file is (1TB/512) = 2,147,483,648 which
is just 1 more than INT_MAX (2,147,483,647)
which spills over to the negative half the scale making it -2,147,483,648.
This number, on being copied to int64 causes the most-significant 32 bits to be
filled with 1 making the block-count equal 554050781183 (or 0xffffffff80000000)
in magnitude.
That's the block-count that gets set on the backend in
trusted.glusterfs.shard.file-size xattr in the block-count segment -

[root at rhsqa-grafton7 data]# getfattr -d -m . -e hex
/gluster_bricks/data/data/vm3.img
getfattr: Removing leading '/' from absolute path names
# file: gluster_bricks/data/data/vm3.img
security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.gfid=0x3faffa7142b74e739f3a82b9359d33e6
trusted.gfid2path.6356251b968111ad=0x30303030303030302d303030302d303030302d303030302d3030303030303030303030312f766d332e696d67 
trusted.glusterfs.shard.block-size=0x0000000004000000
trusted.glusterfs.shard.file-size=0x00000100000000000000000000000000ffffffff800000000000000000000000
 <-- notice the "ffffffff80000000" in the block-count segment

But ..

[root at rhsqa-grafton7 test]# stat vm3.img
  File: ‘vm3.img’
  Size: 1099511627776   Blocks: 18446744071562067968 IO Block: 131072 regular
file
Device: 29h/41d Inode: 11473626732659815398  Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Context: system_u:object_r:fusefs_t:s0
Access: 2019-05-02 14:11:11.693559069 +0530
Modify: 2019-05-02 14:12:38.245068328 +0530
Change: 2019-05-02 14:15:56.190546751 +0530
 Birth: -

stat shows block-count as 18446744071562067968 which is way bigger than
(554050781183 * 512).

In the response path, turns out the block-count further gets assigned to a
uint64 number.
The same number, when expressed as uint64 becomes 18446744071562067968.
18446744071562067968 * 512 is a whopping 8.0 Zettabytes!

This bug wasn't seen earlier because the earlier way of preallocating files
never used fallocate, so the original signed 32 int variable delta_blocks would
never exceed 131072.

Anyway, I'll be soon sending a fix for this.

Sas,

Do you have a single node with at least 1TB free space that you can lend me
where I can test the fix? The bug will only be hit when the image size is >
1TB.

-Krutika

--- Additional comment from Krutika Dhananjay on 2019-05-02 10:18:26 UTC ---

(In reply to Krutika Dhananjay from comment #10)
> Found part of the issue.

Sorry, this not part of the issue but THE issue in its entirety. (That line is
from an older draft I'd composed which I forgot to change after rc'ing the bug)

> 
> It's just a case of integer overflow.
> 32-bit signed int is being used to store delta between post-stat and
> pre-stat block-counts.
> The range of numbers for 32-bit signed int is [-2,147,483,648,
> 2,147,483,647] whereas the number of blocks allocated
> as part of creating a preallocated 1TB file is (1TB/512) = 2,147,483,648
> which is just 1 more than INT_MAX (2,147,483,647)
> which spills over to the negative half the scale making it -2,147,483,648.
> This number, on being copied to int64 causes the most-significant 32 bits to
> be filled with 1 making the block-count equal 554050781183 (or
> 0xffffffff80000000) in magnitude.
> That's the block-count that gets set on the backend in
> trusted.glusterfs.shard.file-size xattr in the block-count segment -
> 
> [root at rhsqa-grafton7 data]# getfattr -d -m . -e hex
> /gluster_bricks/data/data/vm3.img
> getfattr: Removing leading '/' from absolute path names
> # file: gluster_bricks/data/data/vm3.img
> security.
> selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f7
> 43a733000
> trusted.afr.dirty=0x000000000000000000000000
> trusted.gfid=0x3faffa7142b74e739f3a82b9359d33e6
> trusted.gfid2path.
> 6356251b968111ad=0x30303030303030302d303030302d303030302d303030302d3030303030
> 303030303030312f766d332e696d67                                              
> 
> trusted.glusterfs.shard.block-size=0x0000000004000000
> trusted.glusterfs.shard.file-
> size=0x00000100000000000000000000000000ffffffff800000000000000000000000  <--
> notice the "ffffffff80000000" in the block-count segment
> 
> But ..
> 
> [root at rhsqa-grafton7 test]# stat vm3.img
>   File: ‘vm3.img’
>   Size: 1099511627776   Blocks: 18446744071562067968 IO Block: 131072
> regular file
> Device: 29h/41d Inode: 11473626732659815398  Links: 1
> Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
> Context: system_u:object_r:fusefs_t:s0
> Access: 2019-05-02 14:11:11.693559069 +0530
> Modify: 2019-05-02 14:12:38.245068328 +0530
> Change: 2019-05-02 14:15:56.190546751 +0530
>  Birth: -
> 
> stat shows block-count as 18446744071562067968 which is way bigger than
> (554050781183 * 512).
> 
> In the response path, turns out the block-count further gets assigned to a
> uint64 number.
> The same number, when expressed as uint64 becomes 18446744071562067968.
> 18446744071562067968 * 512 is a whopping 8.0 Zettabytes!
> 
> This bug wasn't seen earlier because the earlier way of preallocating files
> never used fallocate, so the original signed 32 int variable delta_blocks
> would never exceed 131072.
> 
> Anyway, I'll be soon sending a fix for this.


Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1667998
[Bug 1667998] Image size as reported from the fuse mount is incorrect
https://bugzilla.redhat.com/show_bug.cgi?id=1668001
[Bug 1668001] Image size as reported from the fuse mount is incorrect
-- 
You are receiving this mail because:
You are the QA Contact for the bug.
You are on the CC list for the bug.
You are the assignee for the bug.


More information about the Bugs mailing list