[Bugs] [Bug 1705884] New: Image size as reported from the fuse mount is incorrect
bugzilla at redhat.com
bugzilla at redhat.com
Fri May 3 06:44:21 UTC 2019
https://bugzilla.redhat.com/show_bug.cgi?id=1705884
Bug ID: 1705884
Summary: Image size as reported from the fuse mount is
incorrect
Product: GlusterFS
Version: mainline
Hardware: x86_64
OS: Linux
Status: NEW
Component: sharding
Severity: high
Assignee: bugs at gluster.org
Reporter: kdhananj at redhat.com
QA Contact: bugs at gluster.org
CC: bugs at gluster.org, kdhananj at redhat.com, pasik at iki.fi,
rhs-bugs at redhat.com, sabose at redhat.com,
sankarshan at redhat.com, sasundar at redhat.com,
storage-qa-internal at redhat.com
Depends On: 1668001
Blocks: 1667998
Target Milestone: ---
Classification: Community
+++ This bug was initially created as a clone of Bug #1668001 +++
Description of problem:
-----------------------
The size of the VM image file as reported from the fuse mount is incorrect.
For the file of size 1 TB, the size of the file on the disk is reported as 8
ZB.
Version-Release number of selected component (if applicable):
-------------------------------------------------------------
upstream master
How reproducible:
------------------
Always
Steps to Reproduce:
-------------------
1. On the Gluster storage domain, create the preallocated disk image of size
1TB
2. Check for the size of the file after its creation has succeesded
Actual results:
---------------
Size of the file is reported as 8 ZB, though the size of the file is 1TB
Expected results:
-----------------
Size of the file should be the same as the size created by the user
Additional info:
----------------
Volume in the question is replica 3 sharded
[root at rhsqa-grafton10 ~]# gluster volume info data
Volume Name: data
Type: Replicate
Volume ID: 7eb49e90-e2b6-4f8f-856e-7108212dbb72
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: rhsqa-grafton10.lab.eng.blr.redhat.com:/gluster_bricks/data/data
Brick2: rhsqa-grafton11.lab.eng.blr.redhat.com:/gluster_bricks/data/data
Brick3: rhsqa-grafton12.lab.eng.blr.redhat.com:/gluster_bricks/data/data
(arbiter)
Options Reconfigured:
performance.client-io-threads: on
nfs.disable: on
transport.address-family: inet
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.low-prio-threads: 32
network.remote-dio: off
cluster.eager-lock: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
cluster.data-self-heal-algorithm: full
cluster.locking-scheme: granular
cluster.shd-max-threads: 8
cluster.shd-wait-qlength: 10000
features.shard: on
user.cifs: off
cluster.choose-local: off
client.event-threads: 4
server.event-threads: 4
storage.owner-uid: 36
storage.owner-gid: 36
network.ping-timeout: 30
performance.strict-o-direct: on
cluster.granular-entry-heal: enable
cluster.enable-shared-storage: enable
--- Additional comment from SATHEESARAN on 2019-01-21 16:32:39 UTC ---
Size of the file as reported from the fuse mount:
[root@ ~]# ls -lsah
/rhev/data-center/mnt/glusterSD/rhsqa-grafton10.lab.eng.blr.redhat.com\:_data/bbeee86f-f174-4ec7-9ea3-a0df28709e64/images/0206953c-4850-4969-9dad-15140579d354/eaa5e81d-103c-4ce6-947e-8946806cca1b
8.0Z -rw-rw----. 1 vdsm kvm 1.1T Jan 21 17:14
/rhev/data-center/mnt/glusterSD/rhsqa-grafton10.lab.eng.blr.redhat.com:_data/bbeee86f-f174-4ec7-9ea3-a0df28709e64/images/0206953c-4850-4969-9dad-15140579d354/eaa5e81d-103c-4ce6-947e-8946806cca1b
[root@ ~]# du -shc
/rhev/data-center/mnt/glusterSD/rhsqa-grafton10.lab.eng.blr.redhat.com\:_data/bbeee86f-f174-4ec7-9ea3-a0df28709e64/images/0206953c-4850-4969-9dad-15140579d354/eaa5e81d-103c-4ce6-947e-8946806cca1b
16E
/rhev/data-center/mnt/glusterSD/rhsqa-grafton10.lab.eng.blr.redhat.com:_data/bbeee86f-f174-4ec7-9ea3-a0df28709e64/images/0206953c-4850-4969-9dad-15140579d354/eaa5e81d-103c-4ce6-947e-8946806cca1b
16E total
Note that the disk image is preallocated with 1072GB of space
--- Additional comment from SATHEESARAN on 2019-04-01 19:25:15 UTC ---
(In reply to SATHEESARAN from comment #5)
> (In reply to Krutika Dhananjay from comment #3)
> > Also, do you still have the setup in this state? If so, can I'd like to take
> > a look.
> >
> > -Krutika
>
> Hi Krutika,
>
> The setup is no longer available. Let me recreate the issue and provide you
> the setup
This issue is very easily reproducible. Create a preallocated image on the
replicate volume with sharding enabled.
Use 'qemu-img' to create the VM image.
See the following test:
[root@ ~]# qemu-img create -f raw -o preallocation=falloc /mnt/test/vm1.img 1T
Formatting '/mnt/test/vm1.img', fmt=raw size=1099511627776
preallocation='falloc'
[root@ ]# ls /mnt/test
vm1.img
[root@ ]# ls -lsah vm1.img
8.0Z -rw-r--r--. 1 root root 1.0T Apr 2 00:45 vm1.img
--- Additional comment from Krutika Dhananjay on 2019-04-11 06:07:35 UTC ---
So I tried this locally and I am not hitting the issue -
[root at dhcpxxxxx ~]# qemu-img create -f raw -o preallocation=falloc /mnt/vm1.img
10G
Formatting '/mnt/vm1.img', fmt=raw size=10737418240 preallocation=falloc
[root at dhcpxxxxx ~]# ls -lsah /mnt/vm1.img
10G -rw-r--r--. 1 root root 10G Apr 11 11:26 /mnt/vm1.img
[root at dhcpxxxxx ~]# qemu-img create -f raw -o preallocation=falloc /mnt/vm1.img
30G
Formatting '/mnt/vm1.img', fmt=raw size=32212254720 preallocation=falloc
[root at dhcpxxxxx ~]# ls -lsah /mnt/vm1.img
30G -rw-r--r--. 1 root root 30G Apr 11 11:32 /mnt/vm1.img
Of course, I didn't go beyond 30G due to space constraints on my laptop.
If you could share your setup where you're hitting this bug, I'll take a look.
-Krutika
--- Additional comment from SATHEESARAN on 2019-05-02 05:21:01 UTC ---
(In reply to Krutika Dhananjay from comment #7)
> So I tried this locally and I am not hitting the issue -
>
> [root at dhcpxxxxx ~]# qemu-img create -f raw -o preallocation=falloc
> /mnt/vm1.img 10G
> Formatting '/mnt/vm1.img', fmt=raw size=10737418240 preallocation=falloc
> [root at dhcpxxxxx ~]# ls -lsah /mnt/vm1.img
> 10G -rw-r--r--. 1 root root 10G Apr 11 11:26 /mnt/vm1.img
>
> [root at dhcpxxxxx ~]# qemu-img create -f raw -o preallocation=falloc
> /mnt/vm1.img 30G
> Formatting '/mnt/vm1.img', fmt=raw size=32212254720 preallocation=falloc
> [root at dhcpxxxxx ~]# ls -lsah /mnt/vm1.img
> 30G -rw-r--r--. 1 root root 30G Apr 11 11:32 /mnt/vm1.img
>
> Of course, I didn't go beyond 30G due to space constraints on my laptop.
>
> If you could share your setup where you're hitting this bug, I'll take a
> look.
>
> -Krutika
I could see this very consistenly in two fashions
1. Create VM image >= 1TB
--------------------------
[root at rhsqa-grafton7 test]# qemu-img create -f raw -o preallocation=falloc
vm1.img 10G
Formatting 'vm1.img', fmt=raw size=10737418240 preallocation=falloc
[root@ ]# ls -lsah vm1.img
10G -rw-r--r--. 1 root root 10G May 2 10:30 vm1.img
[root@ ]# qemu-img create -f raw -o preallocation=falloc vm2.img 50G
Formatting 'vm2.img', fmt=raw size=53687091200 preallocation=falloc
[root@ ]# ls -lsah vm2.img
50G -rw-r--r--. 1 root root 50G May 2 10:30 vm2.img
[root@ ]# qemu-img create -f raw -o preallocation=falloc vm3.img 100G
Formatting 'vm3.img', fmt=raw size=107374182400 preallocation=falloc
[root@ ]# ls -lsah vm3.img
100G -rw-r--r--. 1 root root 100G May 2 10:33 vm3.img
[root@ ]# qemu-img create -f raw -o preallocation=falloc vm4.img 500G
Formatting 'vm4.img', fmt=raw size=536870912000 preallocation=falloc
[root@ ]# ls -lsah vm4.img
500G -rw-r--r--. 1 root root 500G May 2 10:33 vm4.img
Once the size reached 1TB, you will see this issue
[root@ ]# qemu-img create -f raw -o preallocation=falloc vm6.img 1T
Formatting 'vm6.img', fmt=raw size=1099511627776 preallocation=falloc
[root@ ]# ls -lsah vm6.img
8.0Z -rw-r--r--. 1 root root 1.0T May 2 10:35 vm6.img <--------
size on disk is too much than expected
2. Recreate the image with the same name
-----------------------------------------
Observe that for the second time, the image is created with the same name
[root@ ]# qemu-img create -f raw -o preallocation=falloc vm1.img 10G
Formatting 'vm1.img', fmt=raw size=10737418240 preallocation=falloc
[root@ ]# ls -lsah vm1.img
10G -rw-r--r--. 1 root root 10G May 2 10:40 vm1.img
[root@ ]# qemu-img create -f raw -o preallocation=falloc vm1.img 20G <--------
The same file name vm1.img is used
Formatting 'vm1.img', fmt=raw size=21474836480 preallocation=falloc
[root@ ]# ls -lsah vm1.img
30G -rw-r--r--. 1 root root 20G May 2 10:40 vm1.img <---------- size on
the disk is 30G, though the file is created with 20G
I will provide setup for the investigation
--- Additional comment from SATHEESARAN on 2019-05-02 05:23:07 UTC ---
The setup details:
-------------------
rhsqa-grafton7.lab.eng.blr.redhat.com ( root/redhat )
volume: data ( replica 3, sharded )
The volume is currently mounted at: /mnt/test
Note: This is the RHVH installation.
@krutika, if you need more info, just ping me in IRC / google chat
--- Additional comment from Krutika Dhananjay on 2019-05-02 10:16:40 UTC ---
Found part of the issue.
It's just a case of integer overflow.
32-bit signed int is being used to store delta between post-stat and pre-stat
block-counts.
The range of numbers for 32-bit signed int is [-2,147,483,648, 2,147,483,647]
whereas the number of blocks allocated
as part of creating a preallocated 1TB file is (1TB/512) = 2,147,483,648 which
is just 1 more than INT_MAX (2,147,483,647)
which spills over to the negative half the scale making it -2,147,483,648.
This number, on being copied to int64 causes the most-significant 32 bits to be
filled with 1 making the block-count equal 554050781183 (or 0xffffffff80000000)
in magnitude.
That's the block-count that gets set on the backend in
trusted.glusterfs.shard.file-size xattr in the block-count segment -
[root at rhsqa-grafton7 data]# getfattr -d -m . -e hex
/gluster_bricks/data/data/vm3.img
getfattr: Removing leading '/' from absolute path names
# file: gluster_bricks/data/data/vm3.img
security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.gfid=0x3faffa7142b74e739f3a82b9359d33e6
trusted.gfid2path.6356251b968111ad=0x30303030303030302d303030302d303030302d303030302d3030303030303030303030312f766d332e696d67
trusted.glusterfs.shard.block-size=0x0000000004000000
trusted.glusterfs.shard.file-size=0x00000100000000000000000000000000ffffffff800000000000000000000000
<-- notice the "ffffffff80000000" in the block-count segment
But ..
[root at rhsqa-grafton7 test]# stat vm3.img
File: ‘vm3.img’
Size: 1099511627776 Blocks: 18446744071562067968 IO Block: 131072 regular
file
Device: 29h/41d Inode: 11473626732659815398 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
Context: system_u:object_r:fusefs_t:s0
Access: 2019-05-02 14:11:11.693559069 +0530
Modify: 2019-05-02 14:12:38.245068328 +0530
Change: 2019-05-02 14:15:56.190546751 +0530
Birth: -
stat shows block-count as 18446744071562067968 which is way bigger than
(554050781183 * 512).
In the response path, turns out the block-count further gets assigned to a
uint64 number.
The same number, when expressed as uint64 becomes 18446744071562067968.
18446744071562067968 * 512 is a whopping 8.0 Zettabytes!
This bug wasn't seen earlier because the earlier way of preallocating files
never used fallocate, so the original signed 32 int variable delta_blocks would
never exceed 131072.
Anyway, I'll be soon sending a fix for this.
Sas,
Do you have a single node with at least 1TB free space that you can lend me
where I can test the fix? The bug will only be hit when the image size is >
1TB.
-Krutika
--- Additional comment from Krutika Dhananjay on 2019-05-02 10:18:26 UTC ---
(In reply to Krutika Dhananjay from comment #10)
> Found part of the issue.
Sorry, this not part of the issue but THE issue in its entirety. (That line is
from an older draft I'd composed which I forgot to change after rc'ing the bug)
>
> It's just a case of integer overflow.
> 32-bit signed int is being used to store delta between post-stat and
> pre-stat block-counts.
> The range of numbers for 32-bit signed int is [-2,147,483,648,
> 2,147,483,647] whereas the number of blocks allocated
> as part of creating a preallocated 1TB file is (1TB/512) = 2,147,483,648
> which is just 1 more than INT_MAX (2,147,483,647)
> which spills over to the negative half the scale making it -2,147,483,648.
> This number, on being copied to int64 causes the most-significant 32 bits to
> be filled with 1 making the block-count equal 554050781183 (or
> 0xffffffff80000000) in magnitude.
> That's the block-count that gets set on the backend in
> trusted.glusterfs.shard.file-size xattr in the block-count segment -
>
> [root at rhsqa-grafton7 data]# getfattr -d -m . -e hex
> /gluster_bricks/data/data/vm3.img
> getfattr: Removing leading '/' from absolute path names
> # file: gluster_bricks/data/data/vm3.img
> security.
> selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f7
> 43a733000
> trusted.afr.dirty=0x000000000000000000000000
> trusted.gfid=0x3faffa7142b74e739f3a82b9359d33e6
> trusted.gfid2path.
> 6356251b968111ad=0x30303030303030302d303030302d303030302d303030302d3030303030
> 303030303030312f766d332e696d67
>
> trusted.glusterfs.shard.block-size=0x0000000004000000
> trusted.glusterfs.shard.file-
> size=0x00000100000000000000000000000000ffffffff800000000000000000000000 <--
> notice the "ffffffff80000000" in the block-count segment
>
> But ..
>
> [root at rhsqa-grafton7 test]# stat vm3.img
> File: ‘vm3.img’
> Size: 1099511627776 Blocks: 18446744071562067968 IO Block: 131072
> regular file
> Device: 29h/41d Inode: 11473626732659815398 Links: 1
> Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
> Context: system_u:object_r:fusefs_t:s0
> Access: 2019-05-02 14:11:11.693559069 +0530
> Modify: 2019-05-02 14:12:38.245068328 +0530
> Change: 2019-05-02 14:15:56.190546751 +0530
> Birth: -
>
> stat shows block-count as 18446744071562067968 which is way bigger than
> (554050781183 * 512).
>
> In the response path, turns out the block-count further gets assigned to a
> uint64 number.
> The same number, when expressed as uint64 becomes 18446744071562067968.
> 18446744071562067968 * 512 is a whopping 8.0 Zettabytes!
>
> This bug wasn't seen earlier because the earlier way of preallocating files
> never used fallocate, so the original signed 32 int variable delta_blocks
> would never exceed 131072.
>
> Anyway, I'll be soon sending a fix for this.
Referenced Bugs:
https://bugzilla.redhat.com/show_bug.cgi?id=1667998
[Bug 1667998] Image size as reported from the fuse mount is incorrect
https://bugzilla.redhat.com/show_bug.cgi?id=1668001
[Bug 1668001] Image size as reported from the fuse mount is incorrect
--
You are receiving this mail because:
You are the QA Contact for the bug.
You are on the CC list for the bug.
You are the assignee for the bug.
More information about the Bugs
mailing list