[Gluster-users] GFID attir is missing after adding large amounts of data
Christoph Schäbel
christoph.schaebel at dc-square.de
Fri Sep 1 08:20:25 UTC 2017
My answers inline.
> Am 01.09.2017 um 04:19 schrieb Ben Turner <bturner at redhat.com>:
>
> I re-added gluster-users to get some more eye on this.
>
> ----- Original Message -----
>> From: "Christoph Schäbel" <christoph.schaebel at dc-square.de>
>> To: "Ben Turner" <bturner at redhat.com>
>> Sent: Wednesday, August 30, 2017 8:18:31 AM
>> Subject: Re: [Gluster-users] GFID attir is missing after adding large amounts of data
>>
>> Hello Ben,
>>
>> thank you for offering your help.
>>
>> Here are outputs from all the gluster commands I could think of.
>> Note that we had to remove the terrabytes of data to keep the system
>> operational, because it is a live system.
>>
>> # gluster volume status
>>
>> Status of volume: gv0
>> Gluster process TCP Port RDMA Port Online Pid
>> ------------------------------------------------------------------------------
>> Brick 10.191.206.15:/mnt/brick1/gv0 49154 0 Y 2675
>> Brick 10.191.198.15:/mnt/brick1/gv0 49154 0 Y 2679
>> Self-heal Daemon on localhost N/A N/A Y
>> 12309
>> Self-heal Daemon on 10.191.206.15 N/A N/A Y 2670
>>
>> Task Status of Volume gv0
>> ------------------------------------------------------------------------------
>> There are no active volume tasks
>
> OK so your bricks are all online, you have two nodes with 1 brick per node.
Yes
>
>>
>> # gluster volume info
>>
>> Volume Name: gv0
>> Type: Replicate
>> Volume ID: 5e47d0b8-b348-45bb-9a2a-800f301df95b
>> Status: Started
>> Snapshot Count: 0
>> Number of Bricks: 1 x 2 = 2
>> Transport-type: tcp
>> Bricks:
>> Brick1: 10.191.206.15:/mnt/brick1/gv0
>> Brick2: 10.191.198.15:/mnt/brick1/gv0
>> Options Reconfigured:
>> transport.address-family: inet
>> performance.readdir-ahead: on
>> nfs.disable: on
>
> You are using a replicate volume with 2 copies of your data, it looks like you are using the defaults as I don't see any tuning.
The only thing we tuned is the network.ping-timeout, we set this to 10 seconds (if this is not the default anyways)
>
>>
>> # gluster peer status
>>
>> Number of Peers: 1
>>
>> Hostname: 10.191.206.15
>> Uuid: 030a879d-da93-4a48-8c69-1c552d3399d2
>> State: Peer in Cluster (Connected)
>>
>>
>> # gluster —version
>>
>> glusterfs 3.8.11 built on Apr 11 2017 09:50:39
>> Repository revision: git://git.gluster.com/glusterfs.git
>> Copyright (c) 2006-2011 Gluster Inc. <http://www.gluster.com>
>> GlusterFS comes with ABSOLUTELY NO WARRANTY.
>> You may redistribute copies of GlusterFS under the terms of the GNU General
>> Public License.
>
> You are running Gluster 3.8 which is the latest upstream release marked stable.
>
>>
>> # df -h
>>
>> Filesystem Size Used Avail Use% Mounted on
>> /dev/mapper/vg00-root 75G 5.7G 69G 8% /
>> devtmpfs 1.9G 0 1.9G 0% /dev
>> tmpfs 1.9G 0 1.9G 0% /dev/shm
>> tmpfs 1.9G 17M 1.9G 1% /run
>> tmpfs 1.9G 0 1.9G 0% /sys/fs/cgroup
>> /dev/sda1 477M 151M 297M 34% /boot
>> /dev/mapper/vg10-brick1 8.0T 700M 8.0T 1% /mnt/brick1
>> localhost:/gv0 8.0T 768M 8.0T 1% /mnt/glusterfs_client
>> tmpfs 380M 0 380M 0% /run/user/0
>>
>
> Your brick is:
>
> /dev/mapper/vg10-brick1 8.0T 700M 8.0T 1% /mnt/brick1
>
> The block device is 8TB. Can you tell me more about your brick? Is it a single disk or a RAID? If its a RAID can you tell me about the disks? I am interested in:
>
> -Size of disks
> -RAID type
> -Stripe size
> -RAID controller
Not sure about the disks, because it comes from a large storage system (not the cheap NAS kind, but the really expensive rack kind) which is then used by VMWare to present a single Volume to my virtual machine. I am pretty sure that on the storage system there is some kind of RAID going on, but I am not sure if that does have an effect on the "virtual“ disk that is presented to my VM. To the VM the disk does not look like a RAID, as far as I can tell.
# lvdisplay
--- Logical volume ---
LV Path /dev/vg10/brick1
LV Name brick1
VG Name vg10
LV UUID OEvHEG-m5zc-2MQ1-3gNd-o2gh-q405-YWG02j
LV Write Access read/write
LV Creation host, time localhost, 2017-01-26 09:44:08 +0000
LV Status available
# open 1
LV Size 8.00 TiB
Current LE 2096890
Segments 1
Allocation inherit
Read ahead sectors auto
- currently set to 8192
Block device 253:1
--- Logical volume ---
LV Path /dev/vg00/root
LV Name root
VG Name vg00
LV UUID 3uyF7l-Xhfa-6frx-qjsP-Iy0u-JdbQ-Me03AS
LV Write Access read/write
LV Creation host, time localhost, 2016-12-15 14:24:08 +0000
LV Status available
# open 1
LV Size 74.49 GiB
Current LE 19069
Segments 1
Allocation inherit
Read ahead sectors auto
- currently set to 8192
Block device 253:0
# ssm list
-----------------------------------------------------------
Device Free Used Total Pool Mount point
-----------------------------------------------------------
/dev/fd0 4.00 KB
/dev/sda 80.00 GB PARTITIONED
/dev/sda1 500.00 MB /boot
/dev/sda2 20.00 MB 74.49 GB 74.51 GB vg00
/dev/sda3 5.00 GB SWAP
/dev/sdb 1.02 GB 8.00 TB 8.00 TB vg10
-----------------------------------------------------------
-------------------------------------------------
Pool Type Devices Free Used Total
-------------------------------------------------
vg00 lvm 1 20.00 MB 74.49 GB 74.51 GB
vg10 lvm 1 1.02 GB 8.00 TB 8.00 TB
-------------------------------------------------
------------------------------------------------------------------------------------
Volume Pool Volume size FS FS size Free Type Mount point
------------------------------------------------------------------------------------
/dev/vg00/root vg00 74.49 GB xfs 74.45 GB 69.36 GB linear /
/dev/vg10/brick1 vg10 8.00 TB xfs 8.00 TB 8.00 TB linear /mnt/brick1
/dev/sda1 500.00 MB ext4 500.00 MB 300.92 MB part /boot
------------------------------------------------------------------------------------
>
> I also see:
>
> localhost:/gv0 8.0T 768M 8.0T 1% /mnt/glusterfs_client
>
> So you are mounting your volume on the local node, is this the mount where you are writing data to?
Yes, this is the mount I am writing to.
>
>>
>>
>> The setup of the servers is done via shell script on CentOS 7 containing the
>> following commands:
>>
>> yum install -y centos-release-gluster
>> yum install -y glusterfs-server
>>
>> mkdir /mnt/brick1
>> ssm create -s 999G -n brick1 --fstype xfs -p vg10 /dev/sdb /mnt/brick1
>
> I haven't used system-storage-manager before, do you know if it takes care of properly tuning your storage stack(if you have a RAID that is)? If you don't have a RAID its prolly not that big of a deal, if you do have a RAID we should make sure everything is aware of your stripe size and tune appropriately.
I am not sure if ssm does any tuning by default, but since there does not seem to be a RAID (at least for the VM) I don’t think tuning is necessary.
>
>>
>> echo "/dev/mapper/vg10-brick1 /mnt/brick1 xfs defaults 1 2" >>
>> /etc/fstab
>> mount -a && mount
>> mkdir /mnt/brick1/gv0
>>
>> gluster peer probe OTHER_SERVER_IP
>>
>> gluster pool list
>> gluster volume create gv0 replica 2 OWN_SERVER_IP:/mnt/brick1/gv0
>> OTHER_SERVER_IP:/mnt/brick1/gv0
>> gluster volume start gv0
>> gluster volume info gv0
>> gluster volume set gv0 network.ping-timeout "10"
>> gluster volume info gv0
>>
>> # mount as client for archiving cronjob, is already in fstab
>> mount -a
>>
>> # mount via fuse-client
>> mkdir -p /mnt/glusterfs_client
>> echo "localhost:/gv0 /mnt/glusterfs_client glusterfs defaults,_netdev 0 0" >>
>> /etc/fstab
>> mount -a
>>
>>
>> We untar multiple files (around 1300 tar files) each around 2,7GB in size.
>> The tar files are not compressed.
>> We untar the files with a shell script containing the following:
>>
>> #! /bin/bash
>> for f in *.tar; do tar xfP $f; done
>
> Your script looks good, I am not that familiar with the tar flag "P" but it looks to mean:
>
> -P, --absolute-names
> Don't strip leading slashes from file names when creating archives.
>
> I don't see anything strange here, everything looks OK.
>
>>
>> The script is run as user root, the processes glusterd, glusterfs and
>> glusterfsd also run under user root.
>>
>> Each tar file consists of a single folder with multiple folders and files in
>> it.
>> The folder tree looks like this (note that the "=“ is part of the folder
>> name):
>>
>> 1498780800/
>> - timeframe_hour=1498780800/ (about 25 of these folders)
>> -- type=1/ (about 25 folders total)
>> --- data-x.gz.parquet (between 100MB and 1kb in size)
>> --- data-x.gz.parquet.crc (around 1kb in size)
>> -- …
>> - ...
>>
>> Unfortunately I cannot share the file contents with you.
>
> Thats no problem, I'll try to recreate this in the lab.
>
>>
>> We have not seen any other issues with glusterfs, when untaring just a few of
>> those files. I just tried writing a 100GB with dd and did not see any issues
>> there, the file is replicated and the GFID attribute is set correctly on
>> both nodes.
>
> ACK. I do this all the time, if you saw an issue here I would be worried about your setup.
>
>>
>> We are not able to reproduce this in our lab environment which is a clone
>> (actual cloned VMs) of the other system, but it only has around 1TB of
>> storage.
>> Do you think this could be an issue with the number of files which is
>> generated by tar (over 1.5 million files). ?
>> What I can say is that it is not an issue with inodes, that I checked when
>> all the files where unpacked on the live system.
>
> Hmm I am not sure. Its strange that you can't repro this on your other config, in the lab I have a ton of space to work with so I can run a ton of data in my repro.
>
>>
>> If you need anything else, let me know.
>
> Can you help clarify your reproducer so I can give it a go in the lab? From what I can tell you have:
>
> 1498780800/ <-- Just a string of numbers, this is the root dir of your tarball
> - timeframe_hour=1498780800/ (about 25 of these folders) <-- This is the second level dir of your tarball, there are ~25 of these dirs that mention a timeframe and an hour
> -- type=1/ (about 25 folders total) <-- This is the 3rd level of your tar, there are about 25 different type=$X dirs
> --- data-x.gz.parquet (between 100MB and 1kb in size) <-- This is your actual data. Is there just 1 pair of these file per dir or multiple?
> --- data-x.gz.parquet.crc (around 1kb in size) <-- This is a checksum for the above file?
>
> I have almost everything I need for my reproducer, can you answer the above questions about the data?
Yes this is all correct. There is just 1 pair in the last level, and the *.crc file is a checksum file.
Thank you for your help,
Christoph
More information about the Gluster-users
mailing list