[Gluster-devel] Fw: Re[2]: missing files

David F. Robinson david.robinson at corvidtec.com
Wed Feb 11 14:19:18 UTC 2015


I will forward the emails to Shyam to the devel list. 


David  (Sent from mobile)

===============================
David F. Robinson, Ph.D. 
President - Corvid Technologies
704.799.6944 x101 [office]
704.252.1310      [cell]
704.799.7974      [fax]
David.Robinson at corvidtec.com
http://www.corvidtechnologies.com

> On Feb 11, 2015, at 8:21 AM, Pranith Kumar Karampuri <pkarampu at redhat.com> wrote:
> 
> 
>> On 02/11/2015 06:49 PM, Pranith Kumar Karampuri wrote:
>> 
>>> On 02/11/2015 08:36 AM, Shyam wrote:
>>> Did some analysis with David today on this here is a gist for the list,
>>> 
>>> 1) Volumes classified as slow (i.e with a lot of pre-existing data) and fast (new volumes carved from the same backend file system that slow bricks are on, with little or no data)
>>> 
>>> 2) We ran an strace of tar and also collected io-stats outputs from these volumes, both show that create and mkdir is slower on slow as compared to the fast volume. This seems to be the overall reason for slowness.
>> Did you happen to do strace of the brick when this happened? If not, David, can we get that information as well?
> It would be nice to compare the difference in syscalls of the bricks of two volumes to see if there are any extra syscalls that is adding to the delay.
> 
> Pranith
>> 
>> Pranith
>>> 
>>> 3) The tarball extraction is to a new directory on the gluster mount, so all lookups etc. happen within this new name space on the volume
>>> 
>>> 4) Checked memory footprints of the slow bricks and fast bricks etc. nothing untoward noticed there
>>> 
>>> 5) Restarted the slow volume, just as a test case to do things from scratch, no improvement in performance.
>>> 
>>> Currently attempting to reproduce this on a local system to see if the same behavior is seen so that it becomes easier to debug etc.
>>> 
>>> Others on the list can chime in as they see fit.
>>> 
>>> Thanks,
>>> Shyam
>>> 
>>>> On 02/10/2015 09:58 AM, David F. Robinson wrote:
>>>> Forwarding to devel list as recommended by Justin...
>>>> 
>>>> David
>>>> 
>>>> 
>>>> ------ Forwarded Message ------
>>>> From: "David F. Robinson" <david.robinson at corvidtec.com>
>>>> To: "Justin Clift" <justin at gluster.org>
>>>> Sent: 2/10/2015 9:49:09 AM
>>>> Subject: Re[2]: [Gluster-devel] missing files
>>>> 
>>>> Bad news... I don't think it is the old linkto files. Bad because if
>>>> that was the issue, cleaning up all of bad linkto files would have fixed
>>>> the issue. It seems like the system just gets slower as you add data.
>>>> 
>>>> First, I setup a new clean volume (test2brick) on the same system as the
>>>> old one (homegfs_bkp). See 'gluster v info' below. I ran my simple tar
>>>> extraction test on the new volume and it took 58-seconds to complete
>>>> (which, BTW, is 10-seconds faster than my old non-gluster system, so
>>>> kudos). The time on homegfs_bkp is 19-minutes.
>>>> 
>>>> Next, I copied 10-terabytes of data over to test2brick and re-ran the
>>>> test which then took 7-minutes. I created a test3brick and ran the test
>>>> and it took 53-seconds.
>>>> 
>>>> To confirm all of this, I deleted all of the data from test2brick and
>>>> re-ran the test. It took 51-seconds!!!
>>>> 
>>>> BTW. I also checked the .glusterfs for stale linkto files (find . -type
>>>> f -size 0 -perm 1000 -exec ls -al {} \;). There are many, many thousands
>>>> of these types of files on the old volume and none on the new one, so I
>>>> don't think this is related to the performance issue.
>>>> 
>>>> Let me know how I should proceed. Send this to devel list? Pranith?
>>>> others? Thanks...
>>>> 
>>>> [root at gfs01bkp .glusterfs]# gluster volume info homegfs_bkp
>>>> Volume Name: homegfs_bkp
>>>> Type: Distribute
>>>> Volume ID: 96de8872-d957-4205-bf5a-076e3f35b294
>>>> Status: Started
>>>> Number of Bricks: 2
>>>> Transport-type: tcp
>>>> Bricks:
>>>> Brick1: gfsib01bkp.corvidtec.com:/data/brick01bkp/homegfs_bkp
>>>> Brick2: gfsib01bkp.corvidtec.com:/data/brick02bkp/homegfs_bkp
>>>> 
>>>> [root at gfs01bkp .glusterfs]# gluster volume info test2brick
>>>> Volume Name: test2brick
>>>> Type: Distribute
>>>> Volume ID: 123259b2-3c61-4277-a7e8-27c7ec15e550
>>>> Status: Started
>>>> Number of Bricks: 2
>>>> Transport-type: tcp
>>>> Bricks:
>>>> Brick1: gfsib01bkp.corvidtec.com:/data/brick01bkp/test2brick
>>>> Brick2: gfsib01bkp.corvidtec.com:/data/brick02bkp/test2brick
>>>> 
>>>> [root at gfs01bkp glusterfs]# gluster volume info test3brick
>>>> Volume Name: test3brick
>>>> Type: Distribute
>>>> Volume ID: 9b1613fc-f7e5-4325-8f94-e3611a5c3701
>>>> Status: Started
>>>> Number of Bricks: 2
>>>> Transport-type: tcp
>>>> Bricks:
>>>> Brick1: gfsib01bkp.corvidtec.com:/data/brick01bkp/test3brick
>>>> Brick2: gfsib01bkp.corvidtec.com:/data/brick02bkp/test3brick
>>>> 
>>>> 
>>>> From homegfs_bkp:
>>>> # find . -type f -size 0 -perm 1000 -exec ls -al {} \;
>>>> --------T 2 gmathur pme_ics 0 Jan 9 16:59
>>>> ./00/16/00169a69-1a7a-44c9-b2d8-991671ee87c4
>>>> ---------T 3 jcowan users 0 Jan 9 17:51
>>>> ./00/16/0016a0a0-fd22-4fb5-b6fb-5d7f9024ab74
>>>> ---------T 2 morourke sbir 0 Jan 9 18:17
>>>> ./00/16/0016b36f-32fc-4f2c-accd-e36be2f6c602
>>>> ---------T 2 carpentr irl 0 Jan 9 18:52
>>>> ./00/16/00163faf-741c-4e40-8081-784786b3cc71
>>>> ---------T 3 601 raven 0 Jan 9 22:49
>>>> ./00/16/00163385-a332-4050-8104-1b1af6cd8249
>>>> ---------T 3 bangell sbir 0 Jan 9 22:56
>>>> ./00/16/00167803-0244-46de-8246-d9c382dd3083
>>>> ---------T 2 morourke sbir 0 Jan 9 23:17
>>>> ./00/16/00167bc5-fc56-42ee-9e3f-1e238f3828f4
>>>> ---------T 3 morourke sbir 0 Jan 9 23:34
>>>> ./00/16/0016a71e-89cf-4a86-9575-49c7e9d216c6
>>>> ---------T 2 gmathur users 0 Jan 9 23:47
>>>> ./00/16/00168aa2-d069-4a77-8790-e36431324ca5
>>>> ---------T 2 bangell users 0 Jan 22 09:24
>>>> ./00/16/0016e720-a190-4e43-962f-aa3e4216e5f5
>>>> ---------T 2 root root 0 Jan 22 09:26
>>>> ./00/16/00169e95-64b7-455c-82dc-d9940ee7fe43
>>>> ---------T 2 dfrobins users 0 Jan 22 09:27
>>>> ./00/16/00161b04-1612-4fba-99a4-2a2b54062fdb
>>>> ---------T 2 mdick users 0 Jan 22 09:27
>>>> ./00/16/0016ba60-310a-4bee-968a-36eb290e8c9e
>>>> ---------T 2 dfrobins users 0 Jan 22 09:43
>>>> ./00/16/00160315-1533-4290-8c1a-72e2fbb1962a
>>>> From test2brick:
>>>> find . -type f -size 0 -perm 1000 -exec ls -al {} \;
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> ------ Original Message ------
>>>> From: "Justin Clift" <justin at gluster.org>
>>>> To: "David F. Robinson" <david.robinson at corvidtec.com>
>>>> Sent: 2/9/2015 11:33:54 PM
>>>> Subject: Re: [Gluster-devel] missing files
>>>> 
>>>>> Interesting. (I'm 1/2 asleep atm and really need sleep soon, so take this
>>>>> with a grain of salt... ;>)
>>>>> 
>>>>> As a curiosity question, does the homegfs_bkp volume have a bunch of
>>>>> outdated metadata still in it? eg left over extended attributes or
>>>>> something
>>>>> 
>>>>> Remembering a question you asked earlier er... today/yesterday about old
>>>>> extended attribute entries and if they hang around forever. I don't
>>>>> know the
>>>>> answer to that, but if the old volume still has a 1000's (or more) of
>>>>> entries
>>>>> around, perhaps there's some lookup problem that's killing lookup
>>>>> times for
>>>>> file operations.
>>>>> 
>>>>> On a side note, I can probably setup my test lab stuff here again
>>>>> tomorrow
>>>>> and try this stuff out myself to see if I can replicate the problem.
>>>>> (if that
>>>>> could potentially be useful?)
>>>>> 
>>>>> + Justin
>>>>> 
>>>>> 
>>>>> 
>>>>> On 9 Feb 2015, at 22:56, David F. Robinson
>>>>> <david.robinson at corvidtec.com> wrote:
>>>>>> Justin,
>>>>>> 
>>>>>> Hoping you can help point this to the right people once again. Maybe
>>>>>> all of these issues are related.
>>>>>> 
>>>>>> You can look at the email traffic below, but the summary is that I
>>>>>> was working with Ben to figure out why my GFS system was 20x slower
>>>>>> than my old storage system. During my tracing of this issue, I
>>>>>> determined that if I create a new volume on my storage system, this
>>>>>> slowness goes away. So, either it is faster because it doesn't have
>>>>>> any data on this new volume (I hope this isn't the case) or the older
>>>>>> partitions somehow became corrupted during the upgrades or has some
>>>>>> depricated parameters set that slow it down.
>>>>>> 
>>>>>> Very strange and hoping you can once again help... Thanks in advance...
>>>>>> 
>>>>>> David
>>>>>> 
>>>>>> 
>>>>>> ------ Forwarded Message ------
>>>>>> From: "David F. Robinson" <david.robinson at corvidtec.com>
>>>>>> To: "Benjamin Turner" <bennyturns at gmail.com>
>>>>>> Sent: 2/9/2015 5:52:00 PM
>>>>>> Subject: Re[5]: [Gluster-devel] missing files
>>>>>> 
>>>>>> Ben,
>>>>>> 
>>>>>> I cleared the logs and rebooted the machine. Same issue. homegfs_bkp
>>>>>> takes 19-minutes and test2brick (the new volume) takes 1-minute.
>>>>>> 
>>>>>> Is it possible that some old parameters are still set for
>>>>>> homegfs_bkp that are no longer in use? I tried a gluster volume reset
>>>>>> for homegfs_bkp, but it didn't have any effect.
>>>>>> 
>>>>>> I have attached the full logs.
>>>>>> 
>>>>>> David
>>>>>> 
>>>>>> 
>>>>>> ------ Original Message ------
>>>>>> From: "David F. Robinson" <david.robinson at corvidtec.com>
>>>>>> To: "Benjamin Turner" <bennyturns at gmail.com>
>>>>>> Sent: 2/9/2015 5:39:18 PM
>>>>>> Subject: Re[4]: [Gluster-devel] missing files
>>>>>> 
>>>>>>> Ben,
>>>>>>> 
>>>>>>> I have traced this out to a point where I can rule out many issues.
>>>>>>> I was hoping you could help me from here.
>>>>>>> I went with the "tar -xPf boost.tar" as my test case, which on my
>>>>>>> old storage system took about 1-minute to extract. On my backup
>>>>>>> system and my primary storage (both gluster), it takes roughly
>>>>>>> 19-minutes.
>>>>>>> 
>>>>>>> First step was to create a new storage system (striped RAID, two
>>>>>>> sets of 3-drives). All was good here with a gluster extraction time
>>>>>>> of 1-minute. I then went to my backup system and created another
>>>>>>> partition using only one of the two bricks on that system. Still
>>>>>>> 1-minute. I went to a two brick setup and it stayed at 1-minute.
>>>>>>> 
>>>>>>> At this point, I have recreated using the same parameters on a
>>>>>>> test2brick volume that should be identical to my homegfs_bkp volume.
>>>>>>> Everything is the same including how I mounted the volume. The only
>>>>>>> different is that the homegfs_bkp has 30-TB of data and the
>>>>>>> test2brick is blank. I didn't think that performance would be
>>>>>>> affected by putting data on the volume.
>>>>>>> 
>>>>>>> Can you help? Do you have any suggestions? Do you think upgrading
>>>>>>> gluster from 3.5 to 3.6.1 to 3.6.2 somehow message up homegfs_bkp?
>>>>>>> My layout is shown below. These should give identical speeds.
>>>>>>> 
>>>>>>> [root at gfs01bkp test2brick]# gluster volume info homegfs_bkp
>>>>>>> Volume Name: homegfs_bkp
>>>>>>> Type: Distribute
>>>>>>> Volume ID: 96de8872-d957-4205-bf5a-076e3f35b294
>>>>>>> Status: Started
>>>>>>> Number of Bricks: 2
>>>>>>> Transport-type: tcp
>>>>>>> Bricks:
>>>>>>> Brick1: gfsib01bkp.corvidtec.com:/data/brick01bkp/homegfs_bkp
>>>>>>> Brick2: gfsib01bkp.corvidtec.com:/data/brick02bkp/homegfs_bkp
>>>>>>> [root at gfs01bkp test2brick]# gluster volume info test2brick
>>>>>>> 
>>>>>>> Volume Name: test2brick
>>>>>>> Type: Distribute
>>>>>>> Volume ID: 123259b2-3c61-4277-a7e8-27c7ec15e550
>>>>>>> Status: Started
>>>>>>> Number of Bricks: 2
>>>>>>> Transport-type: tcp
>>>>>>> Bricks:
>>>>>>> Brick1: gfsib01bkp.corvidtec.com:/data/brick01bkp/test2brick
>>>>>>> Brick2: gfsib01bkp.corvidtec.com:/data/brick02bkp/test2brick
>>>>>>> 
>>>>>>> 
>>>>>>> [root at gfs01bkp brick02bkp]# mount | grep test2brick
>>>>>>> gfsib01bkp.corvidtec.com:/test2brick.tcp on /test2brick type
>>>>>>> fuse.glusterfs (rw,allow_other,max_read=131072)
>>>>>>> [root at gfs01bkp brick02bkp]# mount | grep homegfs_bkp
>>>>>>> gfsib01bkp.corvidtec.com:/homegfs_bkp.tcp on /backup/homegfs type
>>>>>>> fuse.glusterfs (rw,allow_other,max_read=131072)
>>>>>>> 
>>>>>>> [root at gfs01bkp brick02bkp]# df -h
>>>>>>> Filesystem Size Used Avail Use% Mounted on
>>>>>>> /dev/mapper/vg00-lv_root 20G 1.7G 18G 9% /
>>>>>>> tmpfs 16G 0 16G 0% /dev/shm
>>>>>>> /dev/md126p1 1008M 110M 848M 12% /boot
>>>>>>> /dev/mapper/vg00-lv_opt 5.0G 220M 4.5G 5% /opt
>>>>>>> /dev/mapper/vg00-lv_tmp 5.0G 139M 4.6G 3% /tmp
>>>>>>> /dev/mapper/vg00-lv_usr 20G 2.7G 17G 15% /usr
>>>>>>> /dev/mapper/vg00-lv_var 40G 4.4G 34G 12% /var
>>>>>>> /dev/mapper/vg01-lvol1 88T 22T 67T 25% /data/brick01bkp
>>>>>>> /dev/mapper/vg02-lvol1 88T 22T 67T 25% /data/brick02bkp
>>>>>>> gfsib01bkp.corvidtec.com:/homegfs_bkp.tcp 175T 43T 133T 25%
>>>>>>> /backup/homegfs
>>>>>>> gfsib01bkp.corvidtec.com:/test2brick.tcp 175T 43T 133T 25% /test2brick
>>>>>>> 
>>>>>>> 
>>>>>>> ------ Original Message ------
>>>>>>> From: "Benjamin Turner" <bennyturns at gmail.com>
>>>>>>> To: "David F. Robinson" <david.robinson at corvidtec.com>
>>>>>>> Sent: 2/6/2015 12:52:58 PM
>>>>>>> Subject: Re: Re[2]: [Gluster-devel] missing files
>>>>>>> 
>>>>>>>> Hi David. Lets start with the basics and go from there. IIRC you
>>>>>>>> are using LVM with thick provisioning, lets verify the following:
>>>>>>>> 
>>>>>>>> 1. You have everything properly aligned for your RAID stripe size,
>>>>>>>> etc. I have attached the script we package with RHS that I am in
>>>>>>>> the process of updating. I want to double check you created the PV
>>>>>>>> / VG / LV with the proper variables. Have a look at the create_pv,
>>>>>>>> create_vg, and create_lv(old) functions. You will need to know the
>>>>>>>> stripe size of your raid and the number of stripe elements(data
>>>>>>>> disks, not hotspares). Also make sure you mkfs.xfs with:
>>>>>>>> 
>>>>>>>> echo "mkfs -t xfs -f -K -i size=$inode_size -d
>>>>>>>> sw=$stripe_elements,su=$stripesize -n size=$fs_block_size
>>>>>>>> /dev/$vgname/$lvname"
>>>>>>>> 
>>>>>>>> We use 512k inodes because some workload use more than the default
>>>>>>>> inode size and you don't want xattrs bleeding over inodes.
>>>>>>>> 
>>>>>>>> 2. Are you running RHEL or Centos? If so I would recommend
>>>>>>>> tuned_profile=rhs-high-throughput. If you don't have that tuned
>>>>>>>> profile I'll get you everything it sets.
>>>>>>>> 
>>>>>>>> 3. For small files we we recommend the following:
>>>>>>>> 
>>>>>>>> # RAID related variables.
>>>>>>>> # stripesize - RAID controller stripe unit size
>>>>>>>> # stripe_elements - the number of data disks
>>>>>>>> # The --dataalignment option is used while creating the physical
>>>>>>>> volumeTo
>>>>>>>> # align I/O at LVM layer
>>>>>>>> # dataalign -
>>>>>>>> # RAID6 is recommended when the workload has predominantly larger
>>>>>>>> # files ie not in kilobytes.
>>>>>>>> # For RAID6 with 12 disks and 128K stripe element size.
>>>>>>>> stripesize=128k
>>>>>>>> stripe_elements=10
>>>>>>>> dataalign=1280k
>>>>>>>> 
>>>>>>>> # RAID10 is recommended when the workload has predominantly
>>>>>>>> smaller files
>>>>>>>> # i.e in kilobytes.
>>>>>>>> # For RAID10 with 12 disks and 256K stripe element size, uncomment
>>>>>>>> the
>>>>>>>> # lines below.
>>>>>>>> # stripesize=256k
>>>>>>>> # stripe_elements=6
>>>>>>>> # dataalign=1536k
>>>>>>>> 
>>>>>>>> 4. Jumbo frames everywhere! Check out the effect of jumbo frames,
>>>>>>>> make sure they are setup properly on your switch and add the
>>>>>>>> MTU=9000 to your ifcfg files(unless you have it already):
>>>>>>>> 
>>>>>>>> 
>>>>>>>> https://rhsummit.files.wordpress.com/2013/07/england_th_0450_rhs_perf_practices-4_neependra.pdf 
>>>>>>>> (see the jumbo frames section here, the whole thing is a good read)
>>>>>>>> 
>>>>>>>> https://rhsummit.files.wordpress.com/2014/04/bengland_h_1100_rhs_performance.pdf 
>>>>>>>> (this is updated for 2014)
>>>>>>>> 
>>>>>>>> 5. There is a smallfile enhancement that just landed in master
>>>>>>>> that is showing me a 60% improvement in writes. This is called
>>>>>>>> multi threaded epoll and it is looking VERY promising WRT smallfile
>>>>>>>> performance. Here is a summary:
>>>>>>>> 
>>>>>>>> Hi all. I see alot of discussion on $subject and I wanted to take
>>>>>>>> a minute to talk about it and what we can do to test / observe the
>>>>>>>> effects of it. Lets start with a bit of background:
>>>>>>>> 
>>>>>>>> **Background**
>>>>>>>> 
>>>>>>>> -Currently epoll is single threaded on both clients and servers.
>>>>>>>>   *This leads to a "hot thread" which consumes 100% of a CPU core.
>>>>>>>>   *This can be observed by running BenE's smallfile benchmark to
>>>>>>>> create files, running top(on both clients and servers), and
>>>>>>>> pressing H to show threads.
>>>>>>>>   *You will be able to see a single glusterfs thread eating 100%
>>>>>>>> of the CPU:
>>>>>>>> 
>>>>>>>>  2871 root 20 0 746m 24m 3004 S 100.0 0.1 14:35.89 glusterfsd
>>>>>>>>  4522 root 20 0 747m 24m 3004 S 5.3 0.1 0:02.25 glusterfsd
>>>>>>>>  4507 root 20 0 747m 24m 3004 S 5.0 0.1 0:05.91 glusterfsd
>>>>>>>> 21200 root 20 0 747m 24m 3004 S 4.6 0.1 0:21.16 glusterfsd
>>>>>>>> 
>>>>>>>> -Single threaded epoll is a bottlenck for high IOP / low metadata
>>>>>>>> workloads(think smallfile). With single threaded epoll we are CPU
>>>>>>>> bound by the single thread pegging out a CPU.
>>>>>>>> 
>>>>>>>> So the proposed solution to this problem is to make epoll multi
>>>>>>>> threaded on both servers and clients. Here is a link to the
>>>>>>>> upstream proposal:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> http://www.gluster.org/community/documentation/index.php/Features/Feature_Smallfile_Perf#multi-thread-epoll 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Status: [ http://review.gluster.org/#/c/3842/ based on Anand
>>>>>>>> Avati's patch ]
>>>>>>>> 
>>>>>>>> Why: remove single-thread-per-brick barrier to higher CPU
>>>>>>>> utilization by servers
>>>>>>>> 
>>>>>>>> Use case: multi-client and multi-thread applications
>>>>>>>> 
>>>>>>>> Improvement: measured 40% with 2 epoll threads and 100% with 4
>>>>>>>> epoll threads for small file creates to an SSD
>>>>>>>> 
>>>>>>>> Disadvantage: conflicts with support for SSL sockets, may require
>>>>>>>> significant code change to support both.
>>>>>>>> 
>>>>>>>> Note: this enhancement also helps high-IOPS applications such as
>>>>>>>> databases and virtualization which are not metadata-intensive. This
>>>>>>>> has been measured already using a Fusion I/O SSD performing random
>>>>>>>> reads and writes -- it was necessary to define multiple bricks per
>>>>>>>> SSD device to get Gluster to the same order of magnitude IOPS as a
>>>>>>>> local filesystem. But this workaround is problematic for users,
>>>>>>>> because storage space is not properly measured when there are
>>>>>>>> multiple bricks on the same filesystem.
>>>>>>>> 
>>>>>>>> Multi threaded epoll is part of a larger page that talks about
>>>>>>>> smallfile performance enhancements, proposed and happening.
>>>>>>>> 
>>>>>>>> Goal: if successful, throughput bottleneck should be either the
>>>>>>>> network or the brick filesystem!
>>>>>>>> What it doesn't do: multi-thread-epoll does not solve the
>>>>>>>> excessive-round-trip protocol problems that Gluster has.
>>>>>>>> What it should do: allow Gluster to exploit the mostly untapped
>>>>>>>> CPU resources on the Gluster servers and clients.
>>>>>>>> How it does it: allow multiple threads to read protocol messages
>>>>>>>> and process them at the same time.
>>>>>>>> How to observe: multi-thread-epoll should be configurable (how to
>>>>>>>> configure? gluster command?), with thread count 1 it should be same
>>>>>>>> as RHS 3.0, with thread count 2-4 it should show significantly more
>>>>>>>> CPU utilization (threads visible with "top -H"), resulting in
>>>>>>>> higher throughput.
>>>>>>>> 
>>>>>>>> **How to observe**
>>>>>>>> 
>>>>>>>> Here are the commands needed to setup an environment to test in on
>>>>>>>> RHS 3.0.3:
>>>>>>>> rpm -e glusterfs-api glusterfs glusterfs-libs glusterfs-fuse
>>>>>>>> glusterfs-geo-replication glusterfs-rdma glusterfs-server
>>>>>>>> glusterfs-cli gluster-nagios-common samba-glusterfs vdsm-gluster
>>>>>>>> --nodeps
>>>>>>>> rhn_register
>>>>>>>> yum groupinstall "Development tools"
>>>>>>>> git clone https://github.com/gluster/glusterfs.git
>>>>>>>> git branch test
>>>>>>>> git checkout test
>>>>>>>> git fetch http://review.gluster.org/glusterfs
>>>>>>>> refs/changes/42/3842/17 && git cherry-pick FETCH_HEAD
>>>>>>>> git fetch http://review.gluster.org/glusterfs
>>>>>>>> refs/changes/88/9488/2 && git cherry-pick FETCH_HEAD
>>>>>>>> yum install openssl openssl-devel
>>>>>>>> wget
>>>>>>>> ftp://fr2.rpmfind.net/linux/epel/6/x86_64/cmockery2-1.3.8-2.el6.x86_64.rpm 
>>>>>>>> 
>>>>>>>> wget
>>>>>>>> ftp://fr2.rpmfind.net/linux/epel/6/x86_64/cmockery2-devel-1.3.8-2.el6.x86_64.rpm 
>>>>>>>> 
>>>>>>>> yum install cmockery2-1.3.8-2.el6.x86_64.rpm
>>>>>>>> cmockery2-devel-1.3.8-2.el6.x86_64.rpm libxml2-devel
>>>>>>>> ./autogen.sh
>>>>>>>> ./configure
>>>>>>>> make
>>>>>>>> make install
>>>>>>>> 
>>>>>>>> Verify you are using the upstream with:
>>>>>>>> 
>>>>>>>> # gluster -- version
>>>>>>>> 
>>>>>>>> To enable set multithreaded epoll run the following commands:
>>>>>>>> 
>>>>>>>> From the patch:
>>>>>>>>         { .key = "client.event-threads", 839
>>>>>>>>           .voltype = "protocol/client", 840
>>>>>>>>           .op_version = GD_OP_VERSION_3_7_0, 841
>>>>>>>>                                 },
>>>>>>>>         { .key = "server.event-threads", 946
>>>>>>>>           .voltype = "protocol/server", 947
>>>>>>>>           .op_version = GD_OP_VERSION_3_7_0, 948
>>>>>>>>         },
>>>>>>>> 
>>>>>>>> # gluster v set <volname> server.event-threads 4
>>>>>>>> # gluster v set <volname> client.event-threads 4
>>>>>>>> 
>>>>>>>> Also grab smallfile:
>>>>>>>> 
>>>>>>>> https://github.com/bengland2/smallfile
>>>>>>>> 
>>>>>>>> After git cloneing smallfile run:
>>>>>>>> 
>>>>>>>> python /small-files/smallfile/smallfile_cli.py --operation create
>>>>>>>> --threads 8 --file-size 64 --files 10000 --top /gluster-mount
>>>>>>>> --pause 1000 --host-set "client1 client2"
>>>>>>>> 
>>>>>>>> Again we will be looking at top + show threads(press H). With 4
>>>>>>>> threads on both clients and servers you should see something
>>>>>>>> similar to(this isnt exact, I coped and pasted):
>>>>>>>> 
>>>>>>>>  2871 root 20 0 746m 24m 3004 S 35.0 0.1 14:35.89 glusterfsd
>>>>>>>>  2872 root 20 0 746m 24m 3004 S 51.0 0.1 14:35.89 glusterfsd
>>>>>>>>  2873 root 20 0 746m 24m 3004 S 43.0 0.1 14:35.89 glusterfsd
>>>>>>>>  2874 root 20 0 746m 24m 3004 S 65.0 0.1 14:35.89 glusterfsd
>>>>>>>>  4522 root 20 0 747m 24m 3004 S 5.3 0.1 0:02.25 glusterfsd
>>>>>>>>  4507 root 20 0 747m 24m 3004 S 5.0 0.1 0:05.91 glusterfsd
>>>>>>>> 21200 root 20 0 747m 24m 3004 S 4.6 0.1 0:21.16 glusterfsd
>>>>>>>> 
>>>>>>>> If you have a test env I would be interested to see how multi
>>>>>>>> threaded epoll performs, but I am 100% sure its not ready for
>>>>>>>> production yet. RH will be supporting it with our 3.0.4(the next
>>>>>>>> one) release unless we find show stopping bugs. My testing looks
>>>>>>>> very promising though.
>>>>>>>> 
>>>>>>>> Smallfile performance enhancements are one of the key focuses for
>>>>>>>> our 3.1 release this summer, we are working very hard to improve
>>>>>>>> this as this is the use case for the majority of people.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Fri, Feb 6, 2015 at 11:59 AM, David F. Robinson
>>>>>>>> <david.robinson at corvidtec.com> wrote:
>>>>>>>> Ben,
>>>>>>>> 
>>>>>>>> I was hoping you might be able to help with two performance
>>>>>>>> questions. I was doing some testing of my rsync where I am backing
>>>>>>>> up my primary gluster system (distributed + replicated) to my
>>>>>>>> backup gluster system (distributed). I tried three tests where I
>>>>>>>> rsynced from one of my primary sytems (gfsib02b) to my backup
>>>>>>>> machine. The test directory contains roughly 5500 files, most of
>>>>>>>> which are small. The script I ran is shown below which repeats the
>>>>>>>> tests 3x for each section to check variability in timing.
>>>>>>>> 
>>>>>>>> 1) Writing to the local disk is drastically faster than writing to
>>>>>>>> gluster. So, my writes to the backup gluster system are what is
>>>>>>>> slowing me down, which makes sense.
>>>>>>>> 2) When I write to the backup gluster system (/backup/homegfs),
>>>>>>>> the timing goes from 35-seconds to 1min40seconds. The question here
>>>>>>>> is whether you could recommend any settings for this volume that
>>>>>>>> would improve performance for small file writes? I have included
>>>>>>>> the output of 'gluster volume info" below.
>>>>>>>> 3) When I did the same tests on the Source_bkp volume, it is
>>>>>>>> almost 3x as slow as the homegfs_bkp volume. However, these are
>>>>>>>> just different volumes on the same storage system. The volume
>>>>>>>> parameters are identical (see below). The performance of these two
>>>>>>>> should be identical. Any idea why they wouldn't be? And any
>>>>>>>> suggestions for how to fix this? The only thing that I see
>>>>>>>> different between the two is the order of the "Options
>>>>>>>> reconfigured" section. I assume order of options doesn't matter.
>>>>>>>> 
>>>>>>>> Backup to local hard disk (no gluster writes)
>>>>>>>>  time /usr/local/bin/rsync -av --numeric-ids --delete
>>>>>>>> --block-size=131072 -e "ssh -T -c arcfour -o Compression=no -x"
>>>>>>>> gfsib02b:/homegfs/test /temp1
>>>>>>>>  time /usr/local/bin/rsync -av --numeric-ids --delete
>>>>>>>> --block-size=131072 -e "ssh -T -c arcfour -o Compression=no -x"
>>>>>>>> gfsib02b:/homegfs/test /temp2
>>>>>>>>  time /usr/local/bin/rsync -av --numeric-ids --delete
>>>>>>>> --block-size=131072 -e "ssh -T -c arcfour -o Compression=no -x"
>>>>>>>> gfsib02b:/homegfs/test /temp3
>>>>>>>> 
>>>>>>>>         real 0m35.579s
>>>>>>>>         user 0m31.290s
>>>>>>>>         sys 0m12.282s
>>>>>>>> 
>>>>>>>>         real 0m38.035s
>>>>>>>>         user 0m31.622s
>>>>>>>>         sys 0m10.907s
>>>>>>>>         real 0m38.313s
>>>>>>>>         user 0m31.458s
>>>>>>>>         sys 0m10.891s
>>>>>>>> Backup to gluster backup system on volume homegfs_bkp
>>>>>>>>  time /usr/local/bin/rsync -av --numeric-ids --delete
>>>>>>>> --block-size=131072 -e "ssh -T -c arcfour -o Compression=no -x"
>>>>>>>> gfsib02b:/homegfs/test /backup/homegfs/temp1
>>>>>>>>  time /usr/local/bin/rsync -av --numeric-ids --delete
>>>>>>>> --block-size=131072 -e "ssh -T -c arcfour -o Compression=no -x"
>>>>>>>> gfsib02b:/homegfs/test /backup/homegfs/temp2
>>>>>>>>  time /usr/local/bin/rsync -av --numeric-ids --delete
>>>>>>>> --block-size=131072 -e "ssh -T -c arcfour -o Compression=no -x"
>>>>>>>> gfsib02b:/homegfs/test /backup/homegfs/temp3
>>>>>>>> 
>>>>>>>>         real 1m42.026s
>>>>>>>>         user 0m32.604s
>>>>>>>>         sys 0m9.967s
>>>>>>>> 
>>>>>>>>         real 1m45.480s
>>>>>>>>         user 0m32.577s
>>>>>>>>         sys 0m11.994s
>>>>>>>> 
>>>>>>>>         real 1m40.436s
>>>>>>>>         user 0m32.521s
>>>>>>>>         sys 0m11.240s
>>>>>>>> 
>>>>>>>> Backup to gluster backup system on volume Source_bkp
>>>>>>>>  time /usr/local/bin/rsync -av --numeric-ids --delete
>>>>>>>> --block-size=131072 -e "ssh -T -c arcfour -o Compression=no -x"
>>>>>>>> gfsib02b:/homegfs/test /backup/Source/temp1
>>>>>>>>  time /usr/local/bin/rsync -av --numeric-ids --delete
>>>>>>>> --block-size=131072 -e "ssh -T -c arcfour -o Compression=no -x"
>>>>>>>> gfsib02b:/homegfs/test /backup/Source/temp2
>>>>>>>>  time /usr/local/bin/rsync -av --numeric-ids --delete
>>>>>>>> --block-size=131072 -e "ssh -T -c arcfour -o Compression=no -x"
>>>>>>>> gfsib02b:/homegfs/test /backup/Source/temp3
>>>>>>>> 
>>>>>>>>         real 3m30.491s
>>>>>>>>         user 0m32.676s
>>>>>>>>         sys 0m10.776s
>>>>>>>> 
>>>>>>>>         real 3m26.076s
>>>>>>>>         user 0m32.588s
>>>>>>>>         sys 0m11.048s
>>>>>>>>         real 3m7.460s
>>>>>>>>         user 0m32.763s
>>>>>>>>         sys 0m11.687s
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Volume Name: Source_bkp
>>>>>>>> Type: Distribute
>>>>>>>> Volume ID: 1d4c210d-a731-4d39-a0c5-ea0546592c1d
>>>>>>>> Status: Started
>>>>>>>> Number of Bricks: 2
>>>>>>>> Transport-type: tcp
>>>>>>>> Bricks:
>>>>>>>> Brick1: gfsib01bkp.corvidtec.com:/data/brick01bkp/Source_bkp
>>>>>>>> Brick2: gfsib01bkp.corvidtec.com:/data/brick02bkp/Source_bkp
>>>>>>>> Options Reconfigured:
>>>>>>>> performance.cache-size: 128MB
>>>>>>>> performance.io-thread-count: 32
>>>>>>>> server.allow-insecure: on
>>>>>>>> network.ping-timeout: 10
>>>>>>>> storage.owner-gid: 100
>>>>>>>> performance.write-behind-window-size: 128MB
>>>>>>>> server.manage-gids: on
>>>>>>>> changelog.rollover-time: 15
>>>>>>>> changelog.fsync-interval: 3
>>>>>>>> 
>>>>>>>> Volume Name: homegfs_bkp
>>>>>>>> Type: Distribute
>>>>>>>> Volume ID: 96de8872-d957-4205-bf5a-076e3f35b294
>>>>>>>> Status: Started
>>>>>>>> Number of Bricks: 2
>>>>>>>> Transport-type: tcp
>>>>>>>> Bricks:
>>>>>>>> Brick1: gfsib01bkp.corvidtec.com:/data/brick01bkp/homegfs_bkp
>>>>>>>> Brick2: gfsib01bkp.corvidtec.com:/data/brick02bkp/homegfs_bkp
>>>>>>>> Options Reconfigured:
>>>>>>>> storage.owner-gid: 100
>>>>>>>> performance.io-thread-count: 32
>>>>>>>> server.allow-insecure: on
>>>>>>>> network.ping-timeout: 10
>>>>>>>> performance.cache-size: 128MB
>>>>>>>> performance.write-behind-window-size: 128MB
>>>>>>>> server.manage-gids: on
>>>>>>>> changelog.rollover-time: 15
>>>>>>>> changelog.fsync-interval: 3
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> ------ Original Message ------
>>>>>>>> From: "Benjamin Turner" <bennyturns at gmail.com>
>>>>>>>> To: "David F. Robinson" <david.robinson at corvidtec.com>
>>>>>>>> Cc: "Gluster Devel" <gluster-devel at gluster.org>;
>>>>>>>> "gluster-users at gluster.org" <gluster-users at gluster.org>
>>>>>>>> Sent: 2/3/2015 7:12:34 PM
>>>>>>>> Subject: Re: [Gluster-devel] missing files
>>>>>>>> 
>>>>>>>>> It sounds to me like the files were only copied to one replica,
>>>>>>>>> werent there for the initial for the initial ls which triggered a
>>>>>>>>> self heal, and were there for the last ls because they were
>>>>>>>>> healed. Is there any chance that one of the replicas was down
>>>>>>>>> during the rsync? It could be that you lost a brick during copy or
>>>>>>>>> something like that. To confirm I would look for disconnects in
>>>>>>>>> the brick logs as well as checking glusterfshd.log to verify the
>>>>>>>>> missing files were actually healed.
>>>>>>>>> 
>>>>>>>>> -b
>>>>>>>>> 
>>>>>>>>> On Tue, Feb 3, 2015 at 5:37 PM, David F. Robinson
>>>>>>>>> <david.robinson at corvidtec.com> wrote:
>>>>>>>>> I rsync'd 20-TB over to my gluster system and noticed that I had
>>>>>>>>> some directories missing even though the rsync completed normally.
>>>>>>>>> The rsync logs showed that the missing files were transferred.
>>>>>>>>> 
>>>>>>>>> I went to the bricks and did an 'ls -al
>>>>>>>>> /data/brick*/homegfs/dir/*' the files were on the bricks. After I
>>>>>>>>> did this 'ls', the files then showed up on the FUSE mounts.
>>>>>>>>> 
>>>>>>>>> 1) Why are the files hidden on the fuse mount?
>>>>>>>>> 2) Why does the ls make them show up on the FUSE mount?
>>>>>>>>> 3) How can I prevent this from happening again?
>>>>>>>>> 
>>>>>>>>> Note, I also mounted the gluster volume using NFS and saw the
>>>>>>>>> same behavior. The files/directories were not shown until I did
>>>>>>>>> the "ls" on the bricks.
>>>>>>>>> 
>>>>>>>>> David
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> ===============================
>>>>>>>>> David F. Robinson, Ph.D.
>>>>>>>>> President - Corvid Technologies
>>>>>>>>> 704.799.6944 x101 [office]
>>>>>>>>> 704.252.1310 [cell]
>>>>>>>>> 704.799.7974 [fax]
>>>>>>>>> David.Robinson at corvidtec.com
>>>>>>>>> http://www.corvidtechnologies.com/
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> _______________________________________________
>>>>>>>>> Gluster-devel mailing list
>>>>>>>>> Gluster-devel at gluster.org
>>>>>>>>> http://www.gluster.org/mailman/listinfo/gluster-devel
>>>>>> <glusterfs.tgz>
>>>>> 
>>>>> -- 
>>>>> GlusterFS - http://www.gluster.org
>>>>> 
>>>>> An open source, distributed file system scaling to several
>>>>> petabytes, and handling thousands of clients.
>>>>> 
>>>>> My personal twitter: twitter.com/realjustinclift
>>>> 
>>>> 
>>>> _______________________________________________
>>>> Gluster-devel mailing list
>>>> Gluster-devel at gluster.org
>>>> http://www.gluster.org/mailman/listinfo/gluster-devel
>>> _______________________________________________
>>> Gluster-devel mailing list
>>> Gluster-devel at gluster.org
>>> http://www.gluster.org/mailman/listinfo/gluster-devel
>> 
>> _______________________________________________
>> Gluster-devel mailing list
>> Gluster-devel at gluster.org
>> http://www.gluster.org/mailman/listinfo/gluster-devel
> 


More information about the Gluster-devel mailing list