[Gluster-users] reading from local replica?

Tue Jun 9 15:40:51 UTC 2015

On 06/09/2015 09:21 AM, Ted Miller wrote:
> On 6/8/2015 5:55 PM, Brian Ericson wrote:
>> Am I misunderstanding
>> cluster.read-subvolume/cluster.read-subvolume-index?
>>
>> I have two regions, "A" and "B" with servers "a" and "b" in,
>> respectfully, each region.  I have clients in both regions.
>> Intra-region communication is fast, but the pipe between the regions
>> is terrible.  I'd like to minimize inter-region communication to as
>> close to glusterfs write operations only and have reads go to the
>> server in the region the client is running in.
>>
>> I have created a replica volume as:
>> gluster volume create gv0 replica 2 a:/data/brick1/gv0
>> b:/data/brick1/gv0 force
>>
>> As a baseline, if I use scp to copy from the brick directly, I get --
>> for a 100M file -- times of about 6s if the client scps from the
>> server in the same region and anywhere from 3 to 5 minutes if I the
>> client scps the server in the other region.
>>
>> I was under the impression (from something I read but can't now find)
>> that glusterfs automatically picks the fastest replica, but that has
>> not been my experience; glusterfs seems to generally prefer the server
>> in the other region over the "local" one, with times usually in excess
>> of 4 minutes.
>>
>> I've also tried having clients mount the volume using the "xlator"
>> options cluster.read-subvolume and cluster.read-subvolume-index, but
>> neither seem to have any impact.  Here are sample mount commands to
>> show what I'm attempting:
>>
>> mount -t glusterfs -o
>> xlator-option=cluster.read-subvolume=gv0-client-<0 or 1> a:/gv0
>> /mnt/glusterfs
>> mount -t glusterfs -o xlator-option=cluster.read-subvolume-index=<0 or
>> 1> a:/gv0 /mnt/glusterfs
>>
>> Am I misunderstanding how glusterfs works, particularly when trying to
>> "read locally"?  Is it possible to configure glusterfs to use a local
>> replica (or the "fastest replica") for reads?
> I am not a developer, nor intimately familiar with the insides of
> glusterfs, but here is how I understand that glusterfs-fuse file reads
> work.
> First, all replica bricks are read, to make sure they are consistent.
> (If not, gluster tries to make them consistent before proceeding).
> After consistency is established, then the actual read occurs from the
> brick with the shortest response time.  I don't know when or how the
> response time is measured, but it seems to work for most people most of
> the time.  (If the client is on one of the brick hosts, it will almost
> always read from the local brick.)
>
> If the file reads involve a lot of small files, the consistency check
> may be what is killing your response times, rather than the read of the
> file itself.  Over a fast LAN, the consistency checks can take many
> times the actual read time of the file.
>
> Hopefully others will chime in with more information, but if you can
> supply more information about what you are reading, that will help too.
> Are you reading entire files, or just reading in a lot of "snippets" or
> what?
>
> Ted Miller
> Elkhart, IN, USA
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users
>
Thanks for the response!  Your understanding matches mine after reading 
documentation and various posts -- this should just work, right?

My test consists of reading a 100M file which has been replicated to 
both regions by glusterfs.  The specific command looks similar to:
time /bin/cp -f /mnt/glusterfs/one_hundred_mb_file /tmp

To avoid local reads, I'm invoking the "cp" on separate hosts in each 
region.  I umount & mount /mnt/glusterfs prior to running the timed to 
avoid measuring a read from the (client-)local cache.  The direct-scp 
timings show that same-region reads could take under 10s and 
between-region reads will take minutes.

Almost universally, the first timed "cp" of a 100M file takes minutes. 
This is true for clients in both regions and regardless of how I mount 
the volume (with/without read-subvolume/read-subvolume-index). 
Occasionally, however (maybe once in every 20 first reads), glusterfs 
will surprise me and give times (reads of ~5-20s), which align with what 
I'd expect if it were going to a same-region glusterfs replica. I have 
never, however, seen this repeated:  if a 100M file copies in under 20s 
and I immediately follow it up with a copy of another 100M file, the 
second file will always take many minutes.

It appears that cluster.read-subvolume and cluster.read-subvolume-index 
have no impact when passed as part of the client's mount command.  I 
note that if I set this at the volume level (gluster volume set gv0 
cluster.read-subvolume gv0-client-0), the impact is immediate: those 
lucky clients on the "right side" of the divide get fast times, while 
those on the "other side" get poor times.  Again, however, I see no 
impact trying to override this as part of the mount command on the client.

So, maybe passing these options as a mount command doesn't work/is a 
no-op, but what I don't understand is why -- given that there is no 
measure by which glusterfs should ever conclude the replica in the 
"other" region is ever faster than the replica in the "same" region. In 
fact, it appears as though glusterfs is *preferring* the slower replica.