- http://www.ibm.com/developerworks/wikis/display/hpccentral/GPFS+Tuning+Parameters


GPFS Tuning Parameters
Added by ScottGPFS, last edited by hraval on Nov 28, 2012  (view change)
Labels: 
(None)

This information has moved

This information has moved here

This section describes some of the configuration parameters available in GPFS. Included are some notes on how they may affect performance. 
These are GPFS configuration parameters that can be set cluster wide, on a specific node or sets of nodes.
To view the configuration parameters that has been changed from the default

mmlsconfig

To view the active value of any of these parameters you can run

mmfsadm dump config

To change any of these parameters use mmchconfig. For example to change the pagepool setting on all nodes.

mmchconfig pagepool=256M

Some options take effect immediately using the -i or -I flag to mmchconfig, some take effect after the node is restarted. Use -i to make the change permanent and affect the running GPFS daemon immediately. Use -I to affect the GPFS daemon only (reverts to saved settings on restart). Refer to the current GPFS Documentation for details.

In addition some parameters have a section called Tuning Guidelines. These are general guidelines that can be used to determine a starting point for tuning a parameter. 

Contents

 
 

leaseRecoveryWait



The leaseRecoveryWait parameter defines how long the FS manager of a filesystem will wait after the last known lease expiration of any failed nodes before running recovery. A failed node cannot reconnect to the cluster before recovery is finished. The leaseRecoveryWait parameter value is in seconds and the default is 35.

Making this value smaller increases the risk that there may be IO in flight from the failing node to the disk/controller when recovery starts running. This may result in out of order IOs between the FS manager and the dying node.

In most cases where a node is expelled from the cluster there is a either a problem with the network or the node running out of resources like paging. For example, if there is an application running on a node paging the machine to death or overrunning network capacity, GPFS may not have a chance to contact the Cluster Manager node to renew its lease within the timeout period.

GPFSCmdPortRange



When GPFS administration commands are executed they may use one or more TCP/IP ports to complete the command. For example when using standard ssh an admin command opens a connection using port 22. In addition to the remote shell or file copy command ports there are additional ports that are opened to pass data to and from GPFS daemons. By default GPFS uses one of the ephemeral ports to complete these connections.

In some environments you may want to limit the range of ports used by GPFS administration commands. You can control the ports used by the remote shell and file copy commands by using different tools or configuring these tools to use different ports. The ports used by the GPFS daemon for administrative command execiution can be defined using the GPFS configuration parameter GPFSCmdPortRange.

mmchconfig GPFSCmdPortRange=lowport-highport

This allows you to limit the ports used for GPFS administration mm* command execution. You need enough ports to support all of the concurrent commands from a node so you should define 20 or more ports for this purpose. Example:

mmchconfig GPFSCmdPortRange=30000-30100

minMissedPingTimeout



The minMissedPingTimeout and maxMissedPingTimeout parameters set limits on the calculation of missedPingTimeout (MPT) which is the allowable time for pings to fail from the Cluster Manager (CM) to a node that has not renewed its lease. The default MPT is leaseRecoveryWait-5 seconds. The CM will wait MPT seconds after the lease has expired before declaring a node out of the cluster. The minMissedPingTimeout and maxMissedPingTimeout parameters value is in seconds and the defaults are 3 and 60 respectively. If these values are changed, only GPFS on the quorum nodes (from which the CM is elected) need to be recycled to take effect.

This can be used to cover over something like a central network switch failure timeout (or other network glitches) that may be longer than leaseRecoveryWait. It may prevent false node down conditions but will extend the time for node recovery to finish which may block other nodes making progress if the failing node held tokens for many shared files.

Just as in the case of leaseRecoveryWait, in most cases where a node is expelled from the cluster there is a either a problem with the network or the node running out of resources like paging. For example, if there is an application running on a node paging the machine to death or overrunning network capacity, GPFS may not have a chance to contact the Cluster Manager node to renew its lease within the timeout period.

maxMissedPingTimeout



See minMissedPingTimeout.

maxReceiverThreads



The maxReceiverThreads parameter is the number of threads used to handle incoming TCP packets. These threads gather the packets until there are enough bytes for the incoming RPC (or RPC reply) to be handled. For some simple RPCs, the receiver thread handles he message immediately, otherwise it hands it off some handler threads.

maxReceiverThreads defaults to the number of CPUs in the node up to 16. It can be configured higher if necessary up to 128 for very large clusters.

pagepool



The Pagepool parameter determines the size of the GPFS file data block cache. Unlike local file systems that use the operating system page cache to cache file data, GPFS allocates its own cache called the pagepool. The GPFS pagepool is used to cache user file data and file system metadata. The default pagepool size of 64MB is too small for many applications so this is a good place to start looking for performance improvement. In release 3.5, the default is 1GB for new installs. When upgrading it keeps the old setting.

Along with file data the pagepool supplies memory for various types of buffers like prefetch and write behind.

For Sequential IO

The default pagepool size may be sufficient for sequential IO workloads, however, a recommended value of 256MB is known to work well in many cases. To change the pagepool size, use the mmchconfig command. For example, to change the pagepool size to 256MB on all nodes in the cluster, execute the mmchconfig command:

    mmchconfig pagepool=256M [-i]

If the file system blocksize is larger than the default (256K), the pagepool size should be scaled accordingly. For example, if 1M blocksize is used, the default 64M pagepool should be increased by 4 times to 256M. This allows the same number of buffers to be cached.

Random IO

The default pagepool size will likely not be sufficient for Random IO or workloads involving a large number of small files. In some cases allocating 4GB, 8GB or more memory can improve workload performance.

Random Direct IO

For database applications that use Direct IO, the pagepool is not used for any user data. It's main purpose in this case is for system metadata and caching the indirect blocks of the database files.

NSD servers

Assuming no applications or Filesystem Manager services are running on the NSD servers, the pagepool is only used transiently by the NSD worker threads to gather data from client nodes and write the data to disk. The NSD server does not cache any of the data. Each NSD worker just needs one pagepool buffer per operation, and the buffer can be potentially as large as the largest filesystem blocksize that the disks belong to. With the default NSD configuration, there will be 3 NSD worker threads per LUN (nsdThreadsPerDisk) that the node services. So the amount of memory needed in the pagepool will be 3*#LUNS*maxBlockSize. The target amount of space in the pagepool for NSD workers is controlled by nsdBufSpace which defaults to 30%. So the pagepool should be large enough so that 30% of it has enough buffers.

32 Bit operating systems

On 32-bit operating systems pagepool is limited by the GPFS daemons address space. This means that it cannot exceed 4GB in size and is often much smaller due to other limitations.



opensslLibName



To initialize multi-cluster communiations GPFS uses openssl. When initializng openssl GPFS looks for these ssl libraries: libssl.so:libssl.so.0:libssl.so.4 (as of GPFS 3.4.0.4). If you are using a newer version of openssl the filename may not match one in the list (exmaple libssl.so.6). You can use the opensslLibName parameter to tell GPFS to look for the newer version instead.

mmchconfig opensslLibName="libssl.so.6"



readReplicaPolicy



Options: default, local

Default
By default when data is replicated GPFS spreads the reads over all of the available failure groups. This configuration typically best when the nodes running GPFS have equal access to both copies of the data.

Local
A value of local has two effects on reading data in a replicated storage pool. Data is read from:

  1. A local block device
  2. A "local" NSD Server

The local block device means that the path to the disk is through a block special device on Linux, for example that would be a /dev/sd* or on AIX a /dev/hdisk device. GPFS does not do any further determination, so if disks at two sites are connected with a long distance fiber connection GPFS cannot distinguish what is local. So to use this option connect the sites using the NSD protocol over TCP/IP or InfiniBand Verbs (Linux Only).

Further GPFS uses the subnets configuration setting to determine what NSD servers are "local" to an NSD client. For NSD clients to benefit from "local" read access the NSD servers supporting the local disk need to be on the same subnet as the NSD clients accessing the data and that subnet needs to be defined using the "subnets" configuration parameter. This parameter is useful when GPFS replication is used to mirror data across sites and there are NSD clients in the cluster. This keeps read access requests from being sent over the WAN.



seqDiscardThreshold



The seqDiscardThreshold parameter affects what happens when GPFS detects a sequential read (or write) access pattern and has to decide what to do with the pagepool buffer after it is consumed (or flushed by writebehind threads). This is the highest performing option for the case where a very large file is read (or written) sequentially. The default for this value is 1MB which means that if you have a file that is sequentially read and is greater than 1MB GPFS does not keep the data in cache after consumption. There are some instances where large files are reread often by multiple processes; data analytics for example. In some cases you can improve the performance of these applications by increasing seqDiscardThreshold to be larger than the sets of files you would like to cache. Increasing seqDiscardthreshold tells GPFS to attempt to keep as much data in cache as possible for the files below that threshold. The value of seqDiscardThreshold is file size in bytes. The default is 1MB (1048576 bytes). 

Tuning Guidelines

  • Increase this value if you want to cache files, that are sequentially read or written, that are larger than 1MB in size.
  • Make sure there are enough buffer descriptors to cache the file data. (See maxBufferDescs )

sharedMemLimit



The sharedMemLimit parameter allows you to increase the amount of memory available to store various GPFS structures including inode cache and tokens. When the value of sharedMemLimit is set to 0 GPFS automatically determines a value for sharedMemLimit. The default value varies on each platform. In GPFS 3.4 the default on Linux and Windows is 256MB. In GPFS 3.4 on Windows sharedMemLimit can only be used to decrease the size of the shared segment. To determine whether or not increasing sharedMemLimit may help you can use the mmfsadm dump fs command.  For example, if you run mmfsadm dump fs and see that you are not getting the desired levels of maxFilesToCache (aka fileCacheLimit) or maxStatCache (aka statCacheLimit) you can try increasing sharedMemLimit.

# mmfsadm dump fs | head -8

Filesystem dump:
  UMALLOC limits:
    bufferDescLimit       4096 desired     4096
    fileCacheLimit        5000 desired    75000
    statCacheLimit       20000 desired    80000
    diskAddrBuffLimit     4000 desired     4000

The sharedMemLimit parameter is set in bytes.

As of release 3.4 the largest sharedMemLimit on Windows is 256M. On Linux and AIX the largest setting is 256G on 64 bit architectures and 2047M on 32 bit architectures. Using larger values may not work on some platforms/GPFS code versions. The actual sharedMemLimit on Linux may be reduced to a percentage of the kernel vmalloc space limit.

socketMaxListenConnections



The parameter socketMaxListenConnections sets the number of TCP/IP sockets that the daemon can listen on in parallel. This tunable was introduced in 3.4.0.7 specifically for large clusters, where an incast message to a manager node from a large number of client nodes may require multiple listen() calls and timeout. To be effective, the Linux tunable /proc/sys/net/core/somaxconn must also be modified from the default of 128. The effective value is the smaller of the GPFS tunable and the kernel tunable.

Default
Versions prior to 3.4.0.7 are fixed at 128. The default remains 128. The Linux kernel tunable also defaults to 128.

Tuning Guidelines
For clusters under 1000 nodes tuning this value should not be required. For larger clusters it should be set to approximately the number of nodes in the GPFS cluster. 
Example
mmchconfig socketMaxListenConnections=1500
echo 1500 > /proc/sys/net/core/somaxconn
(or)
sysctl -w net.core.somaxconn=1500

socketRcvBufferSize



The parameter socketRcvBufferSize sets the size of the TCP/IP receive buffer used for NSD data communication. This parameter is in bytes.

socketSndBufferSize



The parameter socketSndBufferSize sets the size of the TCP/IP send buffer used for NSD data communication. This parameter is in bytes.

maxMBpS



The maxMBpS option is an indicator of the maximum throughput in megabytes that can be submitted by GPFS per second into or out of a single node. It is not a hard limit rather the maxMBpS value is a hint to GPFS used to calculate how much I/O can effectively be done for sequential prefetch and write-behind operations. In GPFS 3.3, the default maxMBpS value is 150, and in GPFS 3.5 it defaults to 2048. The maximum value is 100,000.

The maxMBpS value should be adjusted for the nodes to match the IO throughput the node is expected to support. For example, you should adjust maxMBpS for nodes that are directly attached to storage. A good rule of thumb is to set maxMBpS to twice the IO throughput required of a system. For example, if a system has two 4Gbit HBA's (400MB/sec per HBA) maxMBpS should be set to 1600. If the maxMBpS value is set too low sequential IO performance may be reduced.

This setting is not used by NSD servers. It is only used for application nodes doing sequential access to files. 

maxFilesToCache



The maxFilesToCache parameter controls how many files each node can cache. Each file cached requires memory for the inode and a token(lock).

In addition to this parameter, maxStatCache config parameter controls how many files are partially cached; the default value of maxStatCache is 4 * maxFilesToCache, so maxFilesToCache controls five times the number of tokens, times the number of nodes in the cluster.  The token managers for a given file system have to keep token state for all nodes in the cluster and from nodes in remote clusters that mount the filesystems.  This should be considered when setting this value.

One thing to keep in mind is that on a large cluster, a change in the value of maxFilesToCache is greatly magnified. Increasing maxFilesToCache from the default of 1000 by a factor of 2 in a cluster with 200 nodes will increase the number of tokens a server needs to store by approximately 2,000,000.  Therefore on large clusters it is recommended that if there is a subset of nodes with the need to have many open files only those nodes should increase the maxFilesToCache parameter. Nodes that may need an increased value for maxFilesToCache would include: login nodes, NFS/CIFS exporters, email servers or other file servers. For systems where applications use a large number of files, of any size, increasing the value for maxFilesToCache may prove beneficial. This is particularly true for systems where a large number of small files are accessed.

The increased value should be large enough to handle the number of concurrently open files plus allow caching of recently used files. You can use mmpmon (See monitoring ) to measure the number of files opened and closed on a GPFS file system. Changing the value of maxFilesToCache effects the amount of memory used on the node. The amount of memory required for inodes and control data structures can be calculated as: maxFilesToCache × 2.5 KB where 2.5 KB = 2 KB + 512 bytes for an inode Valid values of maxFilesToCache range from 1 to 100,000,000.

The size of the GPFS shared segment can limit the maximum setting of maxFilesToCache.  See sharedMemLimit for details.

Note: prior to release 3.5 the default maxFilesToCache and maxStatCache were 1000 and 4000. As of release 3.5, the default values are 4000 and 1000. If you change the maxFilesToCache value but not the maxStatCache value, then maxStatCache will default to 4 * maxFilesToCache.

Tuning Guidelines:

  • The increased value should be large enough to handle the number of concurrently open files plus allow caching of recently used files.
  • Increasing maxFilesToCache can improve the performance of user interactive operations like running ls.  
  • As a rule the total of ((maxFilesToCache + maxStatCache) * nodes) should not exceed (600,000 * (tokenMemLimit/256M) * (The number of manager nodes - 1)).  This is assuming you account for the fact that different nodes may have different values of maxFilesToCache.

maxStatCache



The maxStatCache parameter sets aside additional pageable memory to cache attributes of files that are not currently in the regular file cache. This is useful to improve the performance of both the system and GPFS stat() calls for applications with a working set that does not fit in the regular file cache. The memory occupied by the stat cache can be calculated as: maxStatCache × 176 bytes
Valid values of maxStatCache range from 0 to 10,000,000.

For systems where applications test the existence of files, or the properties of files, without actually opening them (as backup applications do), increasing the value for maxStatCache may prove beneficial. The default value is: 4 × maxFilesToCache
On system where maxFilesToCache is greatly increased it is recommended that this value be manually set to something less than 4 * maxFilesToCache. For example if you set maxFilesToCache to 30,000 you may want to set maxStatCache to 30,000 as well. On compute nodes, this can usually be set much lower since they only have a few active files in use for any one job anyway. 

Note: prior to release 3.5 the default maxFilesToCache and maxStatCache were 1000 and 4000. As of release 3.5, the default values are 4000 and 1000. If you change the maxFilesToCache value but not the maxStatCache value, then maxStatCache will default to 4 * maxFilesToCache.

The size of the GPFS shared segment can limit the maximum setting of maxStatCache.  See sharedMemLimit for details. 

maxBufferDescs



The value of maxBufferDescs defaults 10 * maxFilesToCache up to pagepool size/16K. When caching small files, it actually does not need to be more than a small multiple of maxFilesToCache since only OpenFile objects (not stat cache objects) can cache data blocks.

If an application needs to cache very large files you can tune maxBufferDescs to ensure there are enough to cache large files.  To see the current value use the mmfsadm command:

#mmfsadm dump fs

Filesystem dump:
  UMALLOC limits:
    bufferDescLimit      10000 desired    10000
    fileCacheLimit        1000 desired     1000
    statCacheLimit        4000 desired     4000
    diskAddrBuffLimit      800 desired      800

In this case there are 10,000 buffer descriptors configured. If you have a 1MiB file system blocksize and want to cache a 20GiB file, you will not have enough buffer descriptors. In this case to cache a 20GiB file increase maxBufferDescs to at least 20,480 (20GiB/1MiB=20,480). It is not exactly a one to one mapping so a value of 32k may be appropriate.

mmchconfig maxBufferDescs=32k

nfsPrefetchStrategy



The parameter nfsPrefetchStrategy tells GPFS to optimize prefetching for NFS file style access patterns. It defines a window of the number of blocks around the current position that are treated as "fuzzy sequential" access. This can improve performance when reading big files sequentially, but because of kernel scheduling, some of the read requests come to GPFS out of order and therefore do not look "strictly sequential". If the filesystem blocksize is small relative to the read request sizes, making this bigger will provide a bigger window of blocks. The default is 0  .

Tuning Guidelines

  • Setting nfsPrefetchStrategy to 1 can improve sequential read performance when large files are accessed using NFS. 

nsdMaxWorkerThreads



  
The parameter nsdMaxWorkerThreads sets the maximum number of NSD threads on an NSD server that will be concurrently transferring data with NSD clients. The default is 32 with a minimum of 8. The maximum value depends on the sum of worker1Threads + prefetchThreads + nsdMaxWorkerThreads < 1500 on 64bit architectures. This default works well in many clusters. In some cases it may help to increase nsdMaxWorkerThreads for large clusters, for example. Scale this with the number of LUNs, not the number of clients. You need this to manage flow control on the network between the clients and the servers. 

numaMemoryInterleave



  
On Linux, setting numaMemoryInterleave to yes starts mmfsd with numactl --interleave=all. Enabling this parameter may improve the performance of GPFS running on NUMA based systems, for example if the system is based on a Intel Nehalem processor. 

prefetchPct



"prefetchPct" defaults to 20% of pagepool. GPFS uses this as a guideline which limits how much pagepool space will be used for prefetch or writebehind buffers in the case of active sequential streams. The default works well for many applications. On the other hand, if the workload is mostly sequential (video serving/ingest) with very little caching of small files or random IO, then this number should be increased up to its 60% maximum, so that each stream can have more buffers available for prefetch and write behind operations. 

prefetchThreads



Tuning Guidelines

  • You usually don't need prefetchThreads to be more than twice the number of LUNs available to the node. Any more than that typically do nothing but wait in queues. The maximum value depends on the sum of worker1Threads + prefetchThreads + nsdMaxWorkerThreads < 1500 on 64bit architectures

Logfile



"Logfile" size should be larger for high metadata rate systems to prevent more glitches when the log has to wrap. Can be as large as 16MB on large blocksize file systems. To set this parameter use the --L flag on mmcrfs. 

verbsLibName

To initialize IB RDMA GPFS looks for a file called libverbs.so. If that file name is different on your system libverbs.so.1.0 , for example, you can change this parameter to match.

Example:
  mmchconfig verbsLibName=libverbs.so.1.0



verbsrdmasperconnection


This is the maximum number of RDMAs that can be outstanding on any single RDMA connection. The default value is 8.

Tuning Guidelines

  • In testing the default was more than enough on SDR. All performance testing of the parameters was done on OFED 1.1 IB SDR. 

verbsrdmaspernode


This is the maximum number of RDMAs that can be outstanding from the node. The default value is 0 (0 means default which is 32).

Tuning Guidelines

  • In testing the default was more than enough to keep adapters busy on SDR. All performance testing of the parameters was done on OFED 1.1 IB SDR. 

worker1Threads



The worker1threads parameter represents the total number of concurrent application requests that can be processed at one time. This may include metadata operations like file stat() requests, open or close and for data operations. The work1threads parameter can be reduced without having to restart the GPFS daemon. Increasing the value of worker1threads requires a restart of the GPFS daemon.
To determine whether you have a sufficient number of worker1threads configured you can use the mmfsadm dump mbcommand.

# mmfsadm dump mb | grep Worker1
  Worker1Threads: max 48 current limit 48 in use 0 waiting 0   PageDecl: max 131072 in use 0

Using the mmfsadm command you can see how many threads are "in use" and how many application requests are "waiting" for a worker1thread.

Tuning Guidelines

  • The default is good for most workloads.
  • You may want to increase worker1threads if your application uses many threads and does Asynchronous IO (AIO) or Direct IO (DIO). In these cases the worker1threads are doing the IO operations. A good place to start is to have worker1theads set to approximately 2 times the number of LUNS in the file system so GPFS can keep the disks busy with parallel requests. The maximum value depends on the sum of worker1Threads + prefetchThreads + nsdMaxWorkerThreads < 1500 on 64bit architectures
  • Do not use excessive values of worker1threads.


worker3Threads



The worker3threads parameter specifies the number of threads to use for inode prefetch. A value of zero disables inode prefetch. The Default is 8.

Tuning Guidelines

  • The default is good for most workloads.


writebehindThreshold



The writebehindThreshold parameter determines at what point GPFS starts flushing newly written data out of the pagepool for a file. Increasing this value can increase how many newly created files are kept in cache. This can be useful, for example, if your workload contains temp files that are smaller than writebehindThreshold and are deleted before they are flushed from cache. As a default, GPFS uses pagepool for buffering IO for best performance but once the data is written the buffers are cleaned, increasing this value tells GPFS to try to keep the data in the pagepool as long as practical instead of immediately cleaning the buffers.  This value is set for maximum file size to keep in cache and is specified in bytes. The default is 512k (524288 bytes). If the value is too large, there may be too many dirty buffers that the sync thread has to flush at the next sync interval causing a surge in disk IO. Keeping it small will ensure a smooth flow of dirty data to disk.

Tuning Guidelines

  • The default is good for most workloads.
  • Increase this value if you have a workload where not flushing newly written files larger than 512k would be beneficial.




블로그 이미지

Melting

,