- http://www.ibm.com/developerworks/wikis/pages/viewpage.action?pageId=119080484


Something isn't working, where do I start?
Added by ScottGPFS, last edited by hraval on Nov 28, 2012  (view change)
Labels: 
(None)

This information has moved

This information has moved 

Contents

  • Common
    • A node failed and I had to rebuild it from scratch. How do I add it back into the cluster?
    • I just created a new file system with pools and when I try to write a file I receive a no space error?
    • I successfully created the NSD's but now GPFS does not see them
    • Something seems slow or appears to hang
  • AIX
  • Linux
    • GPFS fails to start and reports "no such file or directory" for libssl library in the mmfs.log file
    • The Kernel module will not build.
  • Windows
    • How can I verify the connection to the Active Directory server is working from a windows node?
    • The GPFS installation fails (ERROR: encountered while installing driver package 'C:\Windows\SUA\usr\lpp\mmfs\driver\mmbus\mmbus.inf')
    • When attempting to add a windows node mmaddnode fails (A remote host refused an attempted connect operation.)

 Common

A server failed and I had to rebuild it from scratch. How do I add it back into the cluster?

How to recover a failed GPFS node.

It was not an NSD server:
If the node is not an NSD Server then the easiest way to recover is to remove the node from the cluster and add it back in.

  1. Remove the node from the cluster (The node cannot be "ping"able for this to work)
    mmdelnode -N failednode
  2. And Add it back in using mmaddnode
    mmaddnode -N failednode

It was an NSD Server:
If the node is an NSD server you cannot remove it from the cluster without reconfiguring the NSD server definitions for the disks. To recover the node without reconfiguring the NSD server definitions.

  1. Reinstall the operating system and GPFS
  2. Get a copy of the mmsdrfs file from the primary cluster configuration server. (You can get it from any node but this one is the most up to date).
    scp PrimaryClusterConfigNode:/var/mmfs/gen/mmsdrfs /var/mmfs/gen/mmsdrfs
  3. Make sure the cluster configuration information is up to date
    mmchcluster -p LATEST
  4. At this point you should be able to start the node
    mmstartup -N failednode

 I just created a new file system with pools and when I try to write a file I receive a no space error?

If you just created a new file system and you cannot create a file it may be that you have storage pools and no policies. If your system pool is metadata only, which is fine, that means you have metadata space and no data space in that storage pool. The default rule places everything in the system storage pool.  You can check the policy configuration by running mmlspolicy.

[root@perf7-c4-int64]#  mmlspolicy gpfs1
No policy file was installed  for file system 'gpfs1'.

If it says "No policy file was installed" you need a policy. To install a policy you can create a simple policy, something like this:

RULE 'default' set POOL 'satapool'

This policy will send all file data to the storage pool named satapool. Place that text in a file (policy.txt) then install the policy

mmchpolicy gpfs1  policy.txt

I successfully created the NSD's but now GPFS does not see them  

Sometimes you can create an NSD using the mmcrnsd command and it completes successfully thenmmlsnsd -X, for example, says the devices are not found.

# mmlsnsd -X

Disk name NSD volume ID Device Devtype Node name Remarks
---------------------------------------------------------------------------------------------------
nsd1 1E05D0374B7053C4 - - node1 (not found) server node
nsd2 1E05D0384B7053C4 - - node2 (not found) server node

Cause: Unknown device name

This can be caused by GPFS not scanning the device name by default. For example, in this case the device name was /dev/fioa and GPFS does not look for devices that start with /dev/fio* by default. GPFS looks for /dev/sd*, for example. When you run the mmcrnsd command it reads the device name from the NSD descriptor you provided but when the GPFS daemon attempts to find that device it looks through the list of devices it discovered at startup or after you ran the mmnsddiscover command (If the devices were added since the GPFS daemon was started). In this case you need to tell GPFS about this new device name. You can do this using the nsddevices user exit. For information and an example on how to use the nsddevices user exit see Device Naming

Something seems slow or appears to hang


If file system access seems slow or GPFS seems to be hanging. The place to start investigating this is to look at what GPFS calls "waiters." Waiters are operations that are talking longer than some threshold, the reporting threshold is different for each type of operation. Some waiters are normal and indicate a healthy system, some can provide you information on where a problem lies. To see the waiters:
When running GPFS 3.4 you can use the mmdiag command

mmdiag --waiters

When running GPFS 3,.3 or earlier

mmfsadm dump waiters

For more information see:


Linux 

GPFS fails to start and reports "no such file or directory" for libssl library in the mmfs.log file


 
This message may occur if the right library is not specified by the opensslLibName config parameter which defaults to a list of common libssl library names: libssl.so:libssl.so.0:libssl.so.4. If the installed libssl library is not in the default list, you need to specify it through the opensslLibName configuration parameter.

mmchconfig opensslLibName="libssl.so.0.9.8e"

An alternative is to create a symbolic link that points a library name in the default list to the installed library.

ln -s  libssl.so.0.9.8e libssl.so

Another alternative is to install the openssl-dev rpm, which should create a symlink "libssl.so" as well.

On SLES11 or later:
zypper install libopenssl-devel
On RHEL5.4 or later:
yum install openssl-devel
 

The Kernel module will not build


If make Autoconfig or make World fails for some reason, and you are running a Linux distribution from Redhat on GPFS 3.4.0.4 and later you can try telling Autoconfig that the Linux version should be redhat using the LINUX_DISTRIBUTION flag. This will allow you to build the GPFS portability layer on CentOS, for example. 

make LINUX_DISTRIBUTION=REDHAT_AS_LINUX Autoconfig
  

Windows

How can I verify the connection to the Active Directory server is working from a windows node?

To verify that the active directory connection is working from a Windows node, you can verify a user account using the mmfsadm command. For example, to verify that the root account is accessible:

mmfsadm test adlookup "cn=root"

The GPFS installation fails (ERROR: encountered while installing driver package 'C:\Windows\SUA\usr\lpp\mmfs\driver\mmbus\mmbus.inf')


Symptom:

The symptom is that the GPFS 3.3 installer on Windows 2008 server will fail and report that the install came to a premature end. When you look in the install logs in %SystemRoot%\SUA\var\adm\ras you see an error similar to this following:

DIFXAPP: ERROR: encountered while installing driver package 'C:\Windows\SUA\usr\lpp\mmfs\driver\mmbus\mmbus.inf'
DIFXAPP: ERROR: InstallDriverPackages failed with error 0x5

Resolution:

This problem can occur when the user's directory (e.g. C:\Users\root) does not permit the SYSTEM user to create temporary files during installation. If you are installing GPFS as root and root has been configured to support passwordless-ssh, then root's home directory will probably not allow SYSTEM write access.

Some known ways to fix this include:

  1. Install GPFS as Administrator
  2. Temporarily give SYSTEM write access to the user's home directory:
    $ cd ~
    $ ls -l -d .
    drwxr-x---  1 root  +SYSTEM  8192 Jan 26 15:04 .
    $ chmod g+w .
    $ ls -l -d .
    drwxrwx---  1 root  +SYSTEM  8192 Jan 26 15:04 .
    $
    $ #  INSTALL GPFS
    $
    $ chmod g-w .
    
  3. Delete the profile of the user account attempting to do the install. 

When attempting to add a windows node mmaddnode fails (A remote host refused an attempted connect operation.)

Symptom:

When you attempt to add a new windows node to an existing AIX or linux cluster you receive the following error:

nodea.ibm.com: A remote host refused an attempted connect  operation.
nodea.ibm.com: A remote host refused an attempted  connect operation.

Resolution:

There are a few things that can cause this:
  1. SSH is not configured properly for passwordless access. To test this try doing ssh from every node to every node uisng

  • The short name (nodea)
  • The fully qualified name (nodea.ibm.com)
  • The IP address (10.1.1.0)

  2. The existing cluster is using rsh instead of ssh. Windows does not support rsh, you mush use ssh. To fix this reconfigure the existing cluster for ssh using the mmchcluster command.

mmchcluster \-r /bin/ssh \-R /bin/ssh



'Cluster FileSystem' 카테고리의 다른 글

GPFS Tuning Parameters on developerworks #1  (0) 2012.12.04
GPFS FAQ on developerworks #1 (Mar 27, 2009)  (0) 2012.12.04
GPFS Hands-on Guide  (0) 2012.11.30
블로그 이미지

Melting

,