wiki

RSCT reset ID

Detect a problem on HMC

https://ibmsystemsmag.com/Power-Systems/12/2016/hmc-rmc-connection-issues

diagrmc

1° solution:

The recfgct command is a script to reconfigure RSCT subsystems including reassigning node id that will basically reset all cluster configuration on a node to default. Resolving the problem

The “recfgct” command is a undocumented command that allows you to remove the node id from a RSCT node and basically wipe out any domain information that the node had on it. The location of the command is not typically in the normal path statement so you would most likely have to know where to find it to use it. Its default location is:

/usr/sbin/rsct/install/bin/

Recfgct is a shell script with two flags that can be passed to it:

[root@testsrv]/root# /usr/sbin/rsct/install/bin/recfgct -n            #resets the node id file (default option)
[root@testsrv]/root# /usr/sbin/rsct/install/bin/recfgct -s            #saves the node id file.

Care must be taken when using this command as it will also reset the files /var/ct/cfg/ctrmc.acls and /var/ct/cfg/ctsec_map.global which are used for cluster security. TSAMP Support recommends backing up these files before using this command.

2° Solution:

After alternate disk install cloning, improper hostname resolution may cause HMC RSCT errors.

Dynamic LPAR relies on Service Focal Point and RSCT. This means that TCP/IP, including the hostname on the HMC, must be configured. The /etc/hosts files in the HMC and all LPARs must have entries for all entities with the hostname appearing first in any list of aliases.

A problem with DLPAR can exist after using the altinst_rootvg cloning procedure to create the second LPAR. The file /etc/ct_node_id and /var/ct/cfg/ct_node_id had the same contents on both LPARs so RSCT may detect only one LPAR to manage. Development is aware of this and is working on a circumvention and correction. In the meantime, a workaround is to run the following two commands after the first boot of the new LPAR:

[root@testsrv]/root# /usr/sbin/rsct/install/bin/uncfgct -n
[root@testsrv]/root# /usr/sbin/rsct/install/bin/cfgct

A subsequent reboot will cause the partitions to be synchronized from an RSCT perspective. The duplicate /etc/ct_node_id condition are likely to cause negative behaviors in a HACMP scenario between cloned LPARs. Cloning an LPAR with a mksysb tape/DVD will likely cause the same problem. NIM installs avoid this problem as a unique /etc/ct_node_id file is created.

Test the connection between LPAR and HMC:\

Depending on OS version:

[root@testsrv]/root# lsrsrc IBM.MCP
Resource Persistent Attributes for IBM.MCP
resource 1:
        MNName            = "192.168.222.166"
        NodeID            = 11973552969676986359
        KeyToken          = "fsm-pureflex"
        IPAddresses       = {"10.10.3.130"}
        ConnectivityNames = {"192.168.222.166"}
        HMCName           = "7955-01M*069582B"
        HMCIPAddr         = "10.10.3.130"
        HMCAddIPs         = "10.10.3.130"
        HMCAddIPv6s       = "fdd8:ae8f:c01f:0:6eae:8bff:fe7c:43f2,fdb1:b77c:fa7e:0:6eae:8bff:fe7c:43f2,fe80::6eae:8bff:fe7c:43f2"
        ActivePeerDomain  = ""
        NodeNameList      = {"monitor3"}

[root@testsrv]/root# lsrsrc "IBM.ManagementServer"
Resource Persistent Attributes for IBM.ManagementServer
resource 1:
        Name             = "10.10.0.3"
        Hostname         = "10.10.0.3"
        ManagerType      = "HMC"
        LocalHostname    = "10.19.10.81"
        ClusterTM        = "9078-160"
        ClusterSNum      = ""
        ActivePeerDomain = ""
        NodeNameList     = {"testsrv"}
resource 2:
        Name             = "10.9.0.3"
        Hostname         = "10.9.0.3"
        ManagerType      = "HMC"
        LocalHostname    = "10.19.10.81"
        ClusterTM        = "9078-160"
        ClusterSNum      = ""
        ActivePeerDomain = ""
        NodeNameList     = {"testsrv"}

For information, use -z to stop demons, and -A (to start and add for startup)

# /usr/sbin/rsct/bin/rmcctrl -z
# /usr/sbin/rsct/bin/rmcctrl -A

Check also the RMC state on the HMC, it must be active

hscroot@luhmc1:~> lssyscfg -r lpar -m P55A-9133-55A-SN06C1B4G -F name,lpar_id,state,rmc_state,rmc_ipaddr,os_version,dlpar_mem_capable
tsmaixbeta,11,Not Activated,inactive,,Unknown,0
Oracle1,10,Not Activated,inactive,192.168.222.158,AIX 7.1 7100-03-01-1341,0
monitor2,9,Running,active,192.168.222.157,AIX 7.1 7100-03-01-1341,1
repmon,7,Not Activated,inactive,,Unknown,0
monitor,6,Running,active,192.168.222.153,AIX 7.1 7100-03-04-1441,1
aixpowersc,5,Not Activated,inactive,192.168.222.155,AIX 7.1 7100-03-01-1341,0

RSCT problem on Blade, different vlan or firewall between HMC and power server

Management Domain Status: Management Control Points

  Indicates, in a management domain, that a communication problem has 
  been discovered, and the RMC daemon has suspended communications with 
  the RMC daemon that is on the specified node. This is typically the 
  result of a configuration problem in the network, such that small 
  heartbeat packets can be exchanged between the RMC daemon and the
  RMC daemon that is on the specified node, but larger data packets cannot.
  This is usually the result of a difference in MTU sizes in the network
  adapters of the nodes.

Action plan: add lines to the rmc script see below

Suggest reduce this RMC data packet to small size on the RMC startup script - /usr/sbin/rsct/bin/rmcd_start, by add “-S” option to limit rmc packet size.

On VIOS:

# cd /usr/sbin/rsct/bin/
cp rmcd_start rmcd_start.orig
vi rmcd_start

( at the script end, find following line)

# now start the RMC daemon in the current process so it is the child of the
# SRC subsystem daemon
SOPT="-S 4500"   <---- add this line

After saved rmcd_start script, recycle RMC,

# /usr/sbin/rsct/bin/rmcctrl -z
# /usr/sbin/rsct/bin/rmcctrl -A
# /usr/sbin/rsct/bin/rmcctrl -p

wiki

User Tools

Site Tools

Table of Contents

RSCT reset ID

Detect a problem on HMC

1° solution:

2° Solution:

RSCT problem on Blade, different vlan or firewall between HMC and power server

Page Tools