This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
gpfs:gpfs_healthcheck [2021/06/11 18:19] manu |
gpfs:gpfs_healthcheck [2022/08/29 23:05] (current) manu |
||
---|---|---|---|
Line 1: | Line 1: | ||
====== GPFS health ====== | ====== GPFS health ====== | ||
+ | |||
+ | mmhealth thresholds (+GUI +REST) monitors the usage: | ||
+ | <cli prompt='$'> | ||
+ | $ mmhealth thresholds list | ||
+ | ### Threshold Rules ### | ||
+ | rule_name metric error warn direction filterBy groupBy sensitivity | ||
+ | ---------------------------------------------------------------------------------------------------------------------------- | ||
+ | InodeCapUtil_Rule Fileset_inode 90.0 80.0 high gpfs_cluster_name,gpfs_fs_name,gpfs_fset_name 300 | ||
+ | DataCapUtil_Rule DataPool_capUtil 97.0 90.0 high gpfs_cluster_name,gpfs_fs_name,gpfs_diskpool_name 300 | ||
+ | .. | ||
+ | </cli> | ||
List all events | List all events | ||
Line 16: | Line 27: | ||
Clear ALL messages on the Web (GUI) interface | Clear ALL messages on the Web (GUI) interface | ||
<cli prompt='#'> | <cli prompt='#'> | ||
- | [root@prscale-a-01 ~]# /usr/lpp/mmfs/gui/cli/lshealth --clear | + | [root@prscale-a-01 ~]# /usr/lpp/mmfs/gui/cli/lshealth --reset |
+ | </cli> | ||
+ | |||
+ | Already resolved errors that continue to be displayed in mmhealth and the GUI: | ||
+ | How to remove them (and this annoying TIPS): | ||
+ | <cli prompt='#'> | ||
+ | mmdsh -N <NODE or all> mmsysmonc clearDB | ||
+ | mmdsh -N <NODE or all> mmsysmoncontrol restart | ||
+ | mmhealth event hide <EventName> | ||
</cli> | </cli> | ||
Line 38: | Line 57: | ||
</cli> | </cli> | ||
+ | More compact | ||
+ | <cli prompt='#'> | ||
+ | [root@prscale-a-01 ~]# mmhealth cluster show -Y | ||
+ | mmhealth:Summary:HEADER:version:reserved:reserved:component:entityname:total:failed:degraded:healthy:other: | ||
+ | mmhealth:Summary:0:1:::NODE:NODE:3:0:0:0:3: | ||
+ | mmhealth:Summary:0:1:::GPFS:GPFS:3:0:0:0:3: | ||
+ | mmhealth:Summary:0:1:::NETWORK:NETWORK:3:0:0:3:0: | ||
+ | mmhealth:Summary:0:1:::FILESYSTEM:FILESYSTEM:3:0:0:3:0: | ||
+ | mmhealth:Summary:0:1:::DISK:DISK:43:0:0:43:0: | ||
+ | mmhealth:Summary:0:1:::CES:CES:2:0:0:2:0: | ||
+ | mmhealth:Summary:0:1:::CESIP:CESIP:1:0:0:1:0: | ||
+ | mmhealth:Summary:0:1:::CLOUDGATEWAY:CLOUDGATEWAY:2:0:0:2:0: | ||
+ | mmhealth:Summary:0:1:::FILESYSMGR:FILESYSMGR:2:0:0:2:0: | ||
+ | mmhealth:Summary:0:1:::GUI:GUI:2:0:0:2:0: | ||
+ | mmhealth:Summary:0:1:::PERFMON:PERFMON:3:0:0:3:0: | ||
+ | mmhealth:Summary:0:1:::THRESHOLD:THRESHOLD:3:0:0:3:0: | ||
+ | </cli> | ||
+ | |||
+ | Per node | ||
+ | <cli prompt='#'> | ||
+ | [root@prscale-a-01 ~]# mmhealth node show | ||
+ | |||
+ | Node name: prscale-a-01 | ||
+ | Node status: TIPS | ||
+ | Status Change: 2 days ago | ||
+ | |||
+ | Component Status Status Change Reasons | ||
+ | ---------------------------------------------------------------------------------------------------------------------------------- | ||
+ | GPFS TIPS 2 days ago callhome_not_enabled, gpfs_maxfilestocache_small, total_memory_small | ||
+ | NETWORK HEALTHY 4 days ago - | ||
+ | FILESYSTEM HEALTHY 4 days ago - | ||
+ | DISK HEALTHY 4 days ago - | ||
+ | CES HEALTHY 4 days ago - | ||
+ | CESIP HEALTHY 4 days ago - | ||
+ | CLOUDGATEWAY HEALTHY 4 days ago - | ||
+ | FILESYSMGR HEALTHY 4 days ago - | ||
+ | GUI HEALTHY 4 days ago - | ||
+ | PERFMON HEALTHY 4 days ago - | ||
+ | THRESHOLD HEALTHY 4 days ago - | ||
+ | [root@prscale-a-01 ~]# mmhealth node show -y | ||
+ | Option with missing argument | ||
+ | |||
+ | Additional messages: | ||
+ | invalid option: -y | ||
+ | [root@prscale-a-01 ~]# mmhealth node show -Y | ||
+ | mmhealth:Event:HEADER:version:reserved:reserved:node:component:entityname:entitytype:event:arguments:activesince:identifier:ishidden: | ||
+ | mmhealth:State:HEADER:version:reserved:reserved:node:component:entityname:entitytype:status:laststatuschange: | ||
+ | mmhealth:State:0:1:::prscale-a-01:NODE:prscale-a-01:NODE:TIPS:2021-10-03 03%3A55%3A43.152676 CEST: | ||
+ | mmhealth:State:0:1:::prscale-a-01:CES:prscale-a-01:NODE:HEALTHY:2021-10-01 09%3A50%3A34.180949 CEST: | ||
+ | mmhealth:State:0:1:::prscale-a-01:BLOCK:prscale-a-01:NODE:DISABLED:2021-10-01 09%3A23%3A59.486199 CEST: | ||
+ | mmhealth:State:0:1:::prscale-a-01:NFS:prscale-a-01:NODE:HEALTHY:2021-10-01 09%3A50%3A34.180949 CEST: | ||
+ | mmhealth:State:0:1:::prscale-a-01:AUTH_OBJ:prscale-a-01:NODE:DISABLED:2021-10-01 09%3A24%3A17.174618 CEST: | ||
+ | mmhealth:State:0:1:::prscale-a-01:CESNETWORK:prscale-a-01:NODE:HEALTHY:2021-10-01 09%3A49%3A03.938605 CEST: | ||
+ | mmhealth:State:0:1:::prscale-a-01:CESNETWORK:ens192:NIC:HEALTHY:2021-10-01 09%3A24%3A05.346180 CEST: | ||
+ | mmhealth:State:0:1:::prscale-a-01:OBJECT:prscale-a-01:NODE:DISABLED:2021-10-01 09%3A24%3A05.421280 CEST: | ||
+ | mmhealth:State:0:1:::prscale-a-01:SMB:prscale-a-01:NODE:HEALTHY:2021-10-01 09%3A48%3A18.924751 CEST: | ||
+ | mmhealth:State:0:1:::prscale-a-01:AUTH:prscale-a-01:NODE:DISABLED:2021-10-01 09%3A24%3A02.017860 CEST: | ||
+ | mmhealth:State:0:1:::prscale-a-01:HDFS_NAMENODE:prscale-a-01:NODE:DISABLED:2021-10-01 09%3A24%3A17.295829 CEST: | ||
+ | mmhealth:State:0:1:::prscale-a-01:CLOUDGATEWAY:prscale-a-01:NODE:HEALTHY:2021-10-01 09%3A25%3A08.337661 CEST: | ||
+ | mmhealth:State:0:1:::prscale-a-01:CLOUDGATEWAY:tct_tiering1-vault_backup_01:TCT_SERVICE:HEALTHY:2021-10-01 09%3A24%3A33.013963 CEST: | ||
+ | mmhealth:State:0:1:::prscale-a-01:CLOUDGATEWAY:vault_backup_01/:TCT_CSAP:HEALTHY:2021-10-01 09%3A24%3A33.023072 CEST: | ||
+ | mmhealth:State:0:1:::prscale-a-01:CESIP:prscale-a-01:NODE:HEALTHY:2021-10-01 09%3A35%3A32.969821 CEST: | ||
+ | mmhealth:State:0:1:::prscale-a-01:DISK:prscale-a-01:NODE:HEALTHY:2021-10-01 09%3A52%3A49.707704 CEST: | ||
+ | mmhealth:State:0:1:::prscale-a-01:DISK:GPFS_NSD_A_CES_0001:NSD:HEALTHY:2021-10-01 09%3A52%3A49.715357 CEST: | ||
+ | mmhealth:State:0:1:::prscale-a-01:DISK:GPFS_NSD_M_A_0001:NSD:HEALTHY:2021-10-01 09%3A24%3A14.576509 CEST: | ||
+ | mmhealth:State:0:1:::prscale-a-01:DISK:GPFS_NSD_M_A_0002:NSD:HEALTHY:2021-10-01 09%3A24%3A14.596580 CEST: | ||
+ | mmhealth:State:0:1:::prscale-a-01:DISK:GPFS_NSD_M_A_0003:NSD:HEALTHY:2021-10-01 09%3A24%3A14.618590 CEST: | ||
+ | mmhealth:State:0:1:::prscale-a-01:DISK:GPFS_NSD_D_A_0001:NSD:HEALTHY:2021-10-01 09%3A24%3A14.637984 CEST: | ||
+ | mmhealth:State:0:1:::prscale-a-01:DISK:GPFS_NSD_M_A_0004:NSD:HEALTHY:2021-10-01 09%3A24%3A14.656978 CEST: | ||
+ | mmhealth:State:0:1:::prscale-a-01:DISK:GPFS_NSD_D_A_0002:NSD:HEALTHY:2021-10-01 09%3A24%3A14.673522 CEST: | ||
+ | mmhealth:State:0:1:::prscale-a-01:DISK:GPFS_NSD_D_A_0003:NSD:HEALTHY:2021-10-01 09%3A24%3A14.692509 CEST: | ||
+ | mmhealth:State:0:1:::prscale-a-01:DISK:GPFS_NSD_D_A_0004:NSD:HEALTHY:2021-10-01 09%3A24%3A14.711926 CEST: | ||
+ | mmhealth:State:0:1:::prscale-a-01:DISK:GPFS_NSD_D_A_0005:NSD:HEALTHY:2021-10-01 09%3A24%3A14.742578 CEST: | ||
+ | mmhealth:State:0:1:::prscale-a-01:DISK:GPFS_NSD_D_A_0006:NSD:HEALTHY:2021-10-01 09%3A24%3A14.761914 CEST: | ||
+ | mmhealth:State:0:1:::prscale-a-01:DISK:GPFS_NSD_D_A_0007:NSD:HEALTHY:2021-10-01 09%3A24%3A14.788854 CEST: | ||
+ | mmhealth:State:0:1:::prscale-a-01:DISK:GPFS_NSD_D_A_0008:NSD:HEALTHY:2021-10-01 09%3A24%3A14.808564 CEST: | ||
+ | mmhealth:State:0:1:::prscale-a-01:DISK:GPFS_NSD_D_A_0009:NSD:HEALTHY:2021-10-01 09%3A24%3A14.830882 CEST: | ||
+ | mmhealth:State:0:1:::prscale-a-01:DISK:GPFS_NSD_D_A_0010:NSD:HEALTHY:2021-10-01 09%3A24%3A14.852833 CEST: | ||
+ | mmhealth:State:0:1:::prscale-a-01:DISK:GPFS_NSD_D_A_0011:NSD:HEALTHY:2021-10-01 09%3A24%3A14.876191 CEST: | ||
+ | mmhealth:State:0:1:::prscale-a-01:DISK:GPFS_NSD_D_A_0012:NSD:HEALTHY:2021-10-01 09%3A24%3A14.888040 CEST: | ||
+ | mmhealth:State:0:1:::prscale-a-01:DISK:GPFS03_NSD_M_A_0005:NSD:HEALTHY:2021-10-01 09%3A24%3A14.915370 CEST: | ||
+ | mmhealth:State:0:1:::prscale-a-01:DISK:GPFS03_NSD_M_A_0006:NSD:HEALTHY:2021-10-01 09%3A24%3A14.931229 CEST: | ||
+ | mmhealth:State:0:1:::prscale-a-01:DISK:GPFS03_NSD_M_A_0007:NSD:HEALTHY:2021-10-01 09%3A24%3A14.942819 CEST: | ||
+ | mmhealth:State:0:1:::prscale-a-01:DISK:GPFS03_NSD_M_A_0008:NSD:HEALTHY:2021-10-01 09%3A24%3A14.956576 CEST: | ||
+ | mmhealth:State:0:1:::prscale-a-01:DISK:GPFS03_NSD_M_A_0009:NSD:HEALTHY:2021-10-01 09%3A24%3A14.970095 CEST: | ||
+ | mmhealth:State:0:1:::prscale-a-01:DISK:GPFS03_NSD_M_A_0010:NSD:HEALTHY:2021-10-01 09%3A24%3A14.993296 CEST: | ||
+ | mmhealth:State:0:1:::prscale-a-01:DISK:GPFS_NSD_A_D1_0001:NSD:HEALTHY:2021-10-01 09%3A24%3A15.019350 CEST: | ||
+ | mmhealth:State:0:1:::prscale-a-01:GUI:prscale-a-01:NODE:HEALTHY:2021-10-01 09%3A53%3A32.099884 CEST: | ||
+ | mmhealth:State:0:1:::prscale-a-01:THRESHOLD:prscale-a-01:NODE:HEALTHY:2021-10-01 09%3A24%3A01.090924 CEST: | ||
+ | mmhealth:State:0:1:::prscale-a-01:THRESHOLD:MemFree_Rule:THRESHOLD_RULE:HEALTHY:2021-10-01 09%3A24%3A32.160416 CEST: | ||
+ | mmhealth:State:0:1:::prscale-a-01:THRESHOLD:SMBConnPerNode_Rule:THRESHOLD_RULE:HEALTHY:2021-10-01 09%3A29%3A32.474178 CEST: | ||
+ | mmhealth:State:0:1:::prscale-a-01:THRESHOLD:active_thresh_monitor:THRESHOLD_MONITOR:HEALTHY:2021-10-01 09%3A35%3A32.982353 CEST: | ||
+ | mmhealth:State:0:1:::prscale-a-01:THRESHOLD:SMBConnTotal_Rule:THRESHOLD_RULE:HEALTHY:2021-10-01 09%3A38%3A48.166426 CEST: | ||
+ | mmhealth:State:0:1:::prscale-a-01:PERFMON:prscale-a-01:NODE:HEALTHY:2021-10-01 09%3A24%3A17.345148 CEST: | ||
+ | mmhealth:State:0:1:::prscale-a-01:FILESYSTEM:prscale-a-01:NODE:HEALTHY:2021-10-01 09%3A48%3A17.866719 CEST: | ||
+ | mmhealth:State:0:1:::prscale-a-01:FILESYSTEM:cesSharedRootlv:FILESYSTEM:HEALTHY:2021-10-01 09%3A48%3A17.879658 CEST: | ||
+ | mmhealth:State:0:1:::prscale-a-01:FILESYSTEM:gpfs01lv:FILESYSTEM:HEALTHY:2021-10-01 09%3A48%3A17.893977 CEST: | ||
+ | mmhealth:State:0:1:::prscale-a-01:FILESYSTEM:gpfs02lv:FILESYSTEM:HEALTHY:2021-10-01 09%3A24%3A17.503234 CEST: | ||
+ | mmhealth:State:0:1:::prscale-a-01:GPFS:prscale-a-01:NODE:TIPS:2021-10-03 03%3A55%3A43.146453 CEST: | ||
+ | mmhealth:Event:0:1:::prscale-a-01:GPFS:prscale-a-01:NODE:callhome_not_enabled::2021-10-01 09%3A24%3A16.102766 CEST::no: | ||
+ | mmhealth:Event:0:1:::prscale-a-01:GPFS:prscale-a-01:NODE:gpfs_maxfilestocache_small::2021-10-01 09%3A24%3A16.139998 CEST::no: | ||
+ | mmhealth:Event:0:1:::prscale-a-01:GPFS:prscale-a-01:NODE:total_memory_small::2021-10-01 09%3A24%3A16.169878 CEST::no: | ||
+ | mmhealth:State:0:1:::prscale-a-01:NETWORK:prscale-a-01:NODE:HEALTHY:2021-10-01 09%3A24%3A17.308930 CEST: | ||
+ | mmhealth:State:0:1:::prscale-a-01:NETWORK:ens192:NIC:HEALTHY:2021-10-01 09%3A24%3A17.323203 CEST: | ||
+ | mmhealth:State:0:1:::prscale-a-01:FILESYSMGR:prscale-a-01:NODE:HEALTHY:2021-10-01 09%3A24%3A19.114252 CEST: | ||
+ | mmhealth:State:0:1:::prscale-a-01:FILESYSMGR:gpfs02lv:FILESYSTEMMGMT:HEALTHY:2021-10-01 09%3A24%3A19.137516 CEST: | ||
+ | </cli> | ||
+ | |||
+ | Check Protocols components | ||
+ | <cli prompt='#'> | ||
+ | [root@prscale-a-01 ~]# mmces state show -a -Y | ||
+ | mmces:stateShow:HEADER:version:reserved:reserved:NODE:AUTH:BLOCK:NETWORK:HDFS_NAMENODE:AUTH_OBJ:NFS:OBJ:SMB:CES: | ||
+ | mmces:stateShow:0:1:::prscale-a-01:DISABLED:DISABLED:HEALTHY:DISABLED:DISABLED:HEALTHY:DISABLED:HEALTHY:HEALTHY: | ||
+ | mmces:stateShow:0:1:::prscale-b-01:DISABLED:DISABLED:HEALTHY:DISABLED:DISABLED:HEALTHY:DISABLED:HEALTHY:HEALTHY: | ||
+ | </cli> | ||
+ | |||
+ | Check particular event | ||
+ | <cli prompt='#'> | ||
+ | [root@prscale-a-02 ~]# mmhealth cluster show | ||
+ | |||
+ | Component Total Failed Degraded Healthy Other | ||
+ | ------------------------------------------------------------------------------------- | ||
+ | NODE 3 0 0 0 3 | ||
+ | GPFS 3 0 0 0 3 | ||
+ | NETWORK 3 0 0 3 0 | ||
+ | FILESYSTEM 3 0 0 3 0 | ||
+ | DISK 31 0 0 31 0 | ||
+ | CES 2 0 0 2 0 | ||
+ | CESIP 1 0 0 1 0 | ||
+ | FILESYSMGR 1 0 0 1 0 | ||
+ | GUI 3 0 1 2 0 | ||
+ | PERFMON 3 0 0 3 0 | ||
+ | THRESHOLD 3 0 0 3 0 | ||
+ | </cli> | ||
+ | |||
+ | <code> | ||
+ | mmhealth cluster show [ NODE | GPFS | NETWORK [ UserDefinedSubComponent ] | ||
+ | | FILESYSTEM [UserDefinedSubComponent ]| DISK [UserDefinedSubComponent ] | ||
+ | | CES |AUTH | AUTH_OBJ | BLOCK | CESNETWORK | NFS | OBJECT | SMB | ||
+ | | HADOOP |CLOUDGATEWAY | GUI | PERFMON | THRESHOLD | ||
+ | | AFM [UserDefinedSubComponent] ] | ||
+ | [-Y] [--verbose] | ||
+ | </code> | ||
+ | |||
+ | <cli prompt='#'> | ||
+ | [root@prscale-a-02 ~]# mmhealth cluster show GUI | ||
+ | |||
+ | Component Node Status Reasons | ||
+ | ------------------------------------------------------------------------------------------ | ||
+ | GUI prscale-q-b-01 HEALTHY - | ||
+ | GUI prscale-a-02 HEALTHY - | ||
+ | GUI prscale-a-01 DEGRADED gui_refresh_task_failed | ||
+ | </cli> | ||
+ | |||
+ | <code> | ||
+ | [root@prscale-a-01 ~]# /usr/lpp/mmfs/gui/cli/runtask help --debug | ||
+ | [AFM_FILESET_STATE, AFM_NODE_MAPPING, ALTER_HOST_NAME, CALLBACK, CALLHOME, CALLHOME_STATUS, CAPACITY_LICENSE, CES_ADDRESS, CES_STATE, CES_SERVICE_STATE, CES_USER_AUTH_SERVICE, CLUSTER_CONFIG, CONNECTION_STATUS, DAEMON_CONFIGURATION, DF, DISK_USAGE, DISKS, FILESETS, FILESYSTEM_MOUNT, FILESYSTEMS, FILE_AUDIT_LOG_CONFIG, GUI_CONFIG_CHECK, GPFS_JOBS, DIGEST_NOTIFICATION_TASK, HEALTH_STATES, HEALTH_TRIGGERED, HOST_STATES, HOST_STATES_CLIENTS, INODES, KEYSTORE, LOG_REMOVER, MASTER_GUI_ELECTION, MOUNT_CONFIG, NFS_EXPORTS, NFS_EXPORTS_DEFAULTS, NFS_SERVICE, NODE_LICENSE, NODECLASS, OBJECT_STORAGE_POLICY, OS_DETECT, PM_MONITOR, PM_SENSORS, PM_TOPOLOGY, POLICIES, QUOTA, QUOTA_DEFAULTS, QUOTA_ID_RESOLVE, QUOTA_MAIL, RDMA_INTERFACES, REMOTE_CONFIG, REMOTE_CLUSTER, REMOTE_FILESETS, REMOTE_GPFS_CONFIG, REMOTE_HEALTH_STATES, SMB_GLOBALS, SMB_SHARES, SNAPSHOTS, SNAPSHOTS_FS_USAGE, SNAPSHOT_MANAGER, SQL_STATISTICS, STATE_MAIL, STORAGE_POOL, SYSTEMUTIL_DF, TCT_ACCOUNT, TCT_CLOUD_SERVICE, TCT_NODECLASS, THRESHOLDS, WATCHFOLDER, WATCHFOLDER_STATUS, TASK_CHAIN] | ||
+ | </code> | ||
+ | |||
+ | <cli prompt='#'> | ||
+ | [root@prscale-a-01 ~]# /usr/lpp/mmfs/gui/cli/runtask CLUSTER_CONFIG --debug | ||
+ | debug: locale=en_US | ||
+ | debug: Running 'mmsdrquery 'sdrq_cluster_info' all ' on node localhost | ||
+ | debug: Running 'mmsdrquery 'sdrq_nsd_info' all ' on node localhost | ||
+ | debug: Running 'mmlscluster -Y ' on node localhost | ||
+ | debug: Running 'mmsdrquery 'sdrq_node_info' all ' on node localhost | ||
+ | debug: Running 'mmlsnodeclass 'GUI_MGMT_SERVERS' -Y ' on node localhost | ||
+ | debug: Running 'mmlsnodeclass 'GUI_SERVERS' -Y ' on node localhost | ||
+ | EFSSG1000I The command completed successfully. | ||
+ | </cli> | ||
+ | <cli prompt='#'> | ||
+ | [root@prscale-a-01 ~]# mmhealth event show gui_refresh_task_failed | ||
+ | Event Name: gui_refresh_task_failed | ||
+ | Event ID: 998254 | ||
+ | Description: One or more GUI refresh tasks failed. This could mean that data in the GUI is outdated. | ||
+ | Cause: There can be several reasons. | ||
+ | User Action: 1.) Check if there is additional information available by executing '/usr/lpp/mmfs/gui/cli/lstasklog [taskname]'. 2.) Run the specified task manually on the CLI by executing '/usr/lpp/mmfs/gui/cli/runtask [taskname] --debug'. 3.) Check the GUI logs under /var/log/cnlog/mgtsrv. 4.) Contact IBM Support if this error persists or occurs more often. | ||
+ | Severity: WARNING | ||
+ | State: DEGRADED | ||
+ | [root@prscale-a-01 ~]# mmhealth event resolve 998254 | ||
+ | The specified event gui_refresh_task_failed is not manually resolvable. | ||
+ | </cli> | ||
<code> | <code> |