https://www.linkedin.com/pulse/aix-live-update-cookbook-christian-sonnemans
http://gibsonnet.net/blog/cgaix/html/Installing_an_ifix_with_AIX_Live_Update.html
https://developer.ibm.com/articles/au-aix-db2-blu/
Doc reference
https://docplayer.net/185234528-Chris-s-best-practices-for-aix-live-update-for-2020.html
Introducing IBM AIX 7.2.1 Live Update http://www.ibm.com/developerworks/aix/library/au-aix7.2.1-liveupdate-trs/index.html?ca=drs-&ce=ism0070&ct=is&cmp=ibmsocial&cm=h&cr=crossbrand&ccy=us
The Live Update feature is intended for applying interim fixes that contain kernel changes or kernel extension changes that require a reboot. The interim fix might contain other files (for example, commands and libraries), and the Live Update feature does not change anything about the way these files are applied. For example, a shared library will be modified on the file system, but any running processes continues to use the old version of the library. Therefore, applications that require a library fix must be stopped and restarted to load the new version of the library after the fix is applied. In AIX® Version 7.2 with the 7200-01 Technology Level, or later, you can use the genld -u command to list the processes that are using the old version of any shared libraries or other objects that are updated. You can use the list that is displayed from the genld -u command to identify the processes that must be stopped and restarted to load the updated objects.
Requirements:
Security restrictions:
TEP=ON TLP=ON CHKSHLIB=ON and STOP_UNTRUSTD=ON TSD_FILES_LOCK=ON
Which kernel nodule or old library still in use ? (only AIX 7.2 and higher)
[root@nim]/root# oslevel -s 7200-02-02-1810 [root@nim]/root# genld -u Proc_pid: 15466758 Proc_name: sshd Proc_pid: 16253288 Proc_name: dsmc Proc_pid: 16646408 Proc_name: sshd [root@nim]/root# stopsrc -s sshd 0513-044 The sshd Subsystem was requested to stop. [root@nim]/root# startsrc -s sshd 0513-059 The sshd Subsystem has been started. Subsystem PID is 8782122. [root@nim]/root# genld -u Proc_pid: 15466758 Proc_name: sshd Proc_pid: 16253288 Proc_name: dsmc Proc_pid: 16646408 Proc_name: sshd [root@nim]/root# ps -ef | grep dsm root 6488440 15401388 1 11:51:14 pts/6 0:00 grep dsm root 15532328 1 0 Nov 13 - 115:39 /usr/tivoli/tsm/client/ba/bin/dsmc sched root 16253288 13173056 0 Oct 13 pts/0 18:11 dsmc i [root@nim]/root# kill 15532328 [root@nim]/root# genld -u Proc_pid: 15466758 Proc_name: sshd Proc_pid: 16253288 Proc_name: dsmc Proc_pid: 16646408 Proc_name: sshd [root@nim]/root# ps -ef | grep 15466758 root 13173056 15466758 0 Oct 13 pts/0 0:01 -ksh93 root 15466758 1 0 Oct 13 - 0:00 sshd: root@pts/0 root 17564108 15401388 1 11:52:31 pts/6 0:00 grep 15466758 [root@nim]/root# kill 15466758 [root@nim]/root# ps -ef | grep 16646408 root 13173058 15401388 1 11:52:58 pts/6 0:00 grep 16646408 root 16515370 16646408 0 Oct 16 pts/2 0:01 -ksh93 root 16646408 1 0 Oct 16 - 0:00 sshd: root@pts/2 [root@nim]/root# kill 16646408 [root@nim]/root# genld -u [root@nim]/root#
Using Live update on a single server requires a connection to the HMC for each LPAR !
Check if your application is supported with liveupdate. Currently Oracle v12-19 are supported.
Requisite: artex packages
[root@laboh]/root# lslpp -Lc | egrep "live|hmc|pvc" bos.sysmgt.pvc bos.sysmgt.hmc bos.liveupdate.rte
Create a user liveupdate on HMC with role hmcclientliveupdate
Register the HMC on the LPAR
[root@laboh]/root# hmcauth -u liveupdate -p XXXXXXXXXXXXX -a hmc01 [root@laboh]/root# hmcauth -l Address : 1.1.1.1 User name: liveupdate Port : 12443
you have to add a minimum of 2 disks with same size as rootvg. In case of mirrored rootvg, create 2X disks. Rootvg mirrored on 3 copies is not supported !
Based on the template (/var/adm/ras/liveupdate/lvupdate.template) create a file /var/adm/ras/liveupdate/lvupdate.data
[root@laboh]/root# cat /var/adm/ras/liveupdate/lvupdate.data general: kext_check = disks: nhdisk = mhdisk = tohdisk = tshdisk = hmc: lpar_id = management_console = user =
Example
[root@laboh]/root# hdisk0 00fa343b8f050a76 rootvg active hdisk2 00fa343b8f01b1e2 None hdisk10 00c2fcc8dfe7288e None hdisk11 00c2fcc8dfe729fd None hdisk1 00fa343b8f050ea7 None [root@laboh]/root# lsmpio -q Device Vendor Id Product Id Size Volume Name --------------------------------------------------------------------------------- hdisk0 IBM 2145 60.00GiB a_laboh_bA0 hdisk2 IBM 2145 60.00GiB a_laboh_bB0 hdisk10 IBM 2145 60.00GiB a_laboh_bA0_ALT hdisk11 IBM 2145 60.00GiB a_laboh_bB0_ALT hdisk1 IBM 2145 35.00GiB a_laboh_dA0
I've done a clone on hdisk11 to be able to rollback
[root@laboh]/root# cat /var/adm/ras/liveupdate/lvupdate.data general: kext_check = yes disks: nhdisk = hdisk2 mhdisk = hdisk10 tohdisk = tshdisk = hmc: lpar_id = management_console = hmc01 user = liveupdate
[root@laboh]/root# geninstall -k -p ... LPAR: laboh HMC: 1.1.1.1 user: liveupdate ... +-----------------------------------------------------------------------------+ Live Update Preview Summary... +-----------------------------------------------------------------------------+ The live update preview succeeded. ******************************************************************************* End of Live Update PREVIEW: No Live Update operation has actually occurred. *******************************************************************************
If you have efix installed, you have to remove them
[root@laboh]/var# cat /proc/version Aug 10 2022 13:47:51 IJ41685m3a stjohnr @(#) _kdb_buildinfo unix_64 Aug 10 2022 13:47:51 IJ41685m3a stjohnr [root@laboh]/var# genld -u [root@laboh]/var# geninstall -k -d /mnt update_all [root@laboh]/var# mount nim:/export/aix72/7200-05-05-2246_full /mnt [root@laboh]/var# cd [root@laboh]/root# geninstall -k -d /mnt update_all # /usr/sbin/emgr -P To remove the given EFIX, execute the following command: # /usr/sbin/emgr -r -L <EFIX label> For more information on EFIX management please see the emgr man page and documentation. install_all_updates: Checking for recommended maintenance level 7200-05. install_all_updates: Executing /usr/bin/oslevel -rf, Result = 7200-05 install_all_updates: Verification completed. install_all_updates: Log file is /var/adm/ras/install_all_updates.log install_all_updates: Result = FAILURE geninstall: The live update will not proceed because of an install failure. Please check your logs in /var/adm/ras/liveupdate/logs. Please check /var/adm/ras/emgr.log for interim fix status. [root@laboh]/root# genld -u [root@laboh]/root# /usr/sbin/emgr -P PACKAGE INSTALLER LABEL ======================================================== =========== ========== bos.rte.security installp IJ36810s3a bos.mp64 installp IJ41685m3a openssl.base installp 1022104a bos.net.tcp.bind_utils installp IJ40615m4b bos.net.tcp.bind installp IJ40615m4b
Removing efix, and recheck
[root@laboh]/root# emgr -P There is no efix data on this system. [root@laboh]/root# genld -u Proc_pid: 7274980 Proc_name: sshd Proc_pid: 8519944 Proc_name: sshd Proc_pid: 11796918 Proc_name: sshd
Before liveupdate:
[root@laboh]/root# uname -L 26 laboh [root@laboh]/var# cat /proc/version Aug 10 2022 13:47:51 IJ41685m3a stjohnr @(#) _kdb_buildinfo unix_64 Aug 10 2022 13:47:51 IJ41685m3a stjohnr [root@laboh]/var# genld -u [root@laboh]/var# oslevel -s 7200-05-03-2148
Starting AIX update with liveupdate
[root@laboh]/root# geninstall -Y -k -d /mnt update_all install_all_updates: Initializing system parameters. install_all_updates: Log file is /var/adm/ras/install_all_updates.log install_all_updates: Checking for updated install utilities on media. install_all_updates: Processing media. Moving workload to surrogate LPAR. ............ Blackout Time started. Blackout Time end. Workload is running on surrogate LPAR. .................................................... Shutting down the Original LPAR. ............................The live update operation succeeded. Broadcast message from root@laboh (pts/1) at 15:46:25 ... Live AIX update completed. File /etc/inittab has been modified. One or more of the files listed in /etc/check_config.files have changed. See /var/adm/ras/config.diff for details.
After liveupdate
[root@laboh]/root# uname -L 26 laboh [root@laboh]/root# cat /proc/version May 4 2022 12:25:32 2218B_72X @(#) _kdb_buildinfo unix_64 May 4 2022 12:25:32 2218B_72X [root@laboh]/root# errpt IDENTIFIER TIMESTAMP T C RESOURCE_NAME DESCRIPTION 12295E0B 0207154623 I S LVUPDATE Live AIX update completed successfully CB4A951F 0207154223 I S SRC SOFTWARE PROGRAM ERROR 9DBCFDEE 0207154223 T O errdemon ERROR LOGGING TURNED ON 9A74C7AB 0207152323 I S LVUPDATE Live AIX update started [root@laboh]/root# oslevel -s 7200-05-04-2220 [root@laboh]/root# genld -u Proc_pid: 3473882 Proc_name: srcmstr Proc_pid: 4981166 Proc_name: portmap Proc_pid: 5636522 Proc_name: syslogd Proc_pid: 5833188 Proc_name: xntpd Proc_pid: 6029758 Proc_name: biod Proc_pid: 6095348 Proc_name: secldapclntd Proc_pid: 6357476 Proc_name: ucybsmgr Proc_pid: 6554102 Proc_name: cupsd Proc_pid: 6750684 Proc_name: rpc.lockd Proc_pid: 7012624 Proc_name: zabbix_agentd Proc_pid: 7078156 Proc_name: ucybsmgr ...
After restarting multiples services, everything is OK (just nee to kill -9 the process srcmstr)
[root@laboh]/root# lspv hdisk0 00fa343b8f050a76 rootvg active hdisk2 00fa343b8f01b1e2 lvup_rootvg hdisk10 00c2fcc8dfe7288e None hdisk11 00c2fcc8dfe729fd altinst_rootvg hdisk1 00fa343b8f050ea7 None
Main logging for Live Update can be found on the LPAR self in the following path:
/var/adm/ras/liveupdate/logs/lvupdlog
If any service issue occurs with the Live Update operation, the snap -U command collects all the required information for the support team.
Attempt to clean up partition after a failed Live Update operation:
clvupdate [-e] [-o] [-u] clvupdate [-e] [-n] [-u] clvupdate -d clvupdate -l clvupdate -r clvupdate -v
Flags:
-d Remove surrogate boot disks only -e Ignore Live Update state -l Unlock Live Update lock only -n Run LVUP_ERROR phase scripts -o Force original shutdown -r Reset Live Update state and status only -u Ignore Live Update status -v Remove volume group from previous Live Update
https://docplayer.net/185234528-Chris-s-best-practices-for-aix-live-update-for-2020.html
Introducing IBM AIX 7.2.1 Live Update
http://www.ibm.com/developerworks/aix/library/au-aix7.2.1-liveupdate-trs/index.html?ca=drs-&ce=ism0070&ct=is&cmp=ibmsocial&cm=h&cr=crossbrand&ccy=us
Live Update Cookbook By Christian Sonnemans posted Tue March 30, 2021 05:24 AM
Live Update NIM Cookbook
AIX Live Kernel Updates (Live Update) has been available for AIX 7.2 for several years now and from a customer perspective it is a very good feature. It can help system administrators avoid working after hours to patch their systems. As a customer myself and an AIX admin, I made a concentrated effort to enable the Live Update feature in my environment.So, I’ve written this to help other customers do the same using my “Live Update cookbook”This article guides you through a TL upgrade, invoking it from a NIM master, or manually from the LPAR itself.The same steps can be used for updating to a SP or for a single ifix that requires a kernel update (and reboot).
Purpose: Setup a working NIM environment for Live Update
Steps to take: Set up a NIM or local LPAR environment in the right order. Checklist for the LPAR and take the right steps to make a Live Update successful. Pitfalls and workarounds, logging and tracing.
Minimum requirement for an NIM environment: In this section I describe how to setup your NIM environment and show some sample configurations. I’ve included some other helpful links on this topic at the end of this blog, under the references section. We start by configuring and defining all the required NIM objects and communication between the NIM master, HMC, CEC and VIO servers (VIOS).
# lslpp -l|grep dsm dsm.core dsm.dsh
Save this password in a secure place, and create the user liveupdate on the HMC. Please refer to the example below:
Create a password keystore for storing the password that belongs to the user that can logon to the HMC see following example:
# /usr/bin/dpasswd -f /export/nim/hmc_liveupdate_passwd -U liveupdate
[root@nim]/root# /usr/bin/dpasswd -f /export/nim/hmc_liveupdate_passwd -U liveupdate Password file is /export/nim/hmc_liveupdate_passwd Warning! Password will be echoed to the screen as it is typed.
On the NIM master, define a HMC object with the following NIM command. See the example below: Syntax example:
Syntax example:
# nim -o define -t hmc -a passwd_file=EncryptedPasswordFilePath -a if1=InterfaceDescription -a net_definition=DefinitionName HMCName
More detailed example:
# nim -o define -t hmc -a if1="find_net <hmc-name> 0" -a net_definition="ent 255.255.255.0" -a passwd_file=/export/nim/hmc_liveupdate_passwd HMCname
[root@nim]/root# nim -o define -t ent -a net_addr="1.1.0.0" -a snm="255.255.255.0" -a routing1="default 1.1.0.1" vlan4 [root@nim]/root# nim -o define -t hmc -a if1="find_net hmc01 0" -a net_definition="ent 255.255.255.0" -a passwd_file=/export/nim/hmc_liveupdate_passwd hmc01
check the NIM object:
Tip: to learn more about your existing NIM networks use the following commands:
[root@nim]/root# lsnim -l vlan4 vlan4: class = networks type = ent Nstate = ready for use prev_state = ready for use net_addr = x.X.X.X snm = x.X.X.X routing1 = default x.X.X.X [root@nim]/root# lsnim -l hmc01 hmc01: class = management type = hmc passwd_file = /export/nim/hmc_liveupdate_passwd mgmt_port = 12443 if1 = vlan4 hmc01 0 Cstate = ready for a NIM operation prev_state = Mstate = currently running
Exchange ssh keys between the HMC and NIM master:
# dkeyexch -f /export/nim/hmc_liveupdate_passwd -I hmc -H <hmc_name>
[root@nim]/root# dkeyexch -f /export/nim/hmc_liveupdate_passwd -I hmc -H hmc01 OpenSSH_8.1p1, OpenSSL 1.0.2u 20 Dec 2019 Exception in thread "main" java.lang.NullPointerException at java.lang.ProcessBuilder.start(ProcessBuilder.java:1023) at com.ibm.sysmgt.dsm.dkeyexch.dkeyexch.main(dkeyexch.java:639)
Test of key exchange:
[root@nim]/root# nimquery -a hmc=hmc01 name=xxxxx,type_model=9105-42A,serial_num=78xxxx,ipaddr=10.254.0.3,sp_type=ebmc ....
Workaround, as we reach an error (use the root pub key from nim)
hscroot@hmc01:~> mkauthkeys -u liveupdate -a 'ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxP+ZXcjHvCfE4gOv6mSRetw== root@nim'
Also useful if you like to create VIOS backups via a central NIM server. A pitfall for VIOS backups is that you have to open all the NIM port ranges if you use a VIO firewall. If you don’t use a VIOS firewall, then there is no need to perform this step. Example:
# viosecure -firewall allow -port <portnumber> -address<nim_server> # viosecure -firewall allow -port <portnumber> -address <nim_server> -remote
ports are: 1019,1020,1021,1022,1023, 3901,5813,1058,1059,3901,3902
An easy way to add all servers into NIM: use nimquery with -d to define all CEC:
[root@nim]/root# nimquery -a hmc=hmc01 -d name=xxxx,type_model=9105-42A,serial_num=xxxx ...
Or manually
Example:
nim -o define -t cec -a hw_type=9009 -a hw_model=42A -a hw_serial=<number> -a mgmt_source=hmc-name cec-name
[root@nim]/root# nim -o define -t cec -a hw_type=9080 -a hw_model=M9S -a hw_serial=xxxxx -a mgmt_source=hmc01 p1020
check with command:
[root@nim]/root# lsnim -t cec 9105-42A_xxxxxxx management cec ...
Register a NIM object type VIOS
nim -o define -t vios -a if1="<network-name> <vio-name> 0 entX" -a mgmt_source="<cec-name>" -a identity=<LPAR-id> <vio-name>
More detailed example:
nim -o define -t vios -a if1="adminnetwork vio1 0 ent7" -a mgmt_source="p9-S924-testserver" -a identity=1 <vio-name>
[root@nim]/root# nim -o define -t vios -a if1="vlan4 viodevb1 0 ent1" -a mgmt_source="9080-M9S_xxxx" -a identity=3 viodev1 [root@nim]/root# nim -o define -t vios -a if1="vlan4 viodevb2 0 ent1" -a mgmt_source="9080-M9S_xxxx" -a identity=4 viodev2
Register a NIM object type “client”
On each VIOS, you must register it as a NIM client (to the NIM master), this can be done in one of two ways.
First method to do this is with the default padmin user:
$ remote_management -interface <interface> <nimserver>
The disadvantage of this method is that if you have more than one network interface on the VIOS you cannot specify the desired interface.
The second method gives you more control by allowing the administrator to choose the desired FQDN: Using oem_setup_env, run the following command as root on the VIOS:
$ oem_setup_env # niminit -a master_port=1058 -a master=<FQDN-name-nimmaster> -a name=<vios-FQDN-name>
[root@viodev1]/home/padmin# niminit -a master_port=1058 -a master=nim -a connect=nimsh -a name=viodevb1 nimsh:2:wait:/usr/bin/startsrc -g nimclient >/dev/console 2>&1 0513-059 The nimsh Subsystem has been started. Subsystem PID is 31260968.
Also, in the oem_setup_env environment you can now check the file /etc/niminfo to ensure the NIM definitions for the client and the master are correct i.e.
cat /etc/niminfo
Final check on your NIM server if you can access via NIM your vio servers with:
nim -o lslpp <vio-server name>
Also, on the NIM server check if all the management resources are defined and available with:
lsnim -c management
[root@nim]/root# lsnim -c management hmc01 management hmc 9105-42A_xxxxxxx management cec viodev1 management vios viodev2 management vios
and for each object
lsnim -l <management_object>
It is necessary to update the NIM client object with a valid management profile.
This can be done via the command below:
nim -o change -a identity=<LPAR_ID> -a mgmt_source=<CEC> <lpar_name>
More detailed example:
[root@nim]/root# nim -o change -a connect=nimsh -a platform=chrp -a netboot_kernel=64 -a identity=26 -a mgmt_source=9080-M9S_xxxx laboh [root@nim]/root# lsnim -l laboh laboh: class = machines type = standalone connect = nimsh platform = chrp netboot_kernel = 64 if1 = vlan4 laboh 6E5AFxxxxx ent0 cable_type1 = N/A mgmt_profile1 = hmc01 26 9080-M9S_xxxxx Cstate = ready for a NIM operation prev_state = ready for a NIM operation Mstate = currently running cpuid = 00C2FCxxxxx
This is a very boring job if you have a lot of LPARs, so we created a script for this that gathers the right information form the HMC and then updates for every LPAR the extra management profile.
See below for an example script-1.
Each LPAR will need two additional (unused) disks before the surrogate LPAR can be created. These “spare” disks must be available on the LPAR, before you can run Live Update. Also, on the NIM master, there must be an object than holds the correct information about those disks (i.e. the Live Update data configuration file)
It’s useful to create for every LPAR an liveupdate_data NIM object, if you like to run in parallel NIM Live Updates.
nim -o define -t live_update_data -a server=master -a location=/export/nim/liveupdate/liveupdate_data_laboh liveupdate_data_laboh
Below an example for at liveupdate_data_<lpar_name> file Note: the path is also an example it can be any path where you store your nim objects.
[root@nim]/root# cat /export/nim/liveupdate/liveupdate_data_laboh general: kext_check = yes disks: nhdisk = hdisk0 mhdisk = hdisk10 tohdisk = tshdisk = hmc: lpar_id = management_console = hmca01 user = liveupdate
This is again a boring job so you can create again a script for this job, the script that I created is customer specific therefore not published.
Depending on what you like to update with Live Update you need to prepare your lpp_source, for an SP or TL upgrade.
SP or TL lpp_source: Steps how to create a valid lpp_source from base images can be found on the IBM support page: How to create a spot and lpp_source from an Iso image. There are several ways to create your lpp_source, but this is beyond this blog. But I like to give one example that can be used on the NIM master: -1 download the right TL/SP levels from fixcentral. -2 put extra software for example for the expansion pack also into this source/download directory -3 run the example below with your right paths:
nim -o define -t lpp_source -a server=master \ -a location=“/install/nim/lpp_source” \ -a source=<path_downloaded_files_from_fixcentral> <lpp_source_name>
ifix lpp_source: I like to mention here also the possibility to use Live Update for ifixes. Some ifixes include kernel fixes and will require a reboot. Sure we can use Live Update for this also and no reboot is needed! The update source is also an lpp_source, but this source contains only ifixes. You can put multiple ifixes in one directory and then define it as an lpp_source:
NIM example: nim -o define -t lpp_source -a server=master -a location=“/install/nim/lpp_source/<ifixes-dir>” <lpp_source_ifixes>
To check if your ifix is Live Update capable use the emgr command in the following example:.
emgr -d -e IJ28227s2a.200922.epkg.Z -v2 | grep “LU CAPABLE:” LU CAPABLE: yes
After this step 7 you can do a final check if you have all the required resources:
lsnim | egrep “hmc|cec|vios|live”
After this your NIM environment is ready for Live Update from your NIM master.
Sample script-1
#!/bin/ksh93 LPAR=$1 ping -c3 ${LPAR} >/dev/null 2>&1 if [ $? -ne "0" ] then printf "could not ping host or wrong name <LPARname> \n" exit 1 fi serialnumber=$(ssh ${LPAR} prtconf |grep "Machine Serial Number" |awk '{ print $4}') LPAR_id=$(ssh ${LPAR} LPARstat -i |grep "Partition Number" |awk '{ print $4}') # Get right info from the HMC mgmt_sys=$(ssh user@hmc "lssyscfg -r sys -F name |grep ${serialnumber}") printf "${LPAR} has LPAR id: ${LPAR_id} and is current running on mgmt_system: ${mgmt_sys} \n" NIM -o change -a identity=${LPAR_id} -a mgmt_source=${mgmt_sys} ${LPAR} lsnim -l ${LPAR} exit 0
Before the actual Live Update starts, there are several requirements you must meet prior the Live Update kicks off. Also, for users that are using TE (Trusted Execution) some extra steps must be done. The same case for users that using ipsec_v4 or v6 (ipfilt) In this section I will describe those extra steps. In our case we created a pre-script that runs those checks for us and put the LPAR into the right state. Because this script is very customer specific, I describe the common steps here.
Users that are using TE must first disable it during Live Update. On the LPAR you run:
trustchk -p te=off
Note: special case for users that store TE and / or RBAC database on a read-only LDAP server must modify the /etc/nscontrol.conf so that during the update only the local /etc/security/tsd/tsd.dat and /etc/security/<rbacfiles> are modified. This is not Live Update specific but should be done with any update to another SP or TL level. Example for tsd.dat file:
chsec -f /etc/nscontrol.conf -s tsddat -a secorder=files
Make a backup of your ip-filter lists and stop ip-filters on the LPAR.
After this remove the devices with:
rmdev -dl ipsec_v4 -R rmdev -dl ipsec_v6 -R
Stop the processes that can cause Live Update to fail. We discovered several processes that caused Live Update to fail: WLM (workload manager), we found out that this was started via the inittab even we do not use it. Therefore we shut this down and also disable it in the /etc/inittab (if you do not disable this entry in inittab, the surrogate will execute it and start wlm again). On the LPAR use:
/usr/sbin/wlmcntrl -o
chitab "rcnetwlm:23456789:off:/etc/rc.netwlm start > /dev/console 2>&1 # disable netwlm"
Alternative if you do not use WLM, you can ofcource simply remove it from the /etc/inittab with:
# rmitab rcnetwlm
Other processes that we stop before we start the Live Update are: Backup agents Monitoring scripts, such as nimon, nmon, lpar2rrd etc. Reason for this: You do not want a backup starting during the upgrade phase. Monitoring can cause some open files in /proc and Live Update does not like this. See also the restrictions on the IBM page: https://www.ibm.com/support/knowledgecenter/ssw_aix_72/install/lvupdate_detail_restrict.html
Remove all ifixes on the LPAR if present, this action is normal to do before an upgrade to a higher TL or SP and requires without Live Update most of the time a reboot, if the ifixes are containing kernel patches.
In this case we only remove the ifixes and do no reboot otherwise it would not be a Live Update.
If the following filesets are installed on the LPAR they must be removed otherwise Live Update will fail. This is can be found in the requirements for Live Update. https://www.ibm.com/support/knowledgecenter/ssw_aix_72/install/lvupdate_limitations.html
It concerns the filesets bos.mls.cfg and security.pkcs11
geninstall -u -I “g -J -w” -Z -f filelist where the file list contains the mentioned files
or
installp -ugw bos.mls.cfg security.pkcs11
Step 6: This step is also TL02SP2 specific and it is solved in TL03 (see pitfall number 4.) Copy a modified version c_cust_Live Update to /usr/lpp/bos.sysmgt/nim/methods/ on the LPAR The modification made was:
vi /usr/lpp/bos.sysmgt/NIM/methods/c_cust_Live Update Around line 892 Change:
geninstall_cmd=“${geninstall_cmd} ${VERBOSE:+-D} -k”
To:
geninstall_cmd=“${geninstall_cmd} ${VERBOSE:+-D} -k -Y”
The Live Update will accept the -Y = agree license flag on the LPAR.
Other solution is to install the right apar for this see list below: IJ02913 for level: 7200-01-05-1845 IJ08917 for level: 7200-02-03-1845 IJ05198 for TL level: 7200-03
Of course it’s always better to start with LKU at the latest available level. But the goal of LKU is updating to an higher level, this it the reason why I describe those pitfalls and mention the solution for those levels.
This step was also a step that we needed to do only on TL02 SP2 See for more detail pitfall number 5. On the NIM maser run once and on every LPAR also:
lvupdateRegScript -r -n AUTOFS_UMOUNT -d orig -P LVUP_CHECK
We discovered if there are programs running on the LPAR that have open files in /proc Live Update will fail during the blackout period. (See pitfall 1). You can prevent this by checking the LPAR with lsof |grep proc If you have output than you should stop executables that are causing this, such as truss or debugging programs.
Check if there is enough space in /var and the root filesystem, /. You need at minimum of 100Mb free space, but for safety we verify at that we have at least 500Mb free space in those filesystems.
Before running Live Update, I recommend creating a backup of your system. You have several options for this NIM, mksysb, alt_disk_copy. I prefer to use alt_disk_copy because it only takes one reboot to fallback to the old rootvg.
Make on a separate disk a copy of your current rootvg with the tool alt_disk_copy.
Steps are very simple and straight forward and is a real time saver if something goes terribly wrong!
Steps are:
cfgmgr -v after you added the extra disk via zoning or storage pools.
In case of using TE run trustchk -p te=off
Run the command:
alt_disk_copy -B -d <destination-disk>
You can add the -B option to prevent changing the bootlist.
Check / modify your boot list back to the original disk:
bootlist -m normal <original_rootvg_disk>
Check the profile for the lpar on the HMC. For memory the minimum amount of memory for the active profile must be 2048 (2Gb). Be aware LPAR must be booted with this profile, it cannot be changed afterwards. This is actually a AIX 7.2 requirement. You need a minimum of 2Gb to run the OS.
Let’s update without doing a reboot!
Now that we did a lot of checks and verifications it’s now time for the actual goal, update the TL / SP level without a reboot. In this section we describe several ways to initiate Live Update with NIM, and my advice is: if you are new with technology do first some test runs on a test environment, and to speed up you tests make one on a new disk first a alt_disk_copy. In this way you can rollback with a reboot, see step 10 in the previous section.
Live Update with NIM For preview you can run the following examples with your valid lpp source and liveupdate_data NIM resource:
Dry run: mention the -p flag for preview
nim -o cust -a lpp_source=72000402_lpp_source -a installp_flags="-p" -a fixes=update_all -a live_update=yes -a live_update_data=liveupdate_data_<LPAR-name> <LPAR-name>
Real update with Live Update without the -p flag:
nim -o cust -a lpp_source=72000402_lpp_source -a fixes=update_all -a live_update=yes -a live_update_data=liveupdate_data_<LPAR-name> <LPAR-name>
When starting this command via a script on your NIM server or via the command line you can monitoring the standard output on your NIM master server. For the logging on the LPAR see the chapter logging.
Testing Live Update on a LPAR without using NIM: For testing purposes, you can start Live Update manually on the LPAR itself without using NIM. And if you did not apply any new updates or ifixes, you can use this method as kind of “dry run” to see if it will create a surrogate, and move your LPAR to a new one (monitor the LPAR id). Be aware: this is not a real “dry run” it does an Live Update to a new LPAR, but this is a good test if your workload can service a Live Update action.
Make sure that you first authenticate against the HMC where you created the liveupdate user.
Command example below:
hmcauth -u <liveupdate-user> -p <password> -a <hostname-hmc>
check with:
hmcauth -l
output example:
Address : www.xxx.yyy.zzz User name: liveupdate Port : 12443
On the LPAR run the command below to start the Live Update. Check on LPAR in the path /var/adm/ras/liveupdate The two free disks in the file: lvupdate.data
geninstall -kp <-- for preview first geninstall -k <-- for the real update
during the Live Update you can monitor the process, on either the NIM master, or if stared on the LPAR on the LPAR. On the HMC you can monitor partly the process, such as creating the donor and cleaning up the old LPAR after the process is done. See below a screenshot of the HMC during the Live Update action: [HMC progress]
Above picture shows that the original LPAR is renamed to _Live Update0 and the surrogate is created with the original name.
Apply an ifix with Live Update via the NIM master: This is roughly the same method as a TL or SP upgrade, also for NIM you need the same steps: See also the first part how to define an ifix NIM lpp_resource. Also, the minimum requirements are the same, such as minimum two disks free on the LPAR, liveupdate.data file specified in NIM, connection and authentication to the HMC, disable ip filtering etc.
Apply the ifix preview with the following example:
nim -o cust -a live_update=yes -a lpp_source=Live Update-ifixes -a installp_flags="-p" -a live_update_data=liveupdate_data_<lpar_name> -a filesets=1022100a.210112.epkg.Z <lpar_name>
Leaving out the line with the -p (preview) flag will apply the ifix with Live Update.
Tip: You can find detailed logging for the ifix install on the LPAR in the following location:
/var/adm/ras/emgr.log
Workaround: Separate upgrade from Live Update actions:
Workaround for pitfall 3: Pitfall 3 Description: PFC daemon waits too long to start.
Workaround for this problem is a good example mitigating problem that can arise during the upgrade of a TL or SP level. This workaround separates the upgrade form the actual Live Update actions. And in this case, it makes it possible to start required daemons before the Live Update actions start. See example below, the first part is an “normal” NIM update and after this it commits all the updates. Second part is the actual Live Update, but now packages to apply, so it only the Live Update actions.
nim -o cust -a lpp_source=72000402_lpp_source -a fixes=update_all -a accept_licenses=yes <lpar_name>
# now commit updates and start pfcdaemon
ssh <lpar_name> installp -c ALL ssh <lpar_name> startsrc -s pfcdaemon
nim -o cust -a lpp_source=72000402_lpp_source -a fixes=update_all -a live_update=yes -a live_update_data=liveupdate_data_<lpar_name> <lpar_name>
Fallback option for pitfall 6: The same workaround that it used above can also be used as a kind of failback option for the chicken egg problem that is described in pitfall 6. When your workload fails to move to the surrogate LPAR (during backout a failure). The normal behavior without this workaround is rollback to the original state. With this work around the actual upgrade is already done before the Live Update action starts. So now you have the choice: -1 stop your workload and do another attempt to move to the upgrade surrogate LPAR. So repeat Live Update action without workload, this is not a real Live Update action any longer because the goal was to keep your workload running. OR -2 Because you have to shut down your workload, and in this way it’s not a real Live Update action any longer you can of course also do a reboot of your LPAR, after this your upgrade is done, but not via Live Update.
Logging: Main logging for Live Update can be found on the LPAR self in the following path: /var/adm/ras/liveupdate/logs In this path you can find the file lvupdlog or this file with a timestamp. The log file can be cryptic read, but by searching on keywords like ERROR can help you by find a clue what went wrong in case of a failure. Search ERROR all in caps in logfile.
When you must send logs to IBM for analysis it good to gather the following logs: Make a tar file of everything in the path: /var/adm/ras/liveupdate Example: cd /var/adm/ras/liveupdate tar cvf liveupdate_logs.tar * make a snap -ac and send also this info to IBM.
Pitfall’s: 1: Open files in /proc:
Message during blackout: 1020-311 The state of the LPAR is currently non-checkpointable. Checkpoint should be retried later. [02.229.0730]
Problem programs such as truss or debugger have files open in /proc In log file:
OLVUPDMCR 1610089744.160 DEBUG mcr_fd_srm_dev - Calling lvup_chckpnt_status(LIVE_UPDATE_TRY) KLVUPD 1610089744 DEBUG - 6750576/24904167 DENY Blocking entry: procfs open, count: 9 OLVUPDMCR 1610089754.160 ERROR mcr_fd_srm_dev - 1020-311 The state of the LPAR is currently non-OLVUPDMCR 1610089754.160 ERROR mcr_sync - Action LU_FREEZE failed in SRM with status 6
2: WLM manager running or entry in inittab: rcnetwlm:23456789:wait:/etc/rc.netwlm start> /dev/console 2>&1 # Start netwlm change to: chitab “rcnetwlm:23456789:off:/etc/rc.netwlm start> /dev/console 2>&1 # disable netwlm” and run /usr/sbin/wlmcntrl -o
3: PFC daemon wait too long to start The script for this daemon waits to long (TL02SP2) to start 5 min. Message in logging:
1430-134 Notification script '/opt/triton/daemon-stop' failed with exit status 1
Cause this happens because the pfc daemon is not started or takes too long to start.
Workaround firs apply all updates. then start the pfcdaemon again before the Live Update starts with:
startsrc -s pfcdaemon
4: Using Live Update on TL02-SP2 strange license error during preview. Hitting IJ08917 License errors when running Live Update through NIM. This is solved in 7200-02-03 On NIM client LPAR edit following file and put the -Y = agree license:
# vi /usr/lpp/bos.sysmgt/NIM/methods/c_cust_Live Update
Around line 892
Change:
geninstall_cmd=“${geninstall_cmd} ${VERBOSE:+-D} -k”
To:
geninstall_cmd=“${geninstall_cmd} ${VERBOSE:+-D} -k -Y”
Or apply the following ifix:
IJ02913 for level: 7200-01-05-1845 IJ08917 for level: 7200-02-03-1845 IJ05198 for TL level: 7200-03
5: Using Live Update on TL02-SP2 after the message notifying applications of impending Live Update:
1430-134 Notification script '/usr/lpp/nfs/autofs/autofs_Live Update orig' failed with exit status 127 ….1430-025 An error occurred while notifying applications. 1430-045 The Live Update operation failed. Initiating log capture.
Solution on the NIM maser run once and on every LPAR also:
lvupdateRegScript -r -n AUTOFS_UMOUNT -d orig -P LVUP_CHECK
This is fixed in:
IJ04531 - 7200-02-02-1832 IJ04532 - 7200-03-00-1837 IV89963 - 7200-01-02-1717
6: In blackout window you get several messages like below:
The Live Update logs will show the following error against the library that is causing Live Update to fail.
mcr: cke_errno 52 cke_Sy_error 23349 cke_Xtnd_error 534 cke_buf /usr/lib/libC.a(shr_64.o)
In the NIM output you get:
Blackout Time started. 1020-007 System call mcrk_sys_spl_getfd error (50). [03.156.0064] 1020-007 System call mcrk_sys_spl_getfd error (50). [03.156.0064] 1020-007 System call MCR_GET_SHLIB_ASYNCINFO error (52). A file, file system or message queue is no longer available. [02.239.0142] ……..1430-030 An error occurred while moving the workload. 1430-045 The Live Update operation failed.
This is not quite a pitfall but a more a bug that you are hitting. Problem description: the current kernel cannot release properly the shared libraries for non-root users.
Workaround: No real workaround.
Option 1: shutdown your workload and restart the Live Update operation. But in this way, it’s not a real Live Update anymore!
Option 2: Request an ifix for the level you are updating too. If you plan on updating to 7200-05-01, you need an ifix for 7200-05-01. But with this option you have a chicken egg problem, because the ifix is only active (loaded) in the new kernel. In other words its fixed after a reboot or Live Update, is successfully succeeded (option 1).
Levels that suffer from this bug (impacted) levels are: Service pack: 7200-05-01-2038 (bos.mp64 7.2.5.1) You need IJ30267 –> Official APAR is now in 7200-05-02. Technology Level: 7200-05-00-2037 (bos.mp64 7.2.5.1) You need IJ30267 Service pack: 7200-04-02-2028 (bos.mp64 7.2.4.6) You need IJ29492 Service pack: 7200-04-01-1939 (bos.mp64 7.2.4.2) You need IJ29492 Technology Level: 7200-04-00-1937 (bos.mp64 7.2.4.2) You need IJ29492
Because this fix is for a shared library, the only way for your application to know to use the latest library code is to stop your application and restart it using the updated code. Or reboot your system.
Going forward though, you do not need stop your applications if you have the ifix installed and update to the next update of AIX.
7: program /usr/bin/domainname fails if no nis-domainname is defined.
This pitfall only happens when you have installed the nis Fileset
lslpp -w /usr/bin/domainname File Fileset Type
/usr/bin/domainname bos.net.nis.client File
NIS is an old way for resolving hostnames also known as Yellow Pages, my advice is, remove this fileset because it’s old and insecure, unless you still use it. Other workaround is to replace the executable with a script that does nothing but exit with exit 0 return code.
8: In blackout window you get the following misleading message:
1020-154 Resource /old/dev/nvram is not checkpointable (pid: 19005844) (name: lscfg_chrp). Query returned 22 Invalid argument [02.229.0462]
1020-089 Internal error detected. [02.335.0458]
However investigating the logs you see also messages like:
mcr: cke_errno 52 cke_Sy_error 23349 cke_Xtnd_error 534 cke_buf /usr/lib/libC.a(shr_64.o)
Digging further into the logs it shows that the connection with HMC(s) is lost.
[RPP#rpp::ManagedSystem::display] #### MANAGED SYSTEM #################
[RPP#rpp::ManagedSystem::display] ## NAME : p9-S924-zwl-1-7835130
[RPP#rpp::ManagedSystem::display] ## OP STATUS : OperationError::OP_PARAM_ERROR
[RPP#rpp::ManagedSystem::display] ## RUUID : d1197d99-bb08-38ea-b447-76f6073f4210
[RPP#rpp::ManagedSystem::display] ## MTMS : 9009-42A*7835130
[RPP#rpp::ManagedSystem::display] ## HMC_RELS : hmc-NAME
[RPP#rpp::ManagedSystem::display] ## MEM / PU : Values stored are not considered valid
[RPP#rpp::ManagedSystem::display] ##– ONOFF COD POOL
[RPP#rpp::ManagedSystem::display] ## MEM : Values stored are not considered valid
[RPP#rpp::ManagedSystem::display] ## PROC : Values stored are not considered valid
[RPP#rpp::ManagedSystem::display] ## EPCOD : No pool related to this managed system
[RPP#rpp::ManagedSystem::display] ##– RESOURCE REQUEST
And:
[RPP#rpp::ManagementConsole::rcToError] (hmc-zw-2): librpp error: Operation encountered an error that can be parsed. Check for REST or HSCL errors
[RPP#rpp::ManagementConsole::extractError] (hmc-zw-2): Content of error: hmcGetJsonQuick: – Could not retrieve </rest/api/uom/PowerEnterprisePool/quick/All>: <[HSCL350B The user does not have the appropriate authority. ]>.
[RPP#rpp::ManagementConsole::parseHsclError] (hmc-zw-2): Found HSCL error code: 350B
[RPP#rpp::ManagementConsole::getAllEntitiesAttrs] (hmc-zw-2): Encountered an error related to the management console. Invalidate it
[RPP#rpp::ManagedSystem::refreshResources] (p9-S924-zwl-1-7835130): librpp error: No valid management console to execute the request
[RPP#rpp::ManagedSystem::refreshOnOffResources] (p9-S924-zwl-1-7835130): librpp error: No valid management console to execute the request
<ReasonCode>Unknown internal error.</ReasonCode>
[2021-11-04][16:03:24]libhmc_http.c:getHmcErrorMessage:1395: HMC error msg: [[HSCL350B The user does not have the appropriate authority.
[2021-11-04][16:03:24]libhmc_http.c:getHmcErrorMessage:1395: HMC error msg: [[HSCL350B The user does not have the appropriate authority.
Solution for this problem:
Make sure that the connection between the LPAR and HMC is in a working state, and if not please follow the following procedure:
First check the current status:
/usr/sbin/rsct/bin/rmcdomainstatus -s ctrmc Management Domain Status: Management Control Points
I A 0x7edb4439af21a36f 0002 xxx.xxx.xxx.xxx (ip address) I A 0x94580b842086855a 0001 xxx.xxx.xxx.xxx (ip address)
And on the HMC cli:
lspartition -dlpar
<#55> Partition:<13*9009-42A*7xxxxxxx, lparname, lpar-ip-address>
Active:<1>, OS:<AIX, 7.2, 7200-04-02-2028>, DCaps:<0xc2c5f>, CmdCaps:<0x4000003b, 0x3b>, PinnedMem:<6559>
If not then you can reconfigure the RMC daemon on the failing LPAR with:
First reconfig rsct:
/usr/sbin/rsct/bin/rmcctrl -z # (Stop)
/usr/sbin/rsct/install/bin/uncfgct -n #(Unconfigure)
/usr/sbin/rsct/install/bin/recfgct #(Reconfigure)
output shows something like this: 2610-656 Could not obtain information about the cluster: cu_obtain_cluster_info:Invalid current cluster pointer. fopen(NODEDEF FILE=/var/ct/IW/cfg/nodedef.cfg) fail (errno=2)
0513-071 The ctcas Subsystem has been added.
0513-071 The ctrmc Subsystem has been added.
0513-059 The ctrmc Subsystem has been started. Subsystem PID is 3277166.
Allow remote connections again:
/usr/sbin/rsct/bin/rmcctrl -p #(allow remote connections)
Check again the status:
/usr/sbin/rsct/bin/rmcdomainstatus -s ctrmc
Finally run a second attempt for LKU.