User Tools

Site Tools


aix:aix_live_update

AIX upgrade without reboot, zero downtime (AIX Live update)

AIX live update

The Live Update feature is intended for applying interim fixes that contain kernel changes or kernel extension changes that require a reboot. The interim fix might contain other files (for example, commands and libraries), and the Live Update feature does not change anything about the way these files are applied. For example, a shared library will be modified on the file system, but any running processes continues to use the old version of the library. Therefore, applications that require a library fix must be stopped and restarted to load the new version of the library after the fix is applied. In AIX® Version 7.2 with the 7200-01 Technology Level, or later, you can use the genld -u command to list the processes that are using the old version of any shared libraries or other objects that are updated. You can use the list that is displayed from the genld -u command to identify the processes that must be stopped and restarted to load the updated objects.

Requirements:

  • minimum AIX level is 7.2 Service Pack 1
  • LPAR can be managed by either Hardware Management Console (HMC using hmcauth command) or IBM Power Virtualization Center (PowerVC using pvcauth)
  • package : bos.liveupdate, dsm.core and dsm.dsh filesets must be installed to use the Live Update feature with NIM
  • packages not supported: cas.agent (must be removed)
  • LPAR's memory have to be greater than 2 GB

Security restrictions:

  • The Live Update operation is not supported when a process is using Kerberos authentication.
  • The Live Update feature does not support PowerSC™ Trusted Logging.
  • The Live Update feature is not supported if any of the following security profiles are active: high-level security (HLS), medium-level security (MLS), Sarbanes-Oxley (SOX) - Control Objectives for Information and Related Technology (COBIT), payment card industry (PCI) (any version), database, or Department of Defense (DoD) (any version).
  • The Live Update feature is not supported when audit is enabled for a stopped workload partition (WPAR).
  • The Live Update feature does not support Public-Key Cryptography Standards # 11 (PKCS11). The security.pkcs11 fileset cannot be installed.
  • The Live Update feature is not supported when the Trusted Execution option is turned on (TE=ON) and if any update must be applied. If only interim fixes are applied and the Trusted Execution option is turned on, the following Trusted Execution options in the trustchk command are not supported:
  TEP=ON
  TLP=ON
  CHKSHLIB=ON and STOP_UNTRUSTD=ON
  TSD_FILES_LOCK=ON
  • Active WPARs must be stopped before the Live Update operation.
  • A process that has a file from the /proc file system open can cause the Live Update operation to fail.
AIX check old program in use

Which kernel nodule or old library still in use ? (only AIX 7.2 and higher)

[root@nim]/root# oslevel -s
7200-02-02-1810
[root@nim]/root# genld -u
Proc_pid: 15466758  Proc_name: sshd

Proc_pid: 16253288  Proc_name: dsmc

Proc_pid: 16646408  Proc_name: sshd

[root@nim]/root# stopsrc -s sshd
0513-044 The sshd Subsystem was requested to stop.
[root@nim]/root# startsrc -s sshd
0513-059 The sshd Subsystem has been started. Subsystem PID is 8782122.
[root@nim]/root# genld -u
Proc_pid: 15466758  Proc_name: sshd

Proc_pid: 16253288  Proc_name: dsmc

Proc_pid: 16646408  Proc_name: sshd

[root@nim]/root# ps -ef | grep dsm
    root  6488440 15401388   1 11:51:14  pts/6  0:00 grep dsm
    root 15532328        1   0   Nov 13      - 115:39 /usr/tivoli/tsm/client/ba/bin/dsmc sched
    root 16253288 13173056   0   Oct 13  pts/0 18:11 dsmc i
[root@nim]/root# kill 15532328
[root@nim]/root# genld -u
Proc_pid: 15466758  Proc_name: sshd

Proc_pid: 16253288  Proc_name: dsmc

Proc_pid: 16646408  Proc_name: sshd

[root@nim]/root# ps -ef | grep 15466758
    root 13173056 15466758   0   Oct 13  pts/0  0:01 -ksh93
    root 15466758        1   0   Oct 13      -  0:00 sshd: root@pts/0
    root 17564108 15401388   1 11:52:31  pts/6  0:00 grep 15466758
[root@nim]/root# kill 15466758
[root@nim]/root# ps -ef | grep 16646408
    root 13173058 15401388   1 11:52:58  pts/6  0:00 grep 16646408
    root 16515370 16646408   0   Oct 16  pts/2  0:01 -ksh93
    root 16646408        1   0   Oct 16      -  0:00 sshd: root@pts/2
[root@nim]/root# kill 16646408
[root@nim]/root# genld -u
[root@nim]/root#

TEST1

AIX Liveupdate without NIM

Using Live update on a single server requires a connection to the HMC for each LPAR !

FIXME Check if your application is supported with liveupdate. Currently Oracle v12-19 are supported.

Step 1: install live update packages

Requisite: artex packages

[root@laboh]/root# lslpp -Lc | egrep "live|hmc|pvc"
bos.sysmgt.pvc         
bos.sysmgt.hmc     
bos.liveupdate.rte         

Step 2: Establish a connection with HMC

Create a user liveupdate on HMC with role hmcclientliveupdate

Register the HMC on the LPAR

[root@laboh]/root# hmcauth -u liveupdate -p XXXXXXXXXXXXX -a hmc01
[root@laboh]/root# hmcauth -l
Address  : 1.1.1.1
User name: liveupdate
Port     : 12443

Step 3: Prepare LPAR for liveupdate

you have to add a minimum of 2 disks with same size as rootvg. In case of mirrored rootvg, create 2X disks. Rootvg mirrored on 3 copies is not supported !

Based on the template (/var/adm/ras/liveupdate/lvupdate.template) create a file /var/adm/ras/liveupdate/lvupdate.data

[root@laboh]/root# cat /var/adm/ras/liveupdate/lvupdate.data
general:
        kext_check =

disks:
        nhdisk  =
        mhdisk  =
        tohdisk =
        tshdisk =

hmc:
        lpar_id  =
        management_console =
        user =
  • nhdisk = <disk1,disk2,…> The names of disks to be used to make a copy of the original rootvg which will be used to boot the Surrogate.
  • mhdisk = <disk1,disk2,…> The names of disks to be used to temporarily mirror rootvg on the Original LPAR.
  • tohdisk = <disk1,disk2,…> The names of disks to be used as temporary storage for the Original.
  • tshdisk = <disk1,disk2,…> The names of disks to be used as temporary storage for the Surrogate.
  • kext_check = <yes | no> Blank defaults to yes. If no, the live update operation will be attempted regardless as to whether all the loaded kernel extensions are determined to be safe or not.

Example

[root@laboh]/root#
hdisk0          00fa343b8f050a76                    rootvg          active
hdisk2          00fa343b8f01b1e2                    None
hdisk10         00c2fcc8dfe7288e                    None
hdisk11         00c2fcc8dfe729fd                    None
hdisk1          00fa343b8f050ea7                    None

[root@laboh]/root# lsmpio -q
Device           Vendor Id  Product Id       Size       Volume Name
---------------------------------------------------------------------------------
hdisk0           IBM        2145               60.00GiB a_laboh_bA0
hdisk2           IBM        2145               60.00GiB a_laboh_bB0
hdisk10          IBM        2145               60.00GiB a_laboh_bA0_ALT
hdisk11          IBM        2145               60.00GiB a_laboh_bB0_ALT
hdisk1           IBM        2145               35.00GiB a_laboh_dA0

I've done a clone on hdisk11 to be able to rollback

[root@laboh]/root# cat /var/adm/ras/liveupdate/lvupdate.data
general:
        kext_check = yes

disks:
        nhdisk  = hdisk2
        mhdisk  = hdisk10
        tohdisk =
        tshdisk =

hmc:
        lpar_id  =
        management_console = hmc01
        user = liveupdate

Step 4: Validate the environment

[root@laboh]/root# geninstall -k -p
...
  LPAR: laboh
  HMC: 1.1.1.1
  user: liveupdate
...
+-----------------------------------------------------------------------------+
                    Live Update Preview Summary...
+-----------------------------------------------------------------------------+
The live update preview succeeded.

*******************************************************************************
End of Live Update PREVIEW:  No Live Update operation has actually occurred.
*******************************************************************************

FIXME If you have efix installed, you have to remove them

[root@laboh]/var# cat /proc/version
Aug 10 2022
13:47:51
IJ41685m3a stjohnr
@(#) _kdb_buildinfo unix_64 Aug 10 2022 13:47:51 IJ41685m3a stjohnr
[root@laboh]/var# genld -u
[root@laboh]/var# geninstall -k -d /mnt update_all
[root@laboh]/var# mount nim:/export/aix72/7200-05-05-2246_full /mnt
[root@laboh]/var# cd
[root@laboh]/root# geninstall -k -d /mnt update_all
  # /usr/sbin/emgr -P

  To remove the given EFIX, execute the following command:

  # /usr/sbin/emgr -r -L <EFIX label>

  For more information on EFIX management please see the emgr man page
  and documentation.


install_all_updates: Checking for recommended maintenance level 7200-05.
install_all_updates: Executing /usr/bin/oslevel -rf, Result = 7200-05
install_all_updates: Verification completed.
install_all_updates: Log file is /var/adm/ras/install_all_updates.log
install_all_updates: Result = FAILURE
geninstall:  The live update will not proceed because of an install failure.
Please check your logs in /var/adm/ras/liveupdate/logs.
Please check /var/adm/ras/emgr.log for interim fix status.
[root@laboh]/root# genld -u
[root@laboh]/root# /usr/sbin/emgr -P

PACKAGE                                                  INSTALLER   LABEL
======================================================== =========== ==========
bos.rte.security                                         installp    IJ36810s3a
bos.mp64                                                 installp    IJ41685m3a
openssl.base                                             installp    1022104a
bos.net.tcp.bind_utils                                   installp    IJ40615m4b
bos.net.tcp.bind                                         installp    IJ40615m4b

Removing efix, and recheck

[root@laboh]/root# emgr -P
There is no efix data on this system.
[root@laboh]/root# genld -u
Proc_pid: 7274980   Proc_name: sshd
Proc_pid: 8519944   Proc_name: sshd
Proc_pid: 11796918  Proc_name: sshd

Before liveupdate:

[root@laboh]/root# uname -L
26 laboh
[root@laboh]/var# cat /proc/version
Aug 10 2022
13:47:51
IJ41685m3a stjohnr
@(#) _kdb_buildinfo unix_64 Aug 10 2022 13:47:51 IJ41685m3a stjohnr
[root@laboh]/var# genld -u
[root@laboh]/var# oslevel -s
7200-05-03-2148

Step 5: Starting AIX update with liveupdate

Starting AIX update with liveupdate

[root@laboh]/root# geninstall -Y -k -d /mnt update_all

install_all_updates: Initializing system parameters.
install_all_updates: Log file is /var/adm/ras/install_all_updates.log
install_all_updates: Checking for updated install utilities on media.
install_all_updates: Processing media.
Moving workload to surrogate LPAR.
............
        Blackout Time started.

        Blackout Time end.

Workload is running on surrogate LPAR.
....................................................
Shutting down the Original LPAR.
............................The live update operation succeeded.


Broadcast message from root@laboh (pts/1) at 15:46:25 ...

Live AIX update completed.


File /etc/inittab has been modified.

One or more of the files listed in /etc/check_config.files have changed.
        See /var/adm/ras/config.diff for details.

Step 6: Post-check

After liveupdate

[root@laboh]/root# uname -L
26 laboh
[root@laboh]/root# cat /proc/version
May  4 2022
12:25:32
2218B_72X
@(#) _kdb_buildinfo unix_64 May  4 2022 12:25:32 2218B_72X
[root@laboh]/root# errpt
IDENTIFIER TIMESTAMP  T C RESOURCE_NAME  DESCRIPTION
12295E0B   0207154623 I S LVUPDATE       Live AIX update completed successfully
CB4A951F   0207154223 I S SRC            SOFTWARE PROGRAM ERROR
9DBCFDEE   0207154223 T O errdemon       ERROR LOGGING TURNED ON
9A74C7AB   0207152323 I S LVUPDATE       Live AIX update started

[root@laboh]/root# oslevel -s
7200-05-04-2220
[root@laboh]/root# genld -u
Proc_pid: 3473882   Proc_name: srcmstr
Proc_pid: 4981166   Proc_name: portmap
Proc_pid: 5636522   Proc_name: syslogd
Proc_pid: 5833188   Proc_name: xntpd
Proc_pid: 6029758   Proc_name: biod
Proc_pid: 6095348   Proc_name: secldapclntd
Proc_pid: 6357476   Proc_name: ucybsmgr
Proc_pid: 6554102   Proc_name: cupsd
Proc_pid: 6750684   Proc_name: rpc.lockd
Proc_pid: 7012624   Proc_name: zabbix_agentd
Proc_pid: 7078156   Proc_name: ucybsmgr
...

After restarting multiples services, everything is OK (just nee to kill -9 the process srcmstr)

[root@laboh]/root# lspv
hdisk0          00fa343b8f050a76                    rootvg          active
hdisk2          00fa343b8f01b1e2                    lvup_rootvg
hdisk10         00c2fcc8dfe7288e                    None
hdisk11         00c2fcc8dfe729fd                    altinst_rootvg
hdisk1          00fa343b8f050ea7                    None

Restrictions

  • physical or virtual tapes, as well as optical devices are not supported
  • rootvg with 3 copies, not supported
  • LPARs with LV from VIOS not supported
  • trust Execution not supported

logs

Main logging for Live Update can be found on the LPAR self in the following path:

/var/adm/ras/liveupdate/logs/lvupdlog

If any service issue occurs with the Live Update operation, the snap -U command collects all the required information for the support team.

cleanup

Attempt to clean up partition after a failed Live Update operation:

clvupdate [-e] [-o] [-u]
clvupdate [-e] [-n] [-u]
clvupdate -d
clvupdate -l
clvupdate -r
clvupdate -v

Flags:

  -d Remove surrogate boot disks only
  -e Ignore Live Update state
  -l Unlock Live Update lock only
  -n Run LVUP_ERROR phase scripts
  -o Force original shutdown
  -r Reset Live Update state and status only
  -u Ignore Live Update status
  -v Remove volume group from previous Live Update

Doc reference

TEST2

AIX Liveupdate with NIM

Live Update Cookbook By Christian Sonnemans posted Tue March 30, 2021 05:24 AM

Live Update NIM Cookbook

AIX Live Kernel Updates (Live Update) has been available for AIX 7.2 for several years now and from a customer perspective it is a very good feature. It can help system administrators avoid working after hours to patch their systems. As a customer myself and an AIX admin, I made a concentrated effort to enable the Live Update feature in my environment.So, I’ve written this to help other customers do the same using my “Live Update cookbook”This article guides you through a TL upgrade, invoking it from a NIM master, or manually from the LPAR itself.The same steps can be used for updating to a SP or for a single ifix that requires a kernel update (and reboot).

Purpose: Setup a working NIM environment for Live Update

Steps to take: Set up a NIM or local LPAR environment in the right order. Checklist for the LPAR and take the right steps to make a Live Update successful. Pitfalls and workarounds, logging and tracing.

Minimum requirement for an NIM environment: In this section I describe how to setup your NIM environment and show some sample configurations. I’ve included some other helpful links on this topic at the end of this blog, under the references section. We start by configuring and defining all the required NIM objects and communication between the NIM master, HMC, CEC and VIO servers (VIOS).

First steps on NIM master

Step 1: filesets that must be present

# lslpp -l|grep dsm
dsm.core
dsm.dsh

Step 2: Communication from NIM master to HMC

Save this password in a secure place, and create the user liveupdate on the HMC. Please refer to the example below:

Create a password keystore for storing the password that belongs to the user that can logon to the HMC see following example:

# /usr/bin/dpasswd -f /export/nim/hmc_liveupdate_passwd -U liveupdate
[root@nim]/root# /usr/bin/dpasswd -f /export/nim/hmc_liveupdate_passwd -U liveupdate
Password file is /export/nim/hmc_liveupdate_passwd
Warning! Password will be echoed to the screen as it is typed.

On the NIM master, define a HMC object with the following NIM command. See the example below: Syntax example:

Syntax example:

# nim -o define -t hmc -a passwd_file=EncryptedPasswordFilePath -a if1=InterfaceDescription -a net_definition=DefinitionName HMCName

More detailed example:

# nim -o define -t hmc -a if1="find_net <hmc-name> 0" -a net_definition="ent 255.255.255.0" -a passwd_file=/export/nim/hmc_liveupdate_passwd HMCname
[root@nim]/root# nim -o define -t ent -a net_addr="1.1.0.0" -a snm="255.255.255.0" -a routing1="default 1.1.0.1" vlan4
[root@nim]/root# nim -o define -t hmc -a if1="find_net hmc01 0" -a net_definition="ent 255.255.255.0" -a passwd_file=/export/nim/hmc_liveupdate_passwd hmc01

check the NIM object:

Tip: to learn more about your existing NIM networks use the following commands:

[root@nim]/root# lsnim -l vlan4
vlan4:
   class      = networks
   type       = ent
   Nstate     = ready for use
   prev_state = ready for use
   net_addr   = x.X.X.X
   snm        = x.X.X.X
   routing1   = default x.X.X.X
   
[root@nim]/root# lsnim -l hmc01
hmc01:
   class       = management
   type        = hmc
   passwd_file = /export/nim/hmc_liveupdate_passwd
   mgmt_port   = 12443
   if1         = vlan4 hmc01 0
   Cstate      = ready for a NIM operation
   prev_state  =
   Mstate      = currently running

Exchange ssh keys between the HMC and NIM master:

# dkeyexch -f /export/nim/hmc_liveupdate_passwd -I hmc -H <hmc_name>
[root@nim]/root# dkeyexch -f /export/nim/hmc_liveupdate_passwd -I hmc -H hmc01
OpenSSH_8.1p1, OpenSSL 1.0.2u  20 Dec 2019
Exception in thread "main" java.lang.NullPointerException
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1023)
        at com.ibm.sysmgt.dsm.dkeyexch.dkeyexch.main(dkeyexch.java:639)

Test of key exchange:

[root@nim]/root# nimquery -a hmc=hmc01
name=xxxxx,type_model=9105-42A,serial_num=78xxxx,ipaddr=10.254.0.3,sp_type=ebmc
....

Workaround, as we reach an error (use the root pub key from nim)

hscroot@hmc01:~> mkauthkeys -u liveupdate -a 'ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxP+ZXcjHvCfE4gOv6mSRetw== root@nim'

Steps 3 and 4 are required if you use NPIV for storage paths

Also useful if you like to create VIOS backups via a central NIM server. A pitfall for VIOS backups is that you have to open all the NIM port ranges if you use a VIO firewall. If you don’t use a VIOS firewall, then there is no need to perform this step. Example:

# viosecure -firewall allow -port <portnumber> -address<nim_server>
# viosecure -firewall allow -port <portnumber> -address <nim_server> -remote

ports are: 1019,1020,1021,1022,1023, 3901,5813,1058,1059,3901,3902

Step 3: Create a NIM CEC object for each managed system

An easy way to add all servers into NIM: use nimquery with -d to define all CEC:

[root@nim]/root#  nimquery -a hmc=hmc01 -d
name=xxxx,type_model=9105-42A,serial_num=xxxx
...

Or manually Example:

nim -o define -t cec -a hw_type=9009 -a hw_model=42A -a hw_serial=<number> -a mgmt_source=hmc-name cec-name
[root@nim]/root# nim -o define -t cec -a hw_type=9080 -a hw_model=M9S -a hw_serial=xxxxx -a mgmt_source=hmc01 p1020

check with command:

[root@nim]/root#  lsnim -t cec
9105-42A_xxxxxxx     management       cec
...

Step 4: Create a VIOS NIM object for each managed system

Register a NIM object type VIOS

nim -o define -t vios -a if1="<network-name> <vio-name> 0 entX" -a mgmt_source="<cec-name>" -a identity=<LPAR-id> <vio-name>

More detailed example:

nim -o define -t vios -a if1="adminnetwork vio1 0 ent7" -a mgmt_source="p9-S924-testserver" -a identity=1 <vio-name>
[root@nim]/root# nim -o define -t vios -a if1="vlan4 viodevb1 0 ent1" -a mgmt_source="9080-M9S_xxxx" -a identity=3 viodev1
[root@nim]/root# nim -o define -t vios -a if1="vlan4 viodevb2 0 ent1" -a mgmt_source="9080-M9S_xxxx" -a identity=4 viodev2

Register a NIM object type “client”

On each VIOS, you must register it as a NIM client (to the NIM master), this can be done in one of two ways.

First method to do this is with the default padmin user:

$ remote_management -interface <interface> <nimserver>

The disadvantage of this method is that if you have more than one network interface on the VIOS you cannot specify the desired interface.

The second method gives you more control by allowing the administrator to choose the desired FQDN: Using oem_setup_env, run the following command as root on the VIOS:

$ oem_setup_env
# niminit -a master_port=1058 -a master=<FQDN-name-nimmaster> -a name=<vios-FQDN-name>
[root@viodev1]/home/padmin# niminit -a master_port=1058 -a master=nim -a connect=nimsh -a name=viodevb1
nimsh:2:wait:/usr/bin/startsrc -g nimclient >/dev/console 2>&1
0513-059 The nimsh Subsystem has been started. Subsystem PID is 31260968.

Also, in the oem_setup_env environment you can now check the file /etc/niminfo to ensure the NIM definitions for the client and the master are correct i.e.

cat /etc/niminfo

Final check on your NIM server if you can access via NIM your vio servers with:

nim -o lslpp <vio-server name>

Also, on the NIM server check if all the management resources are defined and available with:

lsnim -c management
[root@nim]/root# lsnim -c management
hmc01                management       hmc
9105-42A_xxxxxxx     management       cec

viodev1             management       vios
viodev2             management       vios

and for each object

lsnim -l <management_object>

Step 5: For each NIM client (LPAR) that you wish to update using Live Update via NIM

It is necessary to update the NIM client object with a valid management profile.

This can be done via the command below:

nim -o change -a identity=<LPAR_ID> -a mgmt_source=<CEC> <lpar_name>

More detailed example:

[root@nim]/root# nim -o change -a connect=nimsh -a platform=chrp -a netboot_kernel=64 -a identity=26 -a mgmt_source=9080-M9S_xxxx laboh
[root@nim]/root# lsnim -l laboh
laboh:
   class          = machines
   type           = standalone
   connect        = nimsh
   platform       = chrp
   netboot_kernel = 64
   if1            = vlan4 laboh  6E5AFxxxxx ent0
   cable_type1    = N/A
   mgmt_profile1  = hmc01 26 9080-M9S_xxxxx
   Cstate         = ready for a NIM operation
   prev_state     = ready for a NIM operation
   Mstate         = currently running
   cpuid          = 00C2FCxxxxx

This is a very boring job if you have a lot of LPARs, so we created a script for this that gathers the right information form the HMC and then updates for every LPAR the extra management profile.

See below for an example script-1.

Step 6: For Live Update to function

Each LPAR will need two additional (unused) disks before the surrogate LPAR can be created. These “spare” disks must be available on the LPAR, before you can run Live Update. Also, on the NIM master, there must be an object than holds the correct information about those disks (i.e. the Live Update data configuration file)

It’s useful to create for every LPAR an liveupdate_data NIM object, if you like to run in parallel NIM Live Updates.

  nim -o define -t live_update_data -a server=master -a location=/export/nim/liveupdate/liveupdate_data_laboh liveupdate_data_laboh

Below an example for at liveupdate_data_<lpar_name> file Note: the path is also an example it can be any path where you store your nim objects.

[root@nim]/root# cat /export/nim/liveupdate/liveupdate_data_laboh
general:
        kext_check = yes

disks:
        nhdisk  = hdisk0
        mhdisk  = hdisk10
        tohdisk =
        tshdisk =

hmc:
        lpar_id  =
        management_console = hmca01
        user = liveupdate

This is again a boring job so you can create again a script for this job, the script that I created is customer specific therefore not published.

Step 7: Create your NIM lpp_source

Depending on what you like to update with Live Update you need to prepare your lpp_source, for an SP or TL upgrade.

SP or TL lpp_source: Steps how to create a valid lpp_source from base images can be found on the IBM support page: How to create a spot and lpp_source from an Iso image. There are several ways to create your lpp_source, but this is beyond this blog. But I like to give one example that can be used on the NIM master: -1 download the right TL/SP levels from fixcentral. -2 put extra software for example for the expansion pack also into this source/download directory -3 run the example below with your right paths:

nim -o define -t lpp_source -a server=master \ -a location=“/install/nim/lpp_source” \ -a source=<path_downloaded_files_from_fixcentral> <lpp_source_name>

ifix lpp_source: I like to mention here also the possibility to use Live Update for ifixes. Some ifixes include kernel fixes and will require a reboot. Sure we can use Live Update for this also and no reboot is needed! The update source is also an lpp_source, but this source contains only ifixes. You can put multiple ifixes in one directory and then define it as an lpp_source:

NIM example: nim -o define -t lpp_source -a server=master -a location=“/install/nim/lpp_source/<ifixes-dir>” <lpp_source_ifixes>

To check if your ifix is Live Update capable use the emgr command in the following example:.

emgr -d -e IJ28227s2a.200922.epkg.Z -v2 | grep “LU CAPABLE:” LU CAPABLE: yes

After this step 7 you can do a final check if you have all the required resources:

lsnim | egrep “hmc|cec|vios|live”

After this your NIM environment is ready for Live Update from your NIM master.

Sample script-1

#!/bin/ksh93

LPAR=$1

ping -c3 ${LPAR} >/dev/null 2>&1
if [ $? -ne "0" ]
then
printf "could not ping host or wrong name <LPARname> \n"
exit 1
fi

serialnumber=$(ssh ${LPAR} prtconf |grep "Machine Serial Number" |awk '{ print $4}')
LPAR_id=$(ssh ${LPAR} LPARstat -i |grep "Partition Number" |awk '{ print $4}')

# Get right info from the HMC
mgmt_sys=$(ssh user@hmc "lssyscfg -r sys -F name |grep ${serialnumber}")

printf "${LPAR} has LPAR id: ${LPAR_id} and is current running on mgmt_system: ${mgmt_sys} \n"

NIM -o change -a identity=${LPAR_id} -a mgmt_source=${mgmt_sys} ${LPAR}

lsnim -l ${LPAR}

exit 0

Preparing the LPAR before we kick off the actual Live Update update script

Before the actual Live Update starts, there are several requirements you must meet prior the Live Update kicks off. Also, for users that are using TE (Trusted Execution) some extra steps must be done. The same case for users that using ipsec_v4 or v6 (ipfilt) In this section I will describe those extra steps. In our case we created a pre-script that runs those checks for us and put the LPAR into the right state. Because this script is very customer specific, I describe the common steps here.

Step 1:

Users that are using TE must first disable it during Live Update. On the LPAR you run:

trustchk -p te=off

Note: special case for users that store TE and / or RBAC database on a read-only LDAP server must modify the /etc/nscontrol.conf so that during the update only the local /etc/security/tsd/tsd.dat and /etc/security/<rbacfiles> are modified. This is not Live Update specific but should be done with any update to another SP or TL level. Example for tsd.dat file:

chsec -f /etc/nscontrol.conf -s tsddat -a secorder=files

Step 2:

Make a backup of your ip-filter lists and stop ip-filters on the LPAR.

After this remove the devices with:

rmdev -dl ipsec_v4 -R
rmdev -dl ipsec_v6 -R

Step 3:

Stop the processes that can cause Live Update to fail. We discovered several processes that caused Live Update to fail: WLM (workload manager), we found out that this was started via the inittab even we do not use it. Therefore we shut this down and also disable it in the /etc/inittab (if you do not disable this entry in inittab, the surrogate will execute it and start wlm again). On the LPAR use:

/usr/sbin/wlmcntrl -o
chitab "rcnetwlm:23456789:off:/etc/rc.netwlm start > /dev/console 2>&1 # disable netwlm"

Alternative if you do not use WLM, you can ofcource simply remove it from the /etc/inittab with:

# rmitab rcnetwlm

Other processes that we stop before we start the Live Update are: Backup agents Monitoring scripts, such as nimon, nmon, lpar2rrd etc. Reason for this: You do not want a backup starting during the upgrade phase. Monitoring can cause some open files in /proc and Live Update does not like this. See also the restrictions on the IBM page: https://www.ibm.com/support/knowledgecenter/ssw_aix_72/install/lvupdate_detail_restrict.html

Step 4:

Remove all ifixes on the LPAR if present, this action is normal to do before an upgrade to a higher TL or SP and requires without Live Update most of the time a reboot, if the ifixes are containing kernel patches.

In this case we only remove the ifixes and do no reboot otherwise it would not be a Live Update.

Step 5:

If the following filesets are installed on the LPAR they must be removed otherwise Live Update will fail. This is can be found in the requirements for Live Update. https://www.ibm.com/support/knowledgecenter/ssw_aix_72/install/lvupdate_limitations.html

It concerns the filesets bos.mls.cfg and security.pkcs11

geninstall -u -I “g -J -w” -Z -f filelist where the file list contains the mentioned files

or

installp -ugw bos.mls.cfg security.pkcs11

Step 6: This step is also TL02SP2 specific and it is solved in TL03 (see pitfall number 4.) Copy a modified version c_cust_Live Update to /usr/lpp/bos.sysmgt/nim/methods/ on the LPAR The modification made was:

vi /usr/lpp/bos.sysmgt/NIM/methods/c_cust_Live Update Around line 892 Change:

geninstall_cmd=“${geninstall_cmd} ${VERBOSE:+-D} -k”

To:

geninstall_cmd=“${geninstall_cmd} ${VERBOSE:+-D} -k -Y”

The Live Update will accept the -Y = agree license flag on the LPAR.

Other solution is to install the right apar for this see list below: IJ02913 for level: 7200-01-05-1845 IJ08917 for level: 7200-02-03-1845 IJ05198 for TL level: 7200-03

Of course it’s always better to start with LKU at the latest available level. But the goal of LKU is updating to an higher level, this it the reason why I describe those pitfalls and mention the solution for those levels.

Step 7:

This step was also a step that we needed to do only on TL02 SP2 See for more detail pitfall number 5. On the NIM maser run once and on every LPAR also:

lvupdateRegScript -r -n AUTOFS_UMOUNT -d orig -P LVUP_CHECK

Step 8:

We discovered if there are programs running on the LPAR that have open files in /proc Live Update will fail during the blackout period. (See pitfall 1). You can prevent this by checking the LPAR with lsof |grep proc If you have output than you should stop executables that are causing this, such as truss or debugging programs.

Step 9:

Check if there is enough space in /var and the root filesystem, /. You need at minimum of 100Mb free space, but for safety we verify at that we have at least 500Mb free space in those filesystems.

Step 10:

Before running Live Update, I recommend creating a backup of your system. You have several options for this NIM, mksysb, alt_disk_copy. I prefer to use alt_disk_copy because it only takes one reboot to fallback to the old rootvg.

Make on a separate disk a copy of your current rootvg with the tool alt_disk_copy.

Steps are very simple and straight forward and is a real time saver if something goes terribly wrong!

Steps are:

cfgmgr -v after you added the extra disk via zoning or storage pools.

In case of using TE run trustchk -p te=off

Run the command:

alt_disk_copy -B -d <destination-disk>

You can add the -B option to prevent changing the bootlist.

Check / modify your boot list back to the original disk:

bootlist -m normal <original_rootvg_disk>

Step 11:

Check the profile for the lpar on the HMC. For memory the minimum amount of memory for the active profile must be 2048 (2Gb). Be aware LPAR must be booted with this profile, it cannot be changed afterwards. This is actually a AIX 7.2 requirement. You need a minimum of 2Gb to run the OS.

Let’s update without doing a reboot!

Now that we did a lot of checks and verifications it’s now time for the actual goal, update the TL / SP level without a reboot. In this section we describe several ways to initiate Live Update with NIM, and my advice is: if you are new with technology do first some test runs on a test environment, and to speed up you tests make one on a new disk first a alt_disk_copy. In this way you can rollback with a reboot, see step 10 in the previous section.

Live Update with NIM For preview you can run the following examples with your valid lpp source and liveupdate_data NIM resource:

Dry run: mention the -p flag for preview

nim -o cust -a lpp_source=72000402_lpp_source -a installp_flags="-p" -a fixes=update_all -a live_update=yes -a live_update_data=liveupdate_data_<LPAR-name> <LPAR-name>

Real update with Live Update without the -p flag:

nim -o cust -a lpp_source=72000402_lpp_source -a fixes=update_all -a live_update=yes -a live_update_data=liveupdate_data_<LPAR-name> <LPAR-name>

When starting this command via a script on your NIM server or via the command line you can monitoring the standard output on your NIM master server. For the logging on the LPAR see the chapter logging.

Testing Live Update on a LPAR without using NIM: For testing purposes, you can start Live Update manually on the LPAR itself without using NIM. And if you did not apply any new updates or ifixes, you can use this method as kind of “dry run” to see if it will create a surrogate, and move your LPAR to a new one (monitor the LPAR id). Be aware: this is not a real “dry run” it does an Live Update to a new LPAR, but this is a good test if your workload can service a Live Update action.

Make sure that you first authenticate against the HMC where you created the liveupdate user.

Command example below:

hmcauth -u <liveupdate-user> -p <password> -a <hostname-hmc>

check with:

hmcauth -l

output example:

Address : www.xxx.yyy.zzz User name: liveupdate Port : 12443

On the LPAR run the command below to start the Live Update. Check on LPAR in the path /var/adm/ras/liveupdate The two free disks in the file: lvupdate.data

geninstall -kp <-- for preview first
geninstall -k   <-- for the real update

during the Live Update you can monitor the process, on either the NIM master, or if stared on the LPAR on the LPAR. On the HMC you can monitor partly the process, such as creating the donor and cleaning up the old LPAR after the process is done. See below a screenshot of the HMC during the Live Update action: [HMC progress]

Above picture shows that the original LPAR is renamed to _Live Update0 and the surrogate is created with the original name.

Apply an ifix with Live Update via the NIM master: This is roughly the same method as a TL or SP upgrade, also for NIM you need the same steps: See also the first part how to define an ifix NIM lpp_resource. Also, the minimum requirements are the same, such as minimum two disks free on the LPAR, liveupdate.data file specified in NIM, connection and authentication to the HMC, disable ip filtering etc.

Apply the ifix preview with the following example:

nim -o cust -a live_update=yes -a lpp_source=Live Update-ifixes -a installp_flags="-p" -a live_update_data=liveupdate_data_<lpar_name> -a filesets=1022100a.210112.epkg.Z <lpar_name>

Leaving out the line with the -p (preview) flag will apply the ifix with Live Update.

Tip: You can find detailed logging for the ifix install on the LPAR in the following location:

/var/adm/ras/emgr.log

Workaround: Separate upgrade from Live Update actions:

Workaround for pitfall 3: Pitfall 3 Description: PFC daemon waits too long to start.

Workaround for this problem is a good example mitigating problem that can arise during the upgrade of a TL or SP level. This workaround separates the upgrade form the actual Live Update actions. And in this case, it makes it possible to start required daemons before the Live Update actions start. See example below, the first part is an “normal” NIM update and after this it commits all the updates. Second part is the actual Live Update, but now packages to apply, so it only the Live Update actions.

nim -o cust -a lpp_source=72000402_lpp_source -a fixes=update_all -a accept_licenses=yes <lpar_name>

# now commit updates and start pfcdaemon

ssh <lpar_name> installp -c ALL
ssh <lpar_name> startsrc -s pfcdaemon
nim -o cust -a lpp_source=72000402_lpp_source -a fixes=update_all -a live_update=yes -a live_update_data=liveupdate_data_<lpar_name> <lpar_name>

Fallback option for pitfall 6: The same workaround that it used above can also be used as a kind of failback option for the chicken egg problem that is described in pitfall 6. When your workload fails to move to the surrogate LPAR (during backout a failure). The normal behavior without this workaround is rollback to the original state. With this work around the actual upgrade is already done before the Live Update action starts. So now you have the choice: -1 stop your workload and do another attempt to move to the upgrade surrogate LPAR. So repeat Live Update action without workload, this is not a real Live Update action any longer because the goal was to keep your workload running. OR -2 Because you have to shut down your workload, and in this way it’s not a real Live Update action any longer you can of course also do a reboot of your LPAR, after this your upgrade is done, but not via Live Update.

Logging: Main logging for Live Update can be found on the LPAR self in the following path: /var/adm/ras/liveupdate/logs In this path you can find the file lvupdlog or this file with a timestamp. The log file can be cryptic read, but by searching on keywords like ERROR can help you by find a clue what went wrong in case of a failure. Search ERROR all in caps in logfile.

When you must send logs to IBM for analysis it good to gather the following logs: Make a tar file of everything in the path: /var/adm/ras/liveupdate Example: cd /var/adm/ras/liveupdate tar cvf liveupdate_logs.tar * make a snap -ac and send also this info to IBM.

Pitfall’s: 1: Open files in /proc:

Message during blackout: 1020-311 The state of the LPAR is currently non-checkpointable. Checkpoint should be retried later. [02.229.0730]

Problem programs such as truss or debugger have files open in /proc In log file:

OLVUPDMCR 1610089744.160 DEBUG mcr_fd_srm_dev - Calling lvup_chckpnt_status(LIVE_UPDATE_TRY) KLVUPD 1610089744 DEBUG - 6750576/24904167 DENY Blocking entry: procfs open, count: 9 OLVUPDMCR 1610089754.160 ERROR mcr_fd_srm_dev - 1020-311 The state of the LPAR is currently non-OLVUPDMCR 1610089754.160 ERROR mcr_sync - Action LU_FREEZE failed in SRM with status 6

2: WLM manager running or entry in inittab: rcnetwlm:23456789:wait:/etc/rc.netwlm start> /dev/console 2>&1 # Start netwlm change to: chitab “rcnetwlm:23456789:off:/etc/rc.netwlm start> /dev/console 2>&1 # disable netwlm” and run /usr/sbin/wlmcntrl -o

3: PFC daemon wait too long to start The script for this daemon waits to long (TL02SP2) to start 5 min. Message in logging:

1430-134 Notification script '/opt/triton/daemon-stop' failed with exit status 1

Cause this happens because the pfc daemon is not started or takes too long to start.

Workaround firs apply all updates. then start the pfcdaemon again before the Live Update starts with:

startsrc -s pfcdaemon

4: Using Live Update on TL02-SP2 strange license error during preview. Hitting IJ08917 License errors when running Live Update through NIM. This is solved in 7200-02-03 On NIM client LPAR edit following file and put the -Y = agree license:

# vi /usr/lpp/bos.sysmgt/NIM/methods/c_cust_Live Update

Around line 892

Change:

geninstall_cmd=“${geninstall_cmd} ${VERBOSE:+-D} -k”

To:

geninstall_cmd=“${geninstall_cmd} ${VERBOSE:+-D} -k -Y”

Or apply the following ifix:

IJ02913 for level: 7200-01-05-1845
IJ08917 for level: 7200-02-03-1845
IJ05198 for TL level: 7200-03

5: Using Live Update on TL02-SP2 after the message notifying applications of impending Live Update:

1430-134 Notification script '/usr/lpp/nfs/autofs/autofs_Live Update orig' failed with exit status 127 ….1430-025 An error occurred while notifying applications. 1430-045 The Live Update operation failed. Initiating log capture.

Solution on the NIM maser run once and on every LPAR also:

lvupdateRegScript -r -n AUTOFS_UMOUNT -d orig -P LVUP_CHECK

This is fixed in:

IJ04531 - 7200-02-02-1832
IJ04532 - 7200-03-00-1837
IV89963 - 7200-01-02-1717

6: In blackout window you get several messages like below:

The Live Update logs will show the following error against the library that is causing Live Update to fail.

mcr: cke_errno 52 cke_Sy_error 23349 cke_Xtnd_error 534 cke_buf /usr/lib/libC.a(shr_64.o)

In the NIM output you get:

Blackout Time started. 1020-007 System call mcrk_sys_spl_getfd error (50). [03.156.0064] 1020-007 System call mcrk_sys_spl_getfd error (50). [03.156.0064] 1020-007 System call MCR_GET_SHLIB_ASYNCINFO error (52). A file, file system or message queue is no longer available. [02.239.0142] ……..1430-030 An error occurred while moving the workload. 1430-045 The Live Update operation failed.

This is not quite a pitfall but a more a bug that you are hitting. Problem description: the current kernel cannot release properly the shared libraries for non-root users.

Workaround: No real workaround.

Option 1: shutdown your workload and restart the Live Update operation. But in this way, it’s not a real Live Update anymore!

Option 2: Request an ifix for the level you are updating too. If you plan on updating to 7200-05-01, you need an ifix for 7200-05-01. But with this option you have a chicken egg problem, because the ifix is only active (loaded) in the new kernel. In other words its fixed after a reboot or Live Update, is successfully succeeded (option 1).

Levels that suffer from this bug (impacted) levels are: Service pack: 7200-05-01-2038 (bos.mp64 7.2.5.1) You need IJ30267 –> Official APAR is now in 7200-05-02. Technology Level: 7200-05-00-2037 (bos.mp64 7.2.5.1) You need IJ30267 Service pack: 7200-04-02-2028 (bos.mp64 7.2.4.6) You need IJ29492 Service pack: 7200-04-01-1939 (bos.mp64 7.2.4.2) You need IJ29492 Technology Level: 7200-04-00-1937 (bos.mp64 7.2.4.2) You need IJ29492

Because this fix is for a shared library, the only way for your application to know to use the latest library code is to stop your application and restart it using the updated code. Or reboot your system.

Going forward though, you do not need stop your applications if you have the ifix installed and update to the next update of AIX.

7: program /usr/bin/domainname fails if no nis-domainname is defined.

This pitfall only happens when you have installed the nis Fileset

lslpp -w /usr/bin/domainname File Fileset Type


/usr/bin/domainname bos.net.nis.client File

NIS is an old way for resolving hostnames also known as Yellow Pages, my advice is, remove this fileset because it’s old and insecure, unless you still use it. Other workaround is to replace the executable with a script that does nothing but exit with exit 0 return code.

8: In blackout window you get the following misleading message:

1020-154 Resource /old/dev/nvram is not checkpointable (pid: 19005844) (name: lscfg_chrp). Query returned 22 Invalid argument [02.229.0462]

      1020-089 Internal error detected. [02.335.0458]

However investigating the logs you see also messages like:

mcr: cke_errno 52 cke_Sy_error 23349 cke_Xtnd_error 534 cke_buf /usr/lib/libC.a(shr_64.o)

Digging further into the logs it shows that the connection with HMC(s) is lost.

[RPP#rpp::ManagedSystem::display] #### MANAGED SYSTEM #################

[RPP#rpp::ManagedSystem::display] ## NAME : p9-S924-zwl-1-7835130

[RPP#rpp::ManagedSystem::display] ## OP STATUS : OperationError::OP_PARAM_ERROR

[RPP#rpp::ManagedSystem::display] ## RUUID : d1197d99-bb08-38ea-b447-76f6073f4210

[RPP#rpp::ManagedSystem::display] ## MTMS : 9009-42A*7835130

[RPP#rpp::ManagedSystem::display] ## HMC_RELS : hmc-NAME

[RPP#rpp::ManagedSystem::display] ## MEM / PU : Values stored are not considered valid

[RPP#rpp::ManagedSystem::display] ##– ONOFF COD POOL

[RPP#rpp::ManagedSystem::display] ## MEM : Values stored are not considered valid

[RPP#rpp::ManagedSystem::display] ## PROC : Values stored are not considered valid

[RPP#rpp::ManagedSystem::display] ## EPCOD : No pool related to this managed system

[RPP#rpp::ManagedSystem::display] ##– RESOURCE REQUEST

And:

[RPP#rpp::ManagementConsole::rcToError] (hmc-zw-2): librpp error: Operation encountered an error that can be parsed. Check for REST or HSCL errors

[RPP#rpp::ManagementConsole::extractError] (hmc-zw-2): Content of error: hmcGetJsonQuick: – Could not retrieve </rest/api/uom/PowerEnterprisePool/quick/All>: <[HSCL350B The user does not have the appropriate authority. ]>.

[RPP#rpp::ManagementConsole::parseHsclError] (hmc-zw-2): Found HSCL error code: 350B

[RPP#rpp::ManagementConsole::getAllEntitiesAttrs] (hmc-zw-2): Encountered an error related to the management console. Invalidate it

[RPP#rpp::ManagedSystem::refreshResources] (p9-S924-zwl-1-7835130): librpp error: No valid management console to execute the request

[RPP#rpp::ManagedSystem::refreshOnOffResources] (p9-S924-zwl-1-7835130): librpp error: No valid management console to execute the request

  <ReasonCode>Unknown internal error.</ReasonCode>

[2021-11-04][16:03:24]libhmc_http.c:getHmcErrorMessage:1395: HMC error msg: [[HSCL350B The user does not have the appropriate authority.

[2021-11-04][16:03:24]libhmc_http.c:getHmcErrorMessage:1395: HMC error msg: [[HSCL350B The user does not have the appropriate authority.

Solution for this problem:

Make sure that the connection between the LPAR and HMC is in a working state, and if not please follow the following procedure:

First check the current status:

/usr/sbin/rsct/bin/rmcdomainstatus -s ctrmc Management Domain Status: Management Control Points

I A  0x7edb4439af21a36f  0002  xxx.xxx.xxx.xxx (ip address)
I A  0x94580b842086855a  0001  xxx.xxx.xxx.xxx (ip address)

And on the HMC cli:

lspartition -dlpar

<#55> Partition:<13*9009-42A*7xxxxxxx, lparname, lpar-ip-address>

     Active:<1>, OS:<AIX, 7.2, 7200-04-02-2028>, DCaps:<0xc2c5f>, CmdCaps:<0x4000003b, 0x3b>, PinnedMem:<6559>

If not then you can reconfigure the RMC daemon on the failing LPAR with:

First reconfig rsct:

/usr/sbin/rsct/bin/rmcctrl -z # (Stop)

/usr/sbin/rsct/install/bin/uncfgct -n #(Unconfigure)

/usr/sbin/rsct/install/bin/recfgct #(Reconfigure)

output shows something like this: 2610-656 Could not obtain information about the cluster: cu_obtain_cluster_info:Invalid current cluster pointer. fopen(NODEDEF FILE=/var/ct/IW/cfg/nodedef.cfg) fail (errno=2)

0513-071 The ctcas Subsystem has been added.

0513-071 The ctrmc Subsystem has been added.

0513-059 The ctrmc Subsystem has been started. Subsystem PID is 3277166.

Allow remote connections again:

/usr/sbin/rsct/bin/rmcctrl -p #(allow remote connections)

Check again the status:

/usr/sbin/rsct/bin/rmcdomainstatus -s ctrmc

Finally run a second attempt for LKU.

aix/aix_live_update.txt · Last modified: 2023/03/02 12:22 by manu