Differences

This shows you the differences between two versions of the page.

--- tsm:replace_volume [2021/01/01 21:25]
127.0.0.1 external edit
+++ tsm:replace_volume [2021/11/25 17:59] (current)
manu
@@ Line 1: / Line 1: @@
 ====== TSM: Replace a damaged volume ======
-===== In a primary stgpool =====
+===== Recovering from a lost or damaged FILE volume or tape in a TSM non-deduplicated storage pool =====
+==== In a primary stgpool ====
   * First determine which volume has troubles
-<code>
+<cli prompt='>'>
- TSM> select volume_name,write_errors,read_errors,access,STGPOOL_NAME from volumes where error_state<>'No' and ( WRITE_ERRORS>0 or READ_ERRORS>0 )
+Protect> select volume_name,write_errors,read_errors,access,STGPOOL_NAME from volumes where error_state<>'No' and ( WRITE_ERRORS>0 or READ_ERRORS>0 )
-</code>
+</cli>
   * Put the volmue access in destroy status
-<code>
+<cli prompt='>'>
- TSM> update vol <volume_name> access=destroy
+Protect> update vol <volume_name> access=destroy
-</code>
+</cli>
   * Checkout the volume from lib inventory
-<code>
+<cli prompt='>'>
- TSM> checkout libvol <library_name> <volume_name>
+Protect> checkout libvol <library_name> <volume_name>
-</code>
+</cli>
   * Search for K7 needed to restore the defect volume
-<code>
+<cli prompt='>'>
- TSM> restore vol <volume_name> preview=yes
+Protect> restore vol <volume_name> preview=yes
-</code>
+</cli>
   * Check into activity log if all required K7 are avalaible into the library, and that enough scratch are available
-<code>
+<cli prompt='>'>
- TSM> restore vol <volume_name>
+Protect> restore vol <volume_name>
-</code>
+</cli>
   * Delete the old defect volume
-<code>
+<cli prompt='>'>
- TSM> delete vol <volume_name> discard=yes
+Protect> delete vol <volume_name> discard=yes
-</code>
+</cli>
-===== In a primary stgpool using replicated data =====
+==== In a primary stgpool using replicated data ====
 On source server do the following:
@@ Line 48: / Line 50: @@
     For example, to run the full node replication process and recover damaged files for client nodes in the PAYROLL group, issue the following command:
-    replicate node payroll recoverdamaged=yes
+<cli prompt='>'>
+Protect> replicate node payroll recoverdamaged=yes
+</cli>
     copy to clipboard
     To run the node replication process only to recover damaged files, specify the name of the node or node group, and the RECOVERDAMAGED parameter with a value of ONLY.
     For example, to recover damaged files for client nodes in the PAYROLL group without running the full node replication process, issue the following command:
+<cli prompt='>'>
-    replicate node payroll recoverdamaged=only
+Protect> replicate node payroll recoverdamaged=only
+</cli>
-===== In a secondary stgpool =====
+==== In a secondary stgpool ====
   * First delete the volume, and checkout from library
-<code>
+<cli prompt='>'>
- TSM> delete vol <volume_name> discard=yes
+Protect> delete vol <volume_name> discard=yes
- TSM> checkout libvol <library_name> <volume_name> remove=bulk
+Protect> checkout libvol <library_name> <volume_name> remove=bulk
-</code>
+</cli>
   * Restore the volume from primary stgpool
-<code>
+<cli prompt='>'>
- TSM> backup stgpool <primary_stgpool> <secondary_stgpool>
+Protect> backup stgpool <primary_stgpool> <secondary_stgpool>
-</code>
+</cli>
+===== Recovering from a lost or damaged FILE volume in a TSM deduplicated storage pool =====
+**NOT FOR CONTAINER**
+The high-level summary of the process is as follows:
+  * Part 1: Recover as much of data on the damaged/lost volume(s) as possible.
+  * Part 2: Remove any remaining irrecoverable data from the damaged/lost volume(s).
+  * Part 3: Recover any data that was referencing irrecoverable data that was removed.
+  * Part 4: Remove any remaining data that cannot be recovered.
+==== Part 1 ====
+Recovering data using a copy storage pool (all supported levels) and/or a node replication target (7.1.1.0+ only):
+. For any file volume that no longer exists and is not mountable (DESTROYED), update the missing volume(s) to DESTROYED status (for example, the file volume was unexpectedly removed from the underlying filesystem):
+    IMPORTANT: You must identify and update all known missing volumes at this step.
+<cli prompt='>'>
+Protect> UPDATE VOLUME <volume name> ACCESS=DESTROYED
+</cli>
+. For any file volume that still exists and is mountable (READONLY), but may contain damaged objects, update the volume(s) to read-only and audit them (for example, the file volume still exists but some objects on that volume cannot be accessed during read operations):
+    IMPORTANT: You must identify, update and audit all known damaged volumes at this step.
+<cli prompt='>'>
+Protect> UPDATE VOLUME <volume name> ACCESS=READONLY
+Protect> AUDIT VOLUME <volume name>
+</cli>
+. Manually initiate client backups, or wait for all normally scheduled clients to run a complete backup cycle, in an attempt to recover any damaged data that may still exist on the client filesystems.
+. Regardless of whether a copy storage pool exists or not, issue the RESTORE STGPOOL command against the storage pool containing the damaged or lost volumes (and wait for the process to end).
+<cli prompt='>'>
+Protect> RESTORE STGPOOL <stgpool name> PREVIEW=NO MAXPROCESS=<n>
+</cli>
+    This process also initiates a silent background re-linker process that attempts to locate a valid copy of the damaged chunk somewhere else in the pool to relink the data. This has the opportunity to reduce the overall scope of the damage, which is why it is recommended regardless of whether a copy storage pool exists or not.
+. If the data is (or might be) replicated, attempt to use node replication to recover the affected data on the missing or damaged volume(s) by issuing the following command and waiting for the process to end (monitor the process on the source and target servers):
+<cli prompt='>'>
+Protect> REPLICATE NODE * RECOVERDAMAGED=ONLY WAIT=YES
+</cli>
+. For any file volume(s) identified in step 2 above (READONLY), attempt to move any existing valid data from those volumes to other new volumes in the same storage pool (and wait for the process to end). Do not move the data to a different storage pool, and do not issue this command for missing volume(s) identified in step 1:
+<cli prompt='>'>
+Protect> MOVE DATA <volume name>
+</cli>
+. Issue the following commands to determine if there are any objects or referenced deduplicated base chunks remaining on any of the volumes identified in steps 1 (DESTROYED) or 2 (READONLY) above:
+<cli prompt='>'>
+Protect> QUERY CONTENT <volume name> FOLLOWLINKS=NO
+</cli>
+    If this command lists objects, you have experienced irrecoverable backup data loss. The list of files returned is a list of irrecoverable objects and their owners (node names). If there are no objects listed or the volume can no longer be found, every object on this volume was recovered successfully using either a copy storage pool or node replication.
+<cli prompt='>'>
+Protect> QUERY CONTENT <volume name> FOLLOWLINKS=JUSTLINKS
+</cli>
+    If this command lists objects, then objects stored on other volumes need to be recovered due to damaged deduplicated base chunks on this volume. The list of files returned is a list of affected objects on other volumes that will need to be recovered.
+    If neither command returned objects, recovery is complete and nothing further is required. If one or more of these QUERY CONTENT commands returned objects, then continue with the remaining steps within this document.
+==== Part 2 ====
+Removing data that cannot be recovered from the missing or damaged volumes:
+. For any file volume(s) identified in step 2 (READONLY) above, ensure that all unreadable data remains marked as damaged by initiating an audit (and wait for the process to end):
+<cli prompt='>'>
+Protect> AUDIT VOLUME <volume name> FIX=YES
+</cli>
+. For any file volume(s) identified in step 2 (READONLY) above, attempt to move any remaining valid data from those volumes to other volumes in the same storage pool and wait for the process to end (do not move the data to a different storage pool):
+<cli prompt='>'>
+Protect> MOVE DATA <volume name>
+</cli>
+    IMPORTANT: Ensure that the MOVE DATA processes end with success. If they end with failure, review the activity log to determine why they failed. If the processes ended with failure because of a resource contention issue (ie lock conflict), re-issue the command at a later time. Otherwise, stop and contact IBM support for further review of the failure.
+. For any file volume(s) identified in either step 1 (DESTROYED) or 2 (READONLY) above, remove the volume(s) and their remaining irrecoverable data from the Tivoli Storage Manager inventory using the following command (and wait for the process to end):
+    WARNING: Before proceeding, determine if any ANR4895E (invalid links) errors have recently occurred (QUERY ACTLOG SEARCH=ANR4895E BEGIND=TODAY-45). If so, there is existing damage in this storage pool that needs to be addressed BEFORE proceeding. To address these errors before continuing, begin by executing the dedupAuditTool.pl script for the ANR4895E symptom as outlined in the following TechNote: Auditing and repairing a deduplicated file storage pool. Once the script results are available, IBM Support should be contacted for further recovery steps. Once recovery of the existing damage has completed, then continue with the remainder of this TechNote.
+    IMPORTANT: Be absolutely sure that the previous steps in this TechNote have been followed correctly before deleting the volume. Failure to do so may cause unintended data loss or corruption of the deduplication catalog.
+    NOTE: Deleting the volume will remove any of the remaining and irrecoverable objects on that volume. Record the object names and owners (node name) before deleting the volume and attempt to back them up from the owning node again later if possible. This information was previously collected with the first QUERY CONTENT command in step 7 above.
+<cli prompt='>'>
+Protect> DELETE VOLUME <volume name> DISCARDD=YES
+</cli>
+    This command will create "invalid links" for the objects referencing the data on this volume. If this command returns that the volume no longer exists (ANR2401E), then recovery completed at an earlier step, but you should still continue with the remaining steps in this document.
+==== Part 3 ====
+Recovering objects referencing damaged data ("invalid links"):
+. Scan and validate the deduplicated storage pool to determine if deleting the volume invalidated any links to base data:
+<cli prompt='>'>
+Protect> VALIDATE EXTENTS <deduplicated stgpool> ACTION=MARKDAMAGED PREVIEW=NO
+</cli>
+. Review the activity log to determine the results of the above step. The results will look similar to the following:
+/09/2015 09:18:12   ****  VALIDATE EXTENTS CURRENT TOTALS FOR dedup ****
+/09/2015 09:18:12   Validate Extents: Total invalid          :      0
+/09/2015 09:18:12   Validate Extents: Total deleted          :      0
+/09/2015 09:18:12   Validate Extents: Total damaged (in pool):      0
+/09/2015 09:18:12   Validate Extents: Total damaged (not in pool):  0
+/09/2015 09:18:12   ANR0985I Process 5 for VALIDATE EXTENTS running in the
+                          BACKGROUND completed with completion state SUCCESS at
+:18:12 AM.
+    If zero objects were returned as invalid or damaged by the VALIDATE EXTENTS command, then recovery and clean-up is complete. If more than 0 objects were returned as either invalid or damaged, then continue with the remaining steps within this document.
+. If a copy storage pool exists, attempt to restore any affected data on the missing or damaged volume(s) by issuing the following command (and wait for the process to end):
+<cli prompt='>'>
+Protect> RESTORE STGPOOL <stgpool name> PREVIEW=NO MAXPROCESS=<n>
+</cli>
+. If the data is (or might be) replicated, attempt to use node replication to recover the affected data on the missing or damaged volume(s) by issuing the following command and waiting for the process to end (monitor the process on the source and target servers):
+<cli prompt='>'>
+Protect> REPLICATE NODE * RECOVERDAMAGED=ONLY WAIT=YES
+</cli>
+. Repeat steps 11 and 12 to verify that no further issues are reported after the RESTORE STGPOOL or REPLICATE NODE RECOVERDAMAGED recovery attempt. If no further objects are invalid/damaged, then recovery is complete. If problems are still reported, then continue with the below step.
+==== Part 4 ====
+Removing any remaining deduplicated data that cannot be recovered:
+. Review the following TechNote to download and start a dedupAuditTool.pl scan using the INVALIDATED_LINKS symptom code: Auditing and repairing a deduplicated file storage pool. Once the script has started, contact IBM support for further assistance in resolving the remaining damage that could not be recovered. The script output, once complete, will be reviewed by the support team and further instructions will be provided to you.

wiki

User Tools

Site Tools

Differences

Page Tools