Rescue for a Mission-critical Oracle Database

    Nov 7, 2017 9:14:00 PM by Nassyam Basha

    Introduction

    This use case article describes the recovery of a production Oracle database without the availability of the required critical archive log files after bulk and unexpected changes to the database had been performed. Unfortunately, there were no flashback technologies enabled to revert the changes and we had to initiate disaster recovery. In this article, we will see how the database can be recovered using a well-known (but undocumented) parameter and what decisions to make when the business is affected, along with a few best practices to avoid such situations.

     

    Series of events

    • Received request from a customer to restore their production database with the latest full backup and point-in-time recovery, anytime between 09/30/2017 11:30 PM and 10/01/2017 00:10 AM, because after that time there were bulk changes in the database which caused loss of data.
    • In such cases, the first recovery option to try is to perform flashback, but these features were not enabled and hence this method was ruled out.
    • There was no standby database with a delay so that we could failover the standby to primary. This database had no standby available for it.
    • Incremental SCN could be performed after restoring the old backup as a new auxiliary instance in the development server and again, you would need to perform a refresh from a new database to the production database. [Could be considered as last option]
    • Then the final option would be performing a fresh restore of the database from the latest backups.
    • We reviewed the backup: Found once-weekly expdp backup and last backup ran on 09/27/2017
      • This was not a feasible option because there would be three days of data loss with this backup.
    • We reviewed the backup: Found daily full RMAN backup, and the last backup ran on 09/30/2017
      • RMAN Backup start time: 09/30/17 23:30
      • RMAN Backup end time: 09/30/17 23:38

    • Hence, we decided to use the latest backup completed on 09/30/17 (highlighted above)
    • Database blackout was performed to avoid monitoring of the target.
    • We took a fresh backup using expdp before destroying the database. (We were aware that this database backup was not useful for our solution, however we saved it in case it might be required for future purposes)  - This is one notable step every DBA should consider.
    • Dropped the database.
    • Restored control file from a recent backup.
    • Restored database as well.

    At this point, we needed to perform the recovery using the archive logs generated during the RMAN backup to synchronize the SCN for all the data files and control files.

    • Recovery using archive log files

    We started recovery from SQL Plus to start to apply the available archive logs up to the timestamp of 10/01/2017 00:10. 

    SQL> alter database recover automatic using backup controlfile until cancel;
    alter database recover automatic using backup controlfile until cancel
    *

    Nevertheless, the recover command during the first required sequence itself failed and was unable to identify or open the file. During the physical review of files, there were no archive log sequences available and no backups of archive logs found. Usually, the archive log backups are included along with the database backup, or the archive log backups are scheduled separately as per the requirement or the amount of redo generation.

    Now we will walk through the backup strategy for the configured production database.

    RMAN> run {execute script 

    The crontab job script calls the catalog script below. 

    RMAN> print script MSDB_custom_rman_disk_db_bkp;
    printing stored script: MSDB_custom_rman_disk_db_bkp
    {sql 'alter system archive log current';
    allocate channel d1 type disk format 'G:rman01MSDBdatabaseMSDB_full_%U';
    setlimit channel d1 maxopenfiles 1;
    backup filesperset=32 full database;
    release channel d1;
    sql 'alter system archive log current';
    }

    Initially, the script “MSDB_custom_rman_disk_db_bkp” will perform a full backup and then immediately the deletion of archive logs follows. The deletion script does not keep even a single archive log backup whenever this command runs: “delete noprompt archivelog until time 'sysdate';”.

    That means after the full backup all the archives deleted; the output from the backup log file is shown below.

    sql statement: alter system archive log current
    allocated channel: ORA_DISK_1
    channel ORA_DISK_1: SID=197 device type=DISK
    deleted archived log
    archived log file name=H:APPORACLEFLASH_RECOVERY_AREAMSDBARCHIVELOG17_09_30O1_MF_1_6283_DX0R9SK9_.ARC RECID=6222 STAMP=956100604
    deleted archived log
    archived log file name=H:APPORACLEFLASH_RECOVERY_AREAMSDBARCHIVELOG17_09_30O1_MF_1_6284_DX0RTHF0_.ARC RECID=6223 STAMP=956101135
    Deleted 2 objects
    Recovery Manager complete.

    Therefore, there are no archive logs available nor is there any backup configured for archive logs.

     

    Rescue Plan – Recovery Options

    Now, what are the options left for us?

    1. Restore the bad backup, which took before database drop? – No use, as there were already bulk changes done.
    2. Use expdp of 27th September, with three days of data loss? – Three days of loss was not acceptable.

     

    Critical and unsafe strategy

    Apart from the above options, there is one more option left that is considered as a last resort, and there is no guarantee that the database can be recovered or usable. This procedure is widely used in scenarios like the one I have described. Personally, from the first time I tested this undocumented technique, back in the year 2011, to now, so far based on my experience I have been able to open the database in a functioning mode is 40 – 50% of the time. An undocumented parameter plays a key role in performing fake recovery with undocumented procedures.

    Undocumented parameter: “_ALLOW_RESETLOGS_CORRUPTION = TRUE”, which we have to edit in pfile or update in spfile before opening the database. The actual and complete steps are as follows.

    1. Add the parameter “_ALLOW_RESETLOGS_CORRUPTION = TRUE” in pfile.
    2. SQL> startup mount
    3. SQL> alter database open resetlogs;
    4. Unset parameter “_ALLOW_RESETLOGS_CORRUPTION”
    5. Bounce the database
    6. Export and reimport into a fresh database

    Disadvantages

    1.  If the database is unable to open with reset logs then the database will not be usable even for recovery and we have to perform a fresh full restore.
      1. Example: Error 600 happened during database open, shutting down the database
    2. Even though the database opened, we may experience many errors related to accessing SQL or PL/SQL objects and there is an official bug for this.
      1. Bug 3517065: ORA-600 [4146] AFTER OPENING DATABASE WITH ALLOW_RESETLOGS_CORRUPTION

     

    Final Call – What is next?

    Now coming to the point, Is the undocumented procedure a good practice in critical production databases?

    Or…

    Are there any possibilities with the available backups? This entire restore and recover task became overhead with small mistakes with the configuration, which made things miserable. After all, I had made the decision to use this procedure, as there was no other option left. I was also aware that this procedure might work or fail with internal errors. However, I made my final call based on the SLA – Recovery Time Objective (RTO) and the criticality of the database.

    1. Edited the PFILE and started the instance

    Starting up:
    Oracle Database 11g Release 11.2.0.3.0 - 64bit Production.
    Windows NT Version V6.1 Service Pack 1
    CPU                 : 4 - type 8664, 4 Physical Cores
    Process Affinity    : 0x0x0000000000000000
    Memory (Avail/Total): Ph:5573M/8191M, Ph+PgF:9830M/16381M
    VM name             : VMWare Version (6)
    Using parameter settings in server-side pfile H:APPORACLEPRODUCT.2.0DBHOME_2DATABASEINITMSDB.ORA
    System parameters with non-default values:
      processes                = 150
      event                    = ""
      streams_pool_size        = 64M
      memory_target            = 6000M
      control_files            = "H:APPORACLEORADATAMSDBCONTROL01.CTL"
      control_files            = "H:APPORACLEFLASH_RECOVERY_AREAMSDBCONTROL02.CTL"
      db_block_size            = 8192
      compatible               = "11.2.0.0.0"
      cluster_database         = FALSE
      db_recovery_file_dest    = "H:apporacleflash_recovery_area"
      db_recovery_file_dest_size= 84G
      _allow_resetlogs_corruption= TRUE
      undo_tablespace          = "UNDOTBS1"
      undo_retention           = 3000
      remote_login_passwordfile= "EXCLUSIVE"
      audit_sys_operations     = TRUE
      db_domain                = ""
      smtp_out_server          = "SMTP.oracle-ckpt.com"
      audit_file_dest          = "H:APPORACLEADMINMSDBADUMP"
      audit_trail              = "DB"
      audit_trail              = "EXTENDED"
      db_name             = "MSDB"
      open_cursors        = 300

     

    2. Database opened with Restlogs

    SQL> startup mount
    SQL> alter database open resetlogs;
     
    Mon Oct 02 13:32:06 2017                                       
    ALTER DATABASE   MOUNT
    Successful mount of redo thread 1, with mount id 950819542
    Database mounted in Exclusive Mode
    Lost write protection disabled
    Completed: ALTER DATABASE   MOUNT
    Mon Oct 02 13:32:20 2017
    alter database open resetlogs
    RESETLOGS is being done without consistancy checks. This may result
    in a corrupted database. The database should be recreated.
    RESETLOGS after incomplete recovery UNTIL CHANGE 2588214675543
     
    SQL> select name,open_mode,resetlogs_change#,resetlogs_time from v$database;
     
    NAME      OPEN_MODE            RESETLOGS_CHANGE# RESETLOGS_TIME
    --------- -------------------- ----------------- --------------------
    MSDB       READ WRITE               2588214675544 02-OCT-2017 13:32:20
     
    RMAN> list incarnation;
     
    using target database control file instead of recovery catalog
     
    List of Database Incarnations
    DB Key  Inc Key DB Name  DB ID       STATUS  Reset SCN     Reset Time
    ------- ------- -------- ----------- ---     ----------    ----------
    1       1       MSDB      777731056   PARENT  1            29-AUG-11
    2       2       MSDB      777731056   PARENT  539256       14-MAY-12
    3       3       MSDB      777731056   CURRENT 2588214675544  02-OCT-17

    After the database opened using the fake recovery, it is necessary to recreate the database by exporting full back and re-importing.

     

    Corrective action plan (CAP)

    1. Correct the RMAN backup configuration including validating the backup.

    RMAN> print script MSDB_custom_rman_disk_db_bkp;

    printing stored script: 

    2. Review the RMAN backup configuration regularly at least once in a month

    3. Perform Drill tests to restore using RMAN backup

    4. Enable Flashback database with at least few hours of retention, so that performing a full restore task can be escaped.

    5. DR instance should be available all time for the critical database so that with the previous backup and then with incremental roll forward this task can be achieved quickly.

     

    Conclusion

    In this practice, we had to make bold decisions because of minor mistakes in not setting up the RMAN backup configuration properly. Hence, it is a best practice to review backup configurations at least once monthly and highly recommended to perform DR test with the backups on different servers, to ensure the production database is safe in all circumstances and easily RTO achievable.

    Tags: Oracle

    Nassyam Basha

    Written by Nassyam Basha

    Oracle DBA on 9i/10g/11g/12c with RAC 10g/11g on Linux/UNIX and Windows platforms including exposure on dBase, Foxpro, ORACLE 8i with forms & reports and always in front row to work on challenging tasks. I'm an Oracle 11g Certified Master and Oracle ACE Director. Specialties: ORACLE 8i/9i/10g/11g/12c RAC 10g/11g/12c, Strong Exposure on Data Guard. Grid Control/EM 11.1/12c/13c Exadata, Oracle Cloud Co-Author of Oracle DataGuard 11gR2 Beginners Guide[PACKT] Author of OTN articles Member of AIOUG, Speaker @OTN Tour, AIOUG Tech Day and SANGAM14/15/16, TROUG, OUGF Oracle Guru certified by My Oracle Support Community Frequent Contributer in OTN: https://community.oracle.com/people/CKPT , oracle-lists and etc.. Member of Customer advisory board(MOSC) Website: www.oracle-ckpt.com Member of OraWorld Team: Facebook Page: https://www.facebook.com/oraworldteam Tweet @oraworld_team www.oraworld-team.com Certifications: Oracle 11g Certified Master. Oracle Exadata Database Machine Certified Implementation Specialist Oracle 11g Certified Professional. Oracle 10g Certified Professional. Oracle Partner Certified Specialist