When moving from one system to another, one of my favorite migration methods is using data guard. It has its restrictions - the destination should be the same platform (exceptions noted in MOS note #413484.1), same version unless you're performing an upgrade, and that a block-for-block copy of the database is acceptable. Utilizing data guard allows for most of the heavy lifting to be performed before the actual cutover, and generally provides an easy backout procedure if there are issues.
I was in the process of creating a standby database on a customer's shiny new Exadata X8M to get them off of their aging Exadata X5-2, when we hit an issue after enabling the data guard configuration. The database in question began on the X5-2 running 11.2.0.4, then was upgraded to 12.1.0.2, followed by 12.2.0.1, and finally 19c. The upgrades themselves had been rather smooth, thanks to excellent application vendor documentation for recommended patches and parameter settings, as well as the various upgrade nodes found through MOS note #888828.1.
We cloned the database using our normal procedure - copy the password file from the primary to the standby, start the instance in nomount mode, and clone using RMAN. After the database had been cloned, we added the database to the broker configuration, and saw an error.
I have recreated the environment using with new hostnames and database names, to protect client details - in the example here, database ac4dg is on the system to be retired, and ac6dg represents the new platform. When we enabled the data guard broker, we immediately saw issues:
DGMGRL> show configuration;
Configuration - acolvin_dg
Protection Mode: MaxPerformance
Members:
ac4dg - Primary database
Error: ORA-16778: redo transport error for one or more members
ac6dg - Physical standby database
Warning: ORA-16854: apply lag could not be determined
Fast-Start Failover: Disabled
Configuration Status:
ERROR (status updated 27 seconds ago)
SQL> select thread#, sequence#, archived, applied from v$archived_log order by 1,2;
THREAD# SEQUENCE# ARC APPLIED
---------- ---------- --- ---------
1 7 YES NO
1 8 YES NO
1 9 YES NO
1 10 YES NO
1 11 YES NO
1 12 YES NO
1 13 YES NO
1 14 YES NO
8 rows selected.
Checking the alert log on the primary database, we saw the following logged on instance 2:
2021-04-15T10:11:33.500586-05:00
TT00 (PID:106020): Error 1033 received logging on to the standby
2021-04-15T10:16:38.640301-05:00
TT00 (PID:106020): Error 1033 received logging on to the standby
2021-04-15T10:21:43.804389-05:00
TT00 (PID:106020): Error 1033 received logging on to the standby
2021-04-15T10:26:48.956976-05:00
TT00 (PID:106020): Error 1033 received logging on to the standby
2021-04-15T10:31:54.141234-05:00
TT00 (PID:106020): Error 1033 received logging on to the standby
We also saw issues being logged in the standby alert log on instance 2:
2021-04-14T15:12:08.200276-05:00
PR00 (PID:23854): Error 1017 received logging on to the standby
PR00 (PID:23854): -------------------------------------------------------------------------
PR00 (PID:23854): Check that the source and target databases are using a password file
PR00 (PID:23854): and remote_login_passwordfile is set to SHARED or EXCLUSIVE,
PR00 (PID:23854): and that the SYS password is same in the password files,
PR00 (PID:23854): returning error ORA-16191
PR00 (PID:23854): -------------------------------------------------------------------------
PR00 (PID:23854): FAL: Error 16191 connecting to ac4dg for fetching gap sequence
2021-04-14T15:12:08.211820-05:00
Errors in file /u01/app/oracle/diag/rdbms/ac6dg/ac6dg2/trace/ac6dg2_pr00_23854.trc:
ORA-16191: Primary log shipping client not logged on standby
The logs gave some pretty specific messages, so I began to investigate. My standby creation process was the same as I'd used in the past - copy the password file from the primary to the standby, then after the standby database is cloned, move it to ASM. This was when I remembered that the primary database did not use a shared password file, but a file for each instance located in $ORACLE_HOME/dbs. This database had been originally created before RAC databases supported a shared password file, and the password file had never been updated to a single shared file on ASM.
Sure enough, I compared the password files, and they had a different md5sum between the two nodes. When you're running RAC, changing the SYS password from sqlplus will only update the password file on the local node. I verified this in my lab by comparing the md5sum of the password files before and after changing the SYS password:
[oracle@enkx4db03c01 ~]$ dcli -l oracle -g dbs_group 'md5sum /u01/app/oracle/product/19.0.0.0/dbhome_1/dbs/orapw*'
enkx4db03c01: 74af939a4301ff6a0be9c2a31a1777c9 /u01/app/oracle/product/19.0.0.0/dbhome_1/dbs/orapwac4dg1
enkx4db04c01: 74af939a4301ff6a0be9c2a31a1777c9 /u01/app/oracle/product/19.0.0.0/dbhome_1/dbs/orapwac4dg2
[oracle@enkx4db03c01 ~]$ dcli -l oracle -g dbs_group 'ls -al /u01/app/oracle/product/19.0.0.0/dbhome_1/dbs/orapw*'
enkx4db03c01: -rw-r----- 1 oracle oinstall 2048 Apr 14 14:00 /u01/app/oracle/product/19.0.0.0/dbhome_1/dbs/orapwac4dg1
enkx4db04c01: -rw-r----- 1 oracle oinstall 2048 Apr 14 14:00 /u01/app/oracle/product/19.0.0.0/dbhome_1/dbs/orapwac4dg2
-- change password --
[oracle@enkx4db03c01 ~]$ dcli -l oracle -g dbs_group 'md5sum /u01/app/oracle/product/19.0.0.0/dbhome_1/dbs/orapw*'
enkx4db03c01: 50ba6684c119fbc25960167813c63bae /u01/app/oracle/product/19.0.0.0/dbhome_1/dbs/orapwac4dg1
enkx4db04c01: 74af939a4301ff6a0be9c2a31a1777c9 /u01/app/oracle/product/19.0.0.0/dbhome_1/dbs/orapwac4dg2
[oracle@enkx4db03c01 ~]$ dcli -l oracle -g dbs_group 'ls -al /u01/app/oracle/product/19.0.0.0/dbhome_1/dbs/orapw*'
enkx4db03c01: -rw-r----- 1 oracle oinstall 2048 Apr 14 14:11 /u01/app/oracle/product/19.0.0.0/dbhome_1/dbs/orapwac4dg1
enkx4db04c01: -rw-r----- 1 oracle oinstall 2048 Apr 14 14:00 /u01/app/oracle/product/19.0.0.0/dbhome_1/dbs/orapwac4dg2
The first order of business was to get the password file moved to a shared location. We could easily do that with asmcmd's pwcopy command:
ASMCMD> pwcopy --dbuniquename ac4dg /u01/app/oracle/product/19.0.0.0/dbhome_1/dbs/orapwac4dg1 +DATAC1/ac4dg/orapwac4dg
copying /u01/app/oracle/product/19.0.0.0/dbhome_1/dbs/orapwac4dg1 -> +DATAC1/ac4dg/orapwac4dg
ASMCMD> ls -l +DATAC1/ac4dg/orapwac4dg
Type Redund Striped Time Sys Name
PASSWORD HIGH COARSE APR 15 10:00:00 N orapwac4dg => +DATAC1/ac4dg/PASSWORD/pwdac4dg.351.1069929821
[oracle@enkx4db03c01 ~]$ srvctl config database -db ac4dg | egrep 'unique|Password'
Database unique name: ac4dg
Password file: +DATAC1/ac4dg/orapwac4dg
That part was simple, but would anything else be required? I initially thought we may need to perform a rolling restart of the instances after moving the password file to a shared location, but it was actually much easier - we just needed to bounce the data guard broker on the primary:
SQL> alter system set dg_broker_start=false sid='*' scope=both;
System altered.
SQL> alter system set dg_broker_start=true sid='*' scope=both;
System altered.
Once we bounced the broker on the primary database, log shipping picked up, and logs were applied.>
DGMGRL> show configuration;
Configuration - acolvin_dg
Protection Mode: MaxPerformance
Members:
ac4dg - Primary database
ac6dg - Physical standby database
Fast-Start Failover: Disabled
Configuration Status:
SUCCESS (status updated 0 seconds ago)
SQL> select thread#, sequence#, archived, applied from v$archived_log order by 1,2;
THREAD# SEQUENCE# ARC APPLIED
---------- ---------- --- ---------
1 7 YES YES
1 8 YES YES
1 9 YES YES
1 10 YES YES
1 11 YES YES
1 12 YES YES
1 13 YES YES
1 14 YES YES
2 7 YES YES
2 8 YES YES
2 9 YES YES
11 rows selected.
Overall, the lesson that I was able to take away from this was that shared password files are more than just a cool feature in RAC - it should be a necessity. Also, it's very easy to continue to rely on older functionality and methods after numerous upgrades, but sometimes you're better off by taking advantage of the new features after an upgrade.