Home > Tech > Content

Oracle Grid Infrastructure Cluster Startup Troubleshooting Guide

Tech May 9 3

Cluster Startup Sequence

The operating system initiates the ohasd process, which subsequently launches agents responsible for starting core daemons such as gipcd, mdnsd, gpnpd, ctssd, ocssd, crsd, and evmd. The crsd daemon then utilizes agents to bring up user-defined resources like databases, SCAN listeners, and local listeners.

For a comprehensive breakdown of the Grid Infrastructure Cluster boot order, consult official documentation regarding Note 1053147.1.

Verifying Cluster State

Inspecting Cluster and Daemon Status

Execute the following commands to assess the healtth of the cluster services:

$GI_HOME/bin/crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online

$GI_HOME/bin/crsctl stat res -t -init
--------------------------------------------------------------------------------
NAME           TARGET  STATE        SERVER                   STATE_DETAILS
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.asm
      1        ONLINE  ONLINE       cluster_node1            Started
ora.crsd
      1        ONLINE  ONLINE       cluster_node1
ora.cssd
      1        ONLINE  ONLINE       cluster_node1
ora.cssdmonitor
      1        ONLINE  ONLINE       cluster_node1
ora.ctssd
      1        ONLINE  ONLINE       cluster_node1            OBSERVER
ora.diskmon
      1        ONLINE  ONLINE       cluster_node1
ora.drivers.acfs
      1        ONLINE  ONLINE       cluster_node1
ora.evmd
      1        ONLINE  ONLINE       cluster_node1
ora.gipcd
      1        ONLINE  ONLINE       cluster_node1
ora.gpnpd
      1        ONLINE  ONLINE       cluster_node1
ora.mdnsd
      1        ONLINE  ONLINE       cluster_node1

In versions 11.2.0.2 and newer, two additional resources appear:

ora.cluster_interconnect.haip
      1        ONLINE  ONLINE       cluster_node1
ora.crf
      1        ONLINE  ONLINE       cluster_node1

On non-Exadata systems running 11.2.0.3 or higher, ora.diskmon may report offline:

ora.diskmon
      1        OFFLINE OFFLINE      cluster_node1

Versions 12c and later include the ora.storage resource:

ora.storage
1 ONLINE ONLINE cluster_node1 STABLE

Restarting Offline Daemons:

If specific daemons report offline, attempt to start them manually:

$GI_HOME/bin/crsctl start res ora.crsd -init

Issue 1: OHASD Failure to Start

The ohasd.bin process is critical; it直接或间接 starts all other cluster processes. If ohasd.bin is not running, resource checks return CRS-4639 (Could not contact Oracle High Availability Services). Attempting to start it while already running triggers CRS-4640. A startup failure typically yields:

CRS-4124: Oracle High Availability Services startup failed.
CRS-4000: Command Start failed, or completed with errors.

Automatic startup depends on several configuration factors:

1. Operating System Run Level Configuration

The OS must reach the appropriate run level before CRS initiation. Determine the required run level via:

grep init.ohasd /etc/inittab
h1:35:respawn:/etc/init.d/init.ohasd run >/dev/null 2>&1 </dev/null

Note: Oracle Linux 6/Red Hat 6 utilize /etc/init/oracle-ohasd.conf instead of inittab, though the init script remains compatible. Oracle Linux 7/Red Hat 7 employ systemd (e.g., /etc/systemd/system/oracle-ohasd.service).

The example above indicates CRS requires run level 3 or 5. Verify the current OS run level:

who -r

2. Execution of "init.ohasd run"

On Linux/Unix, init (PID 1) spawns the init.ohasd run process based on /etc/inittab. If this fails, ohasd.bin cannot launch:

ps -ef | grep init.ohasd | grep -v grep
root      3421     1  0 19:20 ?        00:00:00 /bin/sh /etc/init.d/init.ohasd run

If startup scripts (e.g., S98gcstartup in rcn.d) hang, init may never execute /etc/init.d/init.ohasd run. Errors like CRS-0715: Oracle High Availability Service has timed out waiting for init.ohasd to be started may appear.

Temporary Workaround:

cd <path-to-init.ohasd>
nohup ./init.ohasd run &

3. Clusterware Auto-Start Configuration

Auto-start is enabled by default. Verify and enable it using:

$GI_HOME/bin/crsctl enable crs
$GI_HOME/bin/crsctl config crs

If OS logs contain:

Feb 29 16:20:36 cluster_node1 logger: Oracle Cluster Ready Services startup disabled.
Feb 29 16:20:36 cluster_node1 logger: Could not access /var/opt/oracle/scls_scr/cluster_node1/root/ohasdstr

This indicates the configuration file is missing or inaccessible, often due to manual changes or incorrect patching tools.

4. Syslogd and Init Script Execution

If the OS hangs on other Snn scripts, S96ohasd may not execute. Check OS logs for:

Jan 20 20:46:51 cluster_node1 logger: Oracle HA daemon is enabled for autostart.

Missing entries may also indicate syslogd (/usr/sbin/syslogd) is not fully running. To debug, modify the script to touch a timestamp file:

# Modification in S96ohasd
case `$CAT $AUTOSTARTFILE` in
  enable*)
    /bin/touch /tmp/ohasd.start."`date`"
    $LOGERR "Oracle HA daemon is enabled for autostart."

If the file /tmp/ohasd.start.timestamp is not created, the OS is stuck on prior scripts. If created but logs are missing, syslogd is the issue. A temporary fix involves adding a sleep delay:

case `$CAT $AUTOSTARTFILE` in
  enable*)
    /bin/sleep 120
    $LOGERR "Oracle HA daemon is enabled for autostart."

5. GRID_HOME Filesystem Availability

Ensure the filesystem hosting GRID_HOME is mounted when S96ohasd executes. Logs should show:

Jan 20 20:46:51 cluster_node1 logger: Oracle HA daemon is enabled for autostart.
..
Jan 20 20:46:57 cluster_node1 logger: exec /u01/app/19c/grid/perl/bin/perl -I/u01/app/19c/grid/perl/lib /u01/app/19c/grid/bin/crswrapexece.pl ...

Missing the execution line suggests the filesystem was not mounted in time.

6. Oracle Local Registry (OLR) Integrity

Verify OLR accessibility:

ls -l $GI_HOME/cdata/*.olr
-rw------- 1 root  oinstall 272756736 Feb  2 18:20 cluster_node1.olr

Corruption or permission issues generate errors in ohasd.log:

2010-01-24 22:59:10.472: [  OCROSD][1373676464]utopen:6m':failed in stat OCR file/disk /u01/app/19c/grid/cdata/cluster_node1.olr, errno=2
...
2010-01-24 22:59:10.474: [ default][1373676464][PANIC] OHASD exiting; Could not init OLR

Restore from backup using ocrconfig -local -restore. Backups reside in $GI_HOME/cdata/$HOST/backup_$TIME_STAMP.olr.

7. Network Socket File Access

ohasd.bin requires access to socket files. Permission errors appear as:

2010-06-29 10:31:01.570: [ COMMCRS][1206901056]clsclisten: Permission denied for (ADDRESS=(PROTOCOL=ipc)(KEY=procr_local_conn_0_PROL))

In Grid Infrastructure, socket files should be owned by root. In Oracle Restart, they belong to grid.

8. Log Directory Accessibility

Ensure log directories exist and have correct permissions. Errors look like:

Feb 20 10:47:08 cluster_node1 OHASD[9566]: OHASD exiting; Directory /u01/app/19c/grid/log/cluster_node1/ohasd not found.

9. SUSE Linux Specifics

On SUSE systems, ohasd may fail post-reboot. Refer to Note 1325718.1.

10. Process Hangs (Bug 11834289)

If ohasd.bin is runing but logs are stagnant, and truss shows repeated close() errors on invalid file descriptors:

15058/1:         0.1995 close(2147483646)                                       Err#9 EBADF

This indicates Bug 11834289, fixed in 11.2.0.3+. Symptoms include CRS-5802: Unable to start the agent process.

11. OLR Corruption Symptoms

If crsctl check crs shows only CRS-4638 and crsctl stat res -p -init returns nothing, OLR is likely corrupted. Refer to Note 1193643.1.

12. EL7/OL7 Specific Issues

For Enterprise Linux 7, ensure Patch 25606616 is applied. Installation failures during root.sh often relate to ohasd startup issues (Note 1959008.1).

13. Log Analysis

Always review $GI_HOME/log/<hostname>/ohasd/ohasd.log and ohasdOUT.log for detailed failure reasons.

Issue 2: OHASD Agents Failure

OHASD.BIN spawns four agents/monitors:

oraagent: Starts ora.asm, ora.evmd, ora.gipcd, etc.
orarootagent: Starts ora.crsd, ora.ctssd, ora.diskmon, etc.
cssdagent / cssdmonitor: Starts ora.cssd and ora.cssdmonitor.

Failure here prevents cluster operation.

1. Permission Issues

Incorrect ownership on agent logs or binaries causes startup failures. Example log:

2015-02-25 15:43:54.350806 : CRSMAIN:3294918400: {0:0:2} Created alert : (:CRSAGF00123:) : Failed to start the agent process: /u01/app/19c/grid/bin/orarootagent ... no exe permission

Ensure post-patch scripts like rootcrs.pl -patch are executed.

2. Binary Corruption

Damaged agent binaries (oraagent.bin, etc.) prevent resource startup:

2011-05-03 12:03:17.491: [    AGFW][1117866336] Created alert : (:CRSAGF00130:) :  Failed to start the agent /u01/app/19c/grid/bin/orarootagent_grid

Compare binaries with a healthy node and restore if necessary.

Issue 3: OCSSD.BIN Startup Failure

cssd.bin requires specific conditions:

1. GPnP Profile Accessibility

gpnpd must be running to serve the profile. Success logs:

2010-02-02 18:00:16.263: [    GPnP][408926240]clsgpnp_profileVerifyForCall: ... Profile verified.

Failure logs:

2010-02-03 22:26:17.057: [    GPnP][3852126240]clsgpnp_getProfileEx: ... Result: (13) CLSGPNP_NO_DAEMON.

2. Voting Disk Accessibility

ocssd.bin reads Voting Disk info from the GPnP profile. If inaccessible:

2010-02-03 22:37:22.227: [    CSSD][1145538880]clssnmvFindInitialConfigs: No voting files found

If voting files are being modified, start in exclusive mode:

$GI_HOME/bin/crsctl start res ora.cssd -init -env "CSSD_MODE=-X"

Non-ASM voting disks require specific permissions:

-rw-r----- 1 ogrid oinstall 21004288 Feb  4 09:13 votedisk1

3. Network and DNS

Network binding failures appear in logs:

2010-02-03 23:26:25.804: [GIPCXCPT][1206540320]gipcmodGipcPassInitializeNetwork: failed to find any interfaces in clsinet

Private network connectivity issues cause heartbeats to fail:

2010-09-20 11:52:54.014: [    CSSD][1103055168]clssnmvDHBValidateNCopy: node 1, cluster_node1, has a disk HB, but no network HB

Verify network configuration per Note 1054902.1.

4. Third-Party Clusterware

If using vendor clusterware, ensure it starts before CRS. Verify with:

$GI_HOME/bin/lsnodes -n
cluster_node1    1

Failures manifest as skgxncin failed in logs.

5. Version Mismatch

Running crsctl from an incorrect GRID_HOME causes version assertions:

2012-11-14 10:21:44.014: [    CSSD][1086675264](:CSSNM00056:)clssnmvStartDiscovery: Terminating because of the release version(11.2.0.2.0) of this node being lesser than the active version(11.2.0.3.0)

Issue 4: CRSD.BIN Startup Failure

If ora.crsd is INTERMEDIATE, communication with the master crsd.bin may be failing. Kill the master process on the active node to force re-election.

1. OCSSD Dependency

crsd.log will show CSS not ready:

2010-02-03 22:37:51.639: [  CRSRTI][1548456880] CSS is not ready. Received status 3 from CSS.

2. OCR Accessibility

If OCR is on ASM, ora.asm must be up. Errors include:

ORA-15077: could not locate ASM instance serving a required diskgroup
...
2010-02-03 22:22:55.190: [    CRSD][2603807664][PANIC] CRSD exiting: Could not init OCR, code: 26

For file-based OCR, check permissions:

-rw-r----- 1 root  oinstall  272756736 Feb  3 23:24 ocr

Permission changes on the grid user can cause ORA-01031: insufficient privileges.

3. PID File Inconsistency

Check $GI_HOME/crs/init/<hostname>.pid. If the PID points to a wrong process (e.g., iscsid):

cat /u01/app/19.0.0.0/grid/crs/init/edwrac1.pid
21508
ps -ef | grep 21508
root      21508      1  1  2024 ?      2-00:21:50 /u01/app/19.0.0.0/grid/bin/crsd.bin reboot

If the PID file is stale or points elsewhere, remove it and restart:

# > $GI_HOME/crs/init/<cluster_node1>.pid
# $GI_HOME/bin/crsctl stop res ora.crsd -init
# $GI_HOME/bin/crsctl start res ora.crsd -init

4. Network Resolution

Network failures prevent OCR initialization:

2010-02-03 23:34:28.434: [  CRSOCR][2235814832] OCR context init failure.  Error: PROC-44: Error in network address and interface operations

5. Binary Permissions

Ensure crsd.bin has correct ownership:

$ ls -l /u01/app/19.0.0.0/grid/bin/crsd.bin
-rwxr----- 1 root oinstall 390616 Apr 12  2024 /u01/app/19.0.0.0/grid/bin/crsd.bin

Issue 5: GPNPD.BIN Startup Failure

1. DNS Resolution

Failures to connect to peers indicate DNS issues:

2010-05-13 12:48:11.541: [    GPnP][1171126592]clsgpnp_connect: ... Failed to connect to call url "tcp://node2:9393"

Verify ping and firewall settings between nodes.

2. Known Bugs

Bug 10105195 can block dispatch threads. Fixed in 11.2.0.2 PSU2 and 11.2.0.3+.

Issue 6: Other Daemon Failures

Common causes include:

Log Permissions: Incorrect ownership on log paths prevents logging and startup.
Socket Permissions: Errors like Permission denied for (ADDRESS=(PROTOCOL=ipc)(KEY=rac1DBG_GIPCD)) indicate socket file issues.
OLR Corruption: Daemons like ctssd may abort if OLR versions are invalid.

OLR Restoration:

Backup OLR:

$GI_HOME/bin/ocrconfig -local -manualbackup

Check backups:

$GI_HOME/bin/ocrconfig -local -showbackup

Restore (ensure GI is down):

# <GI_HOME>/bin/crsctl stop crs -f
# <GI_HOME>/bin/ocrconfig -local -restore <olr-backup>
# <GI_HOME>/bin/crsctl start crs

If patch levels mismatch after restore, run:

<GI_HOME>/crs/install/rootcrs.sh -prepatch
<GI_HOME>/crs/install/rootcrs.sh -postpatch

Issue 7: CRSD Agents Failure

CRSD.BIN spawns orarootagent and oraagent for user resources. Failures often stem from log permissions or Bug 11834289.

Issue 8: HAIP Failure

Resource ora.cluster_interconnect.haip may fail automatically. Consult Note 1210883.1.

General Prerequisites

Network and DNS

Cluster startup relies heavily on network functionality. Validate resolution and connectivity per Note 1054902.1.

Log File Permissions

Correct ownership of $GI_HOME/log subdirectories is vital. In a Grid Infrastructure environment (node cluster_node1, owner grid):

drwxrwxr-x 5 grid oinstall 4096 Dec  6 09:20 log
  drwxr-xr-t 17 root   oinstall 4096 Dec  6 09:22 cluster_node1
    drwxrwxrwt 4 root   oinstall  4096 Dec  6 09:20 agent
      drwxrwxrwt 7 root    oinstall 4096 Jan 26 18:15 crsd
      drwxrwxr-t 6 root oinstall 4096 Dec  6 09:24 ohasd

Ensure recursive permissions match a healthy node.

Network Socket Files

Socket files reside in /tmp/.oracle, /var/tmp/.oracle, or /usr/tmp/.oracle. Incorrect permissions cause clsclisten: Permission denied errors.

Resolution:

Stop GI as root.
Remove socket files in the .oracle directory.
Restart GI.

Example healthy socket directory:

drwxrwxrwt  2 root oinstall 4096 Feb  2 21:25 .oracle
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 srac1DBG_CSSD
srwxrwxrwx 1 root  root        0 Feb  2 18:00 srac1DBG_OHASD

Diagnostic Collection

If issues persist, collect diagnostics on all nodes as root:

$GI_HOME/bin/diagcollection.sh

Analyze the generated .gz archives for deeper insights.

Tags: oracle RAC Clusterware troubleshooting

Back to List

Prev: Essential Algorithms in Python

Next: Building a Multi-Area OSPF Network with MGRE Tunnels, Route Summarization, and Stub Areas

Fading Coder