Oracle Grid Infrastructure Cluster Startup Troubleshooting Guide
Cluster Startup Sequence
The operating system initiates the ohasd process, which subsequently launches agents responsible for starting core daemons such as gipcd, mdnsd, gpnpd, ctssd, ocssd, crsd, and evmd. The crsd daemon then utilizes agents to bring up user-defined resources like databases, SCAN listeners, and local listeners.
For a comprehensive breakdown of the Grid Infrastructure Cluster boot order, consult official documentation regarding Note 1053147.1.
Verifying Cluster State
Inspecting Cluster and Daemon Status
Execute the following commands to assess the healtth of the cluster services:
$GI_HOME/bin/crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
$GI_HOME/bin/crsctl stat res -t -init
--------------------------------------------------------------------------------
NAME TARGET STATE SERVER STATE_DETAILS
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.asm
1 ONLINE ONLINE cluster_node1 Started
ora.crsd
1 ONLINE ONLINE cluster_node1
ora.cssd
1 ONLINE ONLINE cluster_node1
ora.cssdmonitor
1 ONLINE ONLINE cluster_node1
ora.ctssd
1 ONLINE ONLINE cluster_node1 OBSERVER
ora.diskmon
1 ONLINE ONLINE cluster_node1
ora.drivers.acfs
1 ONLINE ONLINE cluster_node1
ora.evmd
1 ONLINE ONLINE cluster_node1
ora.gipcd
1 ONLINE ONLINE cluster_node1
ora.gpnpd
1 ONLINE ONLINE cluster_node1
ora.mdnsd
1 ONLINE ONLINE cluster_node1
In versions 11.2.0.2 and newer, two additional resources appear:
ora.cluster_interconnect.haip
1 ONLINE ONLINE cluster_node1
ora.crf
1 ONLINE ONLINE cluster_node1
On non-Exadata systems running 11.2.0.3 or higher, ora.diskmon may report offline:
ora.diskmon
1 OFFLINE OFFLINE cluster_node1
Versions 12c and later include the ora.storage resource:
ora.storage
1 ONLINE ONLINE cluster_node1 STABLE
Restarting Offline Daemons:
If specific daemons report offline, attempt to start them manually:
$GI_HOME/bin/crsctl start res ora.crsd -init
Issue 1: OHASD Failure to Start
The ohasd.bin process is critical; it直接或间接 starts all other cluster processes. If ohasd.bin is not running, resource checks return CRS-4639 (Could not contact Oracle High Availability Services). Attempting to start it while already running triggers CRS-4640. A startup failure typically yields:
CRS-4124: Oracle High Availability Services startup failed.
CRS-4000: Command Start failed, or completed with errors.
Automatic startup depends on several configuration factors:
1. Operating System Run Level Configuration
The OS must reach the appropriate run level before CRS initiation. Determine the required run level via:
grep init.ohasd /etc/inittab
h1:35:respawn:/etc/init.d/init.ohasd run >/dev/null 2>&1 </dev/null
Note: Oracle Linux 6/Red Hat 6 utilize /etc/init/oracle-ohasd.conf instead of inittab, though the init script remains compatible. Oracle Linux 7/Red Hat 7 employ systemd (e.g., /etc/systemd/system/oracle-ohasd.service).
The example above indicates CRS requires run level 3 or 5. Verify the current OS run level:
who -r
2. Execution of "init.ohasd run"
On Linux/Unix, init (PID 1) spawns the init.ohasd run process based on /etc/inittab. If this fails, ohasd.bin cannot launch:
ps -ef | grep init.ohasd | grep -v grep
root 3421 1 0 19:20 ? 00:00:00 /bin/sh /etc/init.d/init.ohasd run
If startup scripts (e.g., S98gcstartup in rcn.d) hang, init may never execute /etc/init.d/init.ohasd run. Errors like CRS-0715: Oracle High Availability Service has timed out waiting for init.ohasd to be started may appear.
Temporary Workaround:
cd <path-to-init.ohasd>
nohup ./init.ohasd run &
3. Clusterware Auto-Start Configuration
Auto-start is enabled by default. Verify and enable it using:
$GI_HOME/bin/crsctl enable crs
$GI_HOME/bin/crsctl config crs
If OS logs contain:
Feb 29 16:20:36 cluster_node1 logger: Oracle Cluster Ready Services startup disabled.
Feb 29 16:20:36 cluster_node1 logger: Could not access /var/opt/oracle/scls_scr/cluster_node1/root/ohasdstr
This indicates the configuration file is missing or inaccessible, often due to manual changes or incorrect patching tools.
4. Syslogd and Init Script Execution
If the OS hangs on other Snn scripts, S96ohasd may not execute. Check OS logs for:
Jan 20 20:46:51 cluster_node1 logger: Oracle HA daemon is enabled for autostart.
Missing entries may also indicate syslogd (/usr/sbin/syslogd) is not fully running. To debug, modify the script to touch a timestamp file:
# Modification in S96ohasd
case `$CAT $AUTOSTARTFILE` in
enable*)
/bin/touch /tmp/ohasd.start."`date`"
$LOGERR "Oracle HA daemon is enabled for autostart."
If the file /tmp/ohasd.start.timestamp is not created, the OS is stuck on prior scripts. If created but logs are missing, syslogd is the issue. A temporary fix involves adding a sleep delay:
case `$CAT $AUTOSTARTFILE` in
enable*)
/bin/sleep 120
$LOGERR "Oracle HA daemon is enabled for autostart."
5. GRID_HOME Filesystem Availability
Ensure the filesystem hosting GRID_HOME is mounted when S96ohasd executes. Logs should show:
Jan 20 20:46:51 cluster_node1 logger: Oracle HA daemon is enabled for autostart.
..
Jan 20 20:46:57 cluster_node1 logger: exec /u01/app/19c/grid/perl/bin/perl -I/u01/app/19c/grid/perl/lib /u01/app/19c/grid/bin/crswrapexece.pl ...
Missing the execution line suggests the filesystem was not mounted in time.
6. Oracle Local Registry (OLR) Integrity
Verify OLR accessibility:
ls -l $GI_HOME/cdata/*.olr
-rw------- 1 root oinstall 272756736 Feb 2 18:20 cluster_node1.olr
Corruption or permission issues generate errors in ohasd.log:
2010-01-24 22:59:10.472: [ OCROSD][1373676464]utopen:6m':failed in stat OCR file/disk /u01/app/19c/grid/cdata/cluster_node1.olr, errno=2
...
2010-01-24 22:59:10.474: [ default][1373676464][PANIC] OHASD exiting; Could not init OLR
Restore from backup using ocrconfig -local -restore. Backups reside in $GI_HOME/cdata/$HOST/backup_$TIME_STAMP.olr.
7. Network Socket File Access
ohasd.bin requires access to socket files. Permission errors appear as:
2010-06-29 10:31:01.570: [ COMMCRS][1206901056]clsclisten: Permission denied for (ADDRESS=(PROTOCOL=ipc)(KEY=procr_local_conn_0_PROL))
In Grid Infrastructure, socket files should be owned by root. In Oracle Restart, they belong to grid.
8. Log Directory Accessibility
Ensure log directories exist and have correct permissions. Errors look like:
Feb 20 10:47:08 cluster_node1 OHASD[9566]: OHASD exiting; Directory /u01/app/19c/grid/log/cluster_node1/ohasd not found.
9. SUSE Linux Specifics
On SUSE systems, ohasd may fail post-reboot. Refer to Note 1325718.1.
10. Process Hangs (Bug 11834289)
If ohasd.bin is runing but logs are stagnant, and truss shows repeated close() errors on invalid file descriptors:
15058/1: 0.1995 close(2147483646) Err#9 EBADF
This indicates Bug 11834289, fixed in 11.2.0.3+. Symptoms include CRS-5802: Unable to start the agent process.
11. OLR Corruption Symptoms
If crsctl check crs shows only CRS-4638 and crsctl stat res -p -init returns nothing, OLR is likely corrupted. Refer to Note 1193643.1.
12. EL7/OL7 Specific Issues
For Enterprise Linux 7, ensure Patch 25606616 is applied. Installation failures during root.sh often relate to ohasd startup issues (Note 1959008.1).
13. Log Analysis
Always review $GI_HOME/log/<hostname>/ohasd/ohasd.log and ohasdOUT.log for detailed failure reasons.
Issue 2: OHASD Agents Failure
OHASD.BIN spawns four agents/monitors:
oraagent: Startsora.asm,ora.evmd,ora.gipcd, etc.orarootagent: Startsora.crsd,ora.ctssd,ora.diskmon, etc.cssdagent/cssdmonitor: Startsora.cssdandora.cssdmonitor.
Failure here prevents cluster operation.
1. Permission Issues
Incorrect ownership on agent logs or binaries causes startup failures. Example log:
2015-02-25 15:43:54.350806 : CRSMAIN:3294918400: {0:0:2} Created alert : (:CRSAGF00123:) : Failed to start the agent process: /u01/app/19c/grid/bin/orarootagent ... no exe permission
Ensure post-patch scripts like rootcrs.pl -patch are executed.
2. Binary Corruption
Damaged agent binaries (oraagent.bin, etc.) prevent resource startup:
2011-05-03 12:03:17.491: [ AGFW][1117866336] Created alert : (:CRSAGF00130:) : Failed to start the agent /u01/app/19c/grid/bin/orarootagent_grid
Compare binaries with a healthy node and restore if necessary.
Issue 3: OCSSD.BIN Startup Failure
cssd.bin requires specific conditions:
1. GPnP Profile Accessibility
gpnpd must be running to serve the profile. Success logs:
2010-02-02 18:00:16.263: [ GPnP][408926240]clsgpnp_profileVerifyForCall: ... Profile verified.
Failure logs:
2010-02-03 22:26:17.057: [ GPnP][3852126240]clsgpnp_getProfileEx: ... Result: (13) CLSGPNP_NO_DAEMON.
2. Voting Disk Accessibility
ocssd.bin reads Voting Disk info from the GPnP profile. If inaccessible:
2010-02-03 22:37:22.227: [ CSSD][1145538880]clssnmvFindInitialConfigs: No voting files found
If voting files are being modified, start in exclusive mode:
$GI_HOME/bin/crsctl start res ora.cssd -init -env "CSSD_MODE=-X"
Non-ASM voting disks require specific permissions:
-rw-r----- 1 ogrid oinstall 21004288 Feb 4 09:13 votedisk1
3. Network and DNS
Network binding failures appear in logs:
2010-02-03 23:26:25.804: [GIPCXCPT][1206540320]gipcmodGipcPassInitializeNetwork: failed to find any interfaces in clsinet
Private network connectivity issues cause heartbeats to fail:
2010-09-20 11:52:54.014: [ CSSD][1103055168]clssnmvDHBValidateNCopy: node 1, cluster_node1, has a disk HB, but no network HB
Verify network configuration per Note 1054902.1.
4. Third-Party Clusterware
If using vendor clusterware, ensure it starts before CRS. Verify with:
$GI_HOME/bin/lsnodes -n
cluster_node1 1
Failures manifest as skgxncin failed in logs.
5. Version Mismatch
Running crsctl from an incorrect GRID_HOME causes version assertions:
2012-11-14 10:21:44.014: [ CSSD][1086675264](:CSSNM00056:)clssnmvStartDiscovery: Terminating because of the release version(11.2.0.2.0) of this node being lesser than the active version(11.2.0.3.0)
Issue 4: CRSD.BIN Startup Failure
If ora.crsd is INTERMEDIATE, communication with the master crsd.bin may be failing. Kill the master process on the active node to force re-election.
1. OCSSD Dependency
crsd.log will show CSS not ready:
2010-02-03 22:37:51.639: [ CRSRTI][1548456880] CSS is not ready. Received status 3 from CSS.
2. OCR Accessibility
If OCR is on ASM, ora.asm must be up. Errors include:
ORA-15077: could not locate ASM instance serving a required diskgroup
...
2010-02-03 22:22:55.190: [ CRSD][2603807664][PANIC] CRSD exiting: Could not init OCR, code: 26
For file-based OCR, check permissions:
-rw-r----- 1 root oinstall 272756736 Feb 3 23:24 ocr
Permission changes on the grid user can cause ORA-01031: insufficient privileges.
3. PID File Inconsistency
Check $GI_HOME/crs/init/<hostname>.pid. If the PID points to a wrong process (e.g., iscsid):
cat /u01/app/19.0.0.0/grid/crs/init/edwrac1.pid
21508
ps -ef | grep 21508
root 21508 1 1 2024 ? 2-00:21:50 /u01/app/19.0.0.0/grid/bin/crsd.bin reboot
If the PID file is stale or points elsewhere, remove it and restart:
# > $GI_HOME/crs/init/<cluster_node1>.pid
# $GI_HOME/bin/crsctl stop res ora.crsd -init
# $GI_HOME/bin/crsctl start res ora.crsd -init
4. Network Resolution
Network failures prevent OCR initialization:
2010-02-03 23:34:28.434: [ CRSOCR][2235814832] OCR context init failure. Error: PROC-44: Error in network address and interface operations
5. Binary Permissions
Ensure crsd.bin has correct ownership:
$ ls -l /u01/app/19.0.0.0/grid/bin/crsd.bin
-rwxr----- 1 root oinstall 390616 Apr 12 2024 /u01/app/19.0.0.0/grid/bin/crsd.bin
Issue 5: GPNPD.BIN Startup Failure
1. DNS Resolution
Failures to connect to peers indicate DNS issues:
2010-05-13 12:48:11.541: [ GPnP][1171126592]clsgpnp_connect: ... Failed to connect to call url "tcp://node2:9393"
Verify ping and firewall settings between nodes.
2. Known Bugs
Bug 10105195 can block dispatch threads. Fixed in 11.2.0.2 PSU2 and 11.2.0.3+.
Issue 6: Other Daemon Failures
Common causes include:
- Log Permissions: Incorrect ownership on log paths prevents logging and startup.
- Socket Permissions: Errors like
Permission denied for (ADDRESS=(PROTOCOL=ipc)(KEY=rac1DBG_GIPCD))indicate socket file issues. - OLR Corruption: Daemons like
ctssdmay abort if OLR versions are invalid.
OLR Restoration:
Backup OLR:
$GI_HOME/bin/ocrconfig -local -manualbackup
Check backups:
$GI_HOME/bin/ocrconfig -local -showbackup
Restore (ensure GI is down):
# <GI_HOME>/bin/crsctl stop crs -f
# <GI_HOME>/bin/ocrconfig -local -restore <olr-backup>
# <GI_HOME>/bin/crsctl start crs
If patch levels mismatch after restore, run:
<GI_HOME>/crs/install/rootcrs.sh -prepatch
<GI_HOME>/crs/install/rootcrs.sh -postpatch
Issue 7: CRSD Agents Failure
CRSD.BIN spawns orarootagent and oraagent for user resources. Failures often stem from log permissions or Bug 11834289.
Issue 8: HAIP Failure
Resource ora.cluster_interconnect.haip may fail automatically. Consult Note 1210883.1.
General Prerequisites
Network and DNS
Cluster startup relies heavily on network functionality. Validate resolution and connectivity per Note 1054902.1.
Log File Permissions
Correct ownership of $GI_HOME/log subdirectories is vital. In a Grid Infrastructure environment (node cluster_node1, owner grid):
drwxrwxr-x 5 grid oinstall 4096 Dec 6 09:20 log
drwxr-xr-t 17 root oinstall 4096 Dec 6 09:22 cluster_node1
drwxrwxrwt 4 root oinstall 4096 Dec 6 09:20 agent
drwxrwxrwt 7 root oinstall 4096 Jan 26 18:15 crsd
drwxrwxr-t 6 root oinstall 4096 Dec 6 09:24 ohasd
Ensure recursive permissions match a healthy node.
Network Socket Files
Socket files reside in /tmp/.oracle, /var/tmp/.oracle, or /usr/tmp/.oracle. Incorrect permissions cause clsclisten: Permission denied errors.
Resolution:
- Stop GI as root.
- Remove socket files in the
.oracledirectory. - Restart GI.
Example healthy socket directory:
drwxrwxrwt 2 root oinstall 4096 Feb 2 21:25 .oracle
srwxrwxrwx 1 grid oinstall 0 Feb 2 18:00 srac1DBG_CSSD
srwxrwxrwx 1 root root 0 Feb 2 18:00 srac1DBG_OHASD
Diagnostic Collection
If issues persist, collect diagnostics on all nodes as root:
$GI_HOME/bin/diagcollection.sh
Analyze the generated .gz archives for deeper insights.