Diagnosing Spring Microservice Freezes: Connection Pool Exhaustion Analysis
Intermittent unresponsiveness in Spring Cloud services, particularly when downstream Feign clients begin reporting timeout failures, often triggers immediaet invsetigation of HTTP client configurations. Extending connectTimeout and readTimeout values in Feign and Ribbon settinsg frequently proves insufficient when the root cause involves database connection starvation.
integration:
http-client:
default-timeout: 30s
connect-duration: 30s
read-duration: 30s
load-balancer:
socket-timeout: 30s
establish-timeout: 30s
When services become completely unresponsive without emitting error logs, and administrative interfaces such as Swagger UI fail to load, the issue typically indicates thread exhaustion rather than network latency. Capturing the JVM thread state reveals the actual blocking condition:
jstack -m -l <process-id> > stack-trace.log
Analysis of the dump exposes multiple HTTP worker threads suspended indefinitely in a WAITING state, parked at the connection acquisition layer:
"undertow-worker-15" #487 daemon prio=5 os_prio=0 tid=0x00007f1b8c12a800 nid=0x7a3f waiting on condition [0x00007f1b6d4fc000]
java.lang.Thread.State: WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base@11.0.9/Native Method)
- parking to wait for <0x00000000e8c44710> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(java.base@11.0.9/LockSupport.java:194)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(java.base@11.0.9/AbstractQueuedSynchronizer.java:2081)
at com.alibaba.druid.pool.DruidDataSource.takeLast(DruidDataSource.java:2175)
at com.alibaba.druid.pool.DruidDataSource.getConnectionInternal(DruidDataSource.java:1672)
at com.alibaba.druid.pool.DruidDataSource.getConnectionDirect(DruidDataSource.java:1409)
at com.alibaba.druid.pool.DruidDataSource.getConnection(DruidDataSource.java:1389)
...
at org.mybatis.spring.SqlSessionTemplate.selectList(SqlSessionTemplate.java:194)
This stack pattern indicates that all available connections from the Druid pool remain checked out, with threads queuing indefinitely for resources that never return. Druid version 1.1.22 contains a synchronization defect where connections under specific race condiitons fail to recycle properly, effectively leaking from the pool's perspective despite physical closure.
Remediation requires upgrading to Druid 1.2.5 or newer, coupled with explicit pool lifecycle management:
storage:
datasource:
provider: com.alibaba.druid.pool.DruidDataSource
druid:
driver-class-name: org.postgresql.Driver
url: jdbc:postgresql://db-cluster:5432/transaction_db
credentials:
username: ${DB_USERNAME}
password: ${DB_PASSWORD}
pool-metrics:
initial-size: 5
min-idle: 10
max-active: 25
max-wait: 60000
health-check:
validation-query: SELECT 1
test-while-idle: true
test-on-borrow: false
validation-interval: 30000
scavenging:
eviction-interval: 60000
min-evictable-idle: 300000
max-evictable-idle: 600000
remove-abandoned: true
abandoned-timeout: 180
The upgrade resolves the race condition in connection recycling, while abandoned connection detection and aggressive eviction policies ensure that zombie connections return to the pool within predictible timeframes.