Debugging Unexpected LLM Invocations in Dify's Knowledge Retrieval Node
Context
A Dify instance was deployed locally on a GPU server, integrated with Xinference hosting the THUDM/glm-4-9b-chat model. A RAG workflow was built using Dify’s default task flow template and a local knowledge base.
During chat execution, the knowledge retrieval node failed with an error: Model 'gpt-3.5-turbo' not found. Notably:
- Knowledge retrieval should not require an LLM at all — it’s expected to be vecter/keyword/BM25-based.
- No
gpt-3.5-turbomodel was registered or available in the environment.
This discrepancy prompted a source-level investigation to trace where and why this model was being resolved.
Root Cause Analysis
The retrieval logic resides in api/core/rag/datasource/retrieval_service.py, supporting four modes: keyword, vector, full-text (BM25), and hybrid. None of these perform LLM inference directly.
Keyword Retrieval
Implemented via Jieba in api/core/rag/datasource/keyword/jieba/jieba.py:
def search(self, query: str, **kwargs: Any) -> List[Document]:
keyword_table = self._get_dataset_keyword_table()
top_k = kwargs.get('top_k', 4)
indices = self._retrieve_ids_by_query(keyword_table, query, top_k)
documents = []
for idx in indices:
segment = db.session.query(DocumentSegment).filter(
DocumentSegment.dataset_id == self.dataset.id,
DocumentSegment.index_node_id == idx
).first()
if segment:
documents.append(Document(
page_content=segment.content,
metadata={
"doc_id": idx,
"doc_hash": segment.index_node_hash,
"document_id": segment.document_id,
"dataset_id": segment.dataset_id
}
))
return documents
Purely deterministic keyword matching — no LLM involvement.
Vector Retrieval
Delegates to vector DB clients (e.g., Milvus):
def search_by_vector(self, query_vector: List[float], **kwargs: Any) -> List[Document]:
results = self._client.search(
collection_name=self._collection_name,
data=[query_vector],
limit=kwargs.get('top_k', 4),
output_fields=[Field.CONTENT_KEY.value, Field.METADATA_KEY.value]
)
docs = []
for r in results[0]:
meta = r['entity'].get(Field.METADATA_KEY.value, {})
meta['score'] = r['distance']
if r['distance'] > kwargs.get('score_threshold', 0.0):
docs.append(Document(
page_content=r['entity'].get(Field.CONTENT_KEY.value, ""),
metadata=meta
))
return docs
No language model used — only embedding similarity scoring.
Full-Text (BM25) Retrieval
For Qdrent, implemented as a filtered scroll with MatchText:
def search_by_full_text(self, query: str, **kwargs: Any) -> List[Document]:
from qdrant_client.http import models
filter_expr = models.Filter(must=[
models.FieldCondition(key="group_id", match=models.MatchValue(value=self._group_id)),
models.FieldCondition(key="page_content", match=models.MatchText(text=query))
])
response = self._client.scroll(
collection_name=self._collection_name,
scroll_filter=filter_expr,
limit=kwargs.get('top_k', 2),
with_payload=True,
with_vectors=True
)
return [
self._document_from_scored_point(r, Field.CONTENT_KEY.value, Field.METADATA_KEY.value)
for r in response[0] if r
]
Again, purely database-native — no LLM.
Multi-Dataset Routing Logic
The real LLM dependency lies not in retrieval per se, but in routing across multiple datasets. In api/core/workflow/nodes/knowledge_retrieval/knowledge_retrieval_node.py, two strategies are supported:
- Single-dataset routing (
N-to-1): Uses an LLM to classify user intent and select the most relevant dataset based on its description. - Multi-dataset retrieval (
multi-retrieval): Queries all configured datasets in parallel and re-ranks results — zero LLM involvement.
The issue arises exclusively in the N-to-1 path. Examining the model resolution logic:
def _fetch_model_config(self, node_data: KnowledgeRetrievalNodeData):
model_name = node_data.single_retrieval_config.model.name
provider_name = node_data.single_retrieval_config.model.provider
model_manager = ModelManager()
return model_manager.get_model_instance(
tenant_id=self.tenant_id,
model_type=ModelType.LLM,
provider=provider_name,
model=model_name
)
This confirms that the LLM is sourced strictly from the node’s configuration — specifically node_data.single_retrieval_config.model.
Inspecting the frontend-provided node config revealed:
"single_retrieval_config": {
"model": {
"name": "gpt-3.5-turbo",
"provider": "openai"
}
}
That value wasn’t derived dynamically — it was baked in to the default workflow template.
Further inspection confirmed that the UI’s N-to-1 configuration panel explicitly defaults to OpenAI’s gpt-3.5-turbo, even when no OpenAI provider is configured. The /api/workspaces/current/default-model endpoint correctly returns glm-4-9b-chat, but the template ignores it and hardcodes gpt-3.5-turbo for the router.
Resolution Path
The fix is manual: edit the knowledge retrieval node → switch to multi-retrieval mode, or explicitly assign glm-4-9b-chat under the N-to-1 model selector. However, the root friction stems from template design — the out-of-the-box workflow assumes cloud-hosted OpenAI models, making local deployments fragile without explicit configuration overrides.