Elasticsearch Core Operations and Architecture
Document Storage and Indexing
Elasticsearch is a distributed, RESTful search and analytics engine built on Apache Lucene. It functoins as a document-oriented store, meaning the fundamental unit of data is a document rather than relational rows and columns. Every field within a document is indexed and can be searched instantly.
When storing data, the process is called indexing. A document is placed within an index, categorized by a type, and assigned a unique identifier. For instance, a path like /tech_corp/staff/101 breaks down into the index tech_corp, the type staff, and the document ID 101.
CRUD Operations
Create:
PUT /tech_corp/staff/101
{
"given_name": "Alice",
"family_name": "Johnson",
"years_of_exp": 5,
"bio": "Enthusiastic about deep sea diving",
"hobbies": ["photography", "travel"]
}
Read:
GET /tech_corp/staff/101
The response includes metadata such as _index, _type, _id, _version, and the original document inside the _source field.
Delete:
DELETE /tech_corp/staff/101
Update: Overwrite the existing document by issuing another PUT request to the same URL.
Check Existence: Use the HEAD method to determine if a document exists without fetching the body.
Data Retrieval and Search Operations
To retrieve multiple documents, a search request is used.
GET /tech_corp/staff/_search
The response contains key metrics: `took` (time in milliseconds), `timed_out`, `_shards` (details of shard participation), and `hits`. Within `hits`, `total` indicates the overall count of matching documents, while the `hits` array holds the actual results (capped at 10 by default). Each result includes a `_score` representing relevance.
### Query Types
**Expression Search (Match):**
```http
GET /tech_corp/staff/_search
{
"query": {
"match": {
"family_name": "Johnson"
}
}
}
Full-Text Search: Analyzes the search string and finds documents containing the individual terms.
GET /tech_corp/staff/_search
{
"query": {
"match": {
"bio": "deep sea"
}
}
}
Phrase Search: Requires the exact sequence of words to match.
GET /tech_corp/staff/_search
{
"query": {
"match_phrase": {
"bio": "deep sea diving"
}
}
}
Highlighting: Wraps matching terms in tags within the response.
GET /tech_corp/staff/_search
{
"query": {
"match": {
"bio": "deep sea diving"
}
},
"highlight": {
"fields": {
"bio": {}
}
}
}
Search Scope and Pagination
- Empty Search:
GET /_searchsearches across all indices. - Multi-Index/Multi-Type:
GET /idx1,idx2/type1,type2/_searchlimits the scope. - Pagination:
GET /_search?size=5&from=10skips the first 10 results and returns 5. - Lite Search: Passes query parameters in the URL:
GET /tech_corp/staff/_search?q=family_name:Johnson
Advanced Querying Techniques
Structured search deals with data that has an inherent format, such as dates or numbers, often using exact matches. Filters are preferred for these operations because they bypass the scoring mechanism and are cacheable.
Term Query with Constant Score
GET /commerce/items/_search
{
"query": {
"constant_score": {
"filter": {
"term": {
"cost": 50
}
}
}
}
}
Internal Filter Execution:
- Identify Matching Documents: The
termquery locates the value in the inverted index and retrieves the matching document IDs. - Build Bitsets: A bitset (an array of 1s and 0s) is created to represent which documents contain the term. Elasticsearch uses "roaring bitmaps" for efficient encoding.
- Iterate Bitsets: When multiple filters are present, the system iterates through the bitsets to find the intersection. Sparse bitsets are processed first to quickly eliminate non-matching documents.
- Increment Usage Counters: Elasticsearch tracks filter usage. A filter is cached in memory if it has been used within the last 256 queries. Caching is skipped for tiny segments (under 10,000 documents or less than 3% of the index) to save resources.
Boolean Filters
Combines multiple conditions using must (AND), should (OR), and must_not (NOT).
GET /commerce/items/_search
{
"query": {
"bool": {
"must": [
{ "term": { "category": "electronics" } }
],
"should": [
{ "term": { "brand": "BrandA" } },
{ "term": { "brand": "BrandB" } }
],
"must_not": [
{ "term": { "discontinued": true } }
]
}
}
}
Data Aggregations
Aggregations provide analytics over your data, categorized into Buckets and Metrics.
- Buckets: Collections of documents meeting a criterion (similar to
GROUP BY). - Metrics: Statistical calculations over documents in a bucket (like
COUNT,SUM).
Basic Aggregation
GET /dealership/sales/_search
{
"size": 0,
"aggs": {
"top_paints": {
"terms": {
"field": "paint"
}
}
}
}
Setting `size` to 0 exclludes document hits from the response, improving performance. The `aggs` block defines the aggregation name (`top_paints`) and the bucket logic (`terms` on the `paint` field). The response returns an array of buckets, each with a `key` (the paint color) and `doc_count`.
### Aggregation with Query
```http
GET /dealership/sales/_search
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"paint_distribution": {
"terms": {
"field": "paint"
}
}
}
}
Filters can also be applied alongside aggregations to restrict the dataset before the aggregation logic is executed.
Distributed Architecture
Elasticsearch is designed for horizontal scalability, meaning it expands by adding more nodes rather than upgrading a single machine. It abstracts distributed system complexities through automated operations:
- Documents are distributed across shards (containers for data).
- Shards are balanced across cluster nodes to distribute indexing and search load.
- Each shard can have replicas for redundancy and failover.
- Requests are routed to the appropriate nodes automatically.
- New nodes are seamlessly integrated, trigggering automatic shard reallocation.
A running instance is a node. Nodes sharing the same cluster.name form a cluster. One node is elected as the master, responsible for cluster-wide management (like adding/removing indices). The master does not handle document-level operations, preventing bottlenecks. Any node can route requests to the correct data node.
Cluster Health
GET /_cluster/health
The status field indicates cluster state:
- Green: All primary and replica shards are active.
- Yellow: All primary shards are active, but some replicas are not.
- Red: One or more primary shards are inactive.
Index Configuration
When creating an index, you can specify the number of shards and replicas:
PUT /weblogs
{
"settings": {
"number_of_shards": 3,
"number_of_replicas": 1
}
}