Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

Elasticsearch Core Operations and Architecture

Tech 1

Document Storage and Indexing

Elasticsearch is a distributed, RESTful search and analytics engine built on Apache Lucene. It functoins as a document-oriented store, meaning the fundamental unit of data is a document rather than relational rows and columns. Every field within a document is indexed and can be searched instantly.

When storing data, the process is called indexing. A document is placed within an index, categorized by a type, and assigned a unique identifier. For instance, a path like /tech_corp/staff/101 breaks down into the index tech_corp, the type staff, and the document ID 101.

CRUD Operations

Create:

PUT /tech_corp/staff/101
{
  "given_name": "Alice",
  "family_name": "Johnson",
  "years_of_exp": 5,
  "bio": "Enthusiastic about deep sea diving",
  "hobbies": ["photography", "travel"]
}

Read:

GET /tech_corp/staff/101

The response includes metadata such as _index, _type, _id, _version, and the original document inside the _source field.

Delete:

DELETE /tech_corp/staff/101

Update: Overwrite the existing document by issuing another PUT request to the same URL.

Check Existence: Use the HEAD method to determine if a document exists without fetching the body.

Data Retrieval and Search Operations

To retrieve multiple documents, a search request is used.

GET /tech_corp/staff/_search


The response contains key metrics: `took` (time in milliseconds), `timed_out`, `_shards` (details of shard participation), and `hits`. Within `hits`, `total` indicates the overall count of matching documents, while the `hits` array holds the actual results (capped at 10 by default). Each result includes a `_score` representing relevance.

### Query Types

**Expression Search (Match):**
```http
GET /tech_corp/staff/_search
{
  "query": {
    "match": {
      "family_name": "Johnson"
    }
  }
}

Full-Text Search: Analyzes the search string and finds documents containing the individual terms.

GET /tech_corp/staff/_search
{
  "query": {
    "match": {
      "bio": "deep sea"
    }
  }
}

Phrase Search: Requires the exact sequence of words to match.

GET /tech_corp/staff/_search
{
  "query": {
    "match_phrase": {
      "bio": "deep sea diving"
    }
  }
}

Highlighting: Wraps matching terms in tags within the response.

GET /tech_corp/staff/_search
{
  "query": {
    "match": {
      "bio": "deep sea diving"
    }
  },
  "highlight": {
    "fields": {
      "bio": {}
    }
  }
}

Search Scope and Pagination

  • Empty Search: GET /_search searches across all indices.
  • Multi-Index/Multi-Type: GET /idx1,idx2/type1,type2/_search limits the scope.
  • Pagination: GET /_search?size=5&from=10 skips the first 10 results and returns 5.
  • Lite Search: Passes query parameters in the URL: GET /tech_corp/staff/_search?q=family_name:Johnson

Advanced Querying Techniques

Structured search deals with data that has an inherent format, such as dates or numbers, often using exact matches. Filters are preferred for these operations because they bypass the scoring mechanism and are cacheable.

Term Query with Constant Score

GET /commerce/items/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "term": {
          "cost": 50
        }
      }
    }
  }
}

Internal Filter Execution:

  1. Identify Matching Documents: The term query locates the value in the inverted index and retrieves the matching document IDs.
  2. Build Bitsets: A bitset (an array of 1s and 0s) is created to represent which documents contain the term. Elasticsearch uses "roaring bitmaps" for efficient encoding.
  3. Iterate Bitsets: When multiple filters are present, the system iterates through the bitsets to find the intersection. Sparse bitsets are processed first to quickly eliminate non-matching documents.
  4. Increment Usage Counters: Elasticsearch tracks filter usage. A filter is cached in memory if it has been used within the last 256 queries. Caching is skipped for tiny segments (under 10,000 documents or less than 3% of the index) to save resources.

Boolean Filters

Combines multiple conditions using must (AND), should (OR), and must_not (NOT).

GET /commerce/items/_search
{
  "query": {
    "bool": {
      "must": [
        { "term": { "category": "electronics" } }
      ],
      "should": [
        { "term": { "brand": "BrandA" } },
        { "term": { "brand": "BrandB" } }
      ],
      "must_not": [
        { "term": { "discontinued": true } }
      ]
    }
  }
}

Data Aggregations

Aggregations provide analytics over your data, categorized into Buckets and Metrics.

  • Buckets: Collections of documents meeting a criterion (similar to GROUP BY).
  • Metrics: Statistical calculations over documents in a bucket (like COUNT, SUM).

Basic Aggregation

GET /dealership/sales/_search
{
  "size": 0,
  "aggs": {
    "top_paints": {
      "terms": {
        "field": "paint"
      }
    }
  }
}

Setting `size` to 0 exclludes document hits from the response, improving performance. The `aggs` block defines the aggregation name (`top_paints`) and the bucket logic (`terms` on the `paint` field). The response returns an array of buckets, each with a `key` (the paint color) and `doc_count`.

### Aggregation with Query

```http
GET /dealership/sales/_search
{
  "size": 0,
  "query": {
    "match_all": {}
  },
  "aggs": {
    "paint_distribution": {
      "terms": {
        "field": "paint"
      }
    }
  }
}

Filters can also be applied alongside aggregations to restrict the dataset before the aggregation logic is executed.

Distributed Architecture

Elasticsearch is designed for horizontal scalability, meaning it expands by adding more nodes rather than upgrading a single machine. It abstracts distributed system complexities through automated operations:

  • Documents are distributed across shards (containers for data).
  • Shards are balanced across cluster nodes to distribute indexing and search load.
  • Each shard can have replicas for redundancy and failover.
  • Requests are routed to the appropriate nodes automatically.
  • New nodes are seamlessly integrated, trigggering automatic shard reallocation.

A running instance is a node. Nodes sharing the same cluster.name form a cluster. One node is elected as the master, responsible for cluster-wide management (like adding/removing indices). The master does not handle document-level operations, preventing bottlenecks. Any node can route requests to the correct data node.

Cluster Health

GET /_cluster/health

The status field indicates cluster state:

  • Green: All primary and replica shards are active.
  • Yellow: All primary shards are active, but some replicas are not.
  • Red: One or more primary shards are inactive.

Index Configuration

When creating an index, you can specify the number of shards and replicas:

PUT /weblogs
{
  "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 1
  }
}

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.