Home > Tech > Content

Understanding Big Data: Core Concepts and Technology Stack

Tech 1

Defining Big Data

Big data refers to datasets that cannot be captured, managed, or processed using conventional software tools within a reasonable time frame. It represents information assets characterized by enhanced decision-making capabilities, insight discovery, and process optimization through new processing paradigms.

The definition of big data centers on four fundamental characteristics, commonly known as the 4Vs:

Volume — Scale of Data

Modern technology has dramatically enhanced humanity's capacity to collect information, leading to exponential growth in data generation. For instance, major technology companies process hundreds of petabytes daily, with total data volumes reaching exabyte scales.

Velocity — Processing Speed

This characteristic relates to the frequency at which events such as sales, transactions, and measurements occur. During peak shopping events, payment processing can reach over 256,000 transactions per second, with real-time data handling exceeding 472 million records per second.

Variety — Data Source Diversity

Contemporary data processing must accommodate diverse sources including relational databases, NoSQL systems, flat files, XML documents, machine logs, images, audio, and video. New data formats and sources emerge continuously.

Veracity — Data Quality

Factors such as hardware and software failures, application bugs, and human errors can compromise data accuracy. Big data processing requires analyzing and filtering biased, falsified, or anomalous data to prevent inaccurate results.

Core Technologies for Big Data Processing

Hadoop

Apache Hadoop provides a distributed system infrastructure that enables users to develop distributed applications without requiring detailed knowledge of underlying distributed mechanisms. The framework leverages cluster resources for high-speed computation and storage.

Hadoop implements a distributed file system called HDFS (Hadoop Distributed File System). HDFS features high fault tolerance and operates on cost-effective hardware while providing high-throughput access to application data, making it suitable for applications with extremely large datasets. HDFS relaxes POSIX requirements to enable streaming access to filesystem data.

The core components of Hadoop are HDFS and MapReduce. HDFS provides storage for massive data volumes, while MapReduce handles computational processing.

Apache Spark

Spark is a fast, general-purpose computing engine designed for large-scale data processing. Developed by the UC Berkeley AMP laboratory, Spark functions as a parallel framework similar to Hadoop MapReduce but with significant improvements. Unlike MapReduce, Spark can store intermediate job outputs in memory, eliminating the need for HDFS read/write operations and making it more suitable for data mining and machine learning algorithms requiring iterative processing.

Spark operates as an in-memory cluster computing environment that differs from Hadoop in its ability to distribute datasets across memory. This enables interactive query processing and optimization of iterative workloads. Implemented in Scala, Spark integrates closely with Scala's collection operations, allowing developers to work with distributed datasets as if they were local objects.

Technology Categories

File Storage Systems

Hadoop HDFS
Tachyon
KosmosFS

Batch Processing

Hadoop MapReduce
Apache Spark

Stream Processing

Apache Storm
Spark Streaming
S4
Heron
Apache Flink

Key-Value and NoSQL Databases

HBase
Redis
MongoDB

Resource Management

YARN
Apache Mesos

Log Collection

Apache Flume
Logstash
Kibana

Message Systems

Apache Kafka
RabbitMQ

Query and Analysis Engines

Apache Hive
Impala
Apache Pig
Presto
Phoenix
Spark SQL
Apache Drill

Distributed Coordination

Apache Zookeeper
Apache Kylin
Druid

Cluster Management

Ambari
Ganglia
Nagios
Cloudera Manager

Machine Learning

Apache Mahout
Spark MLlib

Data Synchronization

Apache Sqoop

Workflow Scheduling

Apache Oozie

Essential Technologies for Beginners

HDFS Architecture

HDFS (Hadoop Distributed File System) serves as the foundational storage layer in the Hadoop ecosystem. It is designed with high fault tolerance for operation on commodity hardware and provides high-throughput access to application data, making it ideal for applications with large datasets.

Key Components:

Component	Role
Client	System interface that calls HDFS API, interacts with NameNode for metadata, and communicates with DataNodes for read/write operations
NameNode	Single point of management that handles metadata, responds to client metadata queries, and assigns storage locations
DataNode	Manages data block storage and replication, executes block read/write operations

MapReduce Model

MapReduce represents a computational model for processing large-scale data. The framework divides applications into Map and Reduce phases: Map operations process independent elements of a dataset to generate key-value intermediate results, while Reduce operations combine values associated with the same key to produce final outputs. This division suits distributed parallel processing across clusters of computers.

YARN Framework

YARN (Yet Another Resource Negotiator) serves as Hadoop's resource management system. Beyond MapReduce, the Hadoop ecosystem accommodates multiple applications processing data stored in HDFS. The resource management layer ensures multiple applications and jobs can execute simultaneously while guaranteeing each framework receives required resources.

Spark Streaming

Spark Streaming processes real-time data streams with high throughput and fault tolerance. It supports multiple data sources including Kaffka, Flume, Twitter, and TCP sockets, enabling complex operations similar to Map, Reduce, and Join. Results can be saved to external filesystems, databases, or real-time dashboards.

Spark SQL

Spark SQL functions as Hadoop's prominent SQL engine, leveraging Spark as its computational framework. Spark's fundamental data structure is the RDD (Resilient Distributed Dataset), a read-only distributed collection spanning cluster nodes. Unlike MapReduce's requirement to write intermediate results to disk, Spark keeps data in memory, significantly improving performance for iterative operations.

Hive

Hive provides a data warehouse infrastructure built on Hadoop, mapping structured data files to database tables and offering SQL-like query capabilities. It translates SQL statements into MapReduce jobs, enabling users with SQL knowledge to perform MapReduce statistics without developing specialized applications. This makes Hive particularly suitable for data warehouse analytics.

Impala

Impala operates as a Massively Paralel Processing (MPP) query engine on Hadoop, delivering high-performance, low-latency SQL queries against Hadoop cluster data using HDFS as underlying storage. Its rapid query response enables interactive analysis and query optimization that traditional SQL-on-Hadoop technologies cannot achieve for batch processing workloads.

Impala excels in query execution speed, returning most query results within seconds or minutes compared to hours required by equivalent Hive queries. The engine defaults to Parquet columnar storage, which proves efficient for large queries in typical data warehouse scenarios.

HBase

HBase functions as a distributed storage system for structured data. Unlike relational databases, HBase accommodates unstructured data and uses a column-oriented rather than row-oriented model. As a scalable, reliable, high-performance, distributed database with dynamic schema support, HBase implements BigTable's data model: an enhanced sparse sorted map where keys consist of row keys, column keys, and timestamps.

HBase provides random real-time read/write access to massive datasets, with data processed through MapReduce, combining storage and parallel computation effectively.

Apache Kylin

Apache Kylin is an open-source distributed analysis engine providing SQL query interfaces and multidimensional OLAP capabilities on Hadoop. Originally developed by eBay and contributed to the open-source community, Kylin can query massive Hive tables in sub-second time.

Flume

Apache Flume offers a reliable, distributed system for collecting, aggregating, and transporting large volumes of log data. The system supports various data sources in logging applications and provides simple data processing before writing to customizable data sinks.

Recommended Learning Path

For practitioners beginning their big data journey, focusing on the following technologies provides a solid foundation:

HDFS and MapReduce for understanding distributed storage and processing fundamentals
YARN for resource management concepts
Hive for SQL-based data warehousing
Spark for in-memory processing and advanced analytics
HBase for NoSQL data access patterns

Mastering these core technologies enables effective navigation of the broader big data ecosystem and supports specialization in domain-specific tools based on use case requirements.

Back to List

Prev: Setting Global Environment Variables on macOS

Next: Implementing Factory Design Patterns in Python

Fading Coder