Introduction to NoSQL Systems
The term NoSQL stands for "Not Only SQL". It means that, traditional relational database systems (RDBMS) based on SQL are effective in many cases, they are not suitable for all data management needs.
Many organizations today deal with massive, fast-growing datasets, that may not fit well within the rigid structure of traditional SQL-based systems, especially for applications handling and storing vast, diverse, and rapidly changing data.
Two key reasons why NoSQL systems are used in such scenarios are:
Traditional SQL systems offer too many features (like rich querying, strict concurrency control, and complex schema management) that some applications do not need.
The relational model requires predefined schemas, which may be too restrictive for applications dealing with semi-structured or rapidly evolving data.
These applications is not suitable for a traditional relational system and typically needs multiple types of databases and data storage systems.
It often involve unstructured or semi-structured data, requiring more flexible, scalable, and high-performance data storage and processing approaches, thus motivating the rise of NoSQL systems.
Characteristics of NoSQL Systems
NoSQL systems are primarily distributed databses, schema-flexible, and highly scalable, with a focus on, semi-structured data, availability, data replication, and performance rather than on strict immediate data consistency, structured data storage and complex querying.
Their characteristics can be grouped into two categories:
- Characteristics related to distributed architecture
- Characteristics related to data models and query languages
Characteristics Related to Distributed Databases
Scalability
NoSQL systems are designed for horizontal scalability, meaning that system capacity can be increased by adding more machines (nodes). As data volumes grow, the system continues operating while redistributing existing data across new nodes without downtime or interrupting system operations. This is essential for modern applications that need to scale dynamically.
Availability, Replication, and Eventual Consistency
Continuous availability is a core feature for the applications that use NOSQL.
To ensure this:
Data is replicated across multiple nodes in a transparent manner. If one node fails, others can still serve requests.
Replication improves data availability and Read performance improves because reads can be distributed across replicated data nodes (replicas).
Write operations become more complex since update must apply to all copies, especially if consistency is required across replicas. To manage the overhead of updating multiple copies, especially for write operations, NoSQL systems adopt eventual consistency rather than strict serializable consistency.
This means that distributed copies of data may be temporarily inconsistent but all replicas are guaranteed to converge to the same value eventually. This relaxes strict consistency to gain better performance and availability.
Replication Models
Two common replication approaches are:
Master-Slave Replication: One node acts as the master. All writes are directed to the master and later propagated to slaves (eventual consistency). Reads can be directed to either, but only the master always holds the latest data.
Master-Master Replication: Any node can accept writes. However, since updates can happen simultaneously on different nodes, temporary inconsistencies may occur. Conflict resolution becomes necessary in such systems.
Sharding (Horizontal Partitioning)
Sharding of the file records is often employed in NOSQL systems. This divides documents or records into disjoint partitions (shards) across multiple nodes, distributing the access load and improving performance.
Sharding combined with replication helps improve both load balancing and data availability.
Large datasets are sharded, divided horizontally across multiple nodes. Each shard holds a subset of the data, and shards can also be replicated for load balancing, redundancy and availability. This technique enables thousands of users to access the data concurrently without bottlenecks.
High-Performance Data Access
Most NoSQL systems optimize for fast access to individual records using a key. This is done using:
Hashing: A hash function maps each key to a storage location.
Range Partitioning: Keys are assigned to a location based on value ranges (e.g., values between K₁ and K₂ go to a particular shard).
The majority of accesses to an object will be by providing the key value rather than by using complex query conditions. The object key is similar to the concept of object id.
In hashing, a hash function h(K) is applied to the key K, and the location of the object with key K is determined by the value of h(K).
In range partitioning, the location is determined via a range of key values; for example, location i would hold the objects whose key values K are in the range Kimin ≤ K ≤ Kimax.
These techniques allow efficient lookups based on keys rather than complex query conditions.
Characteristics Related to Data Models and Query Languages
Schema Flexibility
A significant flexibility of many NoSQL systems is that they do not require a predefined schema. They typically store semi-structured, self-describing data, like JSON (JavaScript Object Notation), XML (Extensible Markup Language) where schema information is mixed with data values, and individual data items of the same type may have different sets of attributes.
This allows for frequent schema changes common in fast-evolving web applications.
As there may not be a schema, Applications accessing the data are responsible for enforcing any necessary constraints or structure, which increases flexibility, especially when data evolves frequently or varies in format.
Simplified Query Interfaces
NoSQL systems typically do not offer a full-fledged query language like SQL. Instead, they provide:
Basic operations through APIs to read and write objects accomplished by calling the appropriate operations for data access (e.g., CRUD operations: Create, Read, Update, Delete).
Limited support for complex operations like joins in query language itself, which are often handled manually in application logic.
This design choice favors performance and scalability over querying power, aligning with use cases that retrieve specific objects using a unique key rather than performing analytical queries. So they do not require a powerful query language.
Versioning
Some NoSQL databases support versioning of data, where multiple versions of a data item are stored with timestamps.
This allows:
Historical tracking of changes
Conflict resolution in distributed updates
Auditing and rollback capabilities
Versioning is particularly useful in systems where concurrent writes may occur or where data immutability is required.
The CAP Theorem
In the context of distributed database systems (DDBS), maintaining traditional ACID properties—Atomicity, Consistency, Isolation, and Durability—is particularly challenging when data is replicated across multiple nodes.
The CAP theorem, introduced by Eric Brewer in 2000 and formally proven later, addresses the trade-offs faced in such systems. The acronym CAP stands for:
(C) Consistency among replicated copies
(A) Availability of system for read and write
(P) Partition Tolerance during a network fault
The CAP theorem states that a distributed system with data replication cannot simultaneously guarantee all three properties. Instead, a system can guarantee only two of the three at any given time.
1. Consistency
In the CAP context, consistency refers to data consistency across all replicas of the same data in a distributed system. It means that every read receives the most recent write, or an error, regardless of which node is accessed. This is similar to the idea of single-copy consistency, where all clients see a unified, up-to-date view of data.
2. Availability
Availability means that the system guarantees that every request (read or write) receives a response, even if some nodes are not functioning or are unreachable. The response may not be the most recent value, but it will not fail silently or timeout indefinitely.
3. Partition Tolerance
Partition tolerance means that the system continues to operate despite failures or disruptions in communication between nodes. When the system is partitioned (e.g., due to network failure), the nodes may become isolated into groups, but the system as a whole must still function and respond to requests.
Implication of the CAP Theorem
According to the CAP theorem, in the presence of a network partition (which is inevitable in distributed systems), a choice must be made between Consistency and Availability:
A CP system prioritizes Consistency and Partition tolerance. It may sacrifice availability when a partition occurs (e.g., MongoDB).
An AP system prioritizes Availability and Partition tolerance, but may allow temporary inconsistencies (e.g., Amazon’s DynamoDB).
A CA system is theoretically desirable but not practical in distributed systems where partitions can occur.
As a result, NoSQL systems often favor AP, relaxing strict consistency in favor of eventual consistency, where all replicas will eventually converge to the same value.