Distributed Database
A distributed computing system is composed of multiple processing sites or nodes that are connected via a computer network. These nodes work together to perform assigned tasks, allowing a large, complex problem to be divided into smaller sub-problems. These sub-problems are solved in parallel across the system, improving efficiency and performance. Nodes are autonomous but cooperate to provide the required functionality.
Introduction to Distributed Databases
A Distributed Database (DDB) is a collection of logically related databases that are physically spread across multiple locations and connected by a computer network.
The software that manages this system is a Distributed Database Management System (DDBMS). The primary goal of a DDBMS is to make the distribution transparent, giving users the illusion that they are interacting with a single, centralized database.
A Distributed Database (DDB) arises from the integration of two main technologies:
- Database systems: allowing data storage and management.
- Distributed systems: Enabling communication, coordination and query processing across multiple geographically separated computers.
Core Characteristics of a DDB
For a system to be considered a distributed database, it must have:
Networked Database Nodes: Multiple computers (sites or nodes) interconnected through an underlying network, capable of sending data and commands between sites.
Logical Interrelation: The data stored across different nodes is logically connected to form a single, coherent system.
Heterogeneity: The nodes can differ in hardware, operating systems, or even the database models they use. Uniformity is not required. (Absence of Homogeneity)
Transparency in DDB: Hiding the Complexity
A key feature of a DDBMS is transparency, which shields users from the implementation and operational details of the distributed system.
1. Distribution (Data Organization) Transparency: Users don't need to know where data is physically stored and network operational details.
This includes:
Location Transparency: The command used to perform a task is independent of the data's physical location and location of node where command was issued..
Naming Transparency: Ensuring access to named objects (tables, rows, etc.) by its names without specifying their physical locations.
2. Replication Transparency: The system may store multiple copies (replicas) of data on different nodes for performance and availability. Users are completely unaware of these copies and interact with the data as if there is only a single instance.
3. Fragmentation Transparency: A large table may be split into smaller pieces (fragments) and stored across different nodes. Users are unaware of this fragmentation and can query the table as a whole.
The two main types are:
Horizontal Fragmentation (Sharding): Rows of a table are divided among different nodes.
Vertical Fragmentation: Columns of a table are divided among different nodes.
4. Design and Execution Transparency
Design Transparency: Users are not concerned with how the distributed system is architected.
Execution Transparency: Users do not need to know where or how transactions are executed.
Failures, Errors, and Faults:
We can directly relate reliability and availability of the database to the faults, errors, and failures associated with it.
Fault: The root cause of an error(e.g., hardware issue).
Error: A system state that may cause failure.
Failure: When the system deviates from expected behavior for correct execution of operations.
Role of the Recovery Manager:
Managing concurrency and recovery in DDBs is more complex than in centralized systems due to the presence of multiple data copies, potential failures of individual sites, and failures in communication links (including network partitioning).
A DDBMS recovery manager addresses:
Transaction failures
Hardware failures: Loss of memory or storage contents.
Network failures: Includes message loss, corruption, or out-of-order arrival at destination.
A reliable DDBMS tolerates failure of underlying components and continues to process user requests as long as data consistency is maintained.
Fundamental Goals of a Distributed System
Distributed systems are designed with several key properties in mind to handle the challenges of operating across a network.
High Availability and Reliability
DDBs aim for high availability, meaning the system is continuously accessible, and reliability, the probability that the system is running without failure. These are two of the most common potential advantages cited for distributed databases.
Availability is the probability that the system is operational and accessible to users at any given time.
Reliability is the probability that the system is functioning correctly and without failure.
These are achieved through:
Fault Tolerance: The system is designed to anticipate and handle faults (e.g., hardware issues, network errors) before they cause a total system failure.
The system is designed to tolerate network partitioning, continuing operation even if communication links between groups of nodes fail.
Replication: By storing copies of data on multiple nodes, the system can continue to operate even if some nodes fail. If one node goes down, requests can be redirected to another node that has a replica of the data.
Isolating faults to their site of origin and employing data replication to ensure data remains accessible even if some sites fail.
Scalability
Scalability is the system's ability to handle the increasing load by adding resources while continuing to operate without performance degradation or interruption.
Horizontal Scalability (Scaling Out): Adding more nodes (computers) to the system to distribute the load. This is the most common approach for large-scale systems.
Vertical Scalability (Scaling Up): Increasing the capacity of individual nodes (e.g., adding more CPU, RAM, or storage).
Partition Tolerance
Partition Tolerance is the system's ability to continue operating even when a network failure ("partition") occurs, which might prevent some nodes from communicating with others. The subnetworks can operate independently until the connection is restored.
Advantages of Distributed Databases
1. Improved Ease and Flexibility of Application Development
DDBs facilitate the development and maintenance of applications, especially across geographically distributed sites of an organization, primarily due to the transparency of data distribution and control.
2. Increased Availability
This is a major benefit, as DDBs are designed to provide continuous accessibility even in the face of failures.
Fault Isolation: Failures are isolated to their site of origin, meaning that if one or more individual sites fail, the DDBMS can continue to operate with its remaining running sites. Only the data and software located at the failed site become inaccessible.
Data Replication: Availability is further enhanced by replicating data and software at more than one site. If data at a failed site has been replicated elsewhere, users may remain unaffected.
Network Partitioning Tolerance: The system's ability to survive network partitioning (where communication links fail, breaking the network into isolated groups of nodes). Some of the data may be unreachable, but users may still be able to access other parts of the database, contributes to high availability.
3. Improved Performance
DDBs can significantly enhance performance through various mechanisms:
Data Localization: Reduces access latency by placing data closer to where it is most frequently needed. A distributed DBMS reduces contention for CPU and I/O services and minimizes access delays, especially over wide area networks.
Smaller Local Databases: When a large database is distributed, each site typically manages a smaller, local database. This often leads to better performance for local queries and transactions.
Parallelism: DDBs support both interquery parallelism (executing multiple queries concurrently at different sites) and intraquery parallelism (breaking a single query into subqueries that execute in parallel across multiple sites), both of which contribute to improved performance.
Reduced Data Transfer Costs: Distributed query optimization algorithms often aim to minimize the amount of data transferred over the network, which can be a significant cost factor, especially in Wide Area Networks (WANs). Techniques like semijoin are used to reduce the number of tuples in a relation before transferring it, thereby minimizing data transfer.
4. Easier Expansion (Scalability)
DDBs provide an architecture that makes it much easier to expand the system by adding more data, increasing database sizes, or incorporating more nodes, compared to centralized systems. This horizontal scalability is particularly important for NoSQL systems and big data applications that experience continuous data growth.