Distributed Database Concepts
A distributed computing system is composed of multiple processing sites or nodes that are connected via a computer network. These nodes work together to perform assigned tasks, allowing a large, complex problem to be divided into smaller subproblems. These subproblems are solved in parallel across the system, improving efficiency and performance. Nodes are autonomous but cooperate to provide the required functionality.
A Distributed Database (DDB) arises from the integration of two main technologies:
Database systems: Manage structured data.
Distributed systems: Enable communication and coordination across multiple computers.
Distributed Database
A Distributed Database (DDB) is a collection of multiple logically interrelated databases distributed across multiple physical locations connected via a network.
A Distributed Database Management System (DDBMS) is the software system responsible for managing this distributed database. It hides the complexity of data distribution from users, providing distribution transparency.
Key Characteristics of a Distributed Database:
For a database to be called distributed, the following minimum conditions should be satisfied:
Networked Database Nodes: There must be multiple computers (sites or nodes) connected through an underlying network capable of sending data and commands between sites.
Logical Interrelation: The data stored across database nodes must be logically related (connected or integrated) to form a single system view.
Heterogeneity Tolerance: Connected Nodes can differ in hardware, software, or data models; uniformity is not required (Absence of Homogeneity).
Types of Transparency in Distributed Databases
Transparency refers to hiding the complexity of the distributed system from the user. In a DDBMS, multiple layers of transparency exist:
1. Data Organization Transparency
Also known as distribution transparency, it ensures that users do not need to know:
Where data is physically stored.
How data is retrieved across the network.
Freedom for the user from the operational details of the network and the placement of the data in the distributed system.
Subtypes:
Location Transparency: The command used to perform a task is independent of the data's physical location and location of then node where command was issued.
Naming Transparency: Objects (tables, rows, etc.) can be accessed using names without specifying their physical locations.
2. Replication Transparency
Multiple copies of same data may exist across different nodes/sites to improve performance and availability. Replication transparency hides this fact from users, who interact with data as if only one copy exists.
3. Fragmentation Transparency
Data may be split into fragments:
Horizontal Fragmentation: Divides rows of a table(relation) among multiple nodes (a.k.a. sharding).
Vertical Fragmentation: Divides columns of a table across nodes.
Users are unaware of this fragmentation and access data as if it were whole.
4. Design and Execution Transparency
Design Transparency: Users are not concerned with how the distributed system is architected.
Execution Transparency: Users do not need to know where or how transactions are executed.
Availability and Reliability
Reliability and availability are two of the most common potential advantages cited for distributed databases.
Reliability: Probability that the system is functioning correctly at a given point in time.
Availability: Probability that the system remains operational (continuously available) during a given time interval.
In practice, availability is often used to refer to both concepts.
Failures, Errors, and Faults:
We can directly relate reliability and availability of the database to the faults, errors, and failures associated with it.
Fault: The root cause of an error(e.g., hardware issue).
Error: A system state that may cause failure.
Failure: When the system deviates from expected behavior for correct execution of operations.
Fault Tolerance:
Distributed databases implement fault-tolerant mechanisms. It recognizes that faults will occur, and it designs mechanisms that can detect and handle faults before they lead to system failures.
Role of the Recovery Manager:
A DDBMS recovery manager addresses:
Transaction failures
Hardware failures: Loss of memory or storage contents.
Network failures: Includes message loss, corruption, or out-of-order arrival at destination.
A reliable DDBMS tolerates failure of underlying components and continues to process user requests as long as data consistency is maintained.
Scalability and Partition Tolerance
Scalability
Is the ability of a system to grow(expand) its capacity and handle increased load while continuing to operate without performance degradation or interruption.
Types of Scalability:
Horizontal Scalability: Adding more nodes to the distributed system. Distributing some of the data and processing loads from existing nodes to the new nodes.
Vertical Scalability: Enhancing individual nodes in the system with more memory, CPU, or storage.
Partition Tolerance
As the system expands its number of nodes, it is possible that the network, may partition into groups of nodes. The nodes within each partition are still connected by a subnetwork, but communication among the partitions is lost.
Partition Tolerance is the system's ability to continue functioning even when the network is partitioned (i.e., some nodes cannot communicate with others). Subnetworks can still operate independently until full connectivity is restored.
Autonomy in Distributed Databases
Autonomy describes how independently each database or node in a distributed system can operate. High autonomy supports flexibility and localized control of nodes.
Types of Autonomy:
Design Autonomy: Each site can choose its own data models and transaction management strategies.
Communication Autonomy: Each site can control how and when it shares data with others.
Execution Autonomy: Local users can perform operations independently of other nodes.
Advantages of Distributed Databases
Improved Application Development Flexibility
- Applications can be developed and maintained at different distributed sites with ease due to transparency of data distributed and control.
Increased Availability
Faults are often isolated (localized) to site of origin, other parts of the database system (network) remain operational.
Replication of Data and software at more than one site ensures continuity even if a site fails.
The system can continue to function during network partitioning. Some of the data may be unreachable, but users may still be able to access other parts of the database.
Enhanced Performance
Data is kept closer to where it is needed. This Data localization reduces access latency.
Large database distributed to Smaller local databases result in faster local queries and transactions.
Each site has fewer transactions executing thus reduce contention for resources.
Parallelism is achieved executing multiple queries concurrent at different distributed sites as subqueries. (interquery and intraquery parallelism)
Ease of Expansion (Scalability)
- Adding more data, nodes, or increasing database size is simpler and more cost-effective compared to centralized systems.
Total Transparency vs. Autonomy
Total Transparency gives users a global view of the distributed system as if it were a centralized database. All complexities of distribution are hidden.
Autonomy, in contrast, allows individual nodes to manage themselves. These two features must be balanced to optimize user control and system efficiency.