Skip to content

Distributed Database Design Techniques

In a distributed database system (DDBMS), the design process revolves around three core techniques: data fragmentation, data replication, and data allocation. These strategies determine how data is logically divided, copied, and physically placed across the various nodes (sites) in a network.

The goal is to improve performance, increase availability, and ensure the system can scale effectively. All metadata about these design choices is stored in a global directory (or catalog), which the DDBMS uses to locate data and process queries.

These techniques serve the following purposes:

  • Breaking the database into manageable logical units called fragments.

  • Enhancing availability and reliability through data replication.

  • Strategically allocating data fragments or replicas to appropriate sites.

Data Fragmentation

Data fragmentation is the process of breaking down the database, typically a table (relation), into smaller logical units called fragments. This allows for more granular control over data placement at various nodes (processing sites) across a computer network.

A good fragmentation schema must satisfy two key properties:

  1. Completeness: All the data from the original table must be present in the fragments.

  2. Reconstructability: It must be possible to reconstruct the original table from its fragments using relational operators like UNION and JOIN.

1. Horizontal Fragmentation (Sharding)

This technique divides a table into subsets of rows (tuples). Each fragment, often called a shard, contains rows that satisfy a specific condition. This is the most common form of fragmentation.

  • Rows are grouped based on the values of one or more attributes, much like a WHERE clause in SQL. Each of these fragments can then be stored at different sites.

  • Completeness and Disjointness: A set of horizontal fragments is considered complete if their conditions cover all tuples in the original relation (every tuple satisfies at least one condition). It is disjoint if no tuple satisfies conditions for more than one fragment.

  • Reconstruction: The original table is reconstructed using the UNION operator on its horizontal fragments.

Example: Suppose we partition the EMPLOYEE relation by department number:

  • Fragment 1: Dno = 5 (stored at Site A)
  • Fragment 2: Dno = 4 (stored at Site B)

Common Sharding Methods:

  • Range Partitioning: Rows are distributed based on a range of key values (e.g., customers with last names A-M go to Shard 1, N-Z to Shard 2).

  • Hash Partitioning: A hash function is applied to a shard key to determine which shard a row belongs to, ensuring a more random data distribution.

Purpose: Sharding distributes the load of accessing file records to multiple nodes, improving performance via load balancing and horizontal scalability.

2. Vertical Fragmentation

Divides a relation "vertically" into subsets of columns (attributes). A vertical fragment of a relation retains only a subset of the attributes.

Requirement for Reconstruction: To preserve the ability to reconstruct the original table, every vertical fragment must include the primary key (e.g., Ssn) or another unique identifier.

The original table is reconstructed using a FULL OUTER JOIN on the vertical fragments using the primary key.

  • Purpose: This technique is useful when different sites only need specific attributes of a relation, reducing data transfer and storage at those sites.

Example: Fragmenting the EMPLOYEE relation:

  • Fragment A: (Ssn, Name, Bdate, Address) (Personal Info)

  • Fragment B: (Ssn, Salary, Dno) (Work Info)

3. Mixed (Hybrid) Fragmentation

A combination of horizontal and vertical fragmentation. A table can first be split into horizontal fragments (shards), and then one or more of those shards can be further divided into vertical fragments..

  • Allows more flexible partitioning depending on both row and column-based needs.

  • The original relation can be reconstructed by applying UNION and OUTER UNION (or OUTER JOIN) operations in the appropriate order.


Fragmentation Schema

A fragmentation schema defines the set of fragments that include all attributes and tuples in the database, ensuring the whole database can be reconstructed from these fragments.

Once fragmented, an allocation schema then describes how these fragments (or their replicas) are assigned to specific nodes (sites) in the distributed system.

A fragmentation schema defines how the database is broken down into fragments. It must satisfy:

  • Completeness: All data (tuples and attributes) in the database is represented.

  • Reconstructability: The original database can be recreated using OUTER JOINs, UNIONs, or OUTER UNIONs.

Data Replication

Data replication is the process of creating and maintaining multiple copies (replicas) of data fragments on different nodes. The primary goals are to improve data availability (if one node fails, the data is still accessible from another) and read performance (queries can be processed by the nearest node).

Replication Strategies

  1. Full Replication: Every fragment is replicated at every site in the network. This offers maximum availability for reading data as queries can be answered locally but makes updates very slow, as every change must be propagated to every copy and increases complexity of concurrency control and recovery mechanisms.

  2. Partial Replication: This is the most common strategy. Only some fragments are replicated, often those that are in high demand. The number of replicas for each fragment can vary.

  3. No Replication: Each fragment is stored at exactly one site. This minimizes storage costs and makes updates fast, but it offers the lowest availability and resilience.

A replication schema outlines which fragments are replicated and at which sites.

Data Allocation

Data allocation is the final step where each fragment and its replicas are assigned to a specific physical node in the distributed system. This is defined by an allocation schema.

The decision of where to allocate data is critical and depends on several factors:

  • Performance: Placing data closer to the users who access it most frequently to reduce query latency.

  • Availability: Distributing replicas across different geographic locations or failure domains to protect against outages.

  • Transaction Workload: Analyzing the frequency of reads and writes at different sites.

  • Network Costs: Minimizing data transfer across expensive or slow network links.

An allocation schema maps each fragment to one or more sites of the DBMS. When a fragment is stored at multiple sites, it is said to be replicated.

Made with ❤️ for students, by a fellow learner.