Database Design
Introduction to Database Design
Database design is the process of structuring data into a well-organized schema of tables (relations) to ensure data is stored efficiently, without redundancy, and can be accessed accurately.
The main goals of relational database design are:
Information preservation: A stable data structure where no data is lost during design transformations.
Minimal redundancy: Avoid storing the same information more than once to reduce storage and avoid inconsistency during updates.
While there are two main design approaches bottom-up (synthesis) and top-down (analysis), the top-down approach is far more common.
It starts with real-world groupings of information (like an invoice or a registration form) having related attributes together and progressively refines them into well-structured tables.
Informal Design Guidelines for Relation Schema
In relational database design, informal design guidelines are a set of principles used to evaluate the quality of a relation schema before applying more formal normalization techniques. These guidelines help database designers create schemas that are easy to understand, minimize redundancy, reduce the presence of NULL values, and prevent the generation of spurious data.
Guideline 1: Clear Semantics (One Theme per Table)
A well-designed relation schema has a real-world meaning that is easy to explain and unambiguous. If the purpose of a table cannot be described in one simple sentence, it may be poorly designed.
Attributes within a single relation should relate to one entity type or one relationship type. When a relation mixes attributes from multiple entity or relationship types, it can lead to semantic ambiguities.
Each table should represent a single entity type or relationship. Avoid mixing attributes from different concepts into one table.
Bad: A table mixing employee details, department details, and project assignments.
EMP_DEPT(Ssn, Ename, Bdate, Address, Dnumber, Dname, Dmgr_ssn)
Good: Separate tables for
EMPLOYEES
,DEPARTMENTS
, andPROJECTS
.
Guideline 2: Reduce Redundancy and Avoid Anomalies
The design should aim to minimize the redundant storage of the same information and reduce the need for multiple updates to maintain consistency across data copies.
Storing the same information repeatedly is inefficient and dangerous. It leads to update anomalies:
Insertion Anomaly: It might be impossible to insert information about an entity without also inserting information about another related entity which is unrelated data. (e.g., Can't add a new department until at least one employee is assigned to it).
Deletion Anomaly: Deleting a tuple might unintentionally remove other crucial but unrelated information.. (e.g., If the last employee of a department is deleted from
EMP_DEPT
, the entire department's information could be lost).Modification Anomaly: Updating a piece of information that is stored redundantly in multiple tuples requires updating all occurrences to maintain consistency. (e.g., If a department changes its name, you must update the record of every employee in that department).
Guideline 3: Minimize NULL Values
Database designers should avoid placing attributes in a base relation whose values may frequently be NULL. If NULLs are unavoidable, they should be reserved for exceptional cases rather than applying to a majority of tuples.
- Best Practice: If a set of attributes is frequently
NULL
for many records, consider placing them in a separate table.
NULL values can indicate different things, can lead to ambiguity and complicate queries. Comparisons and aggregation operations (like COUNT
, SUM
, AVG
) behave differently with NULL values, and joins involving NULLs can also produce unexpected results, making queries harder to write and understand.
Guideline 4: Prevent Spurious Tuples
When decomposing a relation into smaller relations, the decomposition must be "lossless" or "nonadditive". This ensures that when the decomposed relations are joined back together, no "spurious tuples" (invalid or erroneous information) are generated.
Avoid relations that contain matching attributes that are not (foreign key, primary key) combinations because joining on such attributes may produce spurious tuples.
A "lossy" decomposition can lead to incorrect data being retrieved.
If a relation EMP_PROJ
(Ssn
, Pnumber
, Hours
, Ename
, Pname
, Plocation
) is decomposed into EMP_LOCS
(Ename
, Plocation
) and EMP_PROJ1
(Ssn
, Pnumber
, Hours
, Pname
, Plocation
),
Joining EMP_LOCS
and EMP_PROJ1
by using attribute Plocation
for joining, which is neither a primary key nor a foreign key in either of the decomposed relations, leading to misinformation and produce additional tuples that were not in the original EMP_PROJ
relation.
Functional Dependencies (FDs): The Formal Tool
Functional dependencies (FDs) serve as a formal tool to analyze and improve the quality of relation schemas. They specify constraints on the relationships between two sets of attributes within a database schema.
A functional dependency, denoted
X → Y
, states that the value(s) of attribute setX
uniquely determine the value(s) of attribute setY
.
If two rows in a table have the same values for X
, they must have the same values for Y
. In the dependency X → Y
, X
is the determinant and Y
is the dependent.
Values of the attributes in X
uniquely determine the values of the attributes in Y
. We also say that Y
is functionally dependent on X
.
Examples:
Ssn → Ename
(A Social Security number determines one employee name).Pnumber → {Pname, Plocation}
(A project number determines one project name and one location).{Ssn, Pnumber} → Hours
(The combination of an employee and a project determines the hours worked).
Superkey Relationship: A set of attributes
K
is a superkey forR
if and only ifK → R
holds onR
. This meansK
uniquely identifies an entire tuple.Trivial Functional Dependencies: An FD
X → Y
is considered trivial ifY
is a subset ofX
(Y ⊆ X
). Such FDs are always true in any relation and do not convey significant information about the data semantics.A → A
andAB → A
are trivial.
Functional dependencies are crucial because they allow database designers to:
Specify Constraints: They formally define consistency rules that data in a database must adhere to.
Identify Redundancy: They are the primary tool for detecting undesirable properties in relational schemas, such as data redundancy and update anomalies (insertion, deletion, modification anomalies).
Guide Normalization: FDs are the foundation for defining various normal forms (1NF, 2NF, 3NF, BCNF) and for the normalization process, which aims to decompose problematic relations into "better" ones that minimize redundancy and anomalies.