Normalization: The Step-by-Step Refinement Process
In relational database design, normalization is a systematic process used to create a set of relation schemas that minimizes unnecessary redundancy and ensures data integrity, making information retrieval easier.
This process involves applying a series of Normal Form tests to relation schemas and, if necessary, decomposing them into smaller, "better" schemas that adhere to specific normal forms.
A superkey is a set of one or more attributes that uniquely identifies a tuple in a relation.
A candidate key is a minimal superkey, meaning no proper subset of its attributes can uniquely identify tuples on its own.
The primary key is one of the candidate keys chosen by the database designer as the principal means of identifying tuples within a relation.
A prime attribute is any attribute that is a member of at least one candidate key of a relation. Attributes not part of any candidate key are called nonprime attributes.
First Normal Form (1NF)
A relation is in 1NF if all its attribute values are atomic. This is the baseline for any relational table, meaning no attribute can hold multiple values (like a list or set) in a single row.
First Normal Form (1NF) states that the domain of an attribute must include only atomic (simple, indivisible) values, and the value of any attribute in a tuple must be a single value from that domain. This essentially disallows multivalued attributes, composite attributes, and nested relations.
Example (Not in 1NF):
Student(SID, Name, Phones)
(101, 'Alice', {9876543210, 9123456780})
In 1NF:
Student(SID, Name, Phone)
(101, 'Alice', 9876543210)
(101, 'Alice', 9123456780)
- Normalization to 1NF: To bring a relation into 1NF, multivalued attributes are typically removed and placed into a separate relation along with the primary key of the original relation.
Second Normal Form (2NF): Eliminating Partial Dependencies
A relation is in 2NF if it is in 1NF and every non-prime attribute is fully functionally dependent on the entire primary key. This means that no nonprime attribute can be functionally dependent on only a part of a composite primary key.
2NF targets partial dependencies. This is a major source of data redundancy and update anomalies. If the primary key contains a single attribute, the 2NF test does not apply.
Example (Violates 2NF):
EMP_PROJ(Ssn, Pnumber, Hours, Ename, Pname, Plocation)
Primary Key:
{Ssn, Pnumber}
Partial Dependencies:
Ssn → Ename
(Ename depends only onSsn
, not the full key).Pnumber → {Pname, Plocation}
(Project details depend only onPnumber
).
Solution (Decomposition):
EMPLOYEE(Ssn, Ename)
PROJECT(Pnumber, Pname, Plocation)
WORKS_ON(Ssn, Pnumber, Hours)
Normalization to 2NF: If a relation is not in 2NF, it is decomposed into new relations. Each new relation will contain a partial key from the original primary key and the nonprime attributes that are fully functionally dependent on that partial key. A relation with the original primary key and attributes fully dependent on it is are retained.
Third Normal Form (3NF): Eliminating Transitive Dependencies
A relation schema R
is in 3NF if it satisfies 2NF and no nonprime attribute of R
is transitively dependent on the primary key.
A transitive dependency occurs when a non-prime attribute indirectly depends on the primary key through another non-prime attribute, rather than directly on the primary key. Which causes further redundancy and update anomalies.
A simple way to think of it is: A → B
and B → C
, where A
is the key and B
is not.
Example (Violates 3NF):
EMP_DEPT(Ssn, Ename, Dnumber, Dname, Dmgr_ssn)
Primary Key:
Ssn
Transitive Dependency:
Ssn → Dnumber → {Dname, Dmgr_ssn}
. The department name and manager depend on the department number, which is not a key.
Solution (Decomposition):
EMPLOYEE(Ssn, Ename, Dnumber)
Dnumber
is now foreign keyDEPARTMENT(Dnumber, Dname, Dmgr_ssn)
Normalization to 3NF: If a relation is not in 3NF, it is decomposed. The attributes involved in the transitive dependency are moved to a new relation, with the intermediate attribute (B
) acting as the primary key of the new relation and a foreign key in the original relation.
Normal Forms Based on Primary Keys
Normal Form | Condition | Remedy (Normalization) |
---|---|---|
1NF | No multivalued or nested attributes. All attributes are atomic. | Split multivalued attributes into separate rows or tables |
2NF | No partial dependencies (non-key attribute depending on part of composite key) | Decompose relation to separate partial dependencies |
3NF | No transitive dependencies (non-key attribute depending on another non-key) | Decompose relation to eliminate transitive dependencies |
A Note on Denormalization
Sometimes, for performance reasons, designers may choose to denormalize a database. This is a deliberate process of taking a highly normalized schema and combining tables (violating some normal forms) to reduce the number of joins required for a query. This is a trade-off: it increases redundancy to improve query speed.
Boyce-Codd Normal Form (BCNF)
Boyce-Codd Normal Form (BCNF) is a database normal form that is considered a stricter version of Third Normal Form (3NF). While every table in BCNF is also in 3NF, a table in 3NF is not necessarily in BCNF. BCNF was developed to handle certain types of anomalies that 3NF does not address.
The Core Rule of BCNF
A relation schema is in BCNF if for every one of its non-trivial functional dependencies,
X → Y
, the determinantX
is a superkey.
In simpler terms, for any rule in the table where a set of columns X
determines another set of columns Y
, X
must be capable of uniquely identifying every row in the table (i.e., it must be a superkey).
BCNF addresses a specific and subtle type of data redundancy that can still exist in a 3NF table. This occurs when:
A table has multiple, overlapping candidate keys.
A non-key attribute (or a part of a candidate key) determines another part of a candidate key.
Example:
Consider a table TEACH
that tracks student course enrollments taught by a single instructor per course.
Relation:
ENROLLMENT(StudentID, Course, Instructor)
Functional Dependencies:
{StudentID, Course} → Instructor
(A student in a course has one instructor).Instructor → Course
(Each instructor teaches only one course).
The candidate key for TEACH
is {Student, Course}
. TEACH
is in 3NF but not in BCNF because Instructor → Course
violates BCNF (Instructor is not a superkey)
From these dependencies, we can identify two candidate keys: {StudentID, Course}
and {StudentID, Instructor}
.
This table is in 3NF, but it still has redundancy. The fact that 'Professor Turing' teaches 'CS101' will be repeated for every single student enrolled in that course.
The problem lies with the dependency Instructor → Course
. This violates BCNF because the determinant, Instructor
, is not a superkey on its own.
Achieving BCNF: The Decomposition Algorithm
If a table is not in BCNF, it must be decomposed. The process is straightforward:
Find a Violation: Identify a functional dependency
X → Y
that violates BCNF (whereX
is not a superkey).Decompose the Table: Split the original table into two new tables:
Table 1: Contains the attributes from the violating dependency (
X
andY
).Table 2: Contains
X
and all the other attributes from the original table that were not inY
.
Repeat: Continue this process until all resulting tables are in BCNF.
Applying this to our ENROLLMENT
example:
Violation: The dependency
Instructor → Course
violates BCNF. Here,X
isInstructor
andY
isCourse
.Decomposition:
Table 1:
(Instructor, Course)
- This table stores which instructor teaches which course.Table 2:
(StudentID, Instructor)
- This table tracks which student is taught by which instructor.
The new tables, (Instructor, Course)
and (StudentID, Instructor)
, are now both in BCNF and the data redundancy is eliminated.
Applying the BCNF decomposition algorithm:
- Choose the violating FD:
Instructor → Course
(X = {Instructor}
,Y = {Course}
). - Decompose
TEACH
into:R1 = (X ∪ Y) = (Instructor, Course)
R2 = (TEACH − (Y − X)) = (Student, Course, Instructor) − (Course − Instructor) = (Student, Instructor)
(sinceCourse
is not inInstructor
,Y-X
isCourse
).
- The resulting schemas are
TEACH1
(Instructor
,Course
) andTEACH2
(Instructor
,Student
). Both of these relations are in BCNF. This decomposition is lossless but not dependency-preserving for the original FD1.
Important Trade-Offs of BCNF Decomposition
Decomposing a schema to achieve BCNF has important properties:
Lossless Join Property (Guaranteed): The BCNF decomposition algorithm guarantees that you can join the new tables back together to get the original data without creating any spurious (false) rows. This is a critical requirement.
Dependency Preservation (Not Guaranteed): BCNF decomposition does not always preserve all the original functional dependencies. In our example, the original dependency
{StudentID, Course} → Instructor
can no longer be checked in a single table; it can only be verified by joining the two new tables. This is a known trade-off that designers must sometimes accept to achieve the higher normal form.