Normalization: The Step-by-Step Refinement Process

In relational database design, normalization is a systematic process used to create a set of relation schemas that minimizes unnecessary redundancy and ensures data integrity, making information retrieval easier.

This process involves applying a series of Normal Form tests to relation schemas and, if necessary, decomposing them into smaller, "better" schemas that adhere to specific normal forms.

A superkey is a set of one or more attributes that uniquely identifies a tuple in a relation.
A candidate key is a minimal superkey, meaning no proper subset of its attributes can uniquely identify tuples on its own.
The primary key is one of the candidate keys chosen by the database designer as the principal means of identifying tuples within a relation.
A prime attribute is any attribute that is a member of at least one candidate key of a relation. Attributes not part of any candidate key are called nonprime attributes.

First Normal Form (1NF)

A relation is in 1NF if all its attribute values are atomic. This is the baseline for any relational table, meaning no attribute can hold multiple values (like a list or set) in a single row.

First Normal Form (1NF) states that the domain of an attribute must include only atomic (simple, indivisible) values, and the value of any attribute in a tuple must be a single value from that domain. This essentially disallows multivalued attributes, composite attributes, and nested relations.

Example (Not in 1NF):

plaintext

Student(SID, Name, Phones)
(101, 'Alice', {9876543210, 9123456780})

In 1NF:

plaintext

Student(SID, Name, Phone)
(101, 'Alice', 9876543210)
(101, 'Alice', 9123456780)

Normalization to 1NF: To bring a relation into 1NF, multivalued attributes are typically removed and placed into a separate relation along with the primary key of the original relation.

Second Normal Form (2NF): Eliminating Partial Dependencies

A relation is in 2NF if it is in 1NF and every non-prime attribute is fully functionally dependent on the entire primary key. This means that no nonprime attribute can be functionally dependent on only a part of a composite primary key.

2NF targets partial dependencies. This is a major source of data redundancy and update anomalies. If the primary key contains a single attribute, the 2NF test does not apply.

Example (Violates 2NF):
- EMP_PROJ(Ssn, Pnumber, Hours, Ename, Pname, Plocation)
- Primary Key: {Ssn, Pnumber}
- Partial Dependencies:
  - Ssn → Ename (Ename depends only on Ssn, not the full key).
  - Pnumber → {Pname, Plocation} (Project details depend only on Pnumber).
Solution (Decomposition):
1. EMPLOYEE(Ssn, Ename)
2. PROJECT(Pnumber, Pname, Plocation)
3. WORKS_ON(Ssn, Pnumber, Hours)
Normalization to 2NF: If a relation is not in 2NF, it is decomposed into new relations. Each new relation will contain a partial key from the original primary key and the nonprime attributes that are fully functionally dependent on that partial key. A relation with the original primary key and attributes fully dependent on it is are retained.

Third Normal Form (3NF): Eliminating Transitive Dependencies

A relation schema R is in 3NF if it satisfies 2NF and no nonprime attribute of R is transitively dependent on the primary key.

A transitive dependency occurs when a non-prime attribute indirectly depends on the primary key through another non-prime attribute, rather than directly on the primary key. Which causes further redundancy and update anomalies.

A simple way to think of it is: A → B and B → C, where A is the key and B is not.

Example (Violates 3NF):
- EMP_DEPT(Ssn, Ename, Dnumber, Dname, Dmgr_ssn)
- Primary Key: Ssn
- Transitive Dependency: Ssn → Dnumber → {Dname, Dmgr_ssn}. The department name and manager depend on the department number, which is not a key.
Solution (Decomposition):
1. EMPLOYEE(Ssn, Ename, Dnumber) Dnumber is now foreign key
2. DEPARTMENT(Dnumber, Dname, Dmgr_ssn)

Normalization to 3NF: If a relation is not in 3NF, it is decomposed. The attributes involved in the transitive dependency are moved to a new relation, with the intermediate attribute (B) acting as the primary key of the new relation and a foreign key in the original relation.

Normal Forms Based on Primary Keys

Normal Form	Condition	Remedy (Normalization)
1NF	No multivalued or nested attributes. All attributes are atomic.	Split multivalued attributes into separate rows or tables
2NF	No partial dependencies (non-key attribute depending on part of composite key)	Decompose relation to separate partial dependencies
3NF	No transitive dependencies (non-key attribute depending on another non-key)	Decompose relation to eliminate transitive dependencies

A Note on Denormalization

Sometimes, for performance reasons, designers may choose to denormalize a database. This is a deliberate process of taking a highly normalized schema and combining tables (violating some normal forms) to reduce the number of joins required for a query. This is a trade-off: it increases redundancy to improve query speed.

Boyce-Codd Normal Form (BCNF)

Boyce-Codd Normal Form (BCNF) is a database normal form that is considered a stricter version of Third Normal Form (3NF). While every table in BCNF is also in 3NF, a table in 3NF is not necessarily in BCNF. BCNF was developed to handle certain types of anomalies that 3NF does not address.

The Core Rule of BCNF

A relation schema is in BCNF if for every one of its non-trivial functional dependencies, X → Y, the determinant X is a superkey.

In simpler terms, for any rule in the table where a set of columns X determines another set of columns Y, X must be capable of uniquely identifying every row in the table (i.e., it must be a superkey).

BCNF addresses a specific and subtle type of data redundancy that can still exist in a 3NF table. This occurs when:

A table has multiple, overlapping candidate keys.
A non-key attribute (or a part of a candidate key) determines another part of a candidate key.

Example:

Consider a table TEACH that tracks student course enrollments taught by a single instructor per course.

Relation: ENROLLMENT(StudentID, Course, Instructor)
Functional Dependencies:
- {StudentID, Course} → Instructor (A student in a course has one instructor).
- Instructor → Course (Each instructor teaches only one course).

The candidate key for TEACH is {Student, Course}. TEACH is in 3NF but not in BCNF because Instructor → Course violates BCNF (Instructor is not a superkey)

From these dependencies, we can identify two candidate keys: {StudentID, Course} and {StudentID, Instructor}.

This table is in 3NF, but it still has redundancy. The fact that 'Professor Turing' teaches 'CS101' will be repeated for every single student enrolled in that course.

The problem lies with the dependency Instructor → Course. This violates BCNF because the determinant, Instructor, is not a superkey on its own.

Achieving BCNF: The Decomposition Algorithm

If a table is not in BCNF, it must be decomposed. The process is straightforward:

Find a Violation: Identify a functional dependency X → Y that violates BCNF (where X is not a superkey).
Decompose the Table: Split the original table into two new tables:
- Table 1: Contains the attributes from the violating dependency (X and Y).
- Table 2: Contains X and all the other attributes from the original table that were not in Y.
Repeat: Continue this process until all resulting tables are in BCNF.

Applying this to our ENROLLMENT example:

Violation: The dependency Instructor → Course violates BCNF. Here, X is Instructor and Y is Course.
Decomposition:
- Table 1: (Instructor, Course) - This table stores which instructor teaches which course.
- Table 2: (StudentID, Instructor) - This table tracks which student is taught by which instructor.

The new tables, (Instructor, Course) and (StudentID, Instructor), are now both in BCNF and the data redundancy is eliminated.

Applying the BCNF decomposition algorithm:

Choose the violating FD: Instructor → Course (X = {Instructor}, Y = {Course}).
Decompose TEACH into:
- R1 = (X ∪ Y) = (Instructor, Course)
- R2 = (TEACH − (Y − X)) = (Student, Course, Instructor) − (Course − Instructor) = (Student, Instructor) (since Course is not in Instructor, Y-X is Course).
The resulting schemas are TEACH1 (Instructor, Course) and TEACH2 (Instructor, Student). Both of these relations are in BCNF. This decomposition is lossless but not dependency-preserving for the original FD1.

Important Trade-Offs of BCNF Decomposition

Decomposing a schema to achieve BCNF has important properties:

Lossless Join Property (Guaranteed): The BCNF decomposition algorithm guarantees that you can join the new tables back together to get the original data without creating any spurious (false) rows. This is a critical requirement.
Dependency Preservation (Not Guaranteed): BCNF decomposition does not always preserve all the original functional dependencies. In our example, the original dependency {StudentID, Course} → Instructor can no longer be checked in a single table; it can only be verified by joining the two new tables. This is a known trade-off that designers must sometimes accept to achieve the higher normal form.

Numbers

Searching & Matrix

Sorting

Normalization: The Step-by-Step Refinement Process ​

First Normal Form (1NF) ​

Second Normal Form (2NF): Eliminating Partial Dependencies ​

Third Normal Form (3NF): Eliminating Transitive Dependencies ​

Normal Forms Based on Primary Keys ​

A Note on Denormalization ​

Boyce-Codd Normal Form (BCNF) ​

The Core Rule of BCNF ​

Achieving BCNF: The Decomposition Algorithm ​

Important Trade-Offs of BCNF Decomposition ​