Saturday, October 29, 2011

Database Normalization: What to test for First Normal Form?

Designing a relational database to minimize data redundancy (and therefore maximize data integrity) is called normalization. View my video on Normalization and First Normal Form or read on...

The concept of data normalization was introduced by Edgar Codd, right in the years after he invented the concept of the relational model of storing data. There are various degrees of normalization 1NF (First Normal Form), 2NF, 3NF and so on. Each degree of normalization is stricter than the previous one e.g. if a table is in 3NF then it is automatically in 1NF and 2NF. In this article, I will explain the First Normal Form and what to test for it. Articles on testing the other normal forms will follow.

Years ago, when I used to work as a database developer, I remembered 1NF as each table having only relevant columns, a primary key and no repeating groups. It is still useful to remember 1NF like this because most definitions of 1NF agree to this. Let us now see examples of few tables that are not in 1NF.

a. Having columns of other entities
Customer (CustomerID, FirstName, LastName, Address, SupportContactName)
Here the Support Contact does not belong to the Customer entity. Such a design gives rise to data modification and querying problems. For example, if a Customer row is deleted, then the Support Contact data is also deleted. Also, it is not possible to store available Support Contacts that are not yet assigned to any customer.

b. Having no primary key
Employee (FirstName, LastName, JobTitle)
Here it is possible for two or more employees having the same column values implying duplicate rows. There are update/ delete problems with such a table design. Which row to update if only one of the employees is promoted? How to delete just one row if one of the employee leaves the company? Also, each query written on this table needs to have the logic that includes duplicate rows.

c. Having a repeating group
Customer (CustomerID, FirstName, LastName, Address, PhoneNumber1, PhoneNumber2, PhoneNumber3)
With such a table, there are many problems. For example, many customers may have only one phone number so the space for the other two phone numbers is wasted. Each query on this table fetching phone number values needs to include three columns, impacting performance. This table cannot accommodate a customer who has four phone numbers.

Another example of a repeating group is the table
Customer (CustomerID, FirstName, LastName, Address, PhoneNumbers)
Here it is possible to store any number of phone numbers within the limit of the column width. If there is more than one phone number, this list of phone numbers is comma separated. The problem is that space for a potentially large list of phone numbers is reserved for every row and wasted. Each query on this table fetching phone number(s) has to implement the logic to understand one phone number or a list of phone numbers. Frankly, the PhoneNumbers column is not a single column but really a group of columns for the Customer table. Therefore, the table is not even relational.

In order to convert the Customer table to 1NF, it is necessary to design two tables:
Customer (CustomerID, FirstName, LastName, Address)
CustomerPhoneNumber (CustomerID, PhoneNumber)
Note that now a customer can have any number of phone numbers (zero, one or more than one). Also, the Customer data is stored in the Customer table and PhoneNumber data in the CustomerPhoneNumber table. Customer table has CustomerID as the primary key. CustomerPhoneNumber has CustomerID and PhoneNumber as the primary key. No space is wasted. There are no data modification problems e.g. deleting a phone number does not require deleting a customer. No query needs to include multiple columns for the phone number values, speeding performance. No query needs to implement the logic of understanding a single phone number as well as extract individual phone numbers from a list.

With the above understanding in mind, here the tests that should be applied to check 1NF on every table in the database:
1. Is each column an attribute of the table's entity?
2. Is there a primary key?
3. Are there no duplicate rows?
4. Are there no repeating columns?
5. Are there no multiple values in any column of any row?

Want to learn more?
See more explanation with multiple examples in my video on Normalization and First Normal Form