If you've been working with databases for a while, chances are you've
heard the term normalization. Perhaps someone's asked you "Is that
database normalized?" or "Is that in BCNF?" All too often, the reply is
"Uh, yeah." Normalization is often brushed aside as a luxury that only
academics have time for. However, knowing the principles of
normalization and applying them to your daily database design tasks
really isn't all that complicated and it could drastically improve the
performance of your DBMS.
In this article, we'll introduce the concept of normalization and take a brief look at the most common normal forms. Future articles will provide in-depth explorations of the normalization process.
Before we begin our discussion of the normal forms, it's important to point out that they are guidelines and guidelines only. Occasionally, it becomes necessary to stray from them to meet practical business requirements. However, when variations take place, it's extremely important to evaluate any possible ramifications they could have on your system and account for possible inconsistencies. That said, let's explore the normal forms.
The first rule dictates that we must not duplicate data within the same row of a table. Within the database community, this concept is referred to as the atomicity of a table. Tables that comply with this rule are said to be atomic. Let’s explore this principle with a classic example – a table within a human resources database that stores the manager-subordinate relationship. For the purposes of our example, we’ll impose the business rule that each manager may have one or more subordinates while each subordinate may have only one manager.
Intuitively, when creating a list or spreadsheet to track this information, we might create a table with the following fields:
At this point, a second bright idea usually occurs to database novices: We don’t want to have more than one column and we want to allow for a flexible amount of data storage. Let’s try something like this:
This solution is closer, but it also falls short of the mark. The subordinates column is still duplicative and non-atomic. What happens when we need to add or remove a subordinate? We need to read and write the entire contents of the table. That’s not a big deal in this situation, but what if one manager had one hundred employees? Also, it complicates the process of selecting data from the database in future queries.
Here’s a table that satisfies the first rule of 1NF:
Now, what about the second rule: identify each row with a unique column or set of columns (the primary key)? You might take a look at the table above and suggest the use of the subordinate column as a primary key. In fact, the subordinate column is a good candidate for a primary key due to the fact that our business rules specified that each subordinate may have only one manager. However, the data that we’ve chosen to store in our table makes this a less than ideal solution. What happens if we hire another employee named Jim? How do we store his manager-subordinate relationship in the database?
It’s best to use a truly unique identifier (such as an employee ID) as a primary key. Our final table would look like this:
Over the past month, we've looked at several aspects of normalizing a database table. First, we discussed the basic principles of database normalization. Last time, we explored the basic requirements laid down by the first normal form (1NF). Now, let's continue our journey and cover the principles of second normal form (2NF).
Recall the general requirements of 2NF:
Let's look at an example. Imagine an online store that maintains customer information in a database. They might have a single table called Customers with the following elements:
In a 2NF-compliant database structure, this redundant information is extracted and stored in a separate table. Our new table (let's call it ZIPs) might have the following fields:
Now that we've removed the duplicative data from the Customers table, we've satisfied the first rule of second normal form. We still need to use a foreign key to tie the two tables together. We'll use the ZIP code (the primary key from the ZIPs table) to create that relationship. Here's our new Customers table:
Now, are all of the columns fully dependent upon the primary key? The customer number varies with the order number and it doesn't appear to depend upon any of the other fields. What about the unit price? This field could be dependent upon the customer number in a situation where we charged each customer a set price. However, looking at the data above, it appears we sometimes charge the same customer different prices. Therefore, the unit price is fully dependent upon the order number. The quantity of items also varies from order to order, so we're OK there.
What about the total? It looks like we might be in trouble here. The total can be derived by multiplying the unit price by the quantity, therefore it's not fully dependent upon the primary key. We must remove it from the table to comply with the third normal form. Perhaps we use the following attributes:
In this article, we'll introduce the concept of normalization and take a brief look at the most common normal forms. Future articles will provide in-depth explorations of the normalization process.
The Normal Forms
The database community has developed a series of guidelines for ensuring that databases are normalized. These are referred to as normal forms and are numbered from one (the lowest form of normalization, referred to as first normal form or 1NF) through five (fifth normal form or 5NF). In practical applications, you'll often see 1NF, 2NF, and 3NF along with the occasional 4NF. Fifth normal form is very rarely seen and won't be discussed in this article.Before we begin our discussion of the normal forms, it's important to point out that they are guidelines and guidelines only. Occasionally, it becomes necessary to stray from them to meet practical business requirements. However, when variations take place, it's extremely important to evaluate any possible ramifications they could have on your system and account for possible inconsistencies. That said, let's explore the normal forms.
First Normal Form (1NF)
First Normal Form (1NF) sets the very basic rules for an organized database:- Eliminate duplicative columns from the same table.
- Create separate tables for each group of related data and identify each row with a unique column (the primary key).
The first rule dictates that we must not duplicate data within the same row of a table. Within the database community, this concept is referred to as the atomicity of a table. Tables that comply with this rule are said to be atomic. Let’s explore this principle with a classic example – a table within a human resources database that stores the manager-subordinate relationship. For the purposes of our example, we’ll impose the business rule that each manager may have one or more subordinates while each subordinate may have only one manager.
Intuitively, when creating a list or spreadsheet to track this information, we might create a table with the following fields:
- Manager
- Subordinate1
- Subordinate2
- Subordinate3
- Subordinate4
At this point, a second bright idea usually occurs to database novices: We don’t want to have more than one column and we want to allow for a flexible amount of data storage. Let’s try something like this:
- Manager
- Subordinates
This solution is closer, but it also falls short of the mark. The subordinates column is still duplicative and non-atomic. What happens when we need to add or remove a subordinate? We need to read and write the entire contents of the table. That’s not a big deal in this situation, but what if one manager had one hundred employees? Also, it complicates the process of selecting data from the database in future queries.
Here’s a table that satisfies the first rule of 1NF:
- Manager
- Subordinate
Now, what about the second rule: identify each row with a unique column or set of columns (the primary key)? You might take a look at the table above and suggest the use of the subordinate column as a primary key. In fact, the subordinate column is a good candidate for a primary key due to the fact that our business rules specified that each subordinate may have only one manager. However, the data that we’ve chosen to store in our table makes this a less than ideal solution. What happens if we hire another employee named Jim? How do we store his manager-subordinate relationship in the database?
It’s best to use a truly unique identifier (such as an employee ID) as a primary key. Our final table would look like this:
- Manager ID
- Subordinate ID
Second Normal Form (2NF)
Over the past month, we've looked at several aspects of normalizing a database table. First, we discussed the basic principles of database normalization. Last time, we explored the basic requirements laid down by the first normal form (1NF). Now, let's continue our journey and cover the principles of second normal form (2NF).
Recall the general requirements of 2NF:
- Remove subsets of data that apply to multiple rows of a table and place them in separate tables.
- Create relationships between these new tables and their predecessors through the use of foreign keys.
Let's look at an example. Imagine an online store that maintains customer information in a database. They might have a single table called Customers with the following elements:
- CustNum
- FirstName
- LastName
- Address
- City
- State
- ZIP
In a 2NF-compliant database structure, this redundant information is extracted and stored in a separate table. Our new table (let's call it ZIPs) might have the following fields:
- ZIP
- City
- State
Now that we've removed the duplicative data from the Customers table, we've satisfied the first rule of second normal form. We still need to use a foreign key to tie the two tables together. We'll use the ZIP code (the primary key from the ZIPs table) to create that relationship. Here's our new Customers table:
- CustNum
- FirstName
- LastName
- Address
- ZIP
Third Normal Form (3NF)
There are two basic requirements for a database to be in third normal form:- Already meet the requirements of both 1NF and 2NF
- Remove columns that are not fully dependent upon the primary key.
- Order Number
- Customer Number
- Unit Price
- Quantity
- Total
Now, are all of the columns fully dependent upon the primary key? The customer number varies with the order number and it doesn't appear to depend upon any of the other fields. What about the unit price? This field could be dependent upon the customer number in a situation where we charged each customer a set price. However, looking at the data above, it appears we sometimes charge the same customer different prices. Therefore, the unit price is fully dependent upon the order number. The quantity of items also varies from order to order, so we're OK there.
What about the total? It looks like we might be in trouble here. The total can be derived by multiplying the unit price by the quantity, therefore it's not fully dependent upon the primary key. We must remove it from the table to comply with the third normal form. Perhaps we use the following attributes:
- Order Number
- Customer Number
- Unit Price
- Quantity
SELECT OrderNumber, Total
FROM WidgetOrders
We can now use the following query: SELECT OrderNumber, UnitPrice * Quantity AS Total
FROM WidgetOrders
to achieve the same results without violating normalization rules.Boyce-Codd Normal Form (BCNF)
Definition: A
relation is in Boyce-Codd Normal Form (BCNF) if every determinant is a
candidate key. (See the links in the box at right for definitions of
determinant and candidate key.)
Also Known As: BCNF
Examples:
Conisder
a database table that stores employee information and has the
attributes employee_id, first_name, last_name, title. In this table,
the field employee_id determines first_name and last_name. Similarly,
the tuple (first_name, last_name) determines employee_id.Fourth Normal Form (4NF)
Finally, fourth normal form (4NF) has one additional requirement:- Meet all the requirements of the third normal form.
- A relation is in 4NF if it has no multi-valued dependencies.
Multivalued Dependency
Definition: Multivalued dependencies occur when the presence of one or more rows in a table implies the presence of one or more other rows in that same table.
For more detail, read What is a Database Dependency?
For more detail, read What is a Database Dependency?
Examples:
For
example, imagine a car company that manufactures many models of car,
but always makes both red and blue colors of each model. If you have a
table that contains the model name, color and year of each car the
company manufactures, there is a multivalued dependency in that table.
If there is a row for a certain model name and year in blue, there must
also be a similar row corresponding to the red version of that same
car.
No comments:
Post a Comment