A Junk Blog: Database Normalization

Database Design Process

There are some process that must be done to design a database :

· Gathering user needs / business

· Developing the ER model based on user needs / business

· Converting ER Model to set the relation (table)

· Normalization of relations, for the anomaly

· Implementation goes to database by make table for each relationship already most normalization

Database Normalization

Normalization is the process of efficiently organizing data in a database. There are two goals of the normalization process: eliminating redundant data (for example, storing the same data in more than one table) and ensuring data dependencies make sense (only storing related data in a table). Both of these are worthy goals as they reduce the amount of space a database consumes and ensure that data is logically stored.

The database community has developed a series of guidelines for ensuring that databases are normalized. These are referred to as normal forms and are numbered from one (the lowest form of normalization, referred to as first normal form or 1NF) through five (fifth normal form or 5NF). In practical applications, you'll often see 1NF, 2NF, and 3NF along with the occasional 4NF. Usually only up to the level of 3NF or BCNF because already sufficient to generate the tables of good quality.

First Normal Form

First Normal Form (1NF) sets the very basic rules for an organized database:

Eliminate duplicative columns from the same table.
Create separate tables for each group of related data and identify each row with a unique column or set of columns (the primary key).

Second Normal Form

Second normal form (2NF) further addresses the concept of removing duplicative data:

Meet all the requirements of the first normal form.
Remove subsets of data that apply to multiple rows of a table and place them in separate tables.
Create relationships between these new tables and their predecessors through the use of foreign keys.

Third normal form

Third normal form (3NF) goes one large step further:

Meet all the requirements of the second normal form.
Remove columns that are not dependent upon the primary key.

Fourth Normal Form

Finally, fourth normal form (4NF) has one additional requirement:

Meet all the requirements of the third normal form.
A relation is in 4NF if it has no multi-valued dependencies.

Remember, these normalization guidelines are cumulative. For a database to be in 2NF, it must first fulfill all the criteria of a 1NF database.

Normalization must done beacuse:

· Need to optimizing the structures of table

· It can increasing speed

· The income data is the same

· It is more efficient in the use of storage media

· It is able to reduce redundancy

· Need to Avoid anomalies (insertion anomalies, deletion anomalies, update anomalies)

One table is said well (efficient) or normal if accomplish 3 criterions as follows:

· If there is decomposition (parsing) table, therefore the decomposition shall be secured safe (Lossless Join Decomposition). It’s mean, after that table is untied / at decomposition becomes new tables, that new tables can result original table equally exactly

· Its preserve dependency functional at the moment data change (Dependency Preservation).

· Don't breach Boyce-Code Normal Form (BCNF)

If the third criterion (BCNF) can't be accomplished, therefore at least that table not breach the third Normal Form (3rd Normal Form/ 3NF).

Normalization is required because of the redundancy relations, not relations of "good". Why?

The main reason is the possibility of "update anomalies" (when the insert, delete, update) because can impact on the data inconsistencies
The reason is other waste storage space (hard disk)

An update anomaly. Employee 519 is shown as having different addresses on different records.

An insertion anomaly. Until the new faculty member, Dr. Newsome, is assigned to teach at least one course, his details cannot be recorded.

A deletion anomaly. All information about Dr. Giddens is lost when he temporarily ceases to be assigned to any courses.

How to Handle anomaly?

1. Anomaly in the handle by programming language used to create the application database. Designer should note this anomaly and to tell a programmer.

2. Anomaly does not handle the system, but submitted to the operator to be careful in making modifications, inserts and deletes. It is human error risk.

3. Anomalies be avoided, with ways to secure the normalization.

Denormalization

Databases intended for Online Transaction Processing (OLTP) are typically more normalized than databases intended for Online Analytical Processing (OLAP). OLTP Applications are characterized by a high volume of small transactions such as updating a sales record at a super market checkout counter. The expectation is that each transaction will leave the database in a consistent state. By contrast, databases intended for OLAP operations are primarily "read mostly" databases. OLAP applications tend to extract historical data that has accumulated over a long period of time. For such databases, redundant or "denormalized" data may facilitate business intelligence applications. Specifically, dimensional tables in a star schema often contain denormalized data. The denormalized or redundant data must be carefully controlled during ETL processing, and users should not be permitted to see the data until it is in a consistent state. The normalized alternative to the star schema is the snowflake schema. It has never been proven that this denormalization itself provides any increase in performance, or if the concurrent removal of data constraints is what increases the performance. In many cases, the need for denormalization has waned as computers and RDBMS software have become more powerful, but since data volumes have generally increased along with hardware and software performance, OLAP databases often still use denormalized schemas.

Denormalization is also used to improve performance on smaller computers as in computerized cash-registers and mobile devices, since these may use the data for look-up only (e.g. price lookups). Denormalization may also be used when no RDBMS exists for a platform (such as Palm), or no changes are to be made to the data and a swift response is crucial.

Breaking Rules : When to Denormalize

Sometimes it's necessary to break the rules of normalization and create a database that is deliberately less normal than it otherwise could be. You'll usually do this for performance reasons or because the users of the database demand it. While this won't get you any points with database design purists, ultimately you have to deliver a solution that satisfies your users. If you do break the rules, however, and decide to denormalize you database, it's important that you follow these guidelines:

· Break the rules deliberately; have a good reason for denormalizing.

· Be fully aware of the tradeoffs this decision entails.

· Thoroughly document this decision.

· Create the necessary application adjustments to avoid anomalies.

This last point is worth elaborating on. In most cases, when you denormalize, you will be required to create additional application code to avoid insertion, update, and deletion anomalies that a more normalized design would avoid. For example, if you decide to store a calculation in a table, you'll need to create extra event procedure code and attach it to the appropriate event properties of forms that are used to update the data on which the calculation is based.

If you're considering denormalizing for performance reasons, don't always assume that the denormalized approach is the best. Instead, I suggest you first fully normalize the database (to Third Normal Form or higher) and then denormalize only if it becomes necessary for reasons of performance.

If you're considering denormalizing because your users think they need it, investigate why. Often they will be concerned about simplifying data entry, which you can usually accomplish by basing forms on queries while keeping your base tables fully normalized.

Here are several scenarios where you might choose to break the rules of normalization:

· You decide to store an indexed computed column, Soundex, in tblCustomer to improve query performance, in violation of 3NF (because Soundex is dependent on LastName). The Soundex column contains the sound-alike code for the LastName column. It's an indexed column (with duplicates allowed) and is calculated using a user-defined function. If you wish to perform searches on the Soundex column with any but the smallest tables, you'll find a significant performance advantage to storing the Soundex column in the table and indexing this computed column. You'd likely use an event procedure attached to a form to perform the Soundex calculation and store the result in the Soundex column. To avoid update anomalies, you'll want to ensure that this column cannot be updated by the user and that it is updated every time LastName changes.

· In order to improve report performance, you decide to create a column named TotalOrderCost that contains a sum of the cost of each order item in tblOrder. This violates 2NF because TotalOrderCost is dependent on the primary key of tblOrderDetail, not on tblOrder's primary key. TotalOrderCost is calculated on a form by summing the column TotalCost for each item. Since you often create reports that need to include the total order cost, but not the cost of individual items, you've broken 2NF to avoid having to join these two tables every time this report needs to be generated. As in the last example, you have to be careful to avoid update anomalies. Whenever a record in tblOrderDetail is inserted, updated, or deleted, you will need to update tblOrder, or the information stored there will be erroneous.

· You decide to include a column, SalesPerson, in the tblInvoice table, even though SalesId is also included in tblInvoice. This violates 3NF because the two non-key columns are mutually dependent, but it significantly improves the performance of certain commonly run reports. Once again, this is done to avoid a join to the tblEmployee table, but introduces redundancies and adds the risk of update anomalies.

Functional dependency

Attribute B has a functional dependency on attribute A (i.e., A → B) if, for each value of attribute A, there is exactly one value of attribute B. If value of A is repeating in tuples then value of B will also repeat. In our example, Employee Address has a functional dependency on Employee ID, because a particular Employee ID value corresponds to one and only one Employee Address value. (Note that the reverse need not be true: several employees could live at the same address and therefore one Employee Address value could correspond to more than one Employee ID. Employee ID is therefore not functionally dependent on Employee Address.) An attribute may be functionally dependent either on a single attribute or on a combination of attributes. It is not possible to determine the extent to which a design is normalized without understanding what functional dependencies apply to the attributes within its tables; understanding this, in turn, requires knowledge of the problem domain. For example, an Employer may require certain employees to split their time between two locations, such as New York City and London, and therefore want to allow Employees to have more than one Employee Address. In this case, Employee Address would no longer be functionally dependent on Employee ID.

Another way to look at the above is by reviewing basic mathematical functions:

Let F(x) be a mathematical function of one independent variable. The independent variable is analogous to the attribute A. The dependent variable (or the dependent attribute using the lingo above), and hence the term functional dependency, is the value of F(A); A is an independent attribute. As we know, mathematical functions can have only one output. Notationally speaking, it is common to express this relationship in mathematics as F(A) = B; or, B → F(A).

There are also functions of more than one independent variable—commonly, this is referred to as multivariable functions. This idea represents an attribute being functionally dependent on a combination of attributes. Hence, F(x,y,z) contains three independent variables, or independent attributes, and one dependent attribute, namely, F(x,y,z). In multivariable functions, there can only be one output, or one dependent variable, or attribute.

Trivial functional dependency

A trivial functional dependency is a functional dependency of an attribute on a superset of itself. {Employee ID, Employee Address} → {Employee Address} is trivial, as is {Employee Address} → {Employee Address}.

Full functional dependency

An attribute is fully functionally dependent on a set of attributes X if it is

· functionally dependent on X, and

· not functionally dependent on any proper subset of X. {Employee Address} has a functional dependency on {Employee ID, Skill}, but not a full functional dependency, because it is also dependent on {Employee ID}.

Transitive dependency

A transitive dependency is an indirect functional dependency, one in which X→Z only by virtue of X→Y and Y→Z.

Multivalued dependency

A multivalued dependency is a constraint according to which the presence of certain rows in a table implies the presence of certain other rows.

Join dependency

A table T is subject to a join dependency if T can always be recreated by joining multiple tables each having a subset of the attributes of T.

Superkey

A superkey is an attribute or set of attributes that uniquely identifies rows within a table; in other words, two distinct rows are always guaranteed to have distinct superkeys. {Employee ID, Employee Address, Skill} would be a superkey for the "Employees' Skills" table; {Employee ID, Skill} would also be a superkey.

Candidate key

A candidate key is a minimal superkey, that is, a superkey for which we can say that no proper subset of it is also a superkey. {Employee Id, Skill} would be a candidate key for the "Employees' Skills" table.

Non-prime attribute

A non-prime attribute is an attribute that does not occur in any candidate key. Employee Address would be a non-prime attribute in the "Employees' Skills" table.

Primary key

Most DBMSs require a table to be defined as having a single unique key, rather than a number of possible unique keys. A primary key is a key which the database designer has designated for this purpose.

First Normal Form (1NF)

Eliminate repeating groups from the same table
Aggregate similiar data in separate tables and identify each row with an unique identifier

In simple language if we were to say each attribute of the relation would be atomic in nature for 1NF. Look at the example below to understand better.

Second Normal Form (2NF)

Moving forward lets take a look at the rules that goven 2NF. We get a step even more closer to remove duplicate records.

Remove data that apply to multiple rows and place them in a separate table
Relate the above table with foreign keys

Consider the below example to understand the same.

Third Normal Form (1NF)

This is the most preferred normalization technique followed for most of the database.

Eliminate all fields that donot depend on the Primary key

Values in a record that are not part of that record's key do not belong in the table. In general, any time the contents of a group of fields may apply to more than a single record in the table, consider placing those fields in a separate table.

Note: All these normalization are cummulative in nature. I re-iterate this point.

There are 4NF otherwise called as Boyce-Codd normal form (BCNF). I wouldnot deal much into this form as it becomes far beyond practicle limits to have such a requirement. The rule is, we are in BCNF if and only if every determinant is a candidate key.

Reffrences :

1. ER Ngurah Agus Sanjaya. Slide Part 6 - NORMALISASI.

2. Fundamentals of Relational Database Design_By Paul Litwin (http://www.deeptraining.com/litwin/dbdesign/FundamentalsOfRelationalDatabaseDesign.aspx)

3. Normalizig Techniques_extremeexperts (http://www.extremeexperts.com/sql/articles/Normalizing.aspx)

4. Database Normalization Basics_By Mike Chapple (http://databases.about.com/od/specificproducts/a/normalization.htm)

5. Answers.com (http://www.answers.com/topic/database-normalization)