Class Models are based on the Relational Model portion of Set Theory. However, this is not the same as the Relational Data Model (RDM) that one sees applied in Data Models for relational databases. Unfortunately the syntax for many elements employed in a UML Class Model and those employed in the Entity Relationship Diagrams used for Data Modeling are virtually identical, which tends to cause a lot of confusion. A Class Model is not a Data Model! An OO Class Model represents a far more abstract view of the Relational Model that a Data Model. (I discuss a number of the relevant differences in the category Persistence in OO Applications.)
However, because data and referential integrity are just as important to a running OO application as they are to a relational database, we need to address those issues in a Class Model. The primary tool for doing that is normalization. However, we have to apply it somewhat differently in a Class Model because we do not usually define object identity explicitly as a knowledge responsibility. (We usually handle it implicitly through the way relationships are instantiated; more on that in the category on Relationships.)
Normalization involves the concept of Normal Form in the Relational Model. In the RDM, Normal Form is basically a fancy way of saying that the model is constructed to ensure that data and referential integrity are preserved. That is, data will always by synchronized and navigating relationships between tables will always be valid. That's because the database engine is implemented to ensure that consistency once the data itself is organized in schemas so that it is compliant with Normal Form.
Things are a bit more complicated for a running OO application because properties are not simply data. Object knowledge responsibilities are abstract data types that do not require a physical data store. In addition, behavior responsibilities are not data at all. Finally, running applications have may have to deal with issues like concurrency that are determined uniquely by the problem context rather than being a single generic paradigm like two-phase commit in a data base. As a result ensuring data and referential integrity in a running OO application is managed in stages by the developer. The application of normalization to the Class Model is just one stage of that overall management. We use normalization of the Class Model during OOA to create a static structure that is internally consistent in much the same matter as data in a relational database. But we may still have to tinker with the dynamics to ensure data and referential integrity during OOD and OOP.
To understand how Normal Form plays in an OO context, let's look what it means. There are actually several different levels of Normal Form in the Relational Model but in a Class Model we are usually only interested in the first three levels...
1NF: A relation is in 1NF if every attribute is a simple domain.
Basically this means that each property is logically indivisible at a given level of abstraction. In an OO context we have a very flexible view of logical indivisibility that is tied to subsystems (i.e., every class in a subsystem is at the same level of OOA/D abstraction). That allows an ADT attribute like Complex Number to be treated as a scalar value in a class in one <high-level> subsystem while it becomes a full class with Real and Imaginary attributes in another <low-level> subsystem.
The example I used of a telephone number is a case where the view of 1NF is flexible, depending on the level of abstraction. If one is interested in, say, the area code then the whole number cannot be a simple domain. In an RDB one would have to break out the number into a separate table with separate attributes for {country code, area code, exchange, number}. The key would be compound, containing all of those elements, and there would be no non-key attributes. In Data Modeling we would have:
* customerName +--------- * countryCode
+ address | +------- * areaCode
... | | +----- * exchange
+ country -------+ | | +--- * number
+ area --------------+ | |
+ exchange ------------+ |
+ number ----------------+
...
Of course this is messy on the Customer side, so the DBA would probably make an artificial identifier key identifier for the [TelephoneNumber], say TNId:
* customerName +--------- * TNId
+ address | + countryCode
... | + areaCode
+ contactPhone ----+ + exchange
... + number
That is likely to be some sort of autonumber identifier but it could simply be a concatenation of the non-key attributes. The second case is legal because referential attributes are logically different than the data being stored; they are an artifact of the RDB implementation. However, the second case leads to a certain degree of silliness because the Customer then already has the entire phone number as a foreign key.
That's the reason that DBAs generally use a higher level of abstraction -- any referential identifier (foreign key) to the telephone number will be the telephone number so one is wasting a lot of infrastructure and access processing to make tables whenever they do not have non-key attributes. In addition, the RDB interface (e.g., SQL) can provide mechanisms for extracting the individual elements. So the DBA deliberately chooses a higher level of abstraction where the individual elements are not identified. In fact, RAD IDEs generally provide things like patterns for phone numbers that provide input verification and a mechanism for "parsing" simple domains.
In OOA/D development a similar situation arises. If we need the area code, then we can't describe a telephone number as a simple knowledge ADT that is treated as a scalar value. One should break it out into a separate class with explicit attributes for the elements. As it happens, the DBA problem with redundancy in referential attributes does not exist in OOA/D because we do not need embedded identity attributes and we treat relationships at a higher level of abstraction. So we have:
[Customer] ------------------ [TelephoneNumber]
* customerName + countryCode
+ address + areaCode
... + exchange
+ number
Note that [Customer] has an explicit knowledge attribute, customerName, that uniquely identifies a Customer while [TelephoneNumber] does not have an explicit identifier attribute. In addition, there is no referential attribute defined in [Customer]; the relationship notation abstracts the mechanisms we use for referential integrity. That makes the OOA/D model independent of specific implementation optimizations.
[At OOP time identity will almost always be implicitly implemented through pointers and collection classes, usually when the objects are instantiated. Implicit identity is effectively managed through instantiation of relationships so one always navigates to the "right" instance. (See the category on Relationship for more discussion.) This use of relationships is a very fundamental difference between the OO view and the index-and-search RDB paradigm.]
So there is no reason for the OOA/D modeler to not create the [TelephoneNumber] class. Note that putting countryCode, areaCode, etc. in Customer directly without a [TelephoneNumber] class would be an error, but that is because of...
2NF: A relation R is in 2NF if R is in 1NF and each non-key attribute in R is fully dependent upon every key.
What this means is that any non-key attributes in each table row (tuple) must be uniquely dependent on the key for the row. If the key is compound they must be dependent on every element of the tuple key. So in the RDB example above, we had two choices: either make the elements part of a compound key, which is as dependent as one can get, or one has to have a key were each attribute value is determined as soon as one defines the key value. [Technically in this case the attributes form a secondary key, but the normal forms beyond 3NF deal with that.]
Now in the OO context we need to interpret this somewhat differently because we usually do not have explicit identity attributes. Nonetheless, it is fundamental to OOA/D that every object be abstracted from a uniquely identifiable problem space entity. So we do have unique identity and the properties of an object must be dependent directly on that identity. In other words, relationships raise the level of abstraction so that we don't care how identity is implemented, but we do still care about identity. (We just move those cares to the rules and policies of instantiating relationships.)
In the Class Model above the notion of telephone number is a unique problem space concept and there are many possible values of telephone number. However, the value of, say, areaCode will be fixed for any given telephone number. In other words, it is fundamentally dependent on the identity of a particular telephone number, regardless of how we implement identity. By contrast, consider:
[Customer]
* customerName
+ address
...
+ countryCode
+ areaCode
+ exchange
+ number
...
To be attributes of [Customer] these elements must be uniquely dependent on customerName. Superficially they seem to be. But what happens when the Customer gets a new areaCode because the utility decided to reorganize it service area? Does that change have anything to do with who the Customer is? No; at best it is about where the Customer lives relative to the way the utility defines telephone numbers.
[Note that the same arguments apply above and below even if Customer does not have an explicit identity attribute. We still have a notion of a Customer entity, that Customer has identity, and the individual attributes are not defined directly based on it.]
An even more interesting question is: what if the Customer has multiple phone numbers? How is the number element fully dependent on customerName if there can be different values? [We could get around that with multiple attributes like line1Number, line2Number, ... but that also starts to be pretty silly.]
The last question to ask is: what happens when the phone company re-allocates the Customer's old number to someone else when the areaCode was changed? Now those same values are dependent on a different customerName. That violates the notion of fully dependent.
The bottom line of the example is that if we care about the individual elements, we need to have a notion of a Telephone Number entity. The values of the individual elements are then clearly dependent on the identity of that entity.
3NF: A relation R is in 3NF if R is in 2NF and no non-key attribute of R is transitively dependent on any other non-key attribute of R. [Boyd-Codd Normal Form (BCNF) provides a more comprehensive expression that includes elements of compound keys vs. other key elements.]
Basically this just means that any given attribute's value is independent of the values of other attributes. That is, it is dependent on the key, the whole key, and nothing but the key. [Technically the Telephone Number example could violate 3NF when one considers that different countries allocate numbers for areaCode, exchange, and number differently. As a practical matter this simple model does not work internationally; one needs subclassing because some countries don't even have exchanges or area codes.]
Just substitute identity for key in the second sentence of the paragraph immediately above and one has the OOA/D view. The basic logic is the same. Consider the following class:
* builder
+ style
+ price.
Now suppose we had the following sample of objects where builders specialize in a particular style of house:
------- -------- -----
A Duplex 265K
B Duplex 265K
C Bungalow 245K
D Ranch 250K
E Bungalow 245K
F Duplex 265K
G Ranch 250K
H Ranch 250K
What's wrong with this picture? All Duplexes cost 265K; all Bungalows cost 245K; and all Ranches cost 250K. In other words, the price is transitively dependent on builder through style. Though the price a builder charges is is unique to the builder, it actually depends on what style of house the builder builds. So what we need is a little normalization in the RDB world:
* Builder +-------- * styleType
+ houseStyle ----+ + price
We could do the same thing in the OOA/D world:
[House] ------------------ [Style]
+ builder * StyleType
+ price
Note a subtle change: House::builder is no longer designated as an identifier. This reflects another difference between Class Modeling and Data Modeling. We tend to abstract things differently in an OO context. The name of a table in an RDB is essentially arbitrary; it simply identifies an n-ary relation in a unique fashion. (Good documentation practice urges use of meaningful names, though.) In OO development a House will abstract some identifiable entity in the problem space that is not a Builder (e.g., the big wooden thing with windows and doors). So the identity of the builder is almost never going to be the identity of a House object. In other words, in Data Modeling one is organizing data while in Class Modeling one is abstracting entities. But so long as one keeps track of what identity is in the Class Model, the same principles apply.
One possible approach for Class Modeling is to treat all classes as if they were tables and one was defining a relational database schema. Alas, that is a bit too simplistic. However, it is a very good metaphor, especially when doing normalization of knowledge attributes. The most comprehensive book on OO Class Modeling in Leon Starr's "Executable UML: How to Build Class Models". Yet he never mentions Normal Form. But he uses the table metaphor extensively when describing his "cookbook" guidelines for /doing/ normalization.
One reason the table-driven approach is simplistic is related to my point at the end of the 3NF description. The relational data model is applied differently in OOA/D because we perceive the problem in terms of real (albeit often conceptual) entities in some problem space while the RDB view is applied much more narrowly to organizing data. In addition, relationships and identity are much more abstract in OOA/D so classes become very important. In the RDB view the identity of a tuple (row) is paramount while the identity of an n-ary relation (table) is unique but semantically secondary.
The most important difference, though, is the notion of what a property in an n-ary relation is. In Data Modeling a property is data, pure and simple. In OOA/D, though, a property may be either data or a behavior. That is extremely important because OOA/D treats knowledge (data) and behavior differently on many levels. For one thing knowledge attributes have different values for different objects but what is the "value" of a behavior when all objects of a given class execute the same behavior?
The table metaphor doesn't work so well for behaviors because one can't look for patterns in the "values" of the object behaviors in practical samples. The "value" of a knowledge responsibility is what it knows, which can be described in pure semantic terms. Similarly, the "value" of a behavior responsibility is what it does, which can also be expressed in semantic terms.
This semantic notion of "value" works at the class level. Not coincidentally, that is exactly the level where we do normalization. Thus we can normalize on the basis of responsibility semantics for both knowledge and behavior, but we don't have the convenience of the table metaphor to verify the normalization for behaviors.
Thus the semantics of what the objects of a class know or do is logically indivisible at the level of abstraction of the containing subsystem (1NF). Similarly, the semantics of what the objects of a class know or do is dependent on the identity of the class (2NF). Finally, the semantics of what the objects of a class know or do is dependent only on the identity of the class (3NF).
Note that the lack of a table metaphor for verifying behavior responsibilities is not a major problem -- precisely because all objects of the class execute exactly the same behavior. Thus the sort of analysis I did above for the 3NF discussion is largely irrelevant; essentially one gets 3NF for behaviors for free in OOA/D because of the way OOA/D defines behaviors. So during normalization of behaviors one is primarily interested in 2NF. That translates into one-behavior-one-place.
In fact, one can distill Normal Form as applied to OO Class Models by rephrasing a dictum that dates from the '60s: Each responsibility <fact> should appear in exactly one class <place> in the Class Model. But we also need to ensure that it goes in the right place. To do that we also need the guideline: The responsibility should depend on the object identity, the whole object identity, and nothing but the object identity. As it happens, most OOA/D authors do not mention normal form. They just provide these two guidelines in one form or another.
The remaining issue is why do we care about normalization. The answer is basically the same for why one cares about the relational data model when organizing data. Following the RDM essentially eliminates foot-shooting in the form of data and referential integrity. It also provides a structure that is more efficient to navigate.
Recall that I indicated that the TelephoneNumber example was not correct for internalization. You could get to a proper subclassing solution directly by recognizing patterns in an encyclopedic set of samples of international phone numbers. That's fine but how to you make sure that you recognized the right pattern? One way to validate it is by testing it against Normal Form. You can find some remarkably subtle problems through normalization against a relatively small set of samples. It may not be foolproof because you may have missed some crucial situations, but you will usually be well ahead of the game if the Class Model is in 3NF as you understand the problem space.