Previous post in category
Class Models are based on the Relational Model portion of Set Theory. However, this is not the same as the Relational Data Model (RDM) that one sees applied in Data Models for relational databases. Unfortunately the syntax for many elements employed in a UML Class Model and those employed in the Entity Relationship Diagrams used for Data Modeling are virtually identical, which tends to cause a lot of confusion. A Class Model is not a Data Model! An OO Class Model represents a far more abstract view of the Relational Model that a Data Model. (I discuss a number of the relevant differences in the category Persistence in OO Applications.)
However, because data and referential integrity are just as important to a running OO application as they are to a relational database, we need to address those issues in a Class Model. The primary tool for doing that is normalization. However, we have to apply it somewhat differently in a Class Model because we do not usually define object identity explicitly as a knowledge responsibility. (We usually handle it implicitly through the way relationships are instantiated; more on that in the category on Relationships.)
Normalization involves the concept of Normal Form in the Relational Model. In the RDM, Normal Form is basically a fancy way of saying that the model is constructed to ensure that data and referential integrity are preserved. That is, data will always by synchronized and navigating relationships between tables will always be valid. That's because the database engine is implemented to ensure that consistency once the data itself is organized in schemas so that it is compliant with Normal Form.
Things are a bit more complicated for a running OO application because properties are not simply data. Object knowledge responsibilities are abstract data types that do not require a physical data store. In addition, behavior responsibilities are not data at all. Finally, running applications have may have to deal with issues like concurrency that are determined uniquely by the problem context rather than being a single generic paradigm like two-phase commit in a data base. As a result ensuring data and referential integrity in a running OO application is managed in stages by the developer. The application of normalization to the Class Model is just one stage of that overall management. We use normalization of the Class Model during OOA to create a static structure that is internally consistent in much the same matter as data in a relational database. But we may still have to tinker with the dynamics to ensure data and referential integrity during OOD and OOP.
To understand how Normal Form plays in an OO context, let's look what it means. There are actually several different levels of Normal Form in the Relational Model but in a Class Model we are usually only interested in the first three levels...
1NF: A relation is in 1NF if every attribute is a simple domain.
Basically this means that each property is logically indivisible at a
given level of abstraction. In an OO context we have a very flexible
view of logical indivisibility that is tied to subsystems (i.e., every
class in a subsystem is at the same level of OOA/D abstraction). That
allows an ADT attribute like Complex Number to be treated as a scalar
value in a class in one <high-level> subsystem while it becomes a full
class with Real and Imaginary attributes in another <low-level> subsystem.
The example I used of a telephone number is a case where the view of 1NF
is flexible, depending on the level of abstraction. If one is
interested in, say, the area code then the whole number cannot be a
simple domain. In an RDB one would have to break out the number into a
separate table with separate attributes for {country code, area code,
exchange, number}. The key would be compound, containing all of those
elements, and there would be no non-key attributes. In Data Modeling we
would have:
[Customer] [TelephoneNumber]
* customerName +--------- * countryCode
+ address | +------- * areaCode
... | | +----- * exchange
+ country -------+ | | +--- * number
+ area --------------+ | |
+ exchange ------------+ |
+ number ----------------+
...
Of course this is messy on the Customer side, so the DBA would probably
make an artificial identifier key identifier for the [TelephoneNumber],
say TNId:
[Customer] [TelephoneNumber]
* customerName +--------- * TNId
+ address | + countryCode
... | + areaCode
+ contactPhone ----+ + exchange
... + number
That is likely to be some sort of autonumber identifier but it could
simply be a concatenation of the non-key attributes. The second case is
legal because referential attributes are logically different than the
data being stored; they are an artifact of the RDB implementation.
However, the second case leads to a certain degree of silliness because
the Customer then already has the entire phone number as a foreign key.
That's the reason that DBAs generally use a higher level of abstraction
-- any referential identifier (foreign key) to the telephone number will
be the telephone number so one is wasting a lot of infrastructure and
access processing to make tables whenever they do not have non-key attributes.
In addition, the RDB interface (e.g., SQL) can provide mechanisms for
extracting the individual elements. So the DBA deliberately chooses a
higher level of abstraction where the individual elements are not
identified. In fact, RAD IDEs generally provide things like patterns for
phone numbers that provide input verification and a mechanism for
"parsing" simple domains.
In OOA/D development a similar situation arises. If we need the area
code, then we can't describe a telephone number as a simple knowledge
ADT that is treated as a scalar value. One should break it out into a
separate class with explicit attributes for the elements. As it
happens, the DBA problem with redundancy in referential attributes does
not exist in OOA/D because we do not need embedded identity attributes
and we treat relationships at a higher level of abstraction. So we have:
1 R1 1
[Customer] ------------------ [TelephoneNumber]
* customerName + countryCode
+ address + areaCode
... + exchange
+ number
Note that [Customer] has an explicit knowledge attribute, customerName, that uniquely
identifies a Customer while [TelephoneNumber] does not have an explicit
identifier attribute. In addition, there is no referential attribute
defined in [Customer]; the relationship notation abstracts the
mechanisms we use for referential integrity. That makes the OOA/D model
independent of specific implementation optimizations.
[At OOP time identity will almost always be implicitly implemented
through pointers and collection classes, usually when the objects are
instantiated. Implicit identity is effectively managed through
instantiation of relationships so one always navigates to the "right"
instance. (See the category on Relationship for more discussion.) This use of relationships is a very fundamental difference
between the OO view and the index-and-search RDB paradigm.]
So there is no reason for the OOA/D modeler to not create the
[TelephoneNumber] class. Note that putting countryCode, areaCode, etc.
in Customer directly without a [TelephoneNumber] class would be an
error, but that is because of...
2NF: A relation R is in 2NF if R is in 1NF and each non-key attribute in
R is fully dependent upon every key.
What this means is that
any non-key attributes in each table row (tuple) must be uniquely dependent on
the key for the row. If the key is compound they must be dependent on
every element of the tuple key. So in the RDB example above, we had two
choices: either make the elements part of a compound key, which is as
dependent as one can get, or one has to have a key were each attribute
value is determined as soon as one defines the key value. [Technically
in this case the attributes form a secondary key, but the normal forms
beyond 3NF deal with that.]
Now in the OO context we need to interpret this somewhat differently
because we usually do not have explicit identity attributes.
Nonetheless, it is fundamental to OOA/D that every object be abstracted
from a uniquely identifiable problem space entity. So we do have unique
identity and the properties of an object must be dependent directly on
that identity. In other words, relationships raise the level of abstraction so
that we don't care how identity is implemented, but we do still care
about identity. (We just move those cares to the rules and policies of
instantiating relationships.)
In the Class Model above the notion of telephone number is a unique
problem space concept and there are many possible values of telephone
number. However, the value of, say, areaCode will be fixed for any
given telephone number. In other words, it is fundamentally dependent on the
identity of a particular telephone number, regardless of how we
implement identity. By contrast, consider:
[Customer]
* customerName
+ address
...
+ countryCode
+ areaCode
+ exchange
+ number
...
To be attributes of [Customer] these elements must be uniquely dependent
on customerName. Superficially they seem to be. But what happens when
the Customer gets a new areaCode because the utility decided to
reorganize it service area? Does that change have anything to do with
who the Customer is? No; at best it is about where the Customer lives
relative to the way the utility defines telephone numbers.
[Note that the same arguments apply above and below even if Customer does not have an explicit identity attribute. We still have a notion of a Customer entity, that Customer has identity, and the individual attributes are not defined directly based on it.]
An even more interesting question is: what if the Customer has multiple
phone numbers? How is the number element fully dependent on
customerName if there can be different values? [We could get around
that with multiple attributes like line1Number, line2Number, ... but
that also starts to be pretty silly.]
The last question to ask is: what happens when the phone company
re-allocates the Customer's old number to someone else when the areaCode
was changed? Now those same values are dependent on a different
customerName. That violates the notion of fully dependent.
The bottom line of the example is that if we care about the individual elements, we need to have a notion of a Telephone Number entity. The values of the individual elements are then clearly dependent on the identity of that entity.
3NF: A relation R is in 3NF if R is in 2NF and no non-key attribute of R
is transitively dependent on any other non-key attribute of R. [Boyd-Codd Normal Form (BCNF)
provides a more comprehensive expression that includes elements of
compound keys vs. other key elements.]
Basically this just means that any given attribute's value is
independent of the values of other attributes. That is, it is dependent
on the key, the whole key, and nothing but the key. [Technically the Telephone Number example could violate 3NF when one considers that
different countries allocate numbers for areaCode, exchange, and number
differently. As a practical matter this simple model does not work
internationally; one needs subclassing because some countries don't even
have exchanges or area codes.]
Just substitute identity for key in the second sentence of the
paragraph immediately above and one has the OOA/D view. The basic logic
is the same. Consider the following class:
[House]
* builder
+ style
+ price.
Now suppose we had the following sample of objects where builders
specialize in a particular style of house:
Builder Style Price
------- -------- -----
A Duplex 265K
B Duplex 265K
C Bungalow 245K
D Ranch 250K
E Bungalow 245K
F Duplex 265K
G Ranch 250K
H Ranch 250K
What's wrong with this picture? All Duplexes cost 265K; all Bungalows
cost 245K; and all Ranches cost 250K. In other words, the price is transitively
dependent on builder through style. Though the price a builder charges
is is unique to the builder, it actually depends on what style of house
the builder builds. So what we need is a little normalization in the
RDB world:
[House] [Style]
* Builder +-------- * styleType
+ houseStyle ----+ + price
We could do the same thing in the OOA/D world:
* R1 1
[House] ------------------ [Style]
+ builder * StyleType
+ price
Note a subtle change: House::builder is no longer designated as an
identifier. This reflects another difference between Class Modeling and
Data Modeling. We tend to abstract things differently in an OO context.
The name of a table in an RDB is essentially arbitrary; it simply
identifies an n-ary relation in a unique fashion. (Good documentation
practice urges use of meaningful names, though.) In OO development a
House will abstract some identifiable entity in the problem space that is not a
Builder (e.g., the big wooden thing with windows and doors). So the identity of the builder is almost never going to be the
identity of a House object. In other words, in Data Modeling one is organizing
data while in Class Modeling one is abstracting entities. But so long
as one keeps track of what identity is in the Class Model, the same
principles apply.
One possible approach for Class Modeling is to treat all classes as if they were tables and one was defining a relational database schema. Alas, that is a bit too simplistic. However, it is a very good metaphor, especially
when doing normalization of knowledge attributes. The most
comprehensive book on OO Class Modeling in Leon Starr's "Executable UML:
How to Build Class Models". Yet he never mentions Normal Form. But he uses the table metaphor extensively when describing his "cookbook"
guidelines for /doing/ normalization.
One reason the table-driven approach is simplistic is related to my point at the end
of the 3NF description. The relational data model is applied
differently in OOA/D because we perceive the problem in terms of real
(albeit often conceptual) entities in some problem space while the RDB
view is applied much more narrowly to organizing data. In addition,
relationships and identity are much more abstract in OOA/D so classes
become very important. In the RDB view the identity of a tuple (row) is
paramount while the identity of an n-ary relation (table) is unique but
semantically secondary.
The most important difference, though, is the notion of what a property
in an n-ary relation is. In Data Modeling a property is data, pure and
simple. In OOA/D, though, a property may be either data or a behavior.
That is extremely important because OOA/D treats knowledge (data) and
behavior differently on many levels. For one thing knowledge attributes
have different values for different objects but what is the "value" of a
behavior when all objects of a given class execute the same behavior?
The table metaphor doesn't work so well for behaviors because one can't
look for patterns in the "values" of the object behaviors in practical
samples. The "value" of a knowledge responsibility is what it knows,
which can be described in pure semantic terms. Similarly, the "value"
of a behavior responsibility is what it does, which can also be
expressed in semantic terms.
This semantic notion of "value" works at the class level. Not coincidentally,
that is exactly the level where we do normalization. Thus we can
normalize on the basis of responsibility semantics for both knowledge
and behavior, but we don't have the convenience of the table metaphor to
verify the normalization for behaviors.
Thus the semantics of what the objects of a class know or do is
logically indivisible at the level of abstraction of the containing
subsystem (1NF). Similarly, the semantics of what the objects of a
class know or do is dependent on the identity of the class (2NF).
Finally, the semantics of what the objects of a class know or do is
dependent only on the identity of the class (3NF).
Note that the lack of a table metaphor for verifying behavior
responsibilities is not a major problem -- precisely because all objects
of the class execute exactly the same behavior. Thus the sort of
analysis I did above for the 3NF discussion is largely irrelevant;
essentially one gets 3NF for behaviors for free in OOA/D because of the
way OOA/D defines behaviors. So during normalization of behaviors one
is primarily interested in 2NF. That translates into
one-behavior-one-place.
In fact, one can distill Normal Form as applied to OO Class Models by rephrasing a dictum that dates from the '60s: Each responsibility <fact> should appear in exactly one class <place> in the Class Model. But we also need to ensure that it goes in the right place. To do that we also need the guideline: The responsibility should depend on the object identity, the whole object identity, and nothing but the object identity. As it happens, most OOA/D authors do not mention normal form. They just provide these two guidelines in one form or another.
The remaining issue is why do we care about normalization. The answer
is basically the same for why one cares about the relational data model
when organizing data. Following the RDM essentially eliminates
foot-shooting in the form of data and referential integrity. It also
provides a structure that is more efficient to navigate.
Recall that I indicated that the TelephoneNumber example was not correct
for internalization. You could get to a proper subclassing solution
directly by recognizing patterns in an encyclopedic set of samples of
international phone numbers. That's fine but how to you make sure that
you recognized the right pattern? One way to validate it is by testing
it against Normal Form. You can find some remarkably subtle problems
through normalization against a relatively small set of samples. It may
not be foolproof because you may have missed some crucial situations,
but you will usually be well ahead of the game if the Class Model is in
3NF as you understand the problem space.
previous post in category