| 1: LDM Overview |
Objectives
| |
This module enables the student to:
|
What is an LDM?
| |
A Logical Data Model (LDM) is a graphical representation of the way a business organizes its information. It is not a database, and is platform-independent. It is a logical and unbiased representation of the business data. The data model defines which individual data elements exist and how they relate to one another. It provides an accurate model (or map) of the business information. The LDM allows the user to determine if all their business questions can be answered.
LDM Deliverable
|
Industry LDMs
| |
Teradata Solution Industry LDMs Logical data models (LDMs) are critical in the design, creation and maintenance of data warehouses and are the heart of many Teradata solution offers. Teradata has created industry specific LDMs that are designed to support the specific business needs of that industry. Each industry LDM is comprehensive and ensures that the appropriate data will be included in the warehouse for those critical business needs. The major Teradata Solution Industry specific LDMs are Communications, Financial, Retail, and Travel. The Teradata Division has invested many person years developing and implementing these models so that they require minimal modification for a customer implementation. This means the data modelers can get the data warehouse designed in minimum time with maximum functionality. Using a model that is already tested, accurate, and complete gives the customer a sense of security and confidence in our ability to deliver. Training for each specific industry LDM is available.
|
Benefits and Differentiators
| |
Overview There are many reasons to use the Teradata Solution Industry LDMs. One of the most common is faster warehouse implementation, as noted above. Teradata's LDMs jumpstart the planning, design and implementation process of a data warehouse. It is a template or starting point for the warehouse.
Industry LDM Differentiators
Before we can modify an industry specific LDM, we first need to understand why LDMs are needed and understand the basic structure and philosophy behind their design.
|
Why Create an LDM?
| |
Benefits of the LDM to the Customer
PS and the LDM
Sales and the LDM
Building an LDM for a large enterprise wide system is an immense task. It takes many person years. Although building an enterprise LDM is a huge task, it is typically implemented in phases, and doesn't have to be completed before any other work can be done. Because it is so complex, it takes many iterations to fine tune and be sure that nothing is missed. The good news is that our customers don't need to build their own LDM for any industry where we've developed a solution LDM to go with our Teradata Solution. Using a solution LDM can save the customer years of effort and millions of dollars.
|
Analogies Help Explain the LDM
| |
Foundation Analogy: The LDM is the foundation on which the physical design, implementation, and applications are built. The foundation of a building determines its overall structural integrity and ultimately its usefulness and longevity (and costs associated with later repairs). The LDM does the same for a data warehouse (DW). If the LDM is not designed correctly, the entire decision support capability is compromised. If the foundation is faulty, the building fails, causing huge cost at the least and disaster at the worst.
|
LDM vs. PDM
| |
An LDM is NOT the same as a PDM! A Physical Data Model (PDM) is what you actually implement on a hardware platform, and may be designed to accommodate specific platform performance issues. The LDM is exclusively a logical and unbiased representation of the business data. It does not address platform or performance considerations at all (e.g., It is platform and hardware independent). The PDM may be created from the LDM, but only after the LDM is complete and validated.
Simplified Scenario for a Complete DW Project:
|
Relational Theory
| |
Purpose & Foundation Relational Theory is the basis of all aspects of logical data modeling, from the modeling process, to normalization rules to the resulting design. Relational Theory is the basis for the underlying structure of relational databases as well as SQL (Structured Query Language). SQL is a set-processing programming language designed specifically to work with relational databases. It is not necessary to understand Relational Theory in depth in order to do good logical data modeling. However, these core concepts from Relational Theory underlie the techniques you will be learning. Relational Theory stresses:
Relational Structure
More in-depth discussions of these and related concepts are included throughout the rest of this course.
|
ERD Definition
| |
An LDM is ultimately documented in an ERD (Entity-Relationship Diagram). The ERD is a standard modeling technique for defining and graphically representing an LDM:
All of the concepts below are much more fully defined in the following modules.
|
ERD vs Table Layouts
| |
Instead of ERDs, a table layout format is used throughout this course to facilitate learning. Since people have a great deal of prior personal experience looking at tables in a wide variety of life contexts, they are very easy and intuitive to understand. ERDs are much less so. The final module of the course covers ER Diagramming conventions in detail, and shows how the table formats translate into ERDs. ERDs are not used in this course for teaching the modeling process. However, at a few key points in this course, ERD versions of the table layouts will be available for comparison. ERDs help people understand the big picture, while table layouts make the details clear.
|
Table Layouts
| |
Table layouts give you a more detailed definition of the tables. They are an alternate way to document an LDM that is especially useful while learning about data modeling. Table layouts show entities (sometimes called tables), attributes (sometimes called columns) and relationships, as does the ERD. But, by adding sample data values, table layouts provide a level of detail that make many concepts easier to grasp visually and intuitively. Here is the Table layout equivalent of the ERD just shown:
Table Layout Conventions
NOTE: ERDs and how they are used at customer engagements are discussed more comprehensively at the end of this course.
|
Modeling Process Overview
| |
The Big Picture Data modeling is the process of identifying, defining, validating and documenting all of the entities, relationships, and attributes and assertions that accurately represent the business data to fulfill the business requirements. It is a very detailed and complex task that can take many person-months or even years for a large enterprise. That's the bad news. The good news is that the process is highly logical and intuitive, and is made of many small reasonably simple tasks. The complexity of data modeling is in the number of pieces to be defined, not in the process of defining them. Using a simple business example makes learning the process fairly easy.
Modeling Steps
|
Try It: LDM vs. PDM
| |
Try It! Exercises Don't skip the interactive exercises that are sprinkled throughout this WBT. They are not only the most fun part of the course, but the richest for your learning. Data modeling is a big, complex task, and these exercises are designed to give you hands-on experience with each aspect of it. In addition to the small exercises, there are two larger ones, that may take you up to 30 minutes each. We recommend that you do them, as they bring all the prior learning together by simulating the entire data modeling process for a small enterprise.
|
Try It: LDM vs. PDM
| 2: Defining Entities |
Objectives
| |
This module enables the student to:
|
Entity
| |
What is an Entity? An entity is something about which information must be maintained: A person, place, thing, concept, idea, etc. It is simplest to visualize an entity as a table with rows and columns.
|
Rules for Identifying Entities
| |
The following are rules for identifying entities:
|
Entity Definition Problems
| |
People tend to be fairly casual in the way they think about and describe their business needs and processes. Ask questions to specify the meaning of names that are under consideration as entity names. Here are two common situations to be alert for, and to resolve explicitly when they arise.
|
Try It: Entities
| |
Directions: A hardware vendor has the following business requirements. Using the rules for identifying entities from the previous page (a copy is provided for your reference below), identify the three main entities you would define.
The following are rules for identifying entities:
|
Primary Key
| |
What Is a Primary Key? Remember the rule for defining entities: Each instance of an entity must be uniquely differentiated from all other instances? Primary Keys are the way we enforce this rule. A PK is a unique identifier. It is a column or group of columns whose combined values are unique for every row in the entity table. PK columns are traditionally shown as the first columns in a table (for human convenience, though they need not be in any order according to relational theory), and are called out explicitly as the PK in both ER Diagrams and Table Layouts.
|
Rules for Primary Keys
| |
For each entity, you must determine what unique identifier to use (there may be several candidates for PK), and fill it in on your model. Make sure it adheres to all of these rules for PKs:
|
PK Definition Issue: Uniqueness
| |
Identifying a key with unique values can be troublesome. Many times a user already has identifying numbers they use for their major entities. Often the currently used numbering does not enforce uniqueness, or has changed format over time, so that a single row has more than one identifier. To make it worse, users tend to be emotionally and habitually attached to their legacy numbering schemes. Always ask explicitly if there are any known duplicates, or if the numbering scheme has changed in the past. If the answer is yes, often the best answer is to assign an incremental number, named to match the entity. That way uniqueness is assured in all cases. You can still carry the other historical numbers for reference, without using them for identification.
|
Intelligent Keys
| |
PK Definition Issue: Intelligent Keys An intelligent key is one that is made up of several parts, each of which has meaning to the user. Examples include phone number (for telecommunication companies) and Social Security Number for the IRS. Whenever the user offers an intelligent key as a candidate for PK, advise them to use a uniquely assigned number and carry the intelligent key as an attribute because there are several serious problems with this kind of key:
|
Intelligent Key Disaster Example
| |
Our hardware vendor has always identified their parts in the past by the warehouse Aisle, Level and Bin number: A-3-47, so that they can instantly find whatever part the customer is asking for. An uninformed data modeler gives them a system that retains this PK for their Part table, as requested. Six months later they acquire another hardware concern and start selling twice as many types of parts. They expand their floor space and reorganize their inventory for customer convenience. Now they have to reassign new IDs to almost every part! It's a huge job, but they do it. Then a customer calls in with a question on a PO from before the change. Mass confusion results because the part numbers on the PO now refer to entirely different parts! They must create a translation table, or suffer endless confusion. All of their historical data has this problem. And it will all start over if they ever do another acquisition.
What is the correct number for a Hammer?
| |||||||||
PK Sample Data
| |
Table layouts provide space for sample data values, which can help validate with the user that the combination of attribute values is in fact unique. For multi-column keys, it is best to put duplicate values in each key column, and vary them so that the combinations are unique. This makes it graphically clear to designer and user why multiple columns are needed for the PK. It may also unearth errors in assumptions about valid data values.
|
Try It: PKs
| | Directions: Identify a unique PK for each of the main entities defined for the hardware vendor by choosing the best PK candidate column from the related drop down list. |
| 3: Modeling Relationships |
Objectives
| |
This module enables the student to:
|
Relationships
| |
What is a Relationship? Functions in a business are naturally related in the real world, so the data that represents them is also related. A relationship is a statement about how two entities interact with one another: Customers buy Products; Employees work in Departments. Defining the relationships between tables is a core component of relational design.
|
Direct and Indirect Relationships
| |
You must interview the user to determine if relationships are direct or indirect, because that decision depends on business rules.
|
Rules for Defining Relationships
| |
Each relationship:
|
Steps for Defining Relationships
| |
There are basically three steps to defining the relationship in an LDM:
|
Step 1: Relationship Matrix
| |
The purpose of the matrix is to reduce the work to a minimum while guaranteeing that no possible relationships are missed. The matrix defines the nature and name of the direct relationships between each possible pair of entities. In an actual data modeling engagement, the matrix is not generally used, but for learning purposes it makes the underlying process clear.
|
How to Complete a Relationship Matrix
| |
The following tasks describe how to complete a relationship matrix:
![]()
|
Example
| |
* The customer must order the part (e.g. generate a PO). Resulting Matrix
|
Try It: Matrix
| |
Ask the following questions for the remaining unshaded cells, and choose the most appropriate action for each:
![]()
|
Step 2: Cardinality
| |
Types of Relationship: Cardinality There are three main types of relationships between entities which are differentiated by their degree of cardinality:
|
Step 2: Defining Relationship Cardinality
| |
Once you know how the entities are related, it remains to determine the degree of cardinality.
|
Example
| |
Example: Customer/PO Relationship Cardinality
|
Try It: Cardinality
| |
Directions: Assume the same hardware store scenario for which you've already defined entities. Go through the matrix below and for each filled-in cell, designate the relationship cardinality (1:1, 1:M, or M:M).
|
Step 3: Foreign Keys
| |
What Is an FK? Remember the rule for defining relationships: Cardinality dictates the number of instances of one entity that can be related to a single instance of the other entity? Foreign Keys are the way to model and enforce this rule and the way to model relationships. A PK identifies uniqueness while an FK identifies relationship. FK columns are documented explicitly (marked as FK) in both ER Diagrams and Table Layouts. An FK is a column or group of columns whose combined values appear elsewhere in the model as a PK. Where it appears as a PK it is called the "parent" side of the relationship. Where it appears as an FK it is called the "child" side of the relationship. FK Constraints: ND and NN
|
Example
| |
Home Construction Enterprise Example:
Lot# 14 is a PK value in the Lot table. It is an FK value in the House table. These two designations together tell us that there is a relationship between a Lot and a House. It also shows that Lot is the parent (the side where the PK is), while Lot# in the House table is the child (the side where the FK is). The PK is said to "migrate" from the parent to the child table.
|
Rules for Foreign Keys
| |
You must document each direct relationship correctly in your model, and there is a different way to document each of the three types of relationships. Make sure it adheres to all of the rules for FKs:
|
Modeling 1:1 Relationships
| |
The following are tasks for modeling 1:1 relationships:
Example: House to Lot (1:1 relationship) for a contracting enterprise. Note: True 1:1 relationships tend to migrate into 1:M over time (as this one does). However, it is a valid way to model a relationship that the enterprise does not expect to change over a long period of time (i.e. longer than they are interested in tracking it).
Notice that there are two valid ways to model this relationship. Always model in the direction most likely to become 1:M over time. Since this is a contractor, and not a house moving company, the Lot is parent and the House is child. A contractor might build a second house on the same lot due to fire, for example.
|
Modeling 1:M Relationships
| |
The following are tasks for modeling 1:M relationships:
Notice that the sample data makes it obvious which way is correct:
|
Modeling M:M Relationships (Associative Entities)
| |
The following describes aspects of modeling M:M relationships (associative entities):
Example: Departments to Employees (M:M relationship) for a manufacturing enterprise. The business rule for this enterprise is that each employee may be assigned to multiple departments at any one time, and every employee must be assigned to at least one department at all times. Naturally, a department will normally have more than one employee assigned to it. There is only one valid way to model this relationship: Create a new associative entity with a PK made of the two FKs, naming the new entity for its parent entities.
|
Try It: FKs
| |
It is important at each step to document the model in order to validate its completeness and accuracy. Below is the model as it exists with the entities and PKs we've defined. Do this exercise first on paper. Draw the three main entities (Customer, PO, and Part), identify the three direct relationships, and then place FKs with appropriate constraints (NN, ND) based on what you've just learned about how to model relationships. Then, do the exercise here, using your written model as the basis for your answers.
|
Recursive Relationships
| |
What Is a Recursive Relationship? Good news: You've already done this! It models data hierarchies and networks. Another way to look at it is that recursive relationships are ones where an entity has a relationship to itself .
|
Referential Integrity
| |
What Is Referential Integrity? Referential integrity ensures that the relationships (PK/FK references) are intact and valid. For example, it is not possible in the real world for an employee to:
|
Example
| |
The PK/FK relationships defined in the LDM theoretically ensure that this could never happen, by invoking the rule: The value of an FK must always already exist in its related PK. This linkage is in fact an integral part of the definition of an FK. The implemented DBMS should automatically keep this from ever happening.
|
| 4: Modeling Attributes |
Objectives
| |
This module enables the student to:
|
Attributes
| |
What is an Attribute? A relevant, elementary data item which describes or provides detail about an entity (e.g., a column).
|
Rules for Identifying Attributes
| |
Rules for Identifying Attributes
|
Types of Attributes
| |
The following are types of attributes:
|
Derived Attribute Issues
| |
The attributes from which derived attributes are calculated are in the design, so carrying the derived attribute, in addition, creates redundant data. Derived attributes may be identified and defined in order to validate that the model can in fact deduce them, but should not be shown in the ER Diagram. The reason for this is that carrying redundant data goes against relational design theory and principles. There are several good reasons to avoid carrying redundant data:
|
Linguistic Issues
| |
People tend to be fairly casual in the way they think about and describe their business needs and processes. Ask questions to specify the meaning of names that are under consideration as attribute names. Here are two common situations to be alert for, and to resolve explicitly when they arise.
|
Try It: Attributes
| |
A hardware vendor has the following business requirements. Use the rules for identifying attributes on the previous page (a copy is provided below for your reference) to identify the attributes you would define.
Rules for Identifying Attributes:
|
Domains
| |
What Is a Domain? A domain defines the valid values for an attribute: The actual data value in that column within each row in a table. An attribute should contain only one domain of data, but many attributes may be drawn from the same domain. A few examples of very common domains are: A domain is defined by these primary characteristics:
|
Keep Domains Atomic
| |
The same data may be atomic with respect to one system, but not to another, depending on whether subsets of the data have intrinsic meaning to that enterprise. An example is Phone Number (12 digits: 03-827-456-2837):
To verify if an attribute (and its underlying domain) is atomic, ask the user if they would query against portions of it. If the answer is yes, break the attribute down into the smallest components that would be queried. Example: Would you ever want to report on any pieces of Customer Name? It is 2-5 domains depending on the user's answer:
While title and degree are not really part of customer's name, when a user mentions "customer name" in terms of their business needs, they may actually have these attributes in mind as well. Asking a question such as the one above will ferret out these additional attributes. Anticipated searches on subsets of the data do not indicate that the attribute should be broken out into its components in these cases:
|
Attribute Placement
| |
Overview Once you've identified the attributes, the question is which ones belong in which entities? A non-key attribute should be placed in only one entity. There is a formalized science to this process, called normalization, which is covered in detail later on in this course. However, to minimize work in actual implementation, the first cut for identifying and assigning attributes is generally by intuition. This results in a good model that can be validated and corrected by applying the formalized normalization rules. Assigning attributes intuitively is valid because in most cases the name of the attribute, or the way it's referred to tells you the answer. For example, customer name, address and type obviously belong in the Customer entity. Here's how they look after we've placed them:
|
Validate: Atomic Domains?
| |
Notice that the user originally specified only two attributes: name and address. When we apply the rules and concepts presented above (especially about domains) we discover that each of the original attributes breaks into multiple attributes in order to get to the atomic level. Use realistic sample data to help the user validate the attributes in terms of domain:
|
Validate: Missing Attributes?
| |
Users rarely just happened to mention everything they need at the atomic level. Always ask if there is more, and validate what is done so far. The easiest way to do this is to show them the table layout with sample data and solicit feedback. It's very likely here that the user would also want a start date, a phone number, perhaps e-mail, etc. for their customers. To save space and scrolling in this course, we're postulating that they are satisfied with the attributes shown. It is also a good idea to look at existing reports, screen layouts and file layouts to help identify attributes.
|
Validate: Missing Minor Entities?
| |
Sample data can help identify if entities are missing. Remember the relational rule that seeks to minimize redundant data? When the same descriptive data is repeated within a column, where only a few values are valid, it is generally advised to create minor entities. A minor entity is a table, usually with a code value as its PK and a single description attribute to define the code. The code is carried elsewhere in the system as an FK, and the full description is accessed from the minor entity if needed for reporting. Minor entities are a good way to maintain data accuracy and reduce redundancy. Example: A Gender code table. Clues to missing minor entities include:
|
Example
| |
Example: Employees may be paid by monthly salary or hourly wage. If there are many rows (say 50K employees), these long descriptions for each employee's PayType can cause problems:
|
Quick Quiz
| |
|
Constraints
| |
Constraint Concept Definition A constraint or assertion is a rule that defines limits on valid entity relationships or attribute values, based on business rules. Domain level constraints apply to all attributes in the domain. Attribute level constraints apply only to the specific attribute. Constraints can also apply across multiple entities and attributes. Constraint Examples
|
Validate and Document
| |
Some constraints are explicitly described in the LDM (e.g., a relationship or a domain) while others are only described in documentation associated with the LDM. It is important to validate the entire model against all of the identified constraints to make sure that they are supported by the design when it is complete. Sample data and table layouts can be very useful in validating constraints with the user.
|
NULLs Concept Definition
| |
NULL indicates that the value is unknown. NULL is not a blank or zero value. Reasons for NULLs in an attribute might indicate:
|
NULL vs Zero or Blank
| |
Here are some examples of possible differences in meaning (per the enterprise's business rules):
|
Issues with NULLs
| |
While the following are implementation issues, rather than logical modeling issues, they are the underlying reasons that the rules for NULLs were included in relational theory.
|
Validate: Missing Sub-Entities?
| |
There is a relational rule that all of the attributes in an entity should describe all of its rows. Large numbers of NULLs in the sample data may show that some attributes in fact apply to only a defined subset of the rows. This may indicate the need for a new entity, called a sub-entity or generalization. Sub-entities that are split out from one entity have:
|
Example
| |
Example: Notice that salaried employees have a monthly salary amount and bonus percent, while hourly employees have an hourly amount and overtime authorization. All employees have a PayType. The presence of NULLs in so many rows and columns indicates the need for sub-entities.
The Solution: Sub-Entities Create two more sub-entities from the original such that the NULLs no longer appear:
|
Try It: Place Attributes
| |
It is important at each step to document the model in order to validate
its completeness and accuracy. Below is the model as it exists with the entities
and relationships we've defined.
|
Congratulations
| |
Congratulations!
You've just created your first mini-LDM.
An ERD is provided for students who are already conversant in that format. Detailed discussion of ERDs is reserved for a later module. See an ER Diagram of the model. |
| 5: Bringing It All Together |
Objectives
| |
This module enables the student to describe and follow the high level steps for creating an LDM.
|
LDM Process Review
| |
Essentially the previous modules have taken you through this process, but without focusing on its formalized structure. Below is a summary of the entire process as you would do it on a real life project.
|
Step 1: Define Business Requirements/Scope
| |
The business requirements define the scope. As we develop the LDM we constantly validate it against the business requirements. Any component that does not relate directly to a business requirement is outside of scope for the project and is not included in the data model. Make sure all of the business requirements are understood!
|
Step 2: Define Entities
| |
|
Step 3: Define Relationships
| |
|
Step 4: Define Non-Key Attributes
| |
|
Step 5: Verify the Model
| |
|
What is Metadata?
| |
Metadata is information that describes the data structure of a system. An LDM is metadata. Metadata can be a difficult concept to grasp initially. Envision an LDM that describes a specific data model. Below is an example of how this would look for our hardware enterprise data model.
Notice the Entity that describes the hardware enterprise entities is called "Entity". This is why metadata is difficult to understand and describe: It is self-referent. "Entity" describes multiple levels of meaning in the sentence above. The graphic is much easier to understand.
|
Metadata and the PDM
| |
Just as there are two designs (LDM and PDM), there are two types of metadata. The LDM is a visual representation of metadata and is not implemented. When a database is implemented from the PDM, the data definitions for the tables, columns and relationships, as well as their physical characteristics (data type and length and indexing, etc.) are stored physically on the system. The tables that house these structural descriptions are also considered metadata because they define the implemented system.
|
Try It: Create an LDM
| |
| 6: Normalization |
Objectives
| |
This module enables the student to:
|
Overview
| |
What is Normalization? Normalization is a set of rules and a methodology for making sure that the attributes in a design are carried in the correct entity to map accurately to reality, eliminate data redundancy and minimize update anomalies. Most LDMs adhere first, second and third normal form, because a model in third normal form is as close to the real world conditions as is necessary or useful for most situations. Stated simply: One Fact, One Place!
The details of when to use each of these forms is covered in the following pages of this module. |
Power to the Data!
| |
The net benefit of normalization is that the power of a system is built into the data structure itself, rather than the applications. That is why it is important to design LDMs to be functionally unbiased. A normalized system gives the users control over their data. When they want more flexibility, they can enter more data in the existing design, rather than having to involve designers or programmers for enhancements and applications development. This makes the system much easier and less expensive to maintain over time.
|
Easy to Remember!
| |
This mnemonic can help you remember the purpose for each of the first three levels of normalization: Every non-key attribute in an entity must depend on: The KEY - 1st Normal Form (1NF) These three statements will be explained as we come to each normal form in turn. |
Intuition and Normalization!
| |
Placing attributes by intuition generally results in a model that is very close to normalized, and is much less work than formal normalization. However, the rules of normalization should be applied as a final check, and the model should be adjusted to comply with 3NF if discrepancies are identified.
|
Redesign and Normalization
| |
In many engagements, data modeling is done from the current "model", which may never have gone through the normalization process, or any design process for that matter, and may or may not have been relational. Often the designers must work from a combination of user interviews and current system (e.g., PDM) documentation. These situations generally include many normalization errors that must be identified and corrected. Legacy "designs" may be so riddled with denormalizations that it may be easier to design from scratch, and then go back and validate that all the original requirements and functionality are handled.
|
Earlier Is Better
| |
The earlier a design error is identified and corrected the less involved, and therefore the less expensive it is. For example, an error that requires redesign is many times more expensive if it occurs after implementation because it is more complex and interrupts production work, requiring data re-loading and having possible impact on application code and report generation (among other possible ripple effects). This is one of the main reasons for normalization.
|
Sample Data: The Designer's Best Friend
| |
Sample data is the easiest, most efficient, and simplest way to detect design errors early in the process when they can be quickly corrected in the LDM. It is best to do this on a related set of tables (perhaps by functional area) rather than on the entire model, which may involve hundreds of entities and be too complex to grasp as a whole. The issue is early validation and correction. Attribute names (if done carefully) make the model understandable, but allow users to assume that they contain the expected data. And two users' expectations may not be the same (e.g., Acct# for the accountant is a 9 digit number, for sales it is really a customer identifier that is a 5 digit alpha value). Someone is going to be disappointed when the first report is printed. Someone (probably you) will have to take up the slack when the error is discovered. If they'd seen the table with sample data in it, they would have clarified their needs instantly and the design would have been corrected before implementation.
|
1st Normal Form
| |
What is First Normal Form (1NF) Each non-key attribute should have a 1:1 relationship with the PK of the entity. There should be no repeating groups (multiple occurrences of attribute(s) based on the same domain). Repeating Groups An example here might be if you kept employee pay information on the Employee table. You'd need pay date, pay amount, paycheck number, etc. And you'd need to add a new occurrence of that group of information every pay period.
|
Example: Mixed Domain Names
| |
Here is how a repeating attribute looks, when domain name is not included in the attribute name:
This is a very common example, these days, as more and more phone options become available to people. You can see that the user may want to break "phone" into home phone and office phone, and probably will want to add voice mail phone, cell phone, and likely others over time. How would you know how many occurrences to plan for? No matter what you decide, over time it's likely to outgrow the planning.
|
The Solution
| |
It's best to avoid the problem altogether by removing the attribute(s) that repeat into their own entity:![]() With this model there is no limit to the number of phones for a customer. If a customer adds 20 or even 50 phones over time, all that is required is data entry rather than design change and implementation. This is what we mean by "putting the power in the data". This model (CustomerPhone table) is also more flexible. It could support information about each phone number, such as when the customer prefers each to be used, etc. by adding those attributes to the CustomerPhone table during the design phase. Food for Thought Notice that this model would not support a customer having more than one of a given type of phone number (e.g., multiple voice mails). How would you modify this model to do that?
|
Repeating Groups
| |
Why it Matters: Data Problems The number of repetitions tends to grow over time:
Look for an attribute (or set of attributes) based on the same underlying domain(s):
|
The Origin of 1NF
| |
When relational theory was first developed, the existing systems often included variable length records, which table structures do not permit. 1NF was developed to address this issue, and states that all rows will have the same record layout, with no repeating items or groups, and likewise that all instances should have data for all attributes defined for an entity. There are two ways in which a record length might vary, if repeating groups are permitted:
|
Try It: 1NF
| |
In the model below, identify the four repeating attribute(s) that you would remove from this table by clicking them in the list below. Click the Check button when you've check-marked all you intend to. Click Cheat to see the correct answer.![]()
Show Model (Displays the complete model for this exercise.)
|
2nd Normal Form
| |
What is 2nd Normal Form (2NF)? Each non-key attribute must relate directly to each column of the PK. The attributes depend on the whole key.
|
Example: 2NF Violation
| |
Here is how attributes that violate 2NF look. Notice that social security number relates exclusively to employee, regardless of department. Also, department manager relates exclusively to department, regardless of what employees work there. As a result, both columns (SSN and DeptMgr) have redundant data in them. Percent time, however, relates to a specific employee in a specific department, and is a valid attribute for this associative entity.
Problems that arise from this 2NF violation:
|
The Whole Key
| |
Why it Matters: Data Problems Redundant data (e.g., many occurrences of the same description):
Correcting the Model
|
Try It: 2NF
| |
Part 1 Directions: Which columns violate 2NF? And what tables would you move them to? Select your answers from the drop down menus.![]()
Part 2 Directions: Did this Solve All of the Problems? Choose from the drop-down menu for each issue.
|
3rd Normal Form
| |
What is 3rd Normal Form? Every attribute describes the entity's PK rather than describing another attribute or the PK of a missing entity: The attributes depend on nothing but the key. This rule is essentially the same as 2NF, except that it applies equally to entities with single-column PKs. Its purpose is similar, and the issues that arise related to violations are similar to 2NF. Likewise, the 2NF methods for correcting it are appropriate.
|
Example 3NF Violation
| |
Here is how attributes that violate 3NF look. Notice that department name relates exclusively to department, not to employee at all. As a result it has redundant data for every employee in the same department.
Problems with this 3NF violation:
|
Extraneous Attributes
| |
Why it Matters: Data Problems Redundant data (e.g., many occurrences of the same description):
Recognizing the Pattern
Avoiding the Problem: Use sample data during design to identify this problem early.
Correcting the Model
|
Try It: 3NF
| |
Assume for this model (unlike the Podunk University model) that an employee
may work in only one department. Move the attributes that relate exclusively
to a portion of the PK to the entity that has that PK: ![]()
Show Model (Displays the complete model.) Directions: Did this Solve All of the Problems? Choose from the drop-down menu for each issue.
|
4th and 5th Normal Forms
| |
4NF and 5NF are mentioned here only for your information. The vast majority of models never apply these levels. Essentially these NFs are designed to impose the same level of consistency within a PK composed of more than two columns as the first 3NFs impose on attributes outside the PK. Entities with more than two columns in the PK often contain no non-key attributes. If non-key attributes do exist, 4NF and 5NF violations are unlikely because bringing the model into 3NF compliance precludes them. Usually 4NF and 5NF violations occur when the definition of the information to be represented is ambiguous (e.g., the user has either not really understood what they are asking for, or they have failed to state it clearly enough for the designer to understand it). 4NF and 5NF really represent two flip sides of the same issue: The PK must contain the minimum number of attributes that accurately describe all of the business rules.
|
Formal Definitions
| |
|
Normalization Review
| |
All NFs The normalization rules are a precise way to insure that all of the attributes are placed to:
|
Summary of NFs
| |
1NF: No repeating attributes or groups of attributes. 2NF: Attributes must not relate to a portion of the PK. 3NF: Attributes must not relate to attributes outside of the PK.
|
Summary of Corrective Measures
| |
Whenever a violation is identified:
|
Try It: Normalization
| |
The Challenge: We're being asked to migrate the hardware enterprise's
legacy system into the new design. It is one big Excel spreadsheet, that
maps directly to the PO, so it has several normalization violations.
Directions: For each NF, click the checkbox for each attribute that violates that NF. Click N/A if there is no violation for that NF. |
| 7: Modeling Time |
Objectives
| |
This module enables the student to create a logical model that includes the historical data.
|
Modeling History
| |
So far, we've looked at models that reflect the current state of a business, ignoring the passage of time. In reality, businesses (and their LDMs) must track history of all their major functions. Often the real on-going value in a data warehouse depends on the historical detail available. As in all other aspects of modeling, it is crucial to understand the business requirements.
|
History Example
| |
Our hardware enterprise currently includes the Part table shown on the left. Notice that while we can tell the price of a part, we have no way of tracking changes in the price over time. ![]() To track history, we'll clearly need to add the date each price becomes effective for each part. We can't add multiple dates into an existing Part table row, because it would violate 1NF (no repeating attributes). Instead we create a history table in which the PK is composed of the Part# (FK from the Part table) and the part price effective date (a non-FK component of the PK). This essentially creates a M:M relationship between Part and Calendar Date (an implied entity). Now we can track as many price changes as we need to.
|
Current Status vs. History
| |
Notice that we removed UnitPrice from the Part table. The most recent date in the history table is the current price. If kept in the Part table it would be redundant derived data, which we always strive to avoid in the LDM. During PDM the designer may elect to carry a "current status" attribute in the original table as a way of reducing processing overhead. It is not correct to model it into the LDM.
|
History Modeling Summary
| |
|
Try It: Model History
| |
Podunk University has decided that they need to track all department assignments across time, including cases where someone works in the same department more than once with a time gap in between. Right now their model only shows the original start date for each employee in each department, but no end date. And while they're making changes, they'd like to track percent time for each employee in each department assignment. How would you modify their model (below)? Recommendation: To save time, effort and errors do this exercise first on paper. Then, do the exercise on-line, using your written model as the basis for your answers. Directions: For each purple column name below click as many checkboxes as required to indicate all of its role(s) in a more appropriately modeled Employee/Department History entity. When you're done, click the Check button.
|
| 8: Using ER Diagrams |
Objectives
| |
This module enables the student to read and interpret an LDM in ER Diagram format.
|
Overview
| |
Types of Models LDMs may be used at varying levels of detail:
|
ERD Definition Review
| |
An LDM is ultimately documented in an ERD (Entity-Relationship Diagram). The ERD is a standard modeling technique for defining and graphically representing an LDM:
Note: ERDs can be LDMs or PDMs. In actual use, LDM ERDs don't usually have abbreviations, but PDM ERDs frequently do due to the physical database name size limitations. Abbreviations are used in this course to reduce scrolling for your convenience.
|
Translating from Tables to ERDs
| |
Throughout this module ERDs are shown for various data modeling scenarios covered earlier in the course. For each, a button is provided so that you can see the equivalent model using table layouts for comparison. This course is using the IDEF1x notation. There are other notation styles such as Information Engineering (uses the crows foot to designate the many side of a relationship.)
|
ERD Symbols
| |
ERD Development Tools There are several commercially available tools for developing ERDs, and each has a slightly different set of standards and symbols. We will be using ERwin in this course because it is the most commonly used in the field.
|
ERD Conventions
| |
The following conventions are used in this course, and in ER diagramming in general. The modeling notation shown is IDEF1X and is ERwin compatible. Note: Some customers and modelers prefer using IE notation.
|
Dependent Entity
| |
Entity Types
|
Dependent Entity Example
| |
In our hardware enterprise model, we defined a M:M relationship for PO/Part,
shown in an associative entity. In that model, a customer can order the
same part only once on any given PO. This means that if a customer initiates
a PO and orders 5 widgets, then calls back a day later (before the PO
is filled) and wants to add 6 more widgets to the PO, the order entry
person must update the current PO/Part record and change the order quantity
from 5 to 11. This may in fact be fine. Or it may not, in which case a
change to the model is required (shown below). If the hardware vendor
wants to track this kind of change, they may implement a constraint that
says that order quantity may not be changed, but that a new line item
should be added to the PO instead to reflect changes. This would allow
the same customer to order the same part multiple times on the same PO.
The model would look like this: PurchaseOrder is an independent entity.
POLineItem is a dependent entity. A line item has no existence without
its PO. Note that there is still a M:M relationship between PO and Part,
but that now it is mediated (indirect) through PO/LineItem.
|
Sub-Entity
| |
Review: Sub-Entity = Sub-Type Sub-entities are used to facilitate the relational rule that all of the attributes in an entity (a "Super-Entity") should describe all of its rows. If some attributes in fact apply to only a defined subset of the rows it may be modeled as a new entity, called a sub-entity or sub-type. Sub-entities that are split out from one entity have:
|
Sub-Entity Example
| |
The sub-types for an entity are shown in rounded boxes below a sub-type symbol.
|
Types of Sub-Types
| |
![]() See Table Equivalent
|
ERDs and NULLS
| |
In ERwin, you can't show NULL conditions in the logical model, but you can in the physical model. So, this can be a way to tell if an ERD is an LDM or a PDM. There is no way to show in ERwin notation that NULLs are or are not allowed in a given attribute. Conditions related to NULLs are documented in the constraints list, rather than the diagram. This is one place where table layouts with sample data can be very helpful.
|
Entity State NULLs
| |
An entity state NULL occurs if the attribute instance might or might not be NULL, depending on the state of the particular entity instance in which it occurs. In practical terms this means that an identifiable subset of the rows in a table may have NULLs in this attribute. This condition is a warning flag that a sub-entity may need to be defined.
|
Entity Type NULLs
| |
An entity type NULL occurs if the attribute instance is always NULL for the same types of instances of the entity. In practical terms this means that an identifiable subset of the rows in a table must have NULLs in this attribute. This condition indicates that a sub-entity does need to be defined. An example would be the Hourly/Salaried Employees as shown above. Hourly Employees always have NULL in MoAmt and Bonus. Salaried Employees always have NULL in HrAmt and Overtime.
|
Relationship Types
| |
Identifying vs. Non-Identifying
|
Identifying Relationship Example
| |
PurchaseOrder to POLineItem is a dependent relationship. POLineItem to Part is an independent relationship. Notice that this is the same model shown above to distinguish independent from dependent entities.
|
ERD and M:M
| |
Several Ways There are several ways to show M:M relationships on an ERD. Which one you use depends on the nature of the M:M relationship and whether or not you need to track attributes for the association.
|
M:M without Non-key Attributes
| |
![]() See Table Equivalent
|
M:M with Non-key Attributes
| |
![]() See Table Equivalent
|
M:M Recursive Relationship without Non-key Attributes
| |
![]() NOTE: This model could also be shown "resolved" as is the above example. In that case it would show a Part/Component Part entity, without additional attributes. This is the way M:M relationships are shown in many models. It is the only way to show it if there are attributes for the Part/Component Part entity. See Table Equivalent
|
ERDs Minimize M:M Relationships
| |
There will be many M:M relationships in every business, and these relationships can get extremely complex (the number of FKs in the PK is theoretically unlimited). To simplify the ERD the designer may choose to include associative entities only for relationships where actual data about the association is being tracked, while showing unattributed M:M relationships as a line with double labels (see example above). When using table layouts, all associative entities are always shown, for clarity, consistency and accuracy.
|
ERD vs. Table Layouts
| |
In Practical Reality Often design teams do not have time to specify sample data and draw out tables. They usually model directly into an ERD and just discuss sample data along the way as possible. Sample data may not always be available. Frequently the designers have to do the best they can with what they've got and then validate.
|
Try It: ER Diagrams
| | Step 1 of 3 |
Step 2 of 3
| |
As a result of Step 1, we've drawn the ERD below, showing independent
entities in square boxes, and associative entities in rounded boxes. The
next step is to modify the diagram to show PKs.
Directions: For each table, choose the the PK by clicking on the bottom-most attribute that is part of the PK.
|
Step 3 of 3
| |
| ||||||||||||||||||||||||||||||||||
| 9: Resources |
Data Modeling History
| |
This is optional additional information about the evolution of data modeling as a field, and is not necessary to the practical understanding or utilization of data modeling theory and techniques.
|
Relational Theory's Founding Fathers
| |
|
1960's: Real Men Automate Processes!
| |
|
1970's: Come On People, Now, Let's Get Together!
| |
|
1980's: We Don't Need No Stinking Process Models!
| |
|
1990's: Come On People, Now, Let's Get Together! (redux)
| |
|
For More Info
| |
|
Glossary
| |
This Glossary of Logical Data Modeling terminology includes industry standard definitions.
|
| |
1st Normal Form Each non-key attribute in the entity is a single-occurrence fact that is functionally dependent upon the PK. There should be a 1:1 relationship between the PK and each attribute in an entity. There should be no repeating groups.
2nd Normal Form
3rd Normal Form
4th Normal Form
5th Normal Form Aggregate Attribute: SEE: Derived/summary attribute.
Aggregate Entity
Application Data
As-Is Assertion: SEE: Constraint
Associative Entity
Atomic Data Object
Attribute
Availability (of data) BDO: SEE: Business Data Object
Big Bang Data Warehouse Business Data Item: SEE: Business Data Object
Business Data Object
Business Process Model
Business Question
Business Question Analysis
Business SME
Candidate Key
Cardinality
Categorization: SEE: Generalization.
Cell Column: SEE: Attribute
Complete Generalization
Compound Key
Constraint
Contradicting Definitions
Coverage (of data)
Data Accuracy
Data Definition Language
Data Demographics
Data Object
Data Manipulation Language
Data SME
Data Source
Data Sourcing Model
Data Warehouse
Database DDL: SEE: Data Definition Language. Decision Support
Degree
Demographics: SEE: Data Demographics and Query Demographics
Dependent Entity Derived Data: SEE: Derived/Summary Attribute.
Derived/ Summary Attribute
Dimension DML: Data Manipulation Language.
Domain DSS: Decision Support System. DW: Data Warehouse or Data Warehousing.
Dynamic Materialization (of a BDO)
Element
Entity
Entity Pool
Entity Relationship Diagram
Entity State Null
Entity Type
Entity Type Null ERD: Entity Relationship Diagram.
Facet FK: Foreign Key.
Foreign Key
Generalization
Hierarchy Relationship
Homograph
IDEF0
IDEF1X
Identifying Relationship
Incomplete Generalization
Independent Entity
Industry LDM Instance (of a data object)
Instance Diagram Intersection Entity: SEE: Associative Entity.
Isomorphic LDM
Join
Key Attribute
Key Migration
Latency (of data) LDM: Logical Data Model. LDO: Logical Data Object. Logical Data Item: SEE: Logical Data Object.
Logical Data Model
Logical Data Object
Many-to-Many Relationship
Metadata MRA: SEE: Multi-relationship Assertion.
Multi-relationship Assertion
Network Relationship
Non-identifying Relationship
Normalization
Null Occurrence: SEE: Instance and Row. Occurrence Diagram: SEE: Instance Diagram.
One-to-Many Relationship
One-to-One Relationship
Orphan
Orthogonal PDM: Physical Data Model. PDO: Physical Data Object. Physical Data Item: SEE: Physical Data Object.
Physical Data Model
Physical Data Object PK: Primary Key.
Populate
Populated Primary Key
Quality (of data)
Query
Query Demographics
Recursive Relationship
Redundant Data
Referential Integrity
Relation
Relationship
Relationship Name
Report
Row
Select
Set Theory SME: Subject Matter Expert.
Snow Flake Schema Source Data Item: SEE: Source Data Object.
Source Data Object
Star Schema
Static Materialization (of a BDO)
SEE Dynamic Materialization. Sub-Entity: SEE: Generalization.
Subject Area
Subject Matter Expert Sub-type: SEE: Generalization. Summary Attribute: SEE: Derived/Summary Attribute. Super-type: SEE: Generalization. Table: SEE: Entity.
To-Be Tuple: SEE: Row.
View
Visible
|