Messy reality of data modelling
Table of Contents
I gave a presentation at $work
about data modelling based on William Kent’s Data and Reality (read the 2nd edition, not the third). As expected, the discussion turned out to be quite philosophical.
How difficult is it to define “one thing” in the database? Turns out, there’s no absolute correct answer to this.
(The rest of the post is the presentation itself)
Data Modeling: The Challenge of Vagueness
Entities are a state of mind. No two people agree on what the real world view is.
- Metaxides
Introduction
- Data modeling seems simple: map real-world things to database structures
- Reality: It’s profoundly complex and philosophical
- An information system models a “small, finite subset of the real world”
- But what exactly is that subset, and how do we define it?
The Deceptive Simplicity
We expect:
- One record in the employee file for each person employed
- Clear correspondence between database constructs and real-world things
But this immediately runs into trouble…
Four Fundamental Questions
- What constitutes “one thing”?
- When are two things “the same thing”?
- How do we handle change while maintaining identity?
- What categories should we use to classify things?
Question 1: What is “One Thing”?
That appears at first to be a trivial, irrelevant, irreverent, absurd question.
It’s not.
The Parts Example
Consider a parts inventory system:
- Does “part” mean one physical object?
- Or does it mean one kind of part?
Part #A123: Quantity 500 (in Warehouse 1)Part #A123: Quantity 200 (in Warehouse 2)
Is this one thing or many things?
The Book Example
What is “one book”?
- The abstract work (regardless of language or edition)
- A specific edition
- A specific physical copy
- A specific printing
Library database vs. Bookstore database vs. Publisher database
The Warehouse Example
What is “one warehouse”?
- A single building?
- A group of buildings?
- A floor within a building?
IBM location in Santa Teresa has one building number but eight distinct towers called “building A”, “building B”, etc. How many buildings are there?
The Healthcare Example
What is “one patient record”?
- All information about a person across their lifetime?
- Information from one hospital visit?
- Information related to one condition?
- Information accessible to one provider?
Question 2: How Many Things Is It?
A single entity can be multiple things simultaneously in our data model.
The Soccer Player Example
When Joe Smith, playing halfback, scores a goal:
- Data about two things is modified:
- The number of goals by Joe Smith
- The number of goals by a halfback
That human figure is represented as (and is) two things.
The Healthcare Example
A doctor in a hospital system might be:
- An employee (HR system)
- A care provider (clinical system)
- A researcher (research database)
- A resource (scheduling system)
Each with different attributes and relationships.
The Dual Role Example
Two related people (husband and wife) who work for the same company:
Each person must be considered twice:
- Once as an employee
- Once as a dependent of an employee
How many people are involved?
Question 3: The Challenge of Change
How much can something change before it becomes something else?
The Car Example
If you and I start trading parts of our cars:
- Tires, wheels, transmissions, suspensions, etc.
At what point have we exchanged cars?
The DMV’s arbitrary decision: the “essence” of a car is the engine block.
The Healthcare Example
Patient identity through time:
- Different physical body (cells replace themselves)
- Different mental states
- Different capabilities
- Different diagnoses
Is a patient with dementia the “same person” as before?
The Organization Example
Is it still the same company after changes in:
- Employees? (Of course)
- Management? (Yes)
- Owners? (Maybe)
- Buildings and facilities? (Yes)
- Locations? (Probably)
- Name? (Probably)
- Principal business? (Maybe)
Versions and Time
- When do we discard the old and let the new replace it?
- When do we treat old and new as distinct things?
- When do we try to do both?
“These several things are different versions of the same thing”
Question 4: Categories and Classification
What is it? In what categories do we perceive the thing to be?
The Employee Example
Does “employee” include:
- Part-time employees?
- Contract employees?
- Employees of subsidiary companies?
- Former employees?
- Retired employees?
- Employees on leave?
- Someone who has accepted an offer but not started?
The Healthcare Example
What is a “patient”?
- Someone currently admitted to the hospital?
- Anyone who has ever received care?
- Someone with an upcoming appointment?
- Someone in the emergency waiting room?
- An unborn fetus being monitored?
Fuzzy Boundaries
“A more amusing example is to imagine a continuum of physical objects between some given chair and table… There will be some strange objects in this continuum which cannot clearly be assigned to either class.”
The Role vs. Category Problem
Is something defined by:
- What it is? (intrinsic nature)
- What it’s used for? (role)
- Where it is? (context)
The same hollow metal tube might be called a pipe, an axle, a lamp pole, a mop handle…
The Changing Category
Categories can change with time:
- A dependent becomes an employee, then a customer
- A slab of marble becomes a sculpture
- A person becomes a patient, then recovers
Practical Implications for Data Modeling
The Arbitrary Nature of Models
- No model is “correct” in an absolute sense
- Models are conventions agreed upon by users
- Different applications may need different models
- Integration requires reconciling these differences
Guidelines for Better Data Modeling
- Acknowledge ambiguity upfront
- Define clear conventions for your specific context
- Document assumptions about identity and categories
- Design for change and evolution
- Consider how different stakeholders view the same entities
Example: Healthcare Patient Model
Option 1: Person-centric
- One record per person
- All encounters, conditions as related entities
- Good for: Longitudinal care, population health
Option 2: Encounter-centric
- One record per hospital visit
- Person as a related entity
- Good for: Billing, operational metrics
The Philosophical Reality
“Before we go charging off to design or use a data structure, let’s think about the information we want to represent. Do we have a very clear idea of what that information is like? Do we have a good grasp of the semantic problems involved?”
Remember:
“Becoming an expert in data structures is like becoming an expert in sentence structure and grammar. It’s not of much value if the thoughts you want to express are all muddled.”
Conclusion: Embracing the Challenge
- Data modeling is as much philosophy as technology
- The goal isn’t perfect modeling (impossible) but useful modeling
- Success comes from understanding the inherent vagueness and making deliberate choices
Discussion
- What entities in our organization have ambiguous boundaries?
- Where have we encountered “one thing vs. many things” problems?
- How do we handle identity through change?
💬 Have thoughts on this post? Send me an email or use this form