Tech Blog‎ > ‎

Geek corner

Why do I need a schema anyway?

posted Feb 16, 2016, 1:41 PM by Michael Holleran   [ updated Feb 16, 2016, 1:41 PM by Laura Carrubba ]

"Art is limitation; the essence of every picture is the frame." — G.K. Chesterton


This article  is not yet another argument in the tiresome SQL vs. NoSQL debate. I think both technologies have their place. This is an explanation of the benefits of using a schema when the data can benefit from it.

Most NoSQL databases store data either in key/value form, or as XML/JSON documents. In almost all cases, they lack the concept of a schema. This presents certain advantages: programmers can store any data they want, they can change how they store the data over time without migrating old data, etc... That make sense for unstructured data, but when it comes to structured data, these advantages are offset by significant, and (I think) under-reported, downsides regarding the value of the data, and its long-term viability.

In this article, I describe how a schema can be an important asset when dealing with many types of data, and how the concept of schema can be extended to make it even more useful. 

Why a schema?

When writing software, we usually think of what the system is supposed to do. We should also think about what the software is not supposed to do.

In many ways, that's what a schema does. It's a way to define how data should behave, and how it should not behave. It's a way to draw the line between the "good" space, where data is consistent, and the "bad" space, where data is not consistent.

That is the main purpose of a schema. It's not a crutch to help the database engine. It's not an arbitrary set of limits created solely for the purpose of frustrating the programmer's creativity. It's about carving out a well-defined area in an infinite space of possibilities.

Advantages of having a schema

As a communication tool
The first advantage of having a schema is that it brings structure. This may sound tautological but I don't think it is. Having a formally defined structure for your data means that all parts of the system will have at least that much in common. A schema diagram is a great tool for communicating in a team.

As an error-catching mechanism
Having a well-defined schema will catch errors that would go undetected otherwise: null values where there shouldn't be, incorrectly spelled attribute/column names, values out of range, referential integrity, etc...

 Problem Example
 Invalid data Product price = true (meaningless -- should be a number)
 Missing data Line item does not have a price
 Extraneous data Line item has an extra attribute named "Color" -- we don't know what it means
 Referential integrity Order does not belong to any customer 

Discoverability - reports, other apps, etc...
An under-appreciated benefit of having a schema is also the discoverability it brings to your data. A well-defined schema means that other systems may also be able to use your data: ELT tools, reporting tools, even app generators.

For performance
A schema will make indexing easier
A schema also informs how the database retrieves your data. 

For migration
Perhaps most importantly, a schema will make migrating the data much easier. Data tends to outlive applications. Your data will have to be transformed in any number of ways over its lifetime.

As Sarah Mei recently wrote in her remarkably clear and cogent piece:

"Schema flexibility sounds like a great idea, but the only time it’s actually useful is when the structure of your data has no value." -- Sarah Mei

Disadvantages of having a schema

It takes more time up front

You can't store whatever you feel like.

You have to learn some data modeling.

There is nothing wrong about storing schema-less data if that makes sense for your particular problem. But we should stop pretending that NoSQL is the best solution for everything.

The C in ACID

posted Feb 16, 2016, 1:41 PM by Michael Holleran   [ updated Feb 16, 2016, 1:41 PM by Laura Carrubba ]

Everyone who works with databases is familiar with the acronym ACID, which lists the attributes of a proper transaction. It should be:
  • Atomic
  • Consistent
  • Isolated
  • Durable
We all know about the A and the D -- they're relatively intuitive. Far fewer people truly understand the I, but that's for another article. Today, I'd like to focus on the C. What exactly does it mean for data to be consistent?

Consistent means that the data is in a valid state; in other words, it follows the definition of the schema. For instance, if the column is defined as NOT NULL, it shouldn't ever be null. If it's defined as a foreign key, then the referred object should always exist. The list goes on.

Many databases allow you to go further and define domains. For instance, perhaps the customer's status should be one of Bronze, Silver or Gold, or the customer's age should be between 0 and 125.

These definitions are good and useful because they are easy to declare, and once they are declared, you don't have to think about them. The database is going to do whatever it needs to do to make sure that these definitions remain true, no matter what happens to the data.

For anything more complicated, you typically have to use triggers and stored procedures -- not that there's anything wrong with that, mind you. Triggers and stored procedures participate in transactions, and therefore are part of consistency. In fact, they can be considered to be part of the schema, if you use the term loosely.

But of course, triggers and stored procedures are going to be vendor-dependent, and are often difficult to write and debug. In addition, they add to the database load, which can lead to scalability issues. So the non-trivial logic is often defined in the middle tier, using a language like C#, Java, Python, etc...

There is a big gap between declaring a schema, and writing procedural code. Defining a constraint as part of a schema is (comparatively) easy, and you don't have to explain what it means to the database. For instance, a foreign key definition will automatically cover inserts, updates and deletes. Not only that, but it's also self-documenting: everyone will know what it means.

As soon as you start writing procedural code (whether in triggers and stored procedures, or other languages), you're leaving all that behind, and taking responsibility for a lot of things. You have to make sure that your code does the right thing at the right time, and in particular, you're responsible for dealing with the various dependencies between the various bits of code that you may have. This problem is exacerbated when the logic governing the data is expressed in more than one place. It's not unusual to have some of that logic defined in triggers and stored procedures, some in the middle tier, and (shudder) even some in the presentation layer. Getting a global view of how all this logic works is daunting. Changing any of it can be a frightening proposition, since there may be a lot of non-obvious dependencies that might be tripped by a seemingly innocent change.

Wouldn't it be nice to be able to do more than trivial definitions as part of the schema? What if we could extend schema definition to include higher-level constructs, like complex derivations, aggregates, and multi-table validations? That wouldn't solve all of our problems, but it would allow us to work at a higher level of abstraction.

That's what database reactive programming aims for. We're pushing the declarative aspect of database schemas to a whole new level. By doing so, we want to capture more of the logic as declarations, and less as code.

1-2 of 2