Shared Physics
by Roman Kudryashov
Shared Physics

From NoSQL to SQL: Turning Key-Value Pairs into Old Fashioned Tables for Analytics

Published on 2 min read

TLDR: Key:Value tables are great for when you're dealing with unstructured data but sometimes you just need an old-fashioned tabled to run analytics on.

Key:Value tables, .json data, and nosql databases are great for when you're dealing with unstructured data and you don't know what the future will hold. CRMs are a great use case for this – for logging ever-increasing customer data that might be generated in a free-form way from different kinds of inputs.

But while this is great for agile development and accepting on-the-fly structural updates, it makes it difficult to work with in an analytics capacity (sorting, grouping, and analyzing columns), especially if you're used to working with tabular/columnar data such as in Excel.

The solution is to rotate (transpose) the table into a columnar structure. This can be done by:

  1. Identifying the key attributes you want to analyze, and then
  2. Creating a new view of the data.

My favorite approach is one suggested by the talented folks at Panoply: flattening out the data using CASE statements. For example, let's say your data looks like this:

A_I |   ID    | Key    | Value
01  |  0001   | Name   | John
02  |  0001   | City   | New York
03  |  0001   | Gender | M
04  |  0002   | Name   | Sally
05  |  0002   | City   | Washington
06  |  0002   | Gender | F

... but for purposes of analytics, you want it to look like:

ID    |	Name   |  City        | Gender
0001  |	John   |  New York    | M
0002  |	Sally  |  Washington  | F

Then, your solution would be:

SELECT id,
      MAX(CASE WHEN "key" = 'name' THEN value ELSE NULL END) name,
      MAX(CASE WHEN "key" = 'city' THEN value ELSE NULL END) city,
      MAX(CASE WHEN "key" = 'gender' THEN value ELSE NULL END) gender
FROM table_name
GROUP BY 1;

To break that down: you're selecting the data by the ID field which connects the different key:value pairs, then creating each of the columns via the CASE commands.

Generally speaking, this case statement approach should work to map the key-value pairs you need. You will need as many cases as you have different types of keys. I've found that by the time you really need to rotate a key:value table, you're only looking for a specific subset of the data and this can help you clear out the noise. You can also take this as an opportunity to cast certain values in correct notation (i.e., casting the date values into proper datetime format or whatnot).

Another benefit of this rotation is that you can modify the type of data during the process. For example, I tend to recast UNIX timestamps as a human-readable timestamps because that just makes my life so much easier when sharing the data with non-technical folks.

Saving the View

Last but not least, you want to make sure to save the view you've created. Just wrap your query in:

CREATE VIEW name [ ( [...your work here...] ) ]
    AS query

The benefit of having this as a view is that the view is linked to the original data. As things are updated or added or replaced, your view will reflect that.

You can also use the CREATE TABLE AS syntax (https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_TABLE_AS.html) but these tables will be static – great if the data isn't going to be updated, but awful if you need it to continue being refreshed.

Big thanks to the folks at Panoply who helped me work through this problem when we tried to turn Hubspot .json data into something easy to work with in Metabase.

Thanks for reading

Was this useful? Interesting? Have something to add? Let me know.

Seriously, I love getting email and hearing from readers. Shoot me a note at hi@romandesign.co and I promise I'll respond.


You can also sign up for email or RSS updates when there's a new post.

Thanks,
Roman