Semantic Validation

Whenever you import a SQL query, Single Origin breaks it down using a common definition model. The results go through an automated validation flow comparing potential views, dimensions, and data entities (from the import) to existing views, dimensions, and data entities (from production). We flag both exact and similar matches.

πŸ“˜

Deeper Dive

For a more technical dive into Single Origin's semantic validation, see this blog post!

How does semantic validation work?

At a high level, we apply distributive and commutative properties to applicable operators and transform a SQL query into a canonical format. Recall the example query from our case study:

SELECT
  oi.created_at,
  u.age,
  COUNT(DISTINCT oi.order_id) AS count_orders
FROM
  `sandbox-demo-db.thelook_ecommerce.users` AS u
  JOIN `sandbox-demo-db.thelook_ecommerce.order_items` AS oi ON u.id = oi.user_id
WHERE
  oi.status = 'Returned'
  AND oi.created_at >= '2022-01-01'
GROUP BY
  created_at,
  age

This led to an orders view:

SELECT
  *
FROM
  `sandbox-demo-db.thelook_ecommerce.users` AS u
  INNER JOIN `sandbox-demo-db`.thelook_ecommerce.order_items AS oi ON u.id = oi.user_id

We mentioned that multiple sources can lead to the same view & semantic validation would flag this, so what other queries could we import with the same (or similar) view?

Exact Match

The following query is logically equivalent to "Returns by Age," although the SQL style is different:

WITH users AS (
  SELECT
    id AS user_id,
    age AS user_age
  FROM
    `sandbox-demo-db.thelook_ecommerce.users`
),
orders AS (
  SELECT
    user_id,
    created_at AS created,
    order_id AS id
  FROM
    `sandbox-demo-db.thelook_ecommerce.order_items`
  WHERE
    status = 'Returned'
    AND created_at >= '2022-01-01'
)
SELECT
  orders.created,
  users.user_age,
  COUNT(DISTINCT orders.id) AS count_orders
FROM
  users
  JOIN orders ON users.user_id = orders.user_id
GROUP BY
  1,
  2

When we transform this SQL into a canonical format, we remove column aliases, standardize the join syntax, etc. Thus, "Returns by Age V2" has the exact same view, dimensions, and data entity as "Returns by Age."

Same View

Suppose that in addition to tracking returns by age, we want to track cancellations by gender. To pull this data, we execute the following query:

SELECT
  oi.created_at,
  u.gender,
  COUNT(DISTINCT oi.order_id) AS count_orders
FROM
  `sandbox-demo-db.thelook_ecommerce.users` AS u
  JOIN `sandbox-demo-db.thelook_ecommerce.order_items` AS oi ON u.id = oi.user_id
WHERE
  oi.status = 'Cancelled'
GROUP BY
  1,
  2

When we transform this SQL into a canonical format, we remove grouping and filtering to generate the view; thus, "Cancellations by Gender" can reuse the orders view. All that needs to be imported is two dimensions: status and gender.

Similar View

Suppose that we want to track returns by product brand, as well as user age. To pull this data, we execute the following query:

SELECT
  oi.created_at,
  u.age,
  p.brand,
  COUNT(DISTINCT oi.order_id) AS count_orders
FROM
  `sandbox-demo-db.thelook_ecommerce.users` AS u
  JOIN `sandbox-demo-db.thelook_ecommerce.order_items` AS oi ON u.id = oi.user_id
  JOIN `sandbox-demo-db.thelook_ecommerce.products` AS p ON p.id = oi.product_id
WHERE
  oi.status = 'Returned'
  AND oi.created_at >= '2022-01-01'
GROUP BY
  created_at,
  age,
  brand

This generates a view (called orders with products) that looks like:

SELECT
  *
FROM
  `sandbox-demo-db`.thelook_ecommerce.users AS u
  INNER JOIN `sandbox-demo-db`.thelook_ecommerce.order_items AS oi ON u.id = oi.user_id
  INNER JOIN `sandbox-demo-db`.thelook_ecommerce.products AS p ON p.id = oi.product_id

orders with products looks similar to orders, but we cannot be sure they are the same: the additional join could filter out some data. The validation flow will thus flag this as a similar view, but not an exact match.