Semantic Validation
Whenever you import a SQL query, Single Origin breaks it down using a common definition model. The results go through an automated validation flow comparing potential views, dimensions, and data entities (from the import) to existing views, dimensions, and data entities (from production). We flag both exact and similar matches.
Deeper Dive
For a more technical dive into Single Origin's semantic validation, see this blog post!
How does semantic validation work?
At a high level, we apply distributive and commutative properties to applicable operators and transform a SQL query into a canonical format. Recall the example query from our case study:
SELECT
oi.created_at,
u.age,
COUNT(DISTINCT oi.order_id) AS count_orders
FROM
`sandbox-demo-db.thelook_ecommerce.users` AS u
JOIN `sandbox-demo-db.thelook_ecommerce.order_items` AS oi ON u.id = oi.user_id
WHERE
oi.status = 'Returned'
AND oi.created_at >= '2022-01-01'
GROUP BY
created_at,
age
This led to an orders
view:
SELECT
*
FROM
`sandbox-demo-db.thelook_ecommerce.users` AS u
INNER JOIN `sandbox-demo-db`.thelook_ecommerce.order_items AS oi ON u.id = oi.user_id
We mentioned that multiple sources can lead to the same view & semantic validation would flag this, so what other queries could we import with the same (or similar) view?
Exact Match
The following query is logically equivalent to "Returns by Age," although the SQL style is different:
WITH users AS (
SELECT
id AS user_id,
age AS user_age
FROM
`sandbox-demo-db.thelook_ecommerce.users`
),
orders AS (
SELECT
user_id,
created_at AS created,
order_id AS id
FROM
`sandbox-demo-db.thelook_ecommerce.order_items`
WHERE
status = 'Returned'
AND created_at >= '2022-01-01'
)
SELECT
orders.created,
users.user_age,
COUNT(DISTINCT orders.id) AS count_orders
FROM
users
JOIN orders ON users.user_id = orders.user_id
GROUP BY
1,
2
When we transform this SQL into a canonical format, we remove column aliases, standardize the join syntax, etc. Thus, "Returns by Age V2" has the exact same view, dimensions, and data entity as "Returns by Age."
Same View
Suppose that in addition to tracking returns by age, we want to track cancellations by gender. To pull this data, we execute the following query:
SELECT
oi.created_at,
u.gender,
COUNT(DISTINCT oi.order_id) AS count_orders
FROM
`sandbox-demo-db.thelook_ecommerce.users` AS u
JOIN `sandbox-demo-db.thelook_ecommerce.order_items` AS oi ON u.id = oi.user_id
WHERE
oi.status = 'Cancelled'
GROUP BY
1,
2
When we transform this SQL into a canonical format, we remove grouping and filtering to generate the view; thus, "Cancellations by Gender" can reuse the orders
view. All that needs to be imported is two dimensions: status
and gender
.
Similar View
Suppose that we want to track returns by product brand, as well as user age. To pull this data, we execute the following query:
SELECT
oi.created_at,
u.age,
p.brand,
COUNT(DISTINCT oi.order_id) AS count_orders
FROM
`sandbox-demo-db.thelook_ecommerce.users` AS u
JOIN `sandbox-demo-db.thelook_ecommerce.order_items` AS oi ON u.id = oi.user_id
JOIN `sandbox-demo-db.thelook_ecommerce.products` AS p ON p.id = oi.product_id
WHERE
oi.status = 'Returned'
AND oi.created_at >= '2022-01-01'
GROUP BY
created_at,
age,
brand
This generates a view (called orders with products
) that looks like:
SELECT
*
FROM
`sandbox-demo-db`.thelook_ecommerce.users AS u
INNER JOIN `sandbox-demo-db`.thelook_ecommerce.order_items AS oi ON u.id = oi.user_id
INNER JOIN `sandbox-demo-db`.thelook_ecommerce.products AS p ON p.id = oi.product_id
orders with products
looks similar to orders
, but we cannot be sure they are the same: the additional join could filter out some data. The validation flow will thus flag this as a similar view, but not an exact match.
Updated about 2 years ago