Query Audit

Query Auditing is a tool that Admins can use to get insights about sets of queries that exist outside of the Single Origin platform. You can find query audit by navigating to "Mission Control > Query Audit". From there you can see a history of all query audits run, as well as create a new query audit. If you click on a historical audit, then you will be taken to the results page for that audit.

Query Auditing has a similar flow to Bulk Import, but it does not create drafts or entities in the Single Origin platform. Instead, Query Auditing can be used to better understand where consolidation opportunities exist in a user's system outside of the Single Origin platform.

Sources

The Query Audit tool requires users to import a set of queries with the following information:

  • the compiled query that was run,
  • who/what ran the query, and
  • a list of tags that you want to associate with the query. Tags could be a single item (like an Airflow task ID or team that owns the pipeline) or multiple items.

This set of queries for an audit can come from 1 of 3 sources:

  1. CSV file upload
  2. Compliant Query Table, where the table has the same structure as CSV files
  3. Query History Table, where the query history table is the standard auditing table for the environment e.g. Snowflake or BigQuery

CSV file upload

The required format for CSV file upload is:

queryusertags
STRING - a SQL query to include in the auditSTRING - the email of the "owner" of the query, or the email of who "ran" the querySTRING (comma separated) - tags of metadata to associate with the query. tags can be a single item (like a task ID or team that owns the pipeline) or multiple items

🚧

CSV Header

Please remove the header from the CSV files: the first row of the CSV should contain a query, user, and tags.

So an example CSV looks like:

select lines.L_SHIPDATE as ship_date, parts.P_BRAND as part_brand, AVG(lines.L_DISCOUNT) as avg_discount from `sandbox-demo-db.tpch_sf1.lineitem` lines join `sandbox-demo-db.tpch_sf1.part` parts on lines.L_PARTKEY = parts.P_PARTKEY group by lines.L_SHIPDATE, parts.P_BRAND [email protected] tpch_sf1, customer_orders_by_day

Compliant Query Table

The required format for Compliant Query Table is similar, but we assume that there is a column representing the time the query was executed:

start_timequeryusertags
TIMESTAMP - the time the query was executedSTRING - a SQL query to include in the auditSTRING - the email of the "owner" of the query, or the email of who "ran" the querySTRING (comma separated) - tags of metadata to associate with the query. tags can be a single item (like a task ID or team that owns the pipeline) or multiple items

So an example Compliant Query Table looks like:

start_timequeryusertags
2022-06-18 20:39:14.668604 UTCselect lines.L_SHIPDATE as ship_date, parts.P_BRAND as part_brand, AVG(lines.L_DISCOUNT) as avg_discount from sandbox-demo-db.tpch_sf1.lineitem lines join sandbox-demo-db.tpch_sf1.part parts on lines.L_PARTKEY = parts.P_PARTKEY group by lines.L_SHIPDATE, parts.P_BRAND[email protected]tpch_sf1, customer_orders_by_day

Query History Table

In addition to auditing a set of queries you have curated (into either a CSV or a Compliant Table), you can also pull directly from your "Query History" logs. Note that this functionality assumes that you are exporting your logs in a standard format e.g. with BigQuery using a process like this.

To audit using a query history table, fill in the user email that you would like to filter to in the table (i.e. the user that ran the queries), as well as the time range when the queries were run. Choose a Single Origin Connector that is able to read tables, choose the name of the table you want to use, and apply any final filters (INTERACTIVE = queries that were run interactively, BATCH = queries that were run as part of a batch process, HIGH = queries that were run at a high priority level).

Results

Once the set of queries is uploaded, Single Origin first groups queries based on their source tables & then derives the common tables and common metrics. The results of the audit include:

  • one row per group, where every query in the group references the same common set of tables
  • common formulas in the group & how frequently each formula appears

You can view the results for a query audit by clicking on the desired run in the history table. From here you can navigate the results in Single Origin; furthermore, you can export the results CSV to explore locally.

📘

Results CSV

The results CSV returned is in pipe (|) form, so if you want to view in an app like Google Sheets then you will need to use a custom separator on the import.

Column

Description

group

INTEGER - each row should have a unique group number, starting at 0 and increasing by 1.

queriesInGroup

ARRAY OF INTEGERS - when an import is processed, a row number is appended to the import starting at 1. If the audit finds similarities between queries, then queriesInGroup represents the row numbers of the similar queries. For example, if you upload a CSV and the first and third row are the exact same, then queriesInGroup = [1, 3].

commonDatasets

ARRAY OF STRINGS - For the queries in the group, what are common datasets? A dataset must appear in all queries in the group to be included in commonDatasets.

commonFormulaCount

ARRAY OF STRINGS - For the queries in the group, what are common dimensions and data entities? And how often does the common expression occur? An expression must appear in at least 2 queries in the group to be included in commonFormulaCount.

similarity

STRING - EXACT or SIMILAR. Are expressions from commonDatasets and commonFormulaCount exactly the same, or not exactly the same but similar?

tags

ARRAY OF STRINGS - for the queries in the group, what is the unique list of tags associated with the queries?

Note that if your source is a Query History Table, then we use the insertId column from BigQuery as the tag for the query. This way you can find the compiled SQL queries in your logs by using a WHERE clause (insertId = 'tag').

Currently Single Origin outputs groups of queries that are exact matches (similar matches are coming soon), as well as some metadata on the groups. We first group query into groups based on their source tables, then derive the common tables and common metrics. We then return cases where queriesInGroup > 1.

The amount of time it takes to process a set of queries varies depending on complexity, but as a rough guide you can expect an audit of 100 SQL queries to take 1 minute.


Did this page help you?