Query Audit

Query Audits allow Admins to gain insight into query sets outside the Single Origin platform, and to identify consolidation opportunities in a user’s external system. Query Audits follow a similar flow to Bulk Imports, but they do not create drafts or entities within the Single Origin platform.

Navigate to the Query Audit page by selecting Audits from the homepage, or by selecting the gear icon in the top right corner of the screen and clicking Query Audit in the left-hand nav. From here, administrators can create new query audits and view a history of previously run query audits. Click on a query audit to see more details about that audit.

Prerequisites for a new Query Audit

  • Queries must come from a valid source that can be imported into Single Origin
  • Queries must include:
    • A compiled, executed query
    • The executor of the query
    • A list of tags to associate with the query. Tags can either be a single item, like an Airflow task ID, or multiple items.

📘

Note

Your personal connector must have permission to access all referenced tables.

Valid Sources

Queries may come from 1 of 3 sources:

  • A CSV file
  • A Compliant Query Table
  • A Query History Table

CSV file upload

Each CSV file must follow the below format, with each column separated by a comma. If a header is present, use the column names below (query, user, and tags).

queryusertags
STRING - a SQL query to include in the audit. Include " at the start and end of the query.STRING - the email of the "owner" of the query or the email of who "ran" the querySTRING (comma separated) - tags of metadata to associate with the query. tags can be a single item (like a task ID or team that owns the pipeline) or multiple items. Include " at the start and end of the list.

An example CSV looks like this:

queryusertags
select lines.L_SHIPDATE as ship_date, parts.P_BRAND as part_brand, AVG(lines.L_DISCOUNT) as avg_discount from sandbox-demo-db.tpch_sf1.lineitem lines join sandbox-demo-db.tpch_sf1.part parts on lines.L_PARTKEY = parts.P_PARTKEY group by lines.L_SHIPDATE, parts.P_BRAND[email protected]tpch_sf1, customer_orders_by_day

Compliant Query Table

A Compliant Query Table is similar in structure to a CSV file, but must include a column representing the time the query was executed:

start_timequeryusertags
TIMESTAMP - the time the query was executedSTRING - a SQL query to include in the auditSTRING - the email of the "owner" of the query or the email of who "ran" the querySTRING (comma separated) - tags of metadata to associate with the query. tags can be a single item (like a task ID or team that owns the pipeline) or multiple items

An example Compliant Query Table looks like this:

start_timequeryusertags
2022-06-18 20:39:14.668604 UTCselect lines.L_SHIPDATE as ship_date, parts.P_BRAND as part_brand, AVG(lines.L_DISCOUNT) as avg_discount from sandbox-demo-db.tpch_sf1.lineitem lines join sandbox-demo-db.tpch_sf1.part parts on lines.L_PARTKEY = parts.P_PARTKEY group by lines.L_SHIPDATE, parts.P_BRAND[email protected]tpch_sf1, customer_orders_by_day

Query History Table

Query History Tables are exported from Data Warehouse providers, and provide a history of query executions in a standardized format. Select a user’s email to filter the queries by user, and select a time range. Choose a table and apply any additional filters. Filters include:

  1. INTERACTIVE (queries run interactively)
  2. BATCH (queries run as part of a batch process)
  3. HIGH (queries run at a high priority level).

Results

Uploaded queries are grouped based on their source table, then by common tables and metrics. The results of the query audit include:

  • One row per group, where every query in the group references the same common set of tables
  • Common formulas in the group & how frequently each formula appears

View the results for a query audit by selecting the desired run in the history table. From here, users can navigate the results in Single Origin, and export the results CSV to explore locally.

📘

Results CSV

The results CSV is returned in pipe form. To view in an app like Google Sheets, use a custom separator on import.

ColumnDescription
groupINTEGER - each row should have a unique group number, starting at 0 and increasing by 1.
queriesInGroupARRAY OF INTEGERS - when an import is processed, a row number is appended to the import starting at 1. If the audit finds similarities between queries, then queriesInGroup represents the row numbers of similar queries. For example, if you upload a CSV and the first and third rows are the same, then queriesInGroup = [1, 3].
commonDatasetsARRAY OF STRINGS - For the queries in the group, what are common datasets? A dataset must appear in all queries in the group to be included in commonDatasets.
commonFormulaCountARRAY OF STRINGS - For the queries in the group, what are common dimensions and data entities? And how often does the common expression occur? An expression must appear in at least 2 queries in the group to be included in commonFormulaCount.
similaritySTRING - EXACT or SIMILAR. Are expressions from commonDatasets and commonFormulaCount exactly the same, or not exactly the same but similar?
tagsARRAY OF STRINGS - For the group queries, what is the unique list of tags associated with the queries?

Note that if your source is a Query History Table, then we use the insertId column from BigQuery as the tag for the query. This way, you can find the compiled SQL queries in your logs using a WHERE clause (insertId = 'tag').

Currently, Single Origin outputs groups of queries that are either similar or exact matches, as well as some metadata on the groups.

The time to process a set of queries varies depending on complexity. Expect an audit of 100 SQL queries to take 1 minute.

Batch Import

The Batch Import form is identical to the Query Audit form. To start a Batch Import of SQL queries into Single Origin, click the "Import Queries" button at the top right corner of the Query Audit details page.

Issues

  • Why did my query fail to import?
    • There are several potential reasons:
      1. The SQL query is invalid.
      2. The SQL query is valid, but it references a table in a different project/warehouse than your Single Origin Connector. In this case, we do not have schema metadata for the table.
      3. Your Single Origin connector has not finished syncing. If there is a sync in process, then we may only have partial schema metadata.
    • Why does a query in my source not show up in the results?
      • In the results table, we only display queries that are similar or exact matches to other queries in the set. If a query is unique (e.g., references a table that no other queries reference), it will not appear in the results table.
    • Why do I see "No Results Found" as a result of my audit?
      • This can happen for one of two reasons:
        1. There is no duplication between queries in your set.
        2. We failed to process some (or all) queries, which led to the appearance of no duplication between queries in your set.
    • Can I combine the results of multiple query audits?
      • At this time, each audit is specific to the source queries provided for that audit. You cannot combine audits from different sets of queries - you would have to provide the entire set of queries into a single audit.