For GCP BigQuery

This guide details the process for creating and exporting three Parquet files (columns.parquet, views.parquet, and query-historys.parquet) from BigQuery, and converting query-historys.parquet due to timestamp format mismatches.

Create and Upload Three Parquet Files

Run Query

  1. Go to the BigQuery console.
  2. Click on Compose new query.
  3. Enter the follow SQL query to create the columns table:
  4. Click Run to execute the query.
  5. After execution, click Save Results and select Save as Table. Name the table columns and specify your dataset.
SELECT 
  table_catalog AS table_catalog,
  table_schema AS table_schema,
  table_name AS table_name,
  column_name AS column_name,
  ordinal_position AS ordinal_position,
  is_nullable AS is_nullable,
  data_type AS data_type
FROM 
  `region-us.INFORMATION_SCHEMA.COLUMNS` --replace to your region

Corresponding parquet schema

message schema {
  string table_catalog = 1;
  string table_schema = 2;
  string table_name = 3;
  string column_name = 4;
  int32 ordinal_position = 5;
  string is_nullable = 6;
  string data_type = 7;
}

Export Columns Table To Parquet

  1. In the BigQuery console, navigate to your dataset.Find the columns table, click the SAVE RESULTS TO select Save to BigQuery Table
  2. Choose Export to Google Cloud Storage.
  3. Set the file format to PARQUET and the compression to SNAPPY.
  4. Specify the destination path in Google Cloud Storage (e.g., gs://your_bucket/columns.parquet).
  5. Click Export to start the export process.

Generate And Export views.parquet

Go to the BigQuery console.

run Query

  1. Go to the BigQuery console.
  2. Click on Compose new query.
  3. Enter the follow SQL query to create the columns table:
  4. Click Run to execute the query.
  5. After execution, click Save Results and select Save as Table. Name the table columns and specify your dataset.
SELECT 
  table_catalog AS table_catalog,
  table_schema AS table_schema,
  table_name AS table_name,
  view_definition AS view_definition
 FROM 
  `region-us.INFORMATION_SCHEMA.VIEWS`

Corresponding parquet schema

message schema {
  string table_catalog = 1;
  string table_schema = 2;
  string table_name = 3;
  string view_definition = 4;
}

Click Run to execute the query.

After execution, click Save Results and select Save as Table. Name the table views and specify your dataset.

Export columns Table to Parquet

  1. In the BigQuery console, navigate to your dataset.Find the columns table, click the SAVE RESULTS TO select Save to BigQuery Table
  2. Choose Export to Google Cloud Storage.
  3. Set the file format to PARQUET and the compression to SNAPPY.
  4. Specify the destination path in Google Cloud Storage (e.g., gs://your_bucket/columns.parquet).
  5. Click Export to start the export process.

Generate And Export query-historys.parquet

Run Query

  1. Go to the BigQuery console.
  2. Click on Compose new query.
  3. Enter the follow SQL query to create the columns table:
  4. Click Run to execute the query.
  5. After execution, click Save Results and select Save as Table. Name the table columns and specify your dataset.
SELECT
  job_id AS query_id,
  query AS query,
  user_email AS query_user,
  user_email AS email,
  TIMESTAMP_DIFF(end_time, creation_time, MICROSECOND) / 1e6  AS total_elapsed_time,
  STRUCT(
    UNIX_SECONDS(end_time) AS seconds,
    CAST(EXTRACT(NANOSECOND FROM end_time) AS INT) AS nanos
  ) AS updated_at,
  STRUCT(
    UNIX_SECONDS(start_time) AS seconds,
    CAST(EXTRACT(NANOSECOND FROM start_time) AS INT) AS nanos
  ) AS start_time,
  STRUCT(
    UNIX_SECONDS(end_time) AS seconds,
    CAST(EXTRACT(NANOSECOND FROM end_time) AS INT) AS nanos
  ) AS end_time,
  total_bytes_processed AS processed_bytes,
  ARRAY(SELECT table_id FROM UNNEST(referenced_tables)) AS source_tables,
  (SELECT table_id FROM UNNEST(referenced_tables) LIMIT 1) AS destination_table,
  '' AS default_database, 
  '' AS default_schema, 
  statement_type AS statement_type,
  ARRAY<FLOAT64>[] AS allocated_cost,
  NULL AS produced_rows,
  TIMESTAMP_DIFF(end_time, start_time, MILLISECOND) AS execution_time,
  '' AS warehouse_size,
  '' AS warehouse_id,
  '' AS warehouse_name,
  state AS execution_status,
  '' AS query_tag  
FROM
  `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT --replace to your region
WHERE
  state = "DONE"
ORDER BY
  creation_time DESC

Corresponding parquet schema

message schema {
  string query_id = 1;
  string query = 2;
  string query_user = 3;
  string email = 4;
  double total_elapsed_time = 5;
  google.protobuf.Timestamp updated_at = 6;
  google.protobuf.Timestamp start_time = 7;
  google.protobuf.Timestamp end_time = 8;
  int64 processed_bytes = 9;
  repeated string source_tables = 10;
  string destination_table = 11;
  string default_database = 12;
  string default_schema = 13;
  string statement_type = 14;
  repeated double allocated_cost = 15;
  int64 produced_rows = 16;
  int64 execution_time = 17;
  string warehouse_size = 18;
  string warehouse_id = 19;
  string warehouse_name = 20;
  string execution_status = 21;
  string query_tag = 22;
}

Export query-historys Table To Parquet

  1. In the BigQuery console, navigate to your dataset.Find the columns table, click the SAVE RESULTS TO select Save to BigQuery Table
  2. Choose Export to Google Cloud Storage.
  3. Set the file format to PARQUET and the compression to SNAPPY.
  4. Specify the destination path in Google Cloud Storage (e.g., gs://your_bucket/columns.parquet).
  5. Click Export to start the export process.

III. Additional Conversion for query-historys

Due to timestamp format mismatches between BigQuery's RECORD export and Protobuf's time representation, you need to perform a conversion on query-historys.parquet. Use the following Python script for this transformation.

python

import pyarrow as pa
import pyarrow.parquet as pq

# read Parquet
parquet_file = 'origin-query-historys.parquet'
table = pq.read_table(parquet_file)

# read schema
old_schema = table.schema
print("Old Schema:", old_schema)

# create schema
new_fields = []
for field in old_schema:
    if field.name in ['updated_at', 'start_time', 'end_time']:
        subfields = field.type
        new_subfields = []
        for subfield in subfields:
            if subfield.name == 'nanos':
                new_subfield = pa.field('nanos', pa.int32(), nullable=subfield.nullable)
            else:
                new_subfield = subfield
            new_subfields.append(new_subfield)
        new_field = pa.field(field.name, pa.struct(new_subfields))
    elif field.name == 'output_rows':
        new_field = pa.field('output_rows', pa.int32())
    else:
        new_field = field
    new_fields.append(new_field)

new_schema = pa.schema(new_fields)
print("New Schema:", new_schema)

# convert to new schema
def convert_table_schema(table, new_schema):
    df = table.to_pandas()
    new_table = pa.Table.from_pandas(df, schema=new_schema)
    return new_table

# write table
new_table = convert_table_schema(table, new_schema)
pq.write_table(new_table, 'output-query-historys.parquet')

Processing File

After generating the file, you can directly provide it to Single Origin. Accessing files from GCP is currently under development and will be supported soon.