Generate Input Data Files

Column Schemas

Run Query

Go to the BigQuery console
Click on Compose new query
Enter the follow SQL query
Click Run to execute the query
After execution, click Save Results and select Save as Table

SELECT 
  table_catalog AS table_catalog,
  table_schema AS table_schema,
  table_name AS table_name,
  column_name AS column_name,
  ordinal_position AS ordinal_position,
  is_nullable AS is_nullable,
  data_type AS data_type
FROM 
  `region-us.INFORMATION_SCHEMA.COLUMNS` --replace to your region

Corresponding parquet schema

message schema {
  string table_catalog = 1;
  string table_schema = 2;
  string table_name = 3;
  string column_name = 4;
  int32 ordinal_position = 5;
  string is_nullable = 6;
  string data_type = 7;
}

Export Columns Schemas To Parquet

In the BigQuery console, navigate to your dataset, click the SAVE RESULTS TO select Save to BigQuery Table
Choose Export to Google Cloud Storage
Set the file format to PARQUET and the compression to SNAPPY
Specify the destination path in Google Cloud Storage (e.g., gs://your_bucket/columns.parquet)
Click Export to start the export process

View Schemas

Run Query

Go to the BigQuery console
Click on Compose new query
Enter the follow SQL query
Click Run to execute the query.
After execution, click Save Results and select Save as Table

SELECT 
  table_catalog AS table_catalog,
  table_schema AS table_schema,
  table_name AS table_name,
  view_definition AS view_definition
 FROM 
  `region-us.INFORMATION_SCHEMA.VIEWS`

Corresponding parquet schema

message schema {
  string table_catalog = 1;
  string table_schema = 2;
  string table_name = 3;
  string view_definition = 4;
}

Export Views Schemas to Parquet

In the BigQuery console, navigate to your dataset, click the SAVE RESULTS TO select Save to BigQuery Table
Choose Export to Google Cloud Storage
Set the file format to PARQUET and the compression to SNAPPY
Specify the destination path in Google Cloud Storage (e.g., gs://your_bucket/views.parquet)
Click Export to start the export process

Query Histories

Run Query

Go to the BigQuery console
Click on Compose new query
Enter the follow SQL query
Click Run to execute the query
After execution, click Save Results and select Save as Table

SELECT
  job_id AS query_id,
  query AS query,
  user_email AS query_user,
  user_email AS email,
  TIMESTAMP_DIFF(end_time, creation_time, MICROSECOND) / 1e6  AS total_elapsed_time,
  STRUCT(
    UNIX_SECONDS(end_time) AS seconds,
    CAST(EXTRACT(NANOSECOND FROM end_time) AS INT) AS nanos
  ) AS updated_at,
  STRUCT(
    UNIX_SECONDS(start_time) AS seconds,
    CAST(EXTRACT(NANOSECOND FROM start_time) AS INT) AS nanos
  ) AS start_time,
  STRUCT(
    UNIX_SECONDS(end_time) AS seconds,
    CAST(EXTRACT(NANOSECOND FROM end_time) AS INT) AS nanos
  ) AS end_time,
  total_bytes_processed AS processed_bytes,
  ARRAY(SELECT table_id FROM UNNEST(referenced_tables)) AS source_tables,
  (SELECT table_id FROM UNNEST(referenced_tables) LIMIT 1) AS destination_table,
  '' AS default_database, 
  '' AS default_schema, 
  statement_type AS statement_type,
  ARRAY<FLOAT64>[] AS allocated_cost,
  NULL AS produced_rows,
  TIMESTAMP_DIFF(end_time, start_time, MILLISECOND) AS execution_time,
  '' AS warehouse_size,
  '' AS warehouse_id,
  '' AS warehouse_name,
  state AS execution_status,
  '' AS query_tag  
FROM
  `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT --replace to your region
WHERE
  state = "DONE"
ORDER BY
  creation_time DESC

Corresponding parquet schema

message schema {
  string query_id = 1;
  string query = 2;
  string query_user = 3;
  string email = 4;
  double total_elapsed_time = 5;
  google.protobuf.Timestamp updated_at = 6;
  google.protobuf.Timestamp start_time = 7;
  google.protobuf.Timestamp end_time = 8;
  int64 processed_bytes = 9;
  repeated string source_tables = 10;
  string destination_table = 11;
  string default_database = 12;
  string default_schema = 13;
  string statement_type = 14;
  repeated double allocated_cost = 15;
  int64 produced_rows = 16;
  int64 execution_time = 17;
  string warehouse_size = 18;
  string warehouse_id = 19;
  string warehouse_name = 20;
  string execution_status = 21;
  string query_tag = 22;
}

Export Query Histories To Parquet

In the BigQuery console, navigate to your dataset, click the SAVE RESULTS TO select Save to BigQuery Table
Choose Export to Google Cloud Storage
Set the file format to PARQUET and the compression to SNAPPY
Specify the destination path in Google Cloud Storage (e.g., gs://your_bucket/query-historys.parquet)
Click Export to start the export process.

Additional Conversion for Query Histories

Due to timestamp format mismatches between BigQuery's RECORD export and Protobuf's time representation, you need to perform a conversion on query-historys.parquet. Use the following Python script for this transformation.

python

import pyarrow as pa
import pyarrow.parquet as pq

# read Parquet
parquet_file = 'origin-query-historys.parquet'
table = pq.read_table(parquet_file)

# read schema
old_schema = table.schema
print("Old Schema:", old_schema)

# create schema
new_fields = []
for field in old_schema:
    if field.name in ['updated_at', 'start_time', 'end_time']:
        subfields = field.type
        new_subfields = []
        for subfield in subfields:
            if subfield.name == 'nanos':
                new_subfield = pa.field('nanos', pa.int32(), nullable=subfield.nullable)
            else:
                new_subfield = subfield
            new_subfields.append(new_subfield)
        new_field = pa.field(field.name, pa.struct(new_subfields))
    elif field.name == 'output_rows':
        new_field = pa.field('output_rows', pa.int32())
    else:
        new_field = field
    new_fields.append(new_field)

new_schema = pa.schema(new_fields)
print("New Schema:", new_schema)

# convert to new schema
def convert_table_schema(table, new_schema):
    df = table.to_pandas()
    new_table = pa.Table.from_pandas(df, schema=new_schema)
    return new_table

# write table
new_table = convert_table_schema(table, new_schema)
pq.write_table(new_table, 'output-query-historys.parquet')