Generate Input Data Files

Schema Parquet Files

Schema data is conveyed in 3 separate files for tables, columns, and views.

tables.parquet

Note: row_count, bytes, partition_cols, and sort_cols are important attributes that feed into recommendations to optimize table scans.

message TableItem {
  string table_catalog = 1;
  string table_schema = 2;
  string table_name = 3;
  string table_owner = 4;
  int64 row_count = 5;
  int64 bytes = 6;
  string comment = 7;
  int64 created = 8;
  int64 last_altered = 9;
  repeated string partition_cols = 10;
  repeated string sort_cols = 11;
}

columns.parquet

message schema {
  string table_catalog = 1;
  string table_schema = 2;
  string table_name = 3;
  string column_name = 4;
  int32 ordinal_position = 5;
  string is_nullable = 6;
  string data_type = 7;
  bytes numeric_precision = 8;
  bytes numeric_scale = 9;
  bytes datetime_precision = 10;
}

views.parquet

message schema {
  string table_catalog = 1;
  string table_schema = 2;
  string table_name = 3;
  string view_definition = 4;
  string table_owner = 5;
}

Retrieving Table, View, and Column Metadata in Spark

The method for retrieving metadata in Apache Spark depends on the Meta Catalog you are using. Different storage formats and catalogs have different ways to access this information.

1️⃣ Hive Metastore (Common for Spark with Hive Integration) Tables & Views: Can be retrieved from the INFORMATION_SCHEMA.TABLES. Columns: Available in INFORMATION_SCHEMA.COLUMNS. Views: Identified using table_type = 'VIEW'.

2️⃣ Spark Built-in Catalog (spark_catalog) Tables & Views: Listed using SHOW TABLES. Columns: Retrieved using SHOW COLUMNS or DESCRIBE TABLE. Views: No direct distinction from tables, but DESCRIBE TABLE can provide insight.

3️⃣ Apache Iceberg (For Iceberg-based Tables) Tables & Views: Managed through Iceberg catalog metadata. Note that partition cols can be obtained from "partition-specs", and sort cols from "sort-orders" in iceberg table metadata. Columns: Available via Iceberg’s metadata tables (columns table). Views: Iceberg does not natively support views.

4️⃣ Delta Lake (For Delta Table Metadata) Tables & Views: Managed in Delta logs. Columns: Can be extracted from DESCRIBE DETAIL. Views: Views are typically managed outside Delta in the Spark Catalog or Hive Metastore.

Event logs

We extract the physical plan and runtime statistics from Spark events. These are the same events used to power the Spark History UI.

We ingest the raw JSON events for each instance of a spark job run (denoted by spark_app_id). These should be provided together in a parquet file using the schema below:

spark-events.parquet

message SparkEvent {
  string spark_app_id = 1;
  // json for a single event.  for a given spark_app_id, there can be hundreds of events.
  string event_json = 2;
}

Where to find spark event logs

From the Spark History UI, on the "Environment" tab:

spark.eventLog.dir denotes the location where event logs are persisted
spark.app.id denotes the spark_app_id of this job run
The events for this spark_app_id are typically in a file named <spark_event_log_dir>/<spark_app_id>

Filtering on event types

To reduce the size of the logs shared with Single Origin, they can be filtered down to contain only these specific event types:

org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart
org.apache.spark.sql.execution.ui.SparkListenerSQLAdaptiveExecutionUpdate
SparkListenerJobStart
SparkListenerStageCompleted
SparkListenerEnvironmentUpdate

Query logs (if using SQL)

For Spark jobs that execute a single SQL query, we can ingest these queries to provide additional insights.

query-historys.parquet

message QueryHistoryItem {
  string spark_app_id = 1;

  // SQL text
  string query = 2;
  
  // If the query contains table names that aren't fully qualified, these fields
  // specify a default catalog and schema used to resolve the table reference
  string default_database = 12;
  string default_schema = 13;
}