For Spark
This guide details the process for creating input data files for Spark
Generate Input Data Files
Schema Parquet Files
Schema data is conveyed in 3 separate files for tables, columns, and views.
tables.parquet
message TableItem {
string table_catalog = 1;
string table_schema = 2;
string table_name = 3;
string table_owner = 4;
int64 row_count = 5;
int64 bytes = 6;
string comment = 7;
int64 created = 8;
int64 last_altered = 9;
}
columns.parquet
message schema {
string table_catalog = 1;
string table_schema = 2;
string table_name = 3;
string column_name = 4;
int32 ordinal_position = 5;
string is_nullable = 6;
string data_type = 7;
bytes numeric_precision = 8;
bytes numeric_scale = 9;
bytes datetime_precision = 10;
}
views.parquet
message schema {
string table_catalog = 1;
string table_schema = 2;
string table_name = 3;
string view_definition = 4;
string table_owner = 5;
}
Retrieving Table, View, and Column Metadata in Spark
The method for retrieving metadata in Apache Spark depends on the Meta Catalog you are using. Different storage formats and catalogs have different ways to access this information.
1️⃣ Hive Metastore (Common for Spark with Hive Integration) Tables & Views: Can be retrieved from the INFORMATION_SCHEMA.TABLES. Columns: Available in INFORMATION_SCHEMA.COLUMNS. Views: Identified using table_type = 'VIEW'.
2️⃣ Spark Built-in Catalog (spark_catalog) Tables & Views: Listed using SHOW TABLES. Columns: Retrieved using SHOW COLUMNS or DESCRIBE TABLE. Views: No direct distinction from tables, but DESCRIBE TABLE can provide insight.
3️⃣ Apache Iceberg (For Iceberg-based Tables) Tables & Views: Managed through Iceberg catalog metadata. Columns: Available via Iceberg’s metadata tables (columns table). Views: Iceberg does not natively support views.
4️⃣ Delta Lake (For Delta Table Metadata) Tables & Views: Managed in Delta logs. Columns: Can be extracted from DESCRIBE DETAIL. Views: Views are typically managed outside Delta in the Spark Catalog or Hive Metastore.
Event logs
We extract the physical plan and runtime statistics from Spark events. These are the same events used to power the Spark History UI.
We ingest the raw JSON events for each instance of a spark job run (denoted by spark_app_id). These should be provided together in a parquet file using the schema below:
spark-events.parquet
message SparkEvent {
string spark_app_id = 1;
// json for a single event. for a given spark_app_id, there can be hundreds of events.
string event_json = 2;
}
Where to find spark event logs
From the Spark History UI, on the "Environment" tab:
- spark.eventLog.dir denotes the location where event logs are persisted
- spark.app.id denotes the spark_app_id of this job run
- The events for this spark_app_id are typically in a file named
<spark_event_log_dir>/<spark_app_id>
Query logs (if using SQL)
For Spark jobs that execute a single SQL query, we can ingest these queries to provide additional insights.
query-historys.parquet
message QueryHistoryItem {
string spark_app_id = 1;
// SQL text
string query = 2;
// If the query contains table names that aren't fully qualified, these fields
// specify a default catalog and schema used to resolve the table reference
string default_database = 12;
string default_schema = 13;
}
Updated about 20 hours ago