Connect Single Origin to OSS Spark
Create Spark metadata, event-log, and optional query-history input files that Single Origin can ingest.
Single Origin uses Spark schema metadata and runtime events to analyze query behavior, identify scan inefficiencies, and support optimization recommendations.
Required files
Prepare the following files for Single Origin:
| File | Required | Description |
|---|---|---|
tables.parquet | Yes | Table-level metadata from your Spark catalog or storage layer. |
columns.parquet | Yes | Column-level metadata from your Spark catalog or storage layer. |
views.parquet | Yes | View definitions and ownership metadata, when available. |
spark-events.parquet | Yes | Spark event log records used to extract physical plans and runtime statistics. |
query-historys.parquet | Optional | SQL text and default context for Spark jobs that execute a single SQL query. |
Prerequisites
- Identify the Spark catalog or storage layer you use, such as Hive Metastore, Spark built-in catalog, Apache Iceberg, or Delta Lake.
- Locate your Spark event logs from the Spark History UI or from your configured event-log storage path.
- Prepare a destination where you can write Parquet files for upload to Single Origin.
Generate tables.parquet
tables.parquetExport table-level metadata from your catalog or storage layer.
The file should include:
| Field | Description |
|---|---|
table_catalog | Catalog name. |
table_schema | Schema or database name. |
table_name | Table name. |
table_owner | Table owner. |
row_count | Number of rows, when available. |
bytes | Table size in bytes, when available. |
comment | Table comment. |
created | Creation timestamp. |
last_altered | Last altered timestamp. |
partition_cols | Partition columns. |
sort_cols | Sort columns. |
row_count, bytes, partition_cols, and sort_cols are especially useful because Single Origin uses them to recommend table-scan optimizations.
Generate columns.parquet
columns.parquetExport column-level metadata from your catalog or storage layer.
The file should include:
| Field | Description |
|---|---|
table_catalog | Catalog name. |
table_schema | Schema or database name. |
table_name | Table name. |
column_name | Column name. |
ordinal_position | Column position in the table. |
is_nullable | Whether the column can contain NULL. |
data_type | Spark data type. |
numeric_precision | Numeric precision, when applicable. |
numeric_scale | Numeric scale, when applicable. |
datetime_precision | Datetime precision, when applicable. |
Generate views.parquet
views.parquetExport view metadata when your Spark catalog supports views.
The file should include:
| Field | Description |
|---|---|
table_catalog | Catalog name. |
table_schema | Schema or database name. |
table_name | View name. |
view_definition | SQL definition for the view. |
table_owner | View owner. |
Choose a metadata source
Use the source that matches your Spark environment:
| Source | How to retrieve metadata |
|---|---|
| Hive Metastore | Use INFORMATION_SCHEMA.TABLES and INFORMATION_SCHEMA.COLUMNS. Identify views with table_type = 'VIEW'. |
| Spark built-in catalog | Use commands such as SHOW TABLES, SHOW COLUMNS, and DESCRIBE TABLE. Views are not directly distinguished. |
| Apache Iceberg | Use Iceberg metadata. Partition columns come from partition-specs; sort columns come from sort-orders. Iceberg does not natively support views. |
| Delta Lake | Use Delta logs and DESCRIBE DETAIL. Views are usually managed outside Delta in Spark Catalog or Hive Metastore. |
Generate spark-events.parquet
spark-events.parquetSingle Origin uses Spark event logs to extract physical plans and runtime statistics. These are the same underlying events used by the Spark History UI.
Create spark-events.parquet with the following fields:
| Field | Description |
|---|---|
spark_app_id | Spark application ID. |
event_json | Raw Spark event JSON. |
To locate event logs:
- Open the Spark History UI.
- Go to the Environment tab.
- Find
spark.eventLog.dirto identify the event-log directory. - Find
spark.app.idto identify the Spark application or job run. - Locate event files under
<spark_event_log_dir>/<spark_app_id>.
Filter event logs to the event types Single Origin needs:
org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStartorg.apache.spark.sql.execution.ui.SparkListenerSQLAdaptiveExecutionUpdateSparkListenerJobStartSparkListenerStageCompletedSparkListenerEnvironmentUpdate
Filtering to these event types can reduce the size of the logs you share with Single Origin.
Optional: Generate query-historys.parquet
query-historys.parquetCreate query-historys.parquet for Spark jobs that execute a single SQL query.
The file should include:
| Field | Description |
|---|---|
spark_app_id | Spark application ID. |
query | SQL query text. |
default_database | Default database used to resolve unqualified table references. |
default_schema | Default schema used to resolve unqualified table references. |
The filename
query-historys.parquetis intentionally preserved for compatibility with the existing Single Origin import workflow.
Upload the files
After you generate the required files, upload them to Single Origin as input data files.
Before uploading, verify that:
- The files use the expected names.
- The exported columns match the field names above.
spark_app_idvalues match betweenspark-events.parquetandquery-historys.parquet, when query history is provided.- Event logs include SQL execution and stage-completion events.
Create Spark metadata, event-log, and optional query-history input files that Single Origin can ingest.
Single Origin uses Spark schema metadata and runtime events to analyze query behavior, identify scan inefficiencies, and support optimization recommendations.
Required files
Prepare the following files for Single Origin:
| File | Required | Description |
|---|---|---|
tables.parquet | Yes | Table-level metadata from your Spark catalog or storage layer. |
columns.parquet | Yes | Column-level metadata from your Spark catalog or storage layer. |
views.parquet | Yes | View definitions and ownership metadata, when available. |
spark-events.parquet | Yes | Spark event log records used to extract physical plans and runtime statistics. |
query-historys.parquet | Optional | SQL text and default context for Spark jobs that execute a single SQL query. |
Prerequisites
- Identify the Spark catalog or storage layer you use, such as Hive Metastore, Spark built-in catalog, Apache Iceberg, or Delta Lake.
- Locate your Spark event logs from the Spark History UI or from your configured event-log storage path.
- Prepare a destination where you can write Parquet files for upload to Single Origin.
Generate tables.parquet
tables.parquetExport table-level metadata from your catalog or storage layer.
The file should include:
| Field | Description |
|---|---|
table_catalog | Catalog name. |
table_schema | Schema or database name. |
table_name | Table name. |
table_owner | Table owner. |
row_count | Number of rows, when available. |
bytes | Table size in bytes, when available. |
comment | Table comment. |
created | Creation timestamp. |
last_altered | Last altered timestamp. |
partition_cols | Partition columns. |
sort_cols | Sort columns. |
row_count, bytes, partition_cols, and sort_cols are especially useful because Single Origin uses them to recommend table-scan optimizations.
Generate columns.parquet
columns.parquetExport column-level metadata from your catalog or storage layer.
The file should include:
| Field | Description |
|---|---|
table_catalog | Catalog name. |
table_schema | Schema or database name. |
table_name | Table name. |
column_name | Column name. |
ordinal_position | Column position in the table. |
is_nullable | Whether the column can contain NULL. |
data_type | Spark data type. |
numeric_precision | Numeric precision, when applicable. |
numeric_scale | Numeric scale, when applicable. |
datetime_precision | Datetime precision, when applicable. |
Generate views.parquet
views.parquetExport view metadata when your Spark catalog supports views.
The file should include:
| Field | Description |
|---|---|
table_catalog | Catalog name. |
table_schema | Schema or database name. |
table_name | View name. |
view_definition | SQL definition for the view. |
table_owner | View owner. |
Choose a metadata source
Use the source that matches your Spark environment:
| Source | How to retrieve metadata |
|---|---|
| Hive Metastore | Use INFORMATION_SCHEMA.TABLES and INFORMATION_SCHEMA.COLUMNS. Identify views with table_type = 'VIEW'. |
| Spark built-in catalog | Use commands such as SHOW TABLES, SHOW COLUMNS, and DESCRIBE TABLE. Views are not directly distinguished. |
| Apache Iceberg | Use Iceberg metadata. Partition columns come from partition-specs; sort columns come from sort-orders. Iceberg does not natively support views. |
| Delta Lake | Use Delta logs and DESCRIBE DETAIL. Views are usually managed outside Delta in Spark Catalog or Hive Metastore. |
Generate spark-events.parquet
spark-events.parquetSingle Origin uses Spark event logs to extract physical plans and runtime statistics. These are the same underlying events used by the Spark History UI.
Create spark-events.parquet with the following fields:
| Field | Description |
|---|---|
spark_app_id | Spark application ID. |
event_json | Raw Spark event JSON. |
To locate event logs:
- Open the Spark History UI.
- Go to the Environment tab.
- Find
spark.eventLog.dirto identify the event-log directory. - Find
spark.app.idto identify the Spark application or job run. - Locate event files under
<spark_event_log_dir>/<spark_app_id>.
Filter event logs to the event types Single Origin needs:
org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStartorg.apache.spark.sql.execution.ui.SparkListenerSQLAdaptiveExecutionUpdateSparkListenerJobStartSparkListenerStageCompletedSparkListenerEnvironmentUpdate
Filtering to these event types can reduce the size of the logs you share with Single Origin.
Optional: Generate query-historys.parquet
query-historys.parquetCreate query-historys.parquet for Spark jobs that execute a single SQL query.
The file should include:
| Field | Description |
|---|---|
spark_app_id | Spark application ID. |
query | SQL query text. |
default_database | Default database used to resolve unqualified table references. |
default_schema | Default schema used to resolve unqualified table references. |
The filename
query-historys.parquetis intentionally preserved for compatibility with the existing Single Origin import workflow.
Upload the files
After you generate the required files, upload them to Single Origin as input data files.
Before uploading, verify that:
- The files use the expected names.
- The exported columns match the field names above.
spark_app_idvalues match betweenspark-events.parquetandquery-historys.parquet, when query history is provided.- Event logs include SQL execution and stage-completion events.
Updated about 24 hours ago