Connect Single Origin to OSS Spark

Create Spark metadata, event-log, and optional query-history input files that Single Origin can ingest.

Single Origin uses Spark schema metadata and runtime events to analyze query behavior, identify scan inefficiencies, and support optimization recommendations.

Required files

Prepare the following files for Single Origin:

FileRequiredDescription
tables.parquetYesTable-level metadata from your Spark catalog or storage layer.
columns.parquetYesColumn-level metadata from your Spark catalog or storage layer.
views.parquetYesView definitions and ownership metadata, when available.
spark-events.parquetYesSpark event log records used to extract physical plans and runtime statistics.
query-historys.parquetOptionalSQL text and default context for Spark jobs that execute a single SQL query.

Prerequisites

  1. Identify the Spark catalog or storage layer you use, such as Hive Metastore, Spark built-in catalog, Apache Iceberg, or Delta Lake.
  2. Locate your Spark event logs from the Spark History UI or from your configured event-log storage path.
  3. Prepare a destination where you can write Parquet files for upload to Single Origin.

Generate tables.parquet

Export table-level metadata from your catalog or storage layer.

The file should include:

FieldDescription
table_catalogCatalog name.
table_schemaSchema or database name.
table_nameTable name.
table_ownerTable owner.
row_countNumber of rows, when available.
bytesTable size in bytes, when available.
commentTable comment.
createdCreation timestamp.
last_alteredLast altered timestamp.
partition_colsPartition columns.
sort_colsSort columns.

row_count, bytes, partition_cols, and sort_cols are especially useful because Single Origin uses them to recommend table-scan optimizations.

Generate columns.parquet

Export column-level metadata from your catalog or storage layer.

The file should include:

FieldDescription
table_catalogCatalog name.
table_schemaSchema or database name.
table_nameTable name.
column_nameColumn name.
ordinal_positionColumn position in the table.
is_nullableWhether the column can contain NULL.
data_typeSpark data type.
numeric_precisionNumeric precision, when applicable.
numeric_scaleNumeric scale, when applicable.
datetime_precisionDatetime precision, when applicable.

Generate views.parquet

Export view metadata when your Spark catalog supports views.

The file should include:

FieldDescription
table_catalogCatalog name.
table_schemaSchema or database name.
table_nameView name.
view_definitionSQL definition for the view.
table_ownerView owner.

Choose a metadata source

Use the source that matches your Spark environment:

SourceHow to retrieve metadata
Hive MetastoreUse INFORMATION_SCHEMA.TABLES and INFORMATION_SCHEMA.COLUMNS. Identify views with table_type = 'VIEW'.
Spark built-in catalogUse commands such as SHOW TABLES, SHOW COLUMNS, and DESCRIBE TABLE. Views are not directly distinguished.
Apache IcebergUse Iceberg metadata. Partition columns come from partition-specs; sort columns come from sort-orders. Iceberg does not natively support views.
Delta LakeUse Delta logs and DESCRIBE DETAIL. Views are usually managed outside Delta in Spark Catalog or Hive Metastore.

Generate spark-events.parquet

Single Origin uses Spark event logs to extract physical plans and runtime statistics. These are the same underlying events used by the Spark History UI.

Create spark-events.parquet with the following fields:

FieldDescription
spark_app_idSpark application ID.
event_jsonRaw Spark event JSON.

To locate event logs:

  1. Open the Spark History UI.
  2. Go to the Environment tab.
  3. Find spark.eventLog.dir to identify the event-log directory.
  4. Find spark.app.id to identify the Spark application or job run.
  5. Locate event files under <spark_event_log_dir>/<spark_app_id>.

Filter event logs to the event types Single Origin needs:

  • org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart
  • org.apache.spark.sql.execution.ui.SparkListenerSQLAdaptiveExecutionUpdate
  • SparkListenerJobStart
  • SparkListenerStageCompleted
  • SparkListenerEnvironmentUpdate

Filtering to these event types can reduce the size of the logs you share with Single Origin.

Optional: Generate query-historys.parquet

Create query-historys.parquet for Spark jobs that execute a single SQL query.

The file should include:

FieldDescription
spark_app_idSpark application ID.
querySQL query text.
default_databaseDefault database used to resolve unqualified table references.
default_schemaDefault schema used to resolve unqualified table references.

The filename query-historys.parquet is intentionally preserved for compatibility with the existing Single Origin import workflow.

Upload the files

After you generate the required files, upload them to Single Origin as input data files.

Before uploading, verify that:

  • The files use the expected names.
  • The exported columns match the field names above.
  • spark_app_id values match between spark-events.parquet and query-historys.parquet, when query history is provided.
  • Event logs include SQL execution and stage-completion events.
    Create Spark metadata, event-log, and optional query-history input files that Single Origin can ingest.

Single Origin uses Spark schema metadata and runtime events to analyze query behavior, identify scan inefficiencies, and support optimization recommendations.

Required files

Prepare the following files for Single Origin:

FileRequiredDescription
tables.parquetYesTable-level metadata from your Spark catalog or storage layer.
columns.parquetYesColumn-level metadata from your Spark catalog or storage layer.
views.parquetYesView definitions and ownership metadata, when available.
spark-events.parquetYesSpark event log records used to extract physical plans and runtime statistics.
query-historys.parquetOptionalSQL text and default context for Spark jobs that execute a single SQL query.

Prerequisites

  1. Identify the Spark catalog or storage layer you use, such as Hive Metastore, Spark built-in catalog, Apache Iceberg, or Delta Lake.
  2. Locate your Spark event logs from the Spark History UI or from your configured event-log storage path.
  3. Prepare a destination where you can write Parquet files for upload to Single Origin.

Generate tables.parquet

Export table-level metadata from your catalog or storage layer.

The file should include:

FieldDescription
table_catalogCatalog name.
table_schemaSchema or database name.
table_nameTable name.
table_ownerTable owner.
row_countNumber of rows, when available.
bytesTable size in bytes, when available.
commentTable comment.
createdCreation timestamp.
last_alteredLast altered timestamp.
partition_colsPartition columns.
sort_colsSort columns.

row_count, bytes, partition_cols, and sort_cols are especially useful because Single Origin uses them to recommend table-scan optimizations.

Generate columns.parquet

Export column-level metadata from your catalog or storage layer.

The file should include:

FieldDescription
table_catalogCatalog name.
table_schemaSchema or database name.
table_nameTable name.
column_nameColumn name.
ordinal_positionColumn position in the table.
is_nullableWhether the column can contain NULL.
data_typeSpark data type.
numeric_precisionNumeric precision, when applicable.
numeric_scaleNumeric scale, when applicable.
datetime_precisionDatetime precision, when applicable.

Generate views.parquet

Export view metadata when your Spark catalog supports views.

The file should include:

FieldDescription
table_catalogCatalog name.
table_schemaSchema or database name.
table_nameView name.
view_definitionSQL definition for the view.
table_ownerView owner.

Choose a metadata source

Use the source that matches your Spark environment:

SourceHow to retrieve metadata
Hive MetastoreUse INFORMATION_SCHEMA.TABLES and INFORMATION_SCHEMA.COLUMNS. Identify views with table_type = 'VIEW'.
Spark built-in catalogUse commands such as SHOW TABLES, SHOW COLUMNS, and DESCRIBE TABLE. Views are not directly distinguished.
Apache IcebergUse Iceberg metadata. Partition columns come from partition-specs; sort columns come from sort-orders. Iceberg does not natively support views.
Delta LakeUse Delta logs and DESCRIBE DETAIL. Views are usually managed outside Delta in Spark Catalog or Hive Metastore.

Generate spark-events.parquet

Single Origin uses Spark event logs to extract physical plans and runtime statistics. These are the same underlying events used by the Spark History UI.

Create spark-events.parquet with the following fields:

FieldDescription
spark_app_idSpark application ID.
event_jsonRaw Spark event JSON.

To locate event logs:

  1. Open the Spark History UI.
  2. Go to the Environment tab.
  3. Find spark.eventLog.dir to identify the event-log directory.
  4. Find spark.app.id to identify the Spark application or job run.
  5. Locate event files under <spark_event_log_dir>/<spark_app_id>.

Filter event logs to the event types Single Origin needs:

  • org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart
  • org.apache.spark.sql.execution.ui.SparkListenerSQLAdaptiveExecutionUpdate
  • SparkListenerJobStart
  • SparkListenerStageCompleted
  • SparkListenerEnvironmentUpdate

Filtering to these event types can reduce the size of the logs you share with Single Origin.

Optional: Generate query-historys.parquet

Create query-historys.parquet for Spark jobs that execute a single SQL query.

The file should include:

FieldDescription
spark_app_idSpark application ID.
querySQL query text.
default_databaseDefault database used to resolve unqualified table references.
default_schemaDefault schema used to resolve unqualified table references.

The filename query-historys.parquet is intentionally preserved for compatibility with the existing Single Origin import workflow.

Upload the files

After you generate the required files, upload them to Single Origin as input data files.

Before uploading, verify that:

  • The files use the expected names.
  • The exported columns match the field names above.
  • spark_app_id values match between spark-events.parquet and query-historys.parquet, when query history is provided.
  • Event logs include SQL execution and stage-completion events.