For Spark
This guide details the process for creating and exporting five parquet files (tables.parquet, columns.parquet, views.parquet, query-historys.parquet, query-stats.parquet) from the storage and query log from Spark query engine.
This Guide Will Walk You Through Three Main Steps:
1、Create an S3 bucket
2、Create and uploading three Parquet files with a specified schema and query log.
3、Set Up an IAM User with AmazonS3ReadOnlyAccess and Save Access Keys in Single Origin.
Create A S3 Bucket
Prerequisites
AWS Account: Ensure you have an AWS account.
Instructions
- Login to AWS Management Console: Go to the AWS Management Console and log in.
- Create a S3 Bucket:
- Navigate to the S3 service.
- Click on Create bucket.
- Enter a unique bucket name (e.g., your-unique-bucket-name).
- Select the AWS region where you want to create the bucket.
- Leave the default settings and click Create bucket.
Create And Upload Five Parquet Files
- Upload three Parquet files with the defined schema to a three-level folder structure form current year, month, and day (eg, s3a://your-unique-bucket-name/2024/06/30,s3a://your-unique-bucket-name/2024/07/01):
Schema Parquet
tables.parquet schema
message TableItem {
string table_catalog = 1;
string table_schema = 2;
string table_name = 3;
string table_owner = 4;
int64 row_count = 5;
int64 bytes = 6;
string comment = 7;
int64 created = 8;
int64 last_altered = 9;
}
columns.parquet schema
message schema {
string table_catalog = 1;
string table_schema = 2;
string table_name = 3;
string column_name = 4;
int32 ordinal_position = 5;
string is_nullable = 6;
string data_type = 7;
bytes numeric_precision = 8;
bytes numeric_scale = 9;
bytes datetime_precision = 10;
}
views.parquet schema
message schema {
string table_catalog = 1;
string table_schema = 2;
string table_name = 3;
string view_definition = 4;
string table_owner = 5;
}
Retrieving Table, View, and Column Metadata in Spark
The method for retrieving metadata in Apache Spark depends on the Meta Catalog you are using. Different storage formats and catalogs have different ways to access this information.
1️⃣ Hive Metastore (Common for Spark with Hive Integration)
Tables & Views: Can be retrieved from the INFORMATION_SCHEMA.TABLES.
Columns: Available in INFORMATION_SCHEMA.COLUMNS.
Views: Identified using table_type = 'VIEW'.
2️⃣ Spark Built-in Catalog (spark_catalog)
Tables & Views: Listed using SHOW TABLES.
Columns: Retrieved using SHOW COLUMNS or DESCRIBE TABLE.
Views: No direct distinction from tables, but DESCRIBE TABLE can provide insight.
3️⃣ Apache Iceberg (For Iceberg-based Tables)
Tables & Views: Managed through Iceberg catalog metadata.
Columns: Available via Iceberg’s metadata tables (columns table).
Views: Iceberg does not natively support views.
4️⃣ Delta Lake (For Delta Table Metadata)
Tables & Views: Managed in Delta logs.
Columns: Can be extracted from DESCRIBE DETAIL.
Views: Views are typically managed outside Delta in the Spark Catalog or Hive Metastore.
Spark Logical Plan Representation
Each node in a Logical Plan will be represented using the following structure:
Logical Plan Node Schema
query-historys.parquet schema
message LogicalPlanNode {
string nodeType = 1; // Node type (e.g., Project, Filter, Join)
string expression = 2; // Core expression for the node
repeated LogicalPlanNode children = 3; // Child nodes
DatasetSchema schema = 4; // Schema associated with the node
}
message DatasetSchema {
repeated Column columns = 1; // List of columns in the schema
string tableName = 2; // Optional table name or alias
}
message Column {
string name = 1; // Column name
string dataType = 2; // Data type of the column
bool nullable = 3; // Nullable flag
string metadata = 4; // Additional metadata (e.g., comments)
}
query_stats.parquet schema
message schema {
string stage_id = 1;
string node_type = 2;
string node_attributes = 3;
int64 execution_time = 4;
int64 input_bytes = 5;
}
The query_profile refers to the actual execution plan of a query, which comprises multiple operators. For each operator, it provides real-time performance metrics, including the size of input/output data, the time consumed by the operator, and memory usage. SingleOrigin requires these metrics to analyze the performance of customer-submitted queries and provide optimization suggestions based on the analysis results.
We can call this api:http://:4040/api/v1/applications////stages to get query-stats.
- Verify the Upload: Check your S3 bucket to ensure that the files views.parquet, columns.parquet,tables.parquet ,query-historys.parquet and query_stat.parquet have been uploaded to the three levels folder named with the current year, month, and day (e.g., s3a://your-unique-bucket-name/2024/06/30/views.parquet, s3a://your-unique-bucket-name/2024/06/30/columns.parquet, s3a://your-unique-bucket-name/2024/06/30/tables.parquet, s3a://your-unique-bucket-name/2024/06/30/query-historys.parquet, s3a://your-unique-bucket-name/2024/06/30/query-stats.parquet).
Set Up S3 access_key and secret_key and bucket-name for Single Origin
Prerequisites
IAM User: Ensure you have an IAM user with programmatic access.
Instructions
Create an IAM User with AmazonS3ReadOnlyAccess:
i. Go to the IAM service in the AWS Management Console.
ii. Click on Users and then Add user.
iii. Enter a user name (e.g., S3ReadOnlyUser) and select Programmatic access.
iv. Click on Next: Permissions.
v. Select Attach policies directly and search for AmazonS3ReadOnlyAccess.
vi. Select the policy and click Next: Tags, then Next: Review, and finally Create user.
vii. Note down the Access Key ID and Secret Access Key.
Save Access Keys in Our System:
Log in to our system and navigate to the settings or configuration page for S3 access.
Enter the Access Key ID and Secret Access Key in the respective fields.
Save the configuration.
Verify the Configuration: Ensure that Single Origin system can access the S3 bucket using the provided access keys.
Trigger Processing
Single Origin will automatically detect the uploaded Parquet files in the S3 bucket. Once the files (tables.parquet, columns.parquet, views.parquet, and query-historys.parquet) are successfully uploaded to the a three-level folder structure for current year, month, and day. After you set the file correctly, click the 'Start Import' button on the Single Origin page. A notification will show the number of query logs processed. You can then start the task and proceed with the optimization work. Once completed, you can view the data for the corresponding date on the page.
Updated 12 days ago