For Trino and Presto
This guide details the process for creating and exporting 3 Parquet files (tables.parquet, columns.parquet, views.parquet) from the storage and query log from Trino/Presto query engine.
This Guide Will Walk You Through Three Main Steps:
1、Creating an S3 bucket
2、Creating and uploading three Parquet files with a specified schema and query log.
3、Set Up an IAM User with AmazonS3ReadOnlyAccess and Save Access Keys in Single Origin.
Create A S3 Bucket
Prerequisites
AWS Account: Ensure you have an AWS account.
Instructions
- Login to AWS Management Console: Go to the AWS Management Console and log in.
- Create an S3 Bucket:
- Navigate to the S3 service.
- Click on Create bucket.
- Enter a unique bucket name (e.g., your-unique-bucket-name).
- Select the AWS region where you want to create the bucket.
- Leave the default settings and click Create bucket.
- Enter a unique bucket name (e.g., your-unique-bucket-name).
- Select the AWS region where you want to create the bucket.
- Leave the default settings and click Create bucket.
Create And Upload Three Parquet Files and Query Log
- Upload three Parquet files with the defined schema to a three-level folder structure for current year, month, and day (eg, s3a://your-unique-bucket-name/2024/06/30,s3a://your-unique-bucket-name/2024/07/01):
tables.parquet schema
message TableItem {
string table_catalog = 1;
string table_schema = 2;
string table_name = 3;
string table_owner = 4;
int64 row_count = 5;
int64 bytes = 6;
string comment = 7;
int64 created = 8;
int64 last_altered = 9;
}
columns.parquet schema
message schema {
string table_catalog = 1;
string table_schema = 2;
string table_name = 3;
string column_name = 4;
int32 ordinal_position = 5;
string is_nullable = 6;
string data_type = 7;
bytes numeric_precision = 8;
bytes numeric_scale = 9;
bytes datetime_precision = 10;
}
views.parquet scehma
message schema {
string table_catalog = 1;
string table_schema = 2;
string table_name = 3;
string view_definition = 4;
string table_owner = 5;
}
query-historys.json
After you enable the logging for Trino/Presto queries, the below is a sample query history record:
{
"instant": {
"epochSecond": 1649685758,
"nanoOfSecond": 64613000
},
"thread": "dispatcher-query-0",
"level": "INFO",
"loggerName": "com.github.trino.querylog.QueryLogListener",
"message": {
"createTime": {
"nano": 39000000,
"epochSecond": 1649685758
},
"context": {
"user": "admin",
"principal": {
"empty": false,
"present": true
},
"traceToken": {
"empty": true,
"present": false
},
"remoteClientAddress": {
"empty": false,
"present": true
},
"userAgent": {
"empty": false,
"present": true
},
"clientInfo": {
"empty": true,
"present": false
},
"source": {
"empty": true,
"present": false
},
"catalog": {
"empty": true,
"present": false
},
"schema": {
"empty": true,
"present": false
},
"resourceGroupId": {
"empty": false,
"present": true
},
"resourceEstimates": {
"executionTime": {
"empty": true,
"present": false
},
"cpuTime": {
"empty": true,
"present": false
},
"peakMemoryBytes": {
"empty": true,
"present": false
}
},
"serverAddress": "192.168.56.2",
"serverVersion": "355",
"environment": "demo",
"queryType": {
"empty": false,
"present": true
}
},
"metadata": {
"queryId": "20220411_140237_00000_ai6km",
"transactionId": {
"empty": false,
"present": true
},
"query": "select \* from hive.dw1.flight where count in (10,264)",
"updateType": {
"empty": true,
"present": false
},
"preparedQuery": {
"empty": true,
"present": false
},
"queryState": "QUEUED",
"uri": "<http://192.168.56.2:8080/v1/query/20220411_140237_00000_ai6km">,
"plan": {
"empty": true,
"present": false
},
"payload": {
"empty": true,
"present": false
}
}
},
"endOfBatch": false,
"loggerFqcn": "org.apache.logging.log4j.spi.AbstractLogger",
"threadId": 141,
"threadPriority": 5
}
- Verify the Upload: Check your S3 bucket to ensure that the files tables.parquet, columns.parquet, views.parquet, and query-historys.json have been uploaded to the three levels folder named with the current year, month, and day.
Set Up S3 access_key and secret_key and bucket-name for Single Origin
Prerequisites
IAM User: Ensure you have an IAM user with programmatic access.
Instructions
- Create an IAM User with AmazonS3ReadOnlyAccess:
i. Go to the IAM service in the AWS Management Console.
ii. Click on Users and then Add user.
iii. Enter a user name (e.g., S3ReadOnlyUser) and select Programmatic access.
iv. Click on Next: Permissions.
v. Select Attach policies directly and search for AmazonS3ReadOnlyAccess.
vi. Select the policy and click Next: Tags, then Next: Review, and finally Create user.
vii. Note down the Access Key ID and Secret Access Key. - Save Access Keys in Our System:
Log in to our system and navigate to the settings or configuration page for S3 access.
Enter the Access Key ID and Secret Access Key in the respective fields. - Save the configuration.
Verify the Configuration: Ensure that Single Origin system can access the S3 bucket using the provided access keys.
Trigger Processing
Single Origin will automatically detect the uploaded Parquet files and query-history.json in the S3 bucket. Once the files are successfully uploaded to the a three-level folder structure for current year, month, and day. After you set the file correctly, click the 'Start Import' button on the Single Origin page. A notification will show the number of query logs processed. You can then start the task and proceed with the optimization work. Once completed, you can view the data for the corresponding date on the page.
Updated 4 months ago