Skip to main content

COPY INTO <table>

The COPY INTO command in Databend allows you to load data from files located in multiple locations. This is the recommended method for loading a large amount of data into Databend.

One of its key features is that it provides idempotency by keeping track of files that have already been processed for a default period of 7 days, you can customize this behavior using the load_file_metadata_expire_hours global setting.

The files must already be staged in one of the following locations:

  • Named internal stage.
  • Named external stage that references an external location (Amazon S3-compatible Storage, Google Cloud Storage, or Microsoft Azure).
  • External location:
    • Buckets created in Supported Object Storage Solutions.
    • Remote servers from where you can access the files by their URL (starting with "https://...").
    • IPFS.

Syntax

/* Standard data load */
COPY INTO [<database>.]<table_name>
FROM { internalStage | externalStage | externalLocation }
[ FILES = ( '<file_name>' [ , '<file_name>' ] [ , ... ] ) ]
[ PATTERN = '<regex_pattern>' ]
[ FILE_FORMAT = ( TYPE = { CSV | TSV | NDJSON | PARQUET} [ formatTypeOptions ] ) ]
[ copyOptions ]

/* Data load with transformation(Only support Parquet format) */
COPY INTO [<database>.]<table_name> [ ( <col_name> [ , <col_name> ... ] ) ]
FROM ( SELECT [<file_col> ... ]
FROM { internalStage | externalStage } )
[ FILES = ( '<file_name>' [ , '<file_name>' ] [ , ... ] ) ]
[ PATTERN = '<regex_pattern>' ]
[ FILE_FORMAT = ( TYPE = {PARQUET} [ formatTypeOptions ] ) ]
[ copyOptions ]

internalStage

internalStage ::= @<internal_stage_name>[/<path>]

externalStage

externalStage ::= @<external_stage_name>[/<path>]

Also see How to Create External Stage.

externalLocation

Databend's external location is a feature that allows users to access data stored outside of Databend's internal storage system, such as in cloud storage services like AWS S3 or Azure Blob Storage.

By defining an external location, users can query data stored in external systems directly from Databend without having to load it into Databend's internal storage.

externalLocation ::=
's3://<bucket>[<path>]'
CONNECTION = (
ENDPOINT_URL = 'https://<endpoint-URL>'
ACCESS_KEY_ID = '<your-access-key-ID>'
SECRET_ACCESS_KEY = '<your-secret-access-key>'
ROLE_ARN = '<your-ARN-of-IAM-role>'
EXTERNAL_ID = '<your-external-ID>'
SESSION_TOKEN = '<your-session-token>'
REGION = '<region-name>'
ENABLE_VIRTUAL_HOST_STYLE = 'true|false'
)
info

To access your Amazon S3 buckets, you can use one of two methods: providing AWS access keys and secrets, or specifying an AWS IAM role and external ID for authentication.

By specifying an AWS IAM role and external ID, you can provide more granular control over which S3 buckets a user can access. This means that if the IAM role has been granted permissions to access only specific S3 buckets, then the user will only be able to access those buckets. An external ID can further enhance security by providing an additional layer of verification. For more information, see https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-role.html

ParameterDescriptionRequired
s3://<bucket>[<path>]External files located at the AWS S3 compatible object storage.Required
ENDPOINT_URLThe bucket endpoint URL starting with "https://". To use a URL starting with "http://", set allow_insecure to true in the [storage] block of the file databend-query-node.toml.Required
ACCESS_KEY_IDYour access key ID for connecting the AWS S3 compatible object storage. If not provided, Databend will access the bucket anonymously.Optional
SECRET_ACCESS_KEYYour secret access key for connecting the AWS S3 compatible object storage.Optional
ROLE_ARNAmazon Resource Name (ARN) of an AWS Identity and Access Management (IAM) role.Optional
EXTERNAL_IDYour external ID for authentication when accessing specific Amazon S3 buckets.Optional
SESSION_TOKENYour temporary credential for connecting the AWS S3 service.Optional
REGIONAWS region name. For example, us-east-1.Optional
ENABLE_VIRTUAL_HOST_STYLEIf you use virtual hosting to address the bucket, set it to "true".Optional

FILES = ( 'file1' [ , 'file2' ... ] )

Specify a list of one or more files names (separated by commas) to be loaded.

PATTERN = 'regex_pattern'

A PCRE2-based regular expression pattern string, enclosed in single quotes, specifying the file names to match. Click here to see an example. For PCRE2 syntax, see http://www.pcre.org/current/doc/html/pcre2syntax.html.

FILE_FORMAT

See Input & Output File Formats.

copyOptions

copyOptions ::=
[ SIZE_LIMIT = <num> ]
[ PURGE = <bool> ]
[ FORCE = <bool> ]
[ DISABLE_VARIANT_CHECK = <bool> ]
[ ON_ERROR = { continue | abort | abort_N } ]
[ MAX_FILES = <num> ]
ParameterDescriptionRequired
SIZE_LIMITSpecifies the maximum rows of data to be loaded for a given COPY statement. Defaults to 0 meaning no limits.Optional
PURGEIf True, the command will purge the files in the stage after they are loaded successfully into the table. Default: False.Optional
FORCEDefaults to False meaning the command will skip duplicate files in the stage when copying data. If True, duplicate files will not be skipped.Optional
DISABLE_VARIANT_CHECKIf True, this will allow the variant field to insert invalid JSON strings. Default: False.Optional
ON_ERRORDecides how to handle a file that contains errors: 'continue' to skip and proceed, 'abort' to terminate on error, 'abort_N' to terminate when errors ≥ N. Default is 'abort'. Note: 'abort_N' not available for Parquet files.Optional
MAX_FILESSets the maximum number of files to load that have not been loaded already. The value can be set up to 500; any value greater than 500 will be treated as 500.Optional
tip

When importing large volumes of data, such as logs, it is recommended to set both PURGE and FORCE to True. This ensures efficient data import without the need for interaction with the Meta server (updating the copied-files set). However, it is important to be aware that this may lead to duplicate data imports.

Examples

1. Loading Data from an Internal Stage

COPY INTO mytable
FROM @my_internal_s1
PATTERN = '.*[.]parquet'
FILE_FORMAT = (TYPE = PARQUET);

2. Loading Data from an External Stage

COPY INTO mytable
FROM @my_external_s1
PATTERN = 'books.*parquet'
FILE_FORMAT = (TYPE = PARQUET);
COPY INTO mytable
FROM @my_external_s1
PATTERN = '.*[.]parquet'
FILE_FORMAT = (TYPE = PARQUET);

3. Loading Data from External Locations

This example reads 10 rows from a CSV file and inserts them into a table:

-- Authenticated by AWS access keys and secrets.
COPY INTO mytable
FROM 's3://mybucket/data.csv'
CONNECTION = (
ENDPOINT_URL = 'https://<endpoint-URL>'
ACCESS_KEY_ID = '<your-access-key-ID>'
SECRET_ACCESS_KEY = '<your-secret-access-key>'
)
FILE_FORMAT = (type = CSV field_delimiter = ',' record_delimiter = '\n' skip_header = 1)
SIZE_LIMIT = 10;

This example loads data from a CSV file without specifying the endpoint URL:

COPY INTO mytable
FROM 's3://mybucket/data.csv'
FILE_FORMAT = (type = CSV field_delimiter = ',' record_delimiter = '\n' skip_header = 1)
SIZE_LIMIT = 10;

4. Loading Data with Pattern Matching

This example uses pattern matching to only load from CSV files containing sales in their names:

COPY INTO mytable
FROM 's3://mybucket/'
PATTERN = '.*sales.*[.]csv'
FILE_FORMAT = (type = CSV field_delimiter = ',' record_delimiter = '\n' skip_header = 1);

Where .* is interpreted as zero or more occurrences of any character. The square brackets escape the period character (.) that precedes a file extension.

If you want to load from all the CSV files, use PATTERN = '.*[.]csv':

COPY INTO mytable
FROM 's3://mybucket/'
PATTERN = '.*[.]csv'
FILE_FORMAT = (type = CSV field_delimiter = ',' record_delimiter = '\n' skip_header = 1);

5. Loading Data with AWS IAM Role

-- Authenticated by AWS IAM role and external ID.
COPY INTO mytable
FROM 's3://mybucket/'
CONNECTION = (
ENDPOINT_URL = 'https://<endpoint-URL>',
ROLE_ARN = 'arn:aws:iam::123456789012:role/my_iam_role',
EXTERNAL_ID = '123456'
)
PATTERN = '.*[.]csv'
FILE_FORMAT = (type = CSV field_delimiter = ',' record_delimiter = '\n' skip_header = 1);

6. Loading Data with Compression

This example reads 10 rows from a CSV file compressed as GZIP and inserts them into a table:

COPY INTO mytable
FROM 's3://mybucket/data.csv.gz'
CONNECTION = (
ENDPOINT_URL = 'https://<endpoint-URL>',
ACCESS_KEY_ID = '<your-access-key-ID>',
SECRET_ACCESS_KEY = '<your-secret-access-key>'
)
FILE_FORMAT = (type = CSV field_delimiter = ',' record_delimiter = '\n' skip_header = 1 compression = AUTO)
SIZE_LIMIT = 10;

7. Loading Parquet Files

COPY INTO mytable
FROM 's3://mybucket/'
CONNECTION = (
ACCESS_KEY_ID = '<your-access-key-ID>',
SECRET_ACCESS_KEY = '<your-secret-access-key>'
)
PATTERN = '.*[.]parquet'
FILE_FORMAT = (TYPE = PARQUET);

8. Controlling Parallel Processing

In Databend, the max_threads setting specifies the maximum number of threads that can be utilized to execute a request. By default, this value is typically set to match the number of CPU cores available on the machine.

When loading data into Databend with COPY INTO, you can control the parallel processing capabilities by injecting hints into the COPY INTO command and setting the max_threads parameter. For example:

COPY /*+ set_var(max_threads=6) */ INTO mytable FROM @mystage/ pattern='.*[.]parq' FILE_FORMAT=(TYPE=parquet);

For more information about injecting hints, see SET_VAR.