Data Import | AllegroGraph 8.2.0

Introduction

The AllegroGraph Loader is an efficient tool for loading data into AllegroGraph. It can load data in many formats, listed below in the Specifying Sources and the Loading non-RDF data sections.

The AllegroGraph Loader is one of the agtool utilities. It imports data into a repository quickly and easily. When possible, it makes use of multiple CPU cores to load the data in parallel.

The agtool program is the general program for AllegroGraph command-line operations. (In earlier releases, there was a separate agload program. A change from earlier releases is the -u option, previously a short version of the --base-uri option, is no longer supported.)

Bulk data can be loaded with tools besides agtool load. New WebView and traditional AGWebView both have loading tools. The various programming languages which can be used with AllegroGraph (Java, Python, and Lisp) all also have functionality for loading data from files. See addFile in Python, add in Java, and Adding Triples in the Lisp Reference for Lisp.

Usage

If you run agtool load on the same machine as the AllegroGraph server and as the same user as the user who started the AllegroGraph server, then agtool load runs as if run by a user with AllegroGraph superuser privileges even if no username or password is provided (because of OS authentication).

If run on the same machine as a different user or on a different machine, then the username of a user with write privileges for the repository into which data will be loaded and that user's password must be provided. In that case, you must use a REPO SPEC to designate the repository as there is no other way to specify a username and password in the command.

(In earlier versions of AllegroGraph, agtool load could only be run by the user who started the AllegroGraph server on the same machine as the server is running. That restriction no longer applies.)

The version of agtool must be the same as the AllegroGraph server (that is, agtool must be the one distributed with the server, not from an earlier or later version of AllegroGraph).

The agtool load command line is:

agtool load [OPTIONS] REPO_SPEC SOURCE*

where

OPTIONS are arguments that affect how data is loaded, what indices are used, whether the repository is optimized after loading completes, what the graph component should be if it is not specified in the source, and other aspects of loading. For non-RDF data files, arguments also determine how the data should be transformed into triples. See the several Options sections below for a complete list. One important option is --input, which specifies the format of the data. (Files with meaningful extensions, for example nt for N-Triple format and ttl for Turtle format, will be assumed to be of the associated format if --input is not specified.) Another important option is --supersede. If specified when the repository specified by REPO_SPEC already exists, that repository will be deleted, then created again and filled with the loaded data. All existing data will be lost, recoverable only from backups, assuming backups exist.
REPO_SPEC provides information about the repository into which the data will be loaded, the AllegroGraph user issuing the load command, and the host running the AllegroGraph server. See the Repository Spec section below for information on the REPO_SPEC and also the Repository Specification document.
SOURCE is either a file name, a dash (-) or an @ followed by a file name (see below). SOURCE can be specified multiple times. Source files can be (1) files in one of the various formats that specify triples or quads directly or (2) non-RDF data files from which triples can be constructed based on other arguments to agtool load. See Specifying sources below for more information.

Here is a simple agtool load command line. The REPO_SPEC names the repository (my-repository) but does not specify the catalog (which defaults to the root catalog), the user (which means the user must be the one who started AllegroGraph), the host (which defaults to localhost), or the port (which defaults to 10035). The source file is mydata-1 and has no extension so its format is specified with the --input argument at N-Triples.

agtool load --input ntriples my-repository mydata-1

Here is a more complex example using a REPO_SPEC which specifies the user and password (test:xyzzy) host (localhost), the port (10777), the catalog (my-catalog), and the repository (my-repository). Multiple input files are specified, each with an extension which indicates the data format, ttl, meaning Turtle format, so --input need not be specified:

agtool load http://test:xyzzy@localhost:10777/catalogs/mycatalog/repositories/my-database mydata-2.ttl mydata-3.ttl

The specified catalog must already exist on the server. Different catalogs can contain repositories with the same names so if you specify a catalog, you are calling for the use or creation of the repository with the specified name in that catalog.

If a repository named by the REPO_SPEC argument does not exist, it will be created. If the repository exists, it will be opened and data from the input files will be added to it. If the --supersede option is suppled and the repository exists, it will be deleted and a new one created. Use that option with caution as superseded data is gone and cannot be recovered except from backups.

Repository Specification

Repository specifications are fully described in the Repository Specification document, which describes various short form ways to specify repositories. Here we give a shorter description of repository specifications without describing the various short forms.

The general form of a REPO_SPEC is

[http(s)://[USER:PASSWORD@]HOST[:PORT][/catalogs/CATALOG]/repositories/]REPO

The various elements are:

The scheme: Either http or https.
USER and PASSWORD: These must be supplied unless agtool load is being run by the same user who started the AllegroGraph server on the same machine where the server is running. The user specified must have write permission in the repository if it exists or in the catalog if the repository will be created.
HOST: If agtool load is being run on the same machine as the server, the value should resolve to 127.0.0.1 (127.0.0.1 and its equivalent localhost usually being suitable values). The default is 127.0.0.1. Otherwise, the hostname or IP address of the machine running the AllegroGraph server.
PORT: The port on which the AllegroGraph server on HOST is listening. The default is 10035 and this value (and thus the HOST) must be specified if a different port is being used.
CATALOG: The name of the catalog which contains the REPOSITORY (if it already exists) or will contain it (if it is being created). Catalogs are defined in the configuration file and can only be created at server startup time so the specified CATALOG must already exist. If not specified, defaults to the root catalog.
RESPOSITORY: The repository into which the data will be loaded. If it exists and the --supersede option is not specified, loaded data will be added to any existing data. If it does not exist, it will be created. If the --supersede option is specified, the repository will be deleted, if it exists, and then (re)created, and the data loaded into it.

In its simplest form, a REPO_SPEC is just a repository name, which then expands to

http://127.0.0.1:10035/catalogs/root/repositories/repository-name

If anything must have a value other than the default shown, the complete form must be used, possibly leaving out the [USER:PASSWORD@] and, if the catalog is the root catalog, [/catalogs/root]. (See the Repository Specification document for other short form options for specifying a repository.)

In earlier releases, there were options --port and --catalog (which had abbreviations). These are deprecated but are still accepted. If the REPO_SPEC provides a value for either, the corresponding option, if supplied, must specify the same value. Specifying any of these argument signals a warning that the argument is deprecated and will cease to be supported in a later release.

Specifying sources

Source files can either specify actual triples or quads, or they can contain non-RDF data which will, based on additional arguments, be transformed into triples.

File extensions typically tell agtool load what is the format of the file to be loaded. Recognized file extensions are listed in the Loading options section below. The --input argument to agtool load specifies the format if there is no meaningful extension.

Files on Amazon S3

If a file is location on Amazon S3, you must call agtool load with AWS authentication on the command line as specified in the section Accessing and operating on files on Amazon S3 in the agtool document. Files in S3 must be prefaced by s3://, like the following:

s3://bucketname/a/b/c/filename

Sources That Specify Triples Or Quads

Files that specify triple or quads can be in the following supported formats:

Non-RDF Data Sources

Files that specify non-RDF data can be in the following supported formats:

These are transformed into triples following rules specified by arguments to agtool load described in the section Loading non-RDF Data below.

Specifying - as the SOURCE argument

If the character - is given as a SOURCE, it will be interpreted as standard input. This allows agtool load to accept data directly from other programs. For example:

cat foo.nt bar.nt baz.nt | agtool load --input ntriples test_repo -

Loading lists of files

Loading entire directories (by specifying the directory as the FILE argument) is not supported, but wildcards may be used. Wildcards will be expanded by the shell, as expected. Also, you can specify a single file which contains a list of files to load, as we describe next.

If FILE begins with the character @, agtool load interprets everything after the @ as a file name which contains a list of actual source files. By default the file name items are separated by a CR, LF or CRLF sequence. That is the usual case and covers most actual situations.

Because UNIX allows newlines to appear in any file name, agtool load will permit the files to be separated by NULL characters. If you want this, you must specify the -0 option (the numeral zero, not an uppercase letter o). Here is an example:

agtool load -0 test_repo @null-separated-list-of-sources.txt

Use the @file-name syntax when there are more source files than the shell will pass to agtool load as arguments.

Loading files from HDFS filesystems

agtool load supports loading files from HDFS (Hadoop Distributed File System). In order for the load to succeed, the following conditions must be met:

There must be a working hdfs program in your PATH.
HDFS input files must use the hdfs:// prefix when specified on the agload command line

Here is a sample command line for loading the file lubm-50.nt.gz:

agtool load repo hdfs:///user/bruce/lubm-50.nt.gz

When those conditions are met, agtool load works with HDFS files just as with other files. The command line can mix HDFS files with other files.

Loading from HDFS file systems has been tested with the Cloudera Hadoop distribution.

File encoding

N-Triples, N-Quad, NQX, and Turtle files must have character encoding UTF-8. RDF/XML, TriG, and TriX files use character encodings as defined by standard XML parsing rules. Conversion programs outside of AllegroGraph (such as iconv) can be used to convert files to UTF-8 format if necessary. All such conversions must be done prior to processing by agtool load.

Optimal file types

agtool load is optimized for loading N-Triple and N-Quad files. Loading of files of those types will be spread over multiple cores even if only a single file is specified. agtool load will spread loading of multiple files of other types (Turtle, TriX, TriG, RDF/XML) over multiple cores but each single file will be loaded by a single core. Therefore loading a single, large Turtle file takes much longer than loading a single, equivalently large N-Triple or N-Quad file on machines with multiple cores.

The Rapper program, from http://librdf.org/raptor/rapper.html, will convert files from one format to another.

Loading triple attributes

AllegroGraph supports attributes for triples, as described in the Triple Attributes document. Attributes are name/value pairs associated with triples. Attributes can only be associated with triples when the triple is added to the repository. Using agtool load to load an NQX file is one way to associate attributes with triples. Only Extended N-Quad files (NQX), with type .nqx, can specify different attributes for each triple. Here are two lines from an NQX file specifying attributes:

<http://dbpedia.org/resource/Arif_Babayev> <http://dbpedia.org/property/placeOfDeath> <http://dbpedia.org/resource/Baku> <http://ex#trans@@1142684573200001> {"color": "red"} .  
<http://dbpedia.org/resource/Arif_Naqvi> <http://purl.org/dc/terms/subject> <http://dbpedia.org/resource/Category:Businesspeople_from_Karachi> {"color": "blue"} .

The first includes a graph, the second does not. The loader distinguishes them because it can recognize the JSON input {"color": "red"} and {"color": "blue"}. The "color" attribute name must be defined in the repository prior to loading an NQX file containing these lines.

The --attributes option (see below) allows specification of default attributes that will be applied to every triple that does not have attributes specified. Note that the default attributes will be applied to every triple loaded from any type of file other than an NQX file, since attributes cannot be specified in any such file.

Attribute names must be defined before use, so all attribute names appearing in an NQX file or as the value of the --attributes option must be already defined. See Defining attributes for information on defining attributes.

agtool load failure

agtool load might fail (meaning fail to successfully load all data from all the specified files) for various reasons, including hardware failure and network problems. The most common cause of failure is corrupted or invalid input files.

If agtool load encounters invalid data, its behavior depends on the --error-strategy option desribed below, but by default it will stop loading data.

Note that agtool load adds triples to the repository as it works. Therefore, if agtool load fails (for whatever reason) after it has begun, you may find some but not all triples added, and you should be prepared for this possibility. Here are some ways to be prepared:

Start with an empty repository and do not add triples from other sources until agtool load completes: then if agtool load fails, you can delete the repository and start again.
Backup the repository prior to loading with agtool load and do not add triples from other sources until agtool load completes: then if agtool load fails, you can restore the repository from the backup and start again (see Repository Backup and Restore for information on backing up and restoring).

But sometimes you have to load with agtool load while triples are being added from other sources, and you must be prepared in that case to deal with an agtool load failure. (We do not have specific recommendations as there are many possible cases and equally many ways to deal with them but note that if the graph field is not otherwise used, you can often arrange it so triples loaded with agtool load have a different graph than triples from other sources and that makes it easy to tell which were loaded by agtool load and which were added by other means.)

See the --error option below for more information on handling errors.

A more complex example

The following example loads a single source file into AllegroGraph.

agtool load --supersede -v --with-indices "ospgi,posgi,spogi" -e ignore lesmisload lesmis.rdf

In this example, there are several OPTIONS, the REPO_SPEC is lemisload (so HOST is localhost, PORT is 10035, and the CATALOG is the root catalog) and the SOURCE is lemis.rdf. If there is an existing repository named lesmisload in the root catalog, it will be deleted (as specified by --supersede). The program will generate verbose messages (-v) and will ignore any errors (-e ignore). The repository will generate three triple indices: ospgi, posgi and spogi.

Options

The following options may be used on the agtool load command line.

Repository options

These options control the creation and settings for the repository.

--supersede, -s: If the repository named by REPO_SPEC exists, it will be deleted before loading data. --supersede should be used with care as there is no way to recover the deleted repository other than restoring it from a backup. If the repository named by REPO_SPEC exists and is open, the agtool load command will fail.
--fti NAME, -f NAME: Create a free-text index named name.; This index will include both any newly added triples and all existing triples. Managing the free-text index will slow down loading speed somewhat.
--with-indices INDICES: Specify the repsoitory's indices. When supplied, the parameter should be a list of index names separated by commas (for example: spogi,posgi,ospgi,i). If not specified, newly created repositories will use the standard set of indices and the indices of existing repositories will remain unchanged. ¹

--optimize: Perform a full level 2 optimization of all indices after the load completes (default: Do not optimize). Index optimization is discussed in the Optimizing indices section of the Triple Indices document.

--bulk: Enable bulk mode while processing the loading job. Bulk mode turns off transaction log processing and can provide considerable performance gain for large jobs. Because an unexpected failure during a bulk load can result in an unrecoverably corrupted repository, we recommend you make a backup before using this option.; Specifying this option will cause the load to fail when the repository DBNAME is a replication instance (see the Multi-master Replication document) as replication depends on complete transaction logs.

--attributes

Specifies attributes for any triples which do not have attributes specified in the file being loaded. See the Loading triple attributes section above for more information. Since attributes can only be specified in NQX (extended N-Quad) files, attributes specified by this argument will be applied to all triples loaded from files of all other types. Attributes must be already defined within the repository in order to be values of this argument.

Attributes are specified as a JSON object. Here is an example:

--attributes '{ "pet": "dog", "key2": [ "red", "blue" ] }'

We have surrounded the value with single quotes to avoid interpretation of the braces ({ and }) or the double quotes as shell metacharacters. Three attributes are added, one with name "pet" and and two with name "color". The "pet" attribute has the value "dog", one "color" attribute has the value "red", and the other has the value "blue".

--parameter NAME=VALUE: When the action of agtool load will create a new repository, either because the DBNAME argument names a repository which does not already exist in the specified catalog (the root catalog if no catalog is specified), or because the --supersede argument is specified, then the repository will be created using the various catalog directives listed in the Catalog directives section of the Server Configuration and Control document. Such directives include StringTableSize, StringTableCompression, ExpectedStoreSize, and so on. Many of these directives are inheritable, meaning they can be specified at the top-level and apply to any catalog that does not also specify a value in its definition, so if you are looking in a configuration file for the value which will be used, look at the top-level directives as well as the specific catalog definition.; Only one directive (NAME=VALUE) can be specified for each occurrence of --parameter but the option can be specified as often as desired.

--automate-nd-datatype-mappings: Causes the loader to check for geospatial datatypes and set up mappings automatically (default: no).

Loading options

These options control how agtool load processes data sources.

--input FORMAT, -i FORMAT

Specify the input format to use. The recognized values are: ntriples (and ntriple, nt), nquads (and nquad, nq), nqx, jsonld, trix, trig, turtle (and ttl), rdfxml (and owl, rdf, rdfs, rdf/xml), csv, json, jsonl, and guess.

The default is guess. Use guess if (1) All sources have a recognizable extension, and (2) Every file is actually of the format indicated by its extension.

Recognized N-Triple extensions (--input FORMAT is ntriples, ntriple, or nt):

.nt, .ntriple, .ntriples, .nt.gz, .ntriple.gz, .ntriples.gz,  
.nt.bz2, .ntriple.bz2, and .ntriples.bz2.

Recognized N-Quads extensions (--input FORMAT is nquads, nquad, or nq):

.nq, .nquad, .nquads, .nq.gz, .nquad.gz, .nquads.gz, .nq.bz2,  
.nquad.bz2, and .nquads.bz2.

Recognized Extended N-Quad extension (see the Loading triple attributes section above for information on NQX files, --input FORMAT is nqx):
```
.nqx, .nqx.gz, and .nqx.bz2 
```
Recognized JSON-LD extensions (--input FORMAT is jsonld):
```
.jsonld, .jsonld.gz, .jsonld.bz2 
```

Recognized Turtle extensions (--input FORMAT is turtle or ttl):

.ttl, .turtle, .ttl.gz, .turtle.gz, .ttl.bz2, and .turtle.bz2.

Recognized TriX extensions (--input FORMAT is trix):
```
.trix, .trix.gz, and .trix.bz2. 
```
Recognized TriG extensions (--input FORMAT is trig):
```
.trig, .trig.gz, and .trig.bz2. 
```

Recognized RDF/XML extensions (--input FORMAT is rdfxml, owl, rdf, rdfs, or rdf/xml):

.rdf, .rdfs, .owl, .rdf.gz, .rdfs.gz, .owl.gz, .rdf.bz2, .rdfs.bz2, and .owl.bz2.

Recognized CSV extensions (--input FORMAT is csv):
```
.csv, .csv.gz, .csv.bz2 
```
Recognized JSON and JSONlines extensions (--input FORMAT is json or jsonl):
```
.json, .jsonl, .json.gz, .json.bz2, .jsonl.gz, .jsonl.bz2 
```

If a format other than guess is specified, it will take precedence over a file's extension. For example, if you have an N-Triples file named triples.rdf and specify --input ntriples then triples.rdf will be parsed as an N-Triples file.

If you use multiple source formats in one agtool load command, then you need to ensure that the source file's extensions match their contents. Otherwise, you will need to use multiple command invocations.

The only two compression formats handled by agtool load are gzip and bzip2. Any files which are compressed must be named with .gz or .bz2 extensions in order to be decompressed. All supported formats permit .[format].gz and .[format].bz2 extensions, allowing agtool load to determine the data format from the [format] portion. If a file has extension .gz or .bz2 without also specifying the format, you must use the --input option. For example, if you are loading a gzip'ed N-Quads file and the file is named btc-2011-chunk-045.gz, then you must specify --input nquads.

The use of standard input with agtool load (by specifying FILE to be -) always requires a non-default value for the input format, since standard input has no file type.

--error-strategy STRATEGY, -e STRATEGY

The available options for error strategy are cancel, ignore, print and save. Cancel is the default, and will stop the loading process as soon as an error is detected.

Error strategy ignore will attempt to silently continue upon an error, although it will print a warning (and then continue) when an entire file is skipped because its input format could not be determined (usually because the file type is not recognized or is missing).

Error strategy print is like ignore but will print the error to standard output. Error strategy save is like print but will also log the error to agload.log in the current working directory.

Error strategy applies to all recoverable errors, not just parsing errors.

Remember that if agtool load fails for whatever reason, some triples will have been added to the repository and some will not have been added (except for very unusual edge cases).

--error-strategy ignore will in some cases print warnings even though it will not stop processing files or loading triples. For example, if a file does not have a recognizable type and its format has not been specified with --input, the file will be skipped when the --error-strategy is ignore but a warning will be printed:

agtool load -e ignore repo wrong-file  
Cannot guess the format of wrong-file. Use --input to specify it.  
Load finished 0 sources in 00:00:01 (1.00 seconds).  No triples added.  
Terminating agtool load processes, please wait...

--loaders COUNT, -l COUNT

The loaders option specifies the number of processes which will be connecting to AllegroGraph and committing triples. It is for optimization of agtool load performance. The system will try to use as many cores as you specify.

The default depends on the number of cores on your server. If you have 1 or 2 cores, loaders will default to 1. If you have 4 cores, loaders will default to 3. For more cores the default is the number of cores minus one, up to a maximum of 32. agtool load also has a task dispatcher process and AllegroGraph has its own processes.

If you are not getting satisfactory performance for your load, try increasing or decreasing the number of loaders. If your data has no blank nodes, you may want to set the number of loaders to the number of logical cores on the machine and use --blank none. If you have files dense with blank nodes try decreasing the number of loaders to free up machine resources. For example on an 8 core, 48GB hyperthreaded server, we use --loaders 5 for good performance while loading Billion Triple Challenge. For Lubm-8000 we use --loaders 16.

--base-uri URI

Specify a base URI for formats that support it, such as RDF/XML or Turtle. Note that if standard input is a source and rdfxml is the input format, --base-uri must be specified. (In earlier releases, -u was accepted as a short version of this option. -u is no longer accepted. You must use --base-uri.)

--graph GRAPH, -g GRAPH

Specify a default graph for formats that support it, such as N-Triples. Special values are:

:default, use the default graph node for the repository (this is the default value for graph).
:source, use the source of the triple (i.e. the filename) as the default graph. This cannot be used if standard-input is used a source.
:blank, generate a blank node before loading triples and use it as the default graph.
:root, assign all triples the same graph as the subject of the toplevel JSON-LD object

Any other value is interpreted as a resource (URI) or literal and used as the default graph. Note that strict RDF does not allow literals to be used in the graph slot.

The three special values start with a colon (:) to allow for the usage of default, source, or blank as graph names. See the Examples using the graph option section for more information on the use of this option.

Formats that include the fourth element (like N-Quads) will use the default-graph only for the data that does not explicitly specify it.

JSON-LD note: A JSON-LD error is now signaled if a string which is neither a resource nor a URL is passed to --graph argument. :root is supported as a special value of --graph`, resulting in using the toplevel subject as a default graph for all triples added for the JSON-LD object with that subject.

--external-references, -x

If specified, then external references in RDF/XML and JSON-LD source files will be followed during load.

--external-reference-timeout

If specified, changes the default value of timeout for HTTP requests to external context documents referenced in JSON-LD source files.

--metadata FILENAME

Load attribute definitions and the static filter definition from FILENAME. See Triple Attributes for information on attributes.

Less common options

These options are useful in specific circumstances but do not generally need to be used.

--help, -h

Print the command line summary.

--verbose, -v

The presence of the verbose option will cause additional information to be printed to standard output. This argument can be specified multiple times to increase the verbosity. We recommend using --verbose --verbose if you encounter a problem with loading, but note --verbose --verbose generates a lot of output when loading many files, so it may fill a terminal's scrollback. See also the --debug option which also may be useful when an error occurs during loading. --debug -v -v provides maximum information about loading.

Specifically, the verbosity levels are:

0 (-v not supplied): Only report periodic load rate information.
1 (-v): As above, plus print the job option summary before starting the operation.
2 (-v -v): As above, plus print the name of every file that has been processed.

Specifying more than two -vs is equivalent to specifying two.

--quiet

Reduce output (default: no).

--blank STRATEGY, -b STRATEGY

Determine how to handle blank node identifiers in N-Triple and N-Quad files. STRATEGY must be one of file, job or none.

By default, blank node identifiers are scoped to the source in which they appear. I.e., the blank node _:b1 in file file1.nt is considered to be different than the blank node _:b1 in file file2.nt. AllegroGraph calls this the file strategy and uses it as the default.

Blank node strategy job will consider all blank nodes found in N-Triple and N-Quad files to be in the same "scope". This means that the _:b1 in file1.nt will be considered to be the same as the one found in file2.nt.

Blank node strategy none will cause agtool load to error if any sources contains blank nodes. Loading is faster when the blank node strategy is none.

Note that the blank node strategies of job and file only apply to N-Triple and N-Quad sources. Other formats such as RDF/XML or Turtle are defined to have a blank node scope of the file and this option is ignored. Using a blank node strategy of none will, however, still signal an error if any source files contains a blank node.

--debug

If specified, additional information will be printed when an error occurs.

Typically, --debug is useful only when agtool load returns an exit code of 1, indicating that there was an unhandled error. If this occurs, re-run agtool load using the debug option and send the output to AllegroGraph support for more assistance. The option also causes information to be written to agload.log. See also the --verbose option, which causes other information about loading to be output. --debug -v -v provides maximum information about loading.

--relax-syntax

For N-Triples and N-Quad files, this flag tells AllegroGraph to ignore certain syntax errors. In particular:

Blank node names may use underscore (_) and dash (-) characters.
Literals may be used in the Graph position (for N-Quads).
URIs are not required to include a colon (:).

This is a non-standard parser extension and should only be used when necessary.

--duplicates SETTING, -d SETTING

Changes the handling of duplicate triples in the repository. The valid values for SETTING are keep (the default); delete, meaning delete all but one of all groups of triples that match subject, predicate, object, and graph; and delete-spo, meaning delete groups that match subject, predicate, and object, regardless of graph. In AllegroGraph, triples are deleted only by the function delete-duplicate-triples. If duplicate deletion is specified, that function is (in effect) called at the end of data loading with arguments suitable for the specified argument, and the load completes when duplicates are deleted. This argument does not affect things after the load is complete as future duplicate deletions are only done when delete-duplicate-triples is called.

--dry-run

If specified, then print the loading strategy and stop. I.e., no triples will be loaded.

agtool load loads N-Triples and N-Quads files most efficiently so it can be faster to convert source files before loading them. More information on using rapper and AllegroGraph is described in our documentation.

--null, -0

Use to specify that the file specified after an @ sign in the SOURCE inputs is a null separated list rather than a newline separated one. This is useful for loading files with newlines or other strange characters in their names.

Custom options for specific formats

--json-ld-context CONTEXT: If a string context, a pathname or a URI is provided it is loaded (for pathnames and URIs), parsed and used as a top-level JSON-LD context.
--json-store-source YES/NO: Store the raw JSON (JSONlines, JSON-LD) representation of the object that was loaded (default: yes).

Deprecated and Removed Options

These options are deprecated but still accepted (though they will generate a warning if specified). All can be specified in the REPO_SPEC argument, where they are all described.

--host: See the REPO_SPEC section above for information on this argument.
--port PORT, -p PORT: See the REPO_SPEC section above for information on this argument.
--scheme SCHEME: See the REPO_SPEC section above for information on this argument.
--catalog CATALOG,-c CATALOG: See the REPO_SPEC section above for information on this argument.

The following option is no longer supported:

--encoding ENCODING, -C ENCODING: In earlier releases, this option could be used to specify the character encoding of the file(s) being loaded. This is no longer supported. N-Triple, N-Quad, NQX, and Turtle files must use UTF-8 encoding. RDF/XML, TriG, and TriX files use character encodings as defined by standard XML parsing rules. There are conversion programs, such as iconv, which will convert files to UTF-8 character encoding.

The use of the following option has been deprecated as it is no longer needed:

--dispatch-strategy STRATEGY: STRATEGY must be auto or file. The dispatch strategy tells agtool load how it to parallelize loading.; The default is auto which is combination of dispatch strategies based on file format. N-Triple and N-Quad files will be broken up and loaded in pieces. --dispatch-strategy file means that no files are broken into pieces for loading. There is no reason to specify file (which was useful in much earlier releases and is kept for backward compatibility). Note that RDF/XML, Turtle, TriG, and TriX formats are always dispatched on a file basis regardless of the value of this option.

Loading RDF-star data

RDF-star and SPARQL-star are an evolving set of specifications described in this document. RDF-star formats, in short, can contain triples about other triples in the data. Triples that are subjects or objects of other triples are called quoted triples. AllegroGraph supports loading of several RDF-star formats: Turtle-star, TriG-star, N-Triples-star, N-Quads-star. Note that for now RDF-star mode is not enabled for the repository automatically, so in order to load Turtle-star data into a new repository, you have to create the repository beforehand and enable RDF-star mode for it explicitly:

% agtool repos create my-repository --rdf-star-mode  
% agtool load my-repository rdf-star-data.ttl

% agtool repos create my-repository  
% agtool rdf-star-mode enable my-repository  
% agtool load my-repository rdf-star-data.ttl

See RDF-star document for more information.

Loading non-RDF data

Besides various file formats representing triples, agtool load can also load data from common data exchange formats and transform it into triples. The supported formats include: JSON, JSONlines, and CSV.

Here are some examples of the typical use of this feature:

% agtool load -i csv --tr-id 'http://example.com/empId/${id}' \  
  --tr-prefix http://example.com/ \  
  --tr-transform 'boss=http://example.com/empId/${boss}' \  
  --tr-type hired=date \  
  --csv-columns id,first,last,salary,position,hired,boss,website \  
  my-repository staff.csv

The command above will import the CSV file staff.csv into the repository my-repository in the root catalog with columns named id, first, last, salary, position, hired, boss, and website. The columns will be transformed into the predicates, with the prefix http://example.com/ (specified by the --tr-prefix http://example.com/ arguments). In the process it will:

generate the subject for each triple according to the template http://example.com/empId/${id};
transform the object of the triple with the predicate boss according to the template http://example.com/empId/${boss};
add the type xsd:date to the object of the triple with the predicate hired.

Here is another example, where the file type is JSONlines:

% agtool load --supersede -i jsonl --tr-skip first \  
         --tr-skip last --tr-transform 'foaf:name=$first $last' \  
         --tr-lang position=en --tr-graph salary=g2 \  
         my-repository staff.json

This will import the JSONlines file staff.json into the existing repository my-repository in the root catalog, creating triples for each key-value pair except for the keys first and last, which are handled specially. Pairs with keys first and last will cause a triple to be added with the predicate foaf:name and an object constructed from the concatenation of the values of first and last. In addition, the triple created from the key-value pair with a key salary will be put into a graph g2, and the object of the triple with the predicate position will be assigned language en.

Below is the description of the set of options that are applicable to such files.

Load-transform options

These options allow for performing transformations on the input data or specifying format-specific customizations of the load procedure.

--transform-rules URI

Specify the subject for the transform rules RDF definitions in the current graph. If this option is used the rules must be stored in the repository in the form of triples. See transform rules examples for a more detailed explanation of RDF transform-rules definition.

--tr-id TEMPLATE

Specifies the template for the subject of all imported triples. The template must be a string which may include one or more references to keys in the form $key or ${key}. Here is an example:

--tr-id http://example.com/empId/${first}_${last}

--tr-prefix KEY=URI

Specifies the default prefix for a specific predicate. Here is an example (specify prefix for the key 'salary'):

--tr-prefix salary=http://example.com/staff/

--tr-rename KEY=VALUE

Specifies the renaming of a particular key. Here is an example:

--tr-rename first=first_name

--tr-type KEY=VALUE

Specifies the type for the values with a particular key. Here is an example (specify 'xsd:float' type for the key 'reading'):

--tr-type reading=xsd:float

--tr-lang KEY=VALUE

Specifies the language for the values with a particular key. Here is an example (specify German language for the key 'desc'):

--tr-lang desc=de

--tr-skip KEY

Indicate that no triple should be created for a particular key. Also see skipall below. This rule overrides --tr-use so if --tr-skipall yes, --tr-use hired, and --tr-skip hired all appear, hired is skipped.

--tr-transform KEY=TEMPLATE or KEY=FUNCTION

Specify a transform template or function that will be applied to the value for a particular key. Here are some examples.

Specify the capitalization of the value of the key 'position':

--tr-transform position=string-capitalize

Specify a creation of a new key 'name' from the template that uses the values of the keys 'first' and 'last':

 --tr-transform "name=$first $last"

The available transform functions are:

string-capitalize
string-downcase
string-upcase
strip-whitespace

--tr-graph KEY=VALUE

Specifies the graph for a particular key. Here is an example (specify graph ' ' for the key 'foo'):

--tr-graph foo=<bar>

--tr-skipall yes

Exclude everything unless included with --tr-use (rather than including everything unless excluded with tr-skip).

--tr-use VALUE

Include VALUE when --tr-skipall has been specified. --tr-use can be specified multiple times. For example this call will skip everything except position, salary, hired,and boss:

agtool load -i csv --tr-id 'http://example.com/empId/$ ' \ --tr-use hired \ --tr-use position \ --tr-use salary \ --tr-use boss \ --tr-skipall yes \ REPO_SPEC staff.csv

--csv-columns COMMA-SEPARATED LIST

If your CSV document doesn't contain a top row with column names you can provide them via this option. Here is an example:

--csv-columns id,first,last,salary

--csv-separator CHARACTER

Specifies the separator character for CSV files (default is the comma ,). The labels space and tab can be used in place of those characters here and in other CSV arguments that specify characters. Here is an example:

--csv-separator tab

--csv-quote CHARACTER

Specifies the quote character for CSV files (default is the double quote ").

--csv-whitespace COMMA-SEPARATED LIST

Specifies the set of characters that are treated as whitespace in CSV files (default is space and tab). The labels space and tab are allowed to represent those characters. Here is an example (use space and underscore characters as whitespace):

--csv-whitespace space,_

--csv-double-quote-p YES/NO

Specifies whether to use double quote for escaping quotes inside CSV values (default: no).

--csv-escape CHARACTER

Specifies the escape character for CSV files (default is the backslash '\').

--json-qualified-keys YES/NO

Specifies whether or not keys in transform rules are treated as paths into nested JSON objects (e.g. parent.child.grandchild). See the example below for more details.

--json-store-source YES/NO

Specifies whether or not JSON source objects are stored in the repository under the http://franz.com/ns/allegrograph/6.4/load-meta#source predicate.

Example using transform rules to load CSV documents

Here is a very simple CSV document to import that is stored in the file staff.csv:

id,first,last,position,salary,hired,boss,website  
1,Bruce,Smith,developer,100000,2001-01-01,3,http://bruce.com  
2,Jim,Crane,CEO,1000000,1995-03-15,,http://jimco.com  
3,Jane,Doe,manager,200000,1995-10-31,2,http://janeco.com

It can be imported with the following agtool command line using the transformation rules expressed as parameters:

agtool load \  
   -i csv \  
   --tr-id 'http://example.com/empId/${id}' \  
   --tr-prefix http://example.com/ \  
   --tr-type hired=date \  
   --tr-type website=uri \  
   --tr-transform 'foaf:name=$first $last' \  
   --tr-skip id \  
   --tr-skip first \  
   --tr-skip last \  
   --tr-type boss=uri \  
   --tr-transform 'boss=http://example.com/empId/${boss}' \  
   --tr-rename boss=manager \  
   --tr-transform position=string-capitalize \  
   --tr-lang position=en \  
   --tr-graph 'salary=<http://example.com/g2>' \  
   REPO-SPEC staff.csv

The same transform-rules can also be expressed as RDF triples:

@prefix ldm: <http://franz.com/ns/allegrograph/6.4/load-meta#> .  
@prefix ldt: <http://franz.com/ns/allegrograph/6.4/load-transform#> .  
 
ldm:tr1 ldm:rule _:r1 ;  
        ldm:rule _:r2 ;  
        ldm:rule _:r3 ;  
        ldm:rule _:r4 ;  
        ldm:rule _:r5 ;  
        ldm:rule _:r6 ;  
        ldm:rule _:r7 ;  
        ldm:rule _:r8 ;  
        ldm:rule _:r9 ;  
        ldm:id "http://example.com/empId/${id}" ;  
        ldm:prefix "http://example.com/" .  
 
_:r1 ldm:key "hired" ;  
     ldm:type "date" .  
 
_:r2 ldm:key "website" ;  
     ldm:type "uri" .  
 
_:r3 ldm:key "position" ;  
     ldm:transform ldt:string-capitalize ;  
     ldm:lang "en" .  
 
_:r4 ldm:key "id" ;  
     ldm:skip true .  
 
_:r5 ldm:key "first" ;  
     ldm:skip true .  
 
_:r6 ldm:key "last" ;  
     ldm:skip true .  
 
_:r7 ldm:key "boss" ;  
     ldm:type "uri" ,  
     ldm:transform "http://example.com/empId/${boss}" ,  
     ldm:rename "manager" .  
 
_:r8 ldm:key "name" ;  
     ldm:prefix "foaf" ;  
     ldm:transform "$first $last" .  
 
_:r9 ldm:key "salary" ;  
     ldm:graph '<http://example.com/g2>' .

Loading the example CSV file with either the provided command-line options or by referencing the transform-rules using the command-line option --transform-rules tr1 will result in creating the following triples:

<http://example.com/empId/1> <http://xmlns.com/foaf/0.1/name> "Bruce Smith" .  
<http://example.com/empId/1> <http://example.com/position>    "Developer"@en .  
<http://example.com/empId/1> <http://example.com/salary>      "100000"^^<xsd:integer> <http://example.com/g2> .  
<http://example.com/empId/1> <http://example.com/hired>       "2001-01-01"^^<xsd:date> .  
<http://example.com/empId/1> <http://example.com/manager>     <http://example.com/empId/3> .  
<http://example.com/empId/1> <http://example.com/website>     <http://bruce.com> .  
 
<http://example.com/empId/2> <http://xmlns.com/foaf/0.1/name> "Jim Crane" .  
<http://example.com/empId/2> <http://example.com/position>    "CEO"@en .  
<http://example.com/empId/2> <http://example.com/salary>      "1000000"^^<xsd:integer> <http://example.com/g2> .  
<http://example.com/empId/2> <http://example.com/hired>       "1995-03-15"^^<xsd:date> .  
<http://example.com/empId/1> <http://example.com/website>     <http://jimco.com> .  
 
<http://example.com/empId/3> <http://xmlns.com/foaf/spec/name> "Jane Doe" .  
<http://example.com/empId/3> <http://example.com/position>     "Manager"@en .  
<http://example.com/empId/3> <http://example.com/salary>       "200000"^^<xsd:integer> <http://example.com/g2> .  
<http://example.com/empId/3> <http://example.com/hired>        "1995-10-31"^^<xsd:date> .  
<http://example.com/empId/3> <http://example.com/manager>      <http://example.com/empId/2> .  
<http://example.com/empId/1> <http://example.com/website>      <http://janeco.com> .

RDF triples for skipall:

The skipall rule with one use rule can be expressed as RDF triples using properties ldm:skipall and ldm:use:

ldm:tr1 ldm:rule _:r1 ;  
        ldm:id "http://example.com/empId/${id}" ;  
        ldm:skipall true .  
 
_:r1 ldm:key "hired" ;  
     ldm:use true .

Example using transform rules to load complex JSON documents

Here is a JSON document containing nested objects and arrays of objects that is stored in the file listing.json:

{  
  "producer": {  
    "name": "Apple",  
    "country": "US"  
  },  
  "product": [  
    {  
      "name": "iPhone",  
      "model": [  
        {  
          "version": 13,  
          "variant": "Mini"  
        },  
        {  
          "version": 15,  
          "variant": "Pro Max"  
        }  
      ]  
    },  
    {  
      "name": "MacBook",  
      "model": [  
        {  
          "year": 2023,  
          "variant": "Pro"  
        }  
      ]  
    }  
  ]  
}

JSON load process traverses the whole JSON document and applies the whole rule set to every nested object, selecting rules whose keys match fields of the nested object currently being processed. The subject for every JSON object is either a fresh blank node or a resource constructed according to the rules (--tr-id rule for the top-level object and --tr-transform rules for the nested ones). For example, the following load operation

agtool load \  
       -i json \  
       --json-store-source no \  
       --tr-prefix 'http://example.org/' \  
       REPO-SPEC listing.json

produces the following data:

@prefix : <http://example.org/> .  
 
_:anon1 :producer _:anon2 ;  
        :product  _:anon3 ;  
        :product  _:anon6 .  
 
_:anon2 :country "US" ;  
        :name    "Apple" .  
 
_:anon3 :model _:anon4 ;  
        :model _:anon5 ;  
        :name  "iPhone" .  
 
_:anon4 :variant "Mini" ;  
        :version 13 .  
 
_:anon5 :variant "Pro Max" ;  
        :version 15.  
 
_:anon6 :model _:anon7 ;  
        :name  "MacBook" .  
 
_:anon7 :variant "Pro" ;  
        :year    2023 .

When using the default transform process, transform rules are only suitable for simple one-level JSON objects or arrays of such objects, because any rules that do not match fields in the current object will be treated as applicable and will create new triples for each nested object. For example, attempting to add an rdf:type declaration

agtool load \  
       -i json \  
       --json-store-source no \  
       --tr-prefix 'http://example.org/' \  
       --tr-transform 'rdf:type=http://example.org/Listing' \  
       REPO-SPEC listing.json

will add rdf:type :Listing triple to each of the anonymous nodes in the example above, which is likely undesired and will require multiple load operations with different skip rules to get right. Also, the context for each rule is limited to the object, in which the corresponding field is located, so the product= rule can only refer to fields directly in the {"producer": ..., "product": ...} object, which is not very useful this example.

In order to provide more flexibility when loading complex JSON documents, the following option (introduced in version 8.1.0) can be used

    --json-qualified-keys yes

This option enables fully qualified JSON keys syntax in transform rules: in order to reference a particular field, a fully qualified path to it must be used, starting from the root of the document.

With qualified keys enabled, users can

specify different rules for fields of different objects regardless of any naming conflicts,
specify rules for constructing subjects for nested objects using any information from the context (only fresh blank nodes can be used reliably without this option),
use qualified field references to access information in the context, which in this case is the whole JSON document, starting from the root.

For example:

agtool load \  
       -i json \  
       --json-store-source no \  
       --json-qualified-keys yes \  
       --tr-prefix 'http://example.org/' \  
       --tr-id 'http://example.org/${producer.name}_${producer.country}' \  
       --tr-transform 'rdf:type=http://example.org/Producer' \  
       --tr-type 'rdf:type=uri' \  
       --tr-skip 'producer' \  
       --tr-transform 'product=http://example.org/${product.name}' \  
       --tr-transform 'rdf:product.type=http://example.org/Product' \  
       --tr-type 'product=uri' \  
       --tr-skip 'product.name' \  
       --tr-transform 'product.model=http://example.org/${product.name}_${product.model.version}_${product.model.variant}' \  
       --tr-type 'product.model=uri' \  
       --tr-transform 'rdf:product.model.type=http://example.org/Product' \  
       --tr-skip 'product.model.version' \  
       --tr-skip 'product.model.variant' \  
       REPO-SPEC listing.json

will produce the following result:

@prefix : <http://example.org/> .  
 
:Apple_US a :Producer ;  
          :product :MacBook ;  
          :product :iPhone .  
 
:MacBook a :Product ;  
         :model :MacBook_2023_Pro .  
 
:MacBook_2023_Pro rdf:type :Product  
 
:iPhone a :Product ;  
        :model :iPhone_13_Mini ;  
        :model :iPhone_15_Pro .  
 
:iPhone_13_Mini a :Product .  
 
:iPhone_15_Pro a :Product .

Note that when an array of objects is encountered during load, the process iterates through the array and applies the rules to each member of the array separately, so fields of objects which are enclosed in lists are only accessible in rules for the fields of those objects. In the example above, the product.model rule can refer to product.name (because product object is accessible from the root of the document), and to product.name and product.model.version (because at that point we know for sure which of the product and model objects from the lists we are processing).

Also please note that --json-qualified-keys yes will become the default in the future and the original behavior will be deprecated and removed, so we advise you to always use the qualified keys.

Examples using the graph option

We have an N-Triples file /tmp/test.nt containing the following single line specifying a subject, predicate, and object but no graph:

<http://franz.com/Node60469> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://franz.com/OFFICER> .

Here are some examples using various values for the --graph argument. If the triple were displayed in N-Quads format, it would look like:

<http://franz.com/Node60469> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://franz.com/OFFICER> [graph] .

with various values of [graph]. In all cases except the first, we only show the value of [graph].

RDF compliance: To be RDF-compliant, graphs must be resources. Other values, such as literals, are not permitted. However AllegroGraph will accept and store non-resource values such as literals. This makes AllegroGraph more flexible with regard to what can be stored, but SPARQL queries which involve graphs will not work with non-compliant graph values. Compliant values are recommended unless there is some good reason to use non-compliant values.

First, we specify :default as the value of the --graph argument, or we equivalently (since the default is :default) leave the argument out:

agtool load --graph :default --verbose --supersede foo /tmp/test.nt

agtool load --verbose --supersede foo /tmp/test.nt

Either call uses the default graph node for the particular AllegroGraph repository, which is not displayed in N-Quads format, so the N-Quads output for the triple is:

<http://franz.com/Node60469> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://franz.com/OFFICER> .

Contrast that with this call:

agtool load --graph :source --verbose --supersede foo /tmp/test.nt

The graph is the source file and [graph] is then:

<file:///tmp/test.nt>

In this next call, --graph is :blank:

agtool load --graph :blank --verbose --supersede foo /tmp/test.nt

agtool load generates a new blank node and uses that as the default graph for the whole job. [graph] is then similar to:

_:bC87E16D1x1

Here we use the resource <http://foo.com/abc#123> for the default graph:

agtool load --graph http://foo.com/abc#123 --verbose --supersede foo /tmp/test.nt

[graph] is then:

<http://foo.com/abc#123>

Non-compliant --graph values such as literals

As said above, AllegroGraph will accept values which are not resources as graph values. Such graph values cannot be used with SPARQL queries (the queries can involve compliant subjects, predicates, and objects even if the graph value is non-compliant). Here are some examples which show how AllegroGraph deals with non-complaint values of --graph:

The value of --graph is a string with escaped quotation marks:

agtool load --graph \"abc123\" --verbose --supersede foo /tmp/test.nt

[graph] is then (this is not RDF-compliant):

"abc123"

The value of --graph is \"adios\"@es-mx:

agtool load --graph \"adios\"@es-mx --verbose --supersede foo /tmp/test.nt

[graph] is then (this is not RDF-compliant):

"adios"@es-mx

Some --graph examples which likely do not do what is wanted

Here are a couple of examples where the result is likely not what is intended. These values are accepted by AllegroGraph despite not being RDF-compliant but may not be what was intended.

In this example, the value of --graph is a string with unescaped quotation marks. The shell will remove unescaped quotation marks so the resulting value is not a literal:

agtool load --graph "abc123" --verbose --supersede foo /tmp/test.nt

[graph] is then:

<abc123>

This is not legal RDF. Just above we show how to specifiy a string value by escaping the quotation marks.

In this example, the graph is the word default with no colon:

agtool load --graph default --verbose --supersede foo /tmp/test.nt

[graph] is then:

<default>

This is not legal RDF. The value :default, described above, is likely what was intended.

Notes

agtool load may load the sources in a different order than they appear in the command line.
agtool load makes an attempt to optimize the dispatching of files for maximum use of loaders.

Footnotes

Spaces can also be used as separators but this is deprecated. If spaces are used they must be escaped from the shell in some fashion, such as wrapping the index names in quotation marks (for example: "spogi posgi ospgi i"). ↩

AllegroGraph 8.2.0 Data Import

Introduction

Usage

Repository Specification

Specifying sources

Files on Amazon S3

Sources That Specify Triples Or Quads

Non-RDF Data Sources

Specifying - as the SOURCE argument

Loading lists of files

Loading files from HDFS filesystems

File encoding

Optimal file types

Loading triple attributes

agtool load failure

A more complex example

Options

Repository options

Loading options

Less common options

Custom options for specific formats

Deprecated and Removed Options

Loading RDF-star data

Loading non-RDF data

Load-transform options

Example using transform rules to load CSV documents

Example using transform rules to load complex JSON documents

Examples using the graph option

Non-compliant --graph values such as literals

Some --graph examples which likely do not do what is wanted

Notes

Footnotes