The AllegroGraph Loader (agload) is a command-line utility for importing data into a triple-store quickly and easily. When possible, it makes use of multiple CPU cores to load the data in parallel.
agload must be run on the same machine as the AllegroGraph server, and as the same user as the user who started the AllegroGraph server. Further, the version of agload must be the same as the AllegroGraph server (that is, agload must be the one distributed with the server, not from an earlier or later version of AllegroGraph).
agload [OPTIONS] DBNAME SOURCE*
where DBNAME is the name of an AllegroGraph triple-store, and SOURCE is either a file name, a dash (
-) or an
@ followed by a file name (see below). SOURCE can be specified mulitple times. Source files can be in the RDF/XML, Turtle, TriX, TriG, N-Triples or N-Quads format.
If a database named DBNAME does not exist, it will be created. If the database exists, it will be used. Note: the
--supersede option will cause an existing database to be deleted and re-created, so use that option with caution. Also note that DBNAME is affected by the presence of the
--catalog option: different catalogs can contain databases with the same names so if you specify a catalog, you are calling for the use or creation of the database DBNAME in that catalog.
More on the FILE argument
If the character
- is given as a SOURCE, it will be interpreted as standard input. This allows agload to accept data directly from other programs. For example:
cat foo.nt bar.nt baz.nt | agload --input ntriples test_db -
Loading files from HDFS filesystems
agload supports loading files from HDFS (Hadoop Distributed File System). In order for the load to succeed, the following conditions must be met:
There must be a working hdfs program in your PATH.
HDFS input files must use the hdfs:// prefix when specified on the agload command line
Here is a sample command line for loading the file lubm-50.nt.gz:
agload repo hdfs:///user/bruce/lubm-50.nt.gz
When those conditions are met, agload works with HDFS files just as with other files. The command line can mix HDFS files with other files.
Loading from HDFS file systems has been tested with the Cloudera Hadoop distribution.
Optimal file types
agload is optimized for loading N-Triple and N-Quad files. Loading of files of those types will be spread over multiple cores even if only a single file is specified. agload will spread loading of multiple files of other types (Turtle, TriX, TriG, RDF/XML) over multiple cores but each single file will be loaded by a single core. Therefore loading a single, large turtle file takes much longer that loading a single, equivalently large N-Triple or N-Quad file on machines with multiple cores.
Loading lists of files
Loading entire directories (by specifying the directory as the FILE argument) is not supported, but wildcards may be used. Wildcards will be expanded by the shell, as expected. Also, you can specify a single file which contains a list of files to load, as we describe next.
If FILE begins with the character
@, agload interprets everything after the
@ as a file name which contains a list of actual source files. By default the file name items are separated by a CR, LF or CRLF sequence. That is the usual case and covers most actual situations.
Because UNIX allows newlines to appear in any file name, agload will permit the files to be separated by NULL characters. If you want this, you must specify the -0 option (the numeral zero, not an uppercase letter o). Here is an example:
agload -0 test_db @null-separated-list-of-sources.txt
Use the @file-name syntax when there are more source files than the shell will pass to agload as arguments.
agload might fail (meaning fail to successfully load all data from all the specified files) for various reasons, including hardware failure and network problems. The most common cause of failure is corrupted or invalid input files.
If agload encounters invalid data, its behavior depends on the
--error-strategy option desribed below, but by default it will stop loading data.
Note that agload adds triples to the database as it works. Therefore, if agload fails (for whatever reason) after it has begun, you may find some but not all triples added, and you should be prepared for this possibility. Here are some ways to be prepared:
Start with an empty database and do not add triples from other sources until agload completes: then if agload fails, you can delete the database and start again.
Backup the database prior to loading with agload and do not add triples from other sources until agload completes: then if agload fails, you can restore the database from the backup and start again (see Backup and Restore for information on backing up and restoring).
But sometimes you have to load with agload while triples are being added from other sources, and you must be prepared in that case to deal with an agload failure. (We do not have specific recommendations as there are many possible cases and equally many ways to deal with them but note that if the graph field is not otherwise used, you can often arrange it so triples loaded with agload have a different graph than triples from other sources and that makes it easy to tell which were loaded by agload and which were added by other means.)
--error option below for more information on handling errors.
The following example loads a single source file into AllegroGraph.
./agload --supersede -v --with-indices "ospgi,posgi,spogi" -e ignore -p 10035 lesmisload lesmis.rdf
In this example, agload will load lesmis.rdf into the
lesmisload repository on a running AllegroGraph server that is using port 10035. If there is an existing triple-store named
lesmisload, then it will be deleted (as specified by
--supersede). The program will generate verbose messages (
-v) and will ignore any errors (
-e ignore). The triple-store will generate three triple indices: ospgi, posgi and spogi.
The following options may be used on the agload command line.
These options control the creation and settings for the triple-store.
- --port PORT, -p PORT
- Set this to the port number of the server you would like to use. As said above, agload must be run on the same machine as the server and by the same user, and the agload version must be the same version as the server (that is, distributed with server). The default port value is 10035.
- --catalog CATALOG,-c CATALOG
- Use this option to assign the database to a catalog. If left off, then the root catalog will be used. 1
- --supersede, -s
- If the database DBNAME exists, it will be deleted before loading data. Supersede should be used with care as there is no way to recover the deleted store other than restoring it from backup.
- --fti NAME, -f NAME
Create a free-text index named
This index will include both any newly added triples and all existing triples. Managing the free-text index will slow down loading speed somewhat.
- --with-indices INDICES
- Specify the triple-store's indices. When supplied, the parameter should be a list of index names separated by commas (for example:
spogi,posgi,ospgi,i). If not specified, newly created triple-stores will use the standard set of indices and the indices of existing triple-stores will remain unchanged. 2
- Enable bulk mode while processing the loading job. Bulk mode turns off transaction log processing and can provide considerable performance gain for large jobs. Because an unexpected failure during a bulk load can result in an unrecoverably corrupted database, we recommend you make a backup before using this option.
These options control how agload processes data sources.
- --input FORMAT, -i FORMAT
Specify the input format to use. The recognized values are:
The default is
guessif (1) All sources have a recognizable extension, and (2) Every file is actually of the format indicated by its extension.
- Recognized N-Triple extensions:
.nt, .ntriple, .ntriples, .nt.gz, .ntriple.gz, .ntriples.gz, .nt.bz2, .ntriple.bz2, and .ntriples.bz2.
- Recognized N-Quads extensions:
.nq, .nquad, .nquads, .nq.gz, .nquad.gz, .nquads.gz, .nq.bz2, .nquad.bz2, and .nquads.bz2.
- Recognized RDF/XML extensions:
.rdf, .rdfs, .owl, .rdf.gz, .rdfs.gz, .owl.gz, .rdf.bz2, .rdfs.bz2, and .owl.bz2.
- Recognized Turtle extensions:
.ttl, .turtle, .ttl.gz, .turtle.gz, .ttl.bz2, and .turtle.bz2.
- Recognized TriX extensions:
.trix, .trix.gz, and .trix.bz2.
- Recognized TriG extensions:
.trig, .trig.gz, and .trig.bz2.
- Recognized N-Triple extensions:
If a format other than
guessis specified, it will take precedence over a file's extension. For example, if you have an N-Triples file named
triples.rdfwill be parsed as an N-Triples file.
If you use multiple source formats in one agload command, then you need to ensure that the source file's extensions match their contents. Otherwise, you will need to use multiple command invocations.
The only two compression formats handled by agload are gzip and bzip2. Any files which are compressed must be named with .gz or .bz2 extensions in order to be decompressed. All supported formats permit .[format].gz and .[format].bz2 extensions, allowing agload to determine the data format from the [format] portion. If a file has extension .gz or .bz2 without also specifying the format, you must use the
--inputoption. For example, if you are loading a gzip'ed N-Quads file and the file is named btc-2011-chunk-045.gz, then you must specify
The use of standard input with agload (by specifying FILE to be
-) always requires a non-default value for the input format, since standard input has no file type.
- --error-strategy STRATEGY, -e STRATEGY
The available options for error strategy are
save. Cancel is the default, and will stop the loading process as soon as an error is detected. Error strategy
ignorewill attempt to silently continue upon an error. Error strategy
ignorebut will print the error to standard output. Error strategy
Error strategy applies to all recoverable errors, not just parsing errors.
Remember that if agload fails for whatever reason, some triples will have been added to the database and some will not have been added (except for very unusual edge cases).
- --loaders COUNT, -l COUNT
The loaders option corresponds to the number of processes which will be connecting to AllegroGraph and committing triples. It is for optimization of agload performance. The default depends on the number of physical cores on your server. If you have 1 or 2 cores, loaders will be set to 1. If you have 4 cores, loaders will be set to 3. For more cores it is the number of physical cores minus one. agload also has a task dispatcher process and AllegroGraph has its own processes.
If you are not getting satisfactory performance for your load, try increasing or decreasing the number of loaders. If your data has no blank nodes, you may want to set the number of loaders to the number of logical cores on the machine and use
--blank none. If you have files dense with blank nodes try decreasing the number of loaders to free up machine resources. For example on an 8 core, 48GB hyperthreaded server, we use
--loaders 5for good performance while loading Billion Triple Challenge. For Lubm-8000 we use
- --base-uri URI, -u URI
- Specify a base URI for formats that support it, such as RDF/XML or Turtle. Note that if standard input is a source and
rdfxmlis the input format,
--base-urimust be specified.
- --graph GRAPH, -g GRAPH
- Specify a default graph for formats that support it, such as N-Triples. Special values are:
:default, use the default graph node for the triple-store (this is the default value for
:source, use the source of the triple (i.e. the filename) as the default graph. This cannot be used if standard-input is used a source.
:blank, generate a blank node before loading triples and use it as the default graph.
Any other value is interpreted as a resource (URI) or literal and used as the default graph. Note that strict RDF does not allow literals to be used in the graph slot.
The three special values start with a colon (:) to allow for the usage of
blankas graph names. See the examples section for more information on the use of this option.
Formats that include the fourth element (like N-Quads) will use the default-graph only for the data that does not explicitly specify it.
- --external-references, -x
- If specified, then external references in RDF/XML source files will be followed during load.
Less common options
These options are useful in specific circumstances but do not generally need to be used.
- --help, -h
- Print the command line summary.
- --verbose, -v
The presence of the
verboseoption will cause additional information to be printed to standard output. This argument can be specified multiple times to increase the verbosity. We recommend using --verbose --verbose if you encounter a problem with loading, but note --verbose --verbose generates a lot of output when loading many files, so it may fill a terminal's scrollback. See also the
--debugoption which also may be useful when an error occurs during loading.
--debug -v -vprovides maximum information about loading.
Specifically, the verbosity levels are:
0 (-v not supplied): Only report periodic load rate information.
1 (-v): As above, plus print the job option summary before starting the operation.
2 (-v -v): As above, plus print the name of every file that has been processed.
- Specifying more than two
-vs is equivalent to specifying two.
- --blank STRATEGY, -b STRATEGY
Determine how to handle blank node identifiers in N-Triple and N-Quad files. STRATEGY must be one of
By default, blank node identifiers are scoped to the source in which they appear. I.e., the blank node
file1.ntis considered to be different than the blank node
file2.nt. AllegroGraph calls this the
filestrategy and uses it as the default.
Blank node strategy
jobwill consider all blank nodes found in N-Triple and N-Quad files to be in the same "scope". This means that the
file1.ntwill be considered to be the same as the one found in
Blank node strategy
nonewill cause agload to error if any sources contains blank nodes. Loading is faster when the blank node strategy is
Note that the blank node strategies of
fileonly apply to N-Triple and N-Quad sources. Other formats such as RDF/XML or Turtle are defined to have a blank node scope of the file and this option is ignored. Using a blank node strategy of
nonewill, however, still signal an error if any source files contains a blank node.
If specified, additional information will be printed when an error occurs.
--debugis useful only when agload returns an exit code of 1, indicating that there was an unhandled error. If this occurs, re-run agload using the debug option and send the output to AllegroGraph support for more assistance. The option also causes information to be written to agload.log. See also the
--verboseoption, which causes other information about loading to be output.
--debug -v -vprovides maximum information about loading.
- For N-Quad files, this flag allows a syntax to be used which is often found in Billion Triple Challenge. This is a non-standard parser extension and should only be used when necessary. (In earlier releases, this argument was named --relax-for-btc.)
- --duplicates SETTING, -d SETTING
- Changes the handling of duplicate triples in the store. The valid values for SETTING are
delete, meaning delete all but one of all groups of triples that match subject, predicate, object, and graph; and
delete-spo, meaning delete groups that match subject, predicate, and object, regardless of graph. In AllegroGraph, triples are deleted only by the function delete-duplicate-triples. If duplicate deletion is specified, that function is (in effect) called at the end of data loading with arguments suitable for the specified argument, and the load completes when duplicates are deleted. This argument does not affect things after the load is complete as future duplicate deletions are only done when delete-duplicate-triples is called.
- If specified, then print the loading strategy and stop. I.e., no triples will be loaded.
Use rapper to transform source files into N-Triples format before loading them.
agload loads N-Triples and N-Quads files most efficiently so it can be faster to convert source files before loading them. More information on using rapper and AllegroGraph is described in our documentation.
- --null, -0
- Use to specify that the file specified after an
@sign in the SOURCE inputs is a null separated list rather than a newline separated one. This is useful for loading files with newlines or other strange characters in their names.
The use of the following options has been deprecated as they are no longer needed:
- --encoding ENCODING, -C ENCODING
The N-Triple and N-Quad formats initially required data to be in 7-bit ASCII which made loading data less convenient. AllegroGraph extended the format and added the
--encodingoption to allow users to load data in UTF-8. The more recent version of the N-Triple and N-Quad formats now allows UTF-8 which means that the
--encodingoption is no longer needed.
Note that the
encodingparameter only applies to N-Triple and N-Quad source files because Turtle files always use UTF-8 and RDF/XML, TriG, and TriX files use character encodings as defined by standard XML parsing rules.
- --dispatch-strategy STRATEGY
STRATEGY must be
file. The dispatch strategy tells agload how it to parallelize loading.
The default is
autowhich is combination of dispatch strategies based on file format. N-Triple and N-Quad files will be broken up and loaded in pieces.
--dispatch-strategy filemeans that no files are broken into pieces for loading. There is no reason to specify
file(which was useful in much earlier releases and is kept for backward compatibility). Note that RDF/XML, Turtle, TriG, and TriX formats are always dispatched on a file basis regardless of the value of this option.
Examples of the graph option:
bin/agload --graph :default --verbose --supersede foo /tmp/test.nt
This call uses the default graph node for the particular AllegroGraph database:
triple-store(3): (print-triples (get-triples-list) :format :nquads) <subject> <predicate> <object> .
Contrast that with this call:
bin/agload --graph :source --verbose --supersede foo /tmp/test.nt triple-store(6): (print-triples (get-triples-list) :format :nquads) <subject> <predicate> <object> <file:///tmp/test.nt> .
In this next call, --graph is :blank:
bin/agload --graph :blank --verbose --supersede foo /tmp/test.nt
agload generates a new blank node for the database and use that as the default graph for the whole job:
triple-store(9): (print-triples (get-triples-list) :format :nquads) <subject> <predicate> <object> _:bC87E16D1x1 .
Here we use another value, which is the word (no colon!) default:
bin/agload --graph default --verbose --supersede foo /tmp/test.nt triple-store(12): (print-triples (get-triples-list) :format :nquads) <subject> <predicate> <object> <default> .
Here we use the resource
<http://foo.com/abc#123> for the default graph:
bin/agload --graph http://foo.com/abc#123 --verbose --supersede foo /tmp/test.nt triple-store(24): (print-triples (get-triples-list) :format :nquads) <subject> <predicate> <object> <http://foo.com/abc#123> .
Finally, contrast these two cases. In the second, we use an actual string because we have escaped quotation marks:
bin/agload --graph "abc123" --verbose --supersede foo /tmp/test.nt triple-store(30): (print-triples (get-triples-list) :format :nquads) <subject> <predicate> <object> <abc123> . bin/agload --graph "\"abc123\"" --verbose --supersede foo /tmp/test.nt triple-store(33): (print-triples (get-triples-list) :format :nquads) <subject> <predicate> <object> "abc123" . bin/agload --graph "\"adios\"@es-mx" --verbose --supersede foo /tmp/test.nt triple-store(36): (print-triples (get-triples-list) :format :nquads) <subject> <predicate> <object> "adios"@es-mx . bin/agload --graph "\"123\"^^<http://www.w3.org/2001/XMLSchema#integer>" -- verbose --supersede foo /tmp/test.nt triple-store(39): (print-triples (get-triples-list) :format :nquads) <subject> <predicate> <object> "123"^^<http://www.w3.org/2001/XMLSchema#integer> .
agload may load the sources in a different order than they appear in the command line.
agload makes an attempt to optimize the dispatching of files for maximum use of loaders.
agload prior to version 4.4
agload was significantly upgraded in AllegroGraph 4.4. In general, calls to agload that worked in versions prior to 4.4 will continue to work but note the following differences:
- You cannot specify a directory as a FILE argument to agload.
Previously you could specify a directory to load all of the source files within it.
- The AGRAPH_PORT environment variable is no longer used to supply a default port value.
The port will default to 10035 unless you specify a different value with the
- Note that the root catalog has no name (and so can be specified, if necessary, with the empty string ). It is not named root . ↩
- Spaces can also be used as separators but this is deprecated. If spaces are used they must be escaped from the shell in some fashion, such as wrapping the index names in quotation marks (for example: spogi posgi ospgi i ↩