Table of Contents

Introduction

Pre-AllegroGraph 4.4 usage

Usage

More on the FILE argument

Optimal file types

Loading lists of files

agload notes

agload failure

Example

Options

Further Examples

Examples of the graph option:

Introduction

The AllegroGraph Loader (agload) is a command-line utility for putting data into a triple-store as easily as possible. When possible, it makes use of multiple CPU cores to load the data in parallel.

Bulk data can be loaded with tools besides agload. AGWebView has a loading tool. See also Adding Triples in the Lisp Reference.

Pre-AllegroGraph 4.4 usage

agload was significantly upgraded in the AllegroGraph 4.4 release. In general, calls to agload that worked in 4.x versions prior to version 4.4 will work in later versions, but note the following:

You cannot specify a directory as the FILE argument to agload.
(Previously you could specify a directory and all files in the directory would be loaded).
The AGRAPH_PORT environment variable is no longer used to supply a default port value.
The port is 10035 unless you specify a different value in the call with the --port (or -p) argument.

Usage

agload must be run on the same machine as the AllegroGraph server, and as the same user as the user who started the AllegroGraph server. Further, the version of agload must be the same as the AllegroGraph server (that is, agload must be the one distributed with the server, not from an earlier or later version of AllegroGraph).

agload [OPTIONS] DBNAME FILE ... 

where DBNAME is an AllegroGraph triple store name, and FILE is one or more RDF files. (RDF files being files in rdf/xml, turtle, trix, ntriples and nquads formats).

If a database named DBNAME does not exist, it will be created. If the database exists, it will be used. Please note: the --supersede option will cause an existing database to be deleted and re-created, so use that option with caution. Also note that DBNAME is affected by the presence of the --catalog option: different catalogs can contain databases with the same names so if you specify a catalog, you are calling for the use or creation of the database DBNAME in that catalog.

More on the FILE argument

If the character - is given instead of a file name, it will be interpreted as standard input. This allows for users to pipe output from one program to agload, as shown in the following example:

cat foo.nt bar.nt baz.nt | agload --input ntriples test_db - 

Optimal file types

agload is optimized for loading ntriple and nquad files. Loading of files of those types will be spread over multiple cores even if only a single file is specified. agload will spread loading of multiple files of other types (turtle, trix, XML/RDF) over multiple cores but each single file will be loaded by a single core. Therefore loading a single, large turtle file takes much longer that loading a single, equivalently large ntriple or nquad file on machines with multiple cores.

The Rapper program (from http://librdf.org/raptor/rapper.html will convert files from one format to another.

Loading lists of files

Loading entire directories (by specifying the directory as the FILE argument) is not supported, but wildcards may be used. Wildcards will be expanded by the shell, as expected. Also, you can specify a single file which contains a list of files to load, as we describe next.

If FILE begins with the character @, agload interprets everything after the @ as a file name which contains a list of actual source files. By default the file name items are separated by a CR, LF or CRLF sequence. That is the usual case and covers most actual situations.

Because UNIX allows newlines to appear in any file name, agload will permit the files to be separated by NULL characters. If you want this, you must specify the -0 option (the numeral zero, not an uppercase letter o). Here is an example:

agload -0 test_db @null-separated-list-of-sources.txt 

Use the @file-name syntax when there are more source files than the shell will pass to agload as arguments.

agload notes

agload may load the sources in a different order than they appear in the command line.

agload makes an attempt to optimize the dispatching of files for maximum use of loaders.

agload failure

agload might fail (meaning fail to successfully load all data from all the specified files) for various reasons, including hardware failure and network problems. The most common cause of failure is corrupted or invalid input files.

If agload encounters invalid data, its behavior depends on the --error-strategy option desribed below, but by default it will stop loading data.

Note that agload adds triples to the database as it works. Therefore, if agload fails (for whatever reason) after it has begun, you may find some but not all triples added, and you should be prepared for this possibility. Here are some ways to be prepared:

But sometimes you have to load with agload while triples are being added from other sources, and you must be prepared in that case to deal with an agload failure. (We do not have specific recommendations as there are many possible cases and equally many ways to deal with them but note that if the graph field is not otherwise used, you can often arrange it so triples loaded with agload have a different graph than triples from other sources and that makes it easy to tell which were loaded by agload and which were added by other means.)

See the --error option below for more information on handling errors.

Example

The following example loads a single file into AllegroGraph.

./agload --supersede -v --with-indices "ospgi,posgi,spogi" -e ignore -p 10035 lesmisload lesmis.rdf  
 

In this example, agload will load lesmis.rdf into the lesmisload repository on a running AllegroGraph server that is using port 10035. If there is an existing triple-store named lesmisload, then it will be deleted (as specified by --supersede). The program will generate verbose messages (-v) and will ignore errors (-e ignore), if any. The triple store will generate three triple indices: ospgi, posgi and spogi.

Options

The following options may be placed on the agload command line.

--port PORT, -p PORT
Set this to the port number of the server you would like to use. As said above, agload must be run on the same machine as the server and by the same user, and the agload version must be the same version as the server (that is, distributed with server). The default port value is 10035.
--catalog CATALOG,-c CATALOG
Use this option to assign the database to a catalog. Absence of a catalog argument implies the root catalog (default). Note that the root catalog has no name (and so can be specified, if necessary, with the empty string ""). It is not named "root".
--input FORMAT, -i FORMAT

Use this option to specify which input format for agload to use. The recognized values are: rdfxml, ntriples, nquads, trix, turtle and guess.

The default is guess. Use guess under the following conditions: 1) All sources have a recognizable extension, and 2) Every file is actually of the format of its extension.

Recognized ntriple extensions:  
.nt, .ntriple, .ntriples, .nt.gz, .ntriple.gz, .ntriples.gz,  
.nt.bz2, .ntriple.bz2, and .ntriples.bz2.  
 
Recognized nquads extensions:  
.nq, .nquad, .nquads, .nq.gz, .nquad.gz, .nquads.gz, .nq.bz2,  
.nquad.bz2, and .nquads.bz2.  
 
Recognized rdfxml extensions:  
.rdf, .rdfs, .owl, .rdf.gz, .rdfs.gz, .owl.gz, .rdf.bz2,  
.rdfs.bz2, and .owl.bz2.  
 
Recognized turtle extensions:  
.ttl, .turtle, .ttl.gz, .turtle.gz, .ttl.bz2, and .turtle.bz2.  
 
Recognized trix extensions:  
.trix, .trix.gz, and .trix.bz2. 

If any other input format is used, the specified input format will take precedence over the extension. For example, if you have an ntriples file named triples.rdf and specify --input ntriples then triples.rdf will be parsed as an ntriples file.

If you use multiple formats in one agload command, it is to your advantage to name the files such that --input guess can determine what they are. Note also, agload cannot guess the format of a file based on its contents.

The only two compression formats handled by agload are gzip and bzip2. Any files which are compressed must be named with .gz or .bz2 extensions in order to be decompressed. All supported formats permit .[format].gz and .[format].bz2 extensions, allowing agload to determine the data format from the [format] portion. If a file has extension .gz or .bz2 without also specifying the format, you must use the --input option. For example, if you are loading a gzip'ed nquads file and the file is named btc-2011-chunk-045.gz, then you must specify --input nquads.

The use of standard input with agload (by specifying FILE to be -) always requires a non-default value for the input format, since standard input has no file type.

--dispatch-strategy STRATEGY

STRATEGY must be auto or file. The dispatch strategy tells agload how it might parallelize loading. ntriple and nquad files can be broken up and the pieces loaded in parallel.

The default is auto which is combination of dispatch strategies based on file format. ntriple and nquad files will be broken up and loaded in pieces. --dispatch-strategy file means that no files are broken into pieces for loading. There is no reason to specify file (which was useful in much earlier releases and is kept for backward compatibility). Note that rdfxml, turtle and trix formats are always dispatched on a file basis regardless of the value of this option.

--blank STRATEGY, -b STRATEGY

This switch determines how blank nodes whose names are used in multiple files are handled. STRATEGY must be one of file, job or none.

Blank node strategy file is the default. With this strategy, agload will not consider blank nodes in different files to be the same blank node. For example, if _:b1 is found in file1.nt and _:b1 is found in file2.nt during the same load, they will be assigned different UPIs in AllegroGraph. (A UPI is a Unique Part Identifier, described here.) Contrast this with blank node strategy job:

Blank node strategy job will consider all blank nodes found in ntriple and nquad files to be in the same "scope". This means that if _:b1 is found in file1.nt and _:b1 is found in file2.nt they will be assigned the same UPI in AllegroGraph.

Blank node strategy none will cause agload to error if any ntriple or nquad file being loaded contains blank nodes. Loading of ntriple and nquad files which do not contain blank nodes is faster when the blank node strategy is none.

Note that blank node strategy only applies to ntriple and nquad files. Other formats such as rdfxml, turtle and trix are defined to have a blank node scope of the file and so for such files, this option is ignored.

--error-strategy STRATEGY, -e STRATEGY

The available options for error strategy are cancel, ignore, print and save. Cancel is the default, and will stop the loading process as soon as an error is detected. Error strategy ignore will attempt to silently continue upon an error. Error strategy print is like ignore but will print the error to standard output. Error strategy save is like print but will also log the error to agload.log in the current working directory.

Error strategy applies to all recoverable errors, not just parsing errors.

Remember that if agload fails for whatever reason, some triples will have been added to the database and some will not have been added (except for very unusual edge cases).

--help, -h
Print helpful information.
--verbose, -v
The presence of the verbose option will cause additional information to be printed to standard output. This argument can be specified multiple times to increase the verbosity. We recommend using --verbose --verbose if you encounter a problem with loading. --verbose --verbose generates a lot of output when loading many files, so it may fill a terminal's scrollback. See also the --debug option which also may be useful when an error occurs during loading. --debug -v -v provides maximum information about loading.
--debug
The debug flag will cause debugging information to be output upon error. If agload returns an exit code of 1, which means unhandled error, use the debug option before contacting support. The option also causes information to be written to agload.log. See also the --verbose option, which causes other information about loading to be output. --debug -v -v provides maximum information about loading.
--loaders COUNT, -l COUNT

The loaders option corresponds to the number of processes which will be connecting to AllegroGraph and committing triples. It is for optimization of agload performance. The default depends on the number of physical cores on your server. If you have 1 or 2 cores, loaders will be set to 1. If you have 4 cores, loaders will be set to 3. For more cores it is the number of physical cores minus one. agload also has a task dispatcher process and AllegroGraph has its own processes.

If you are not getting satisfactory performance for your load, try increasing or decreasing the number of loaders. If your data has no blank nodes, you may want to set the number of loaders to the number of logical cores on the machine and use --blank none. If you have files dense with blank nodes try decreasing the number of loaders to free up machine resources. For example on an 8 core, 48GB hyperthreaded server, we use --loaders 5 for good performance while loading Billion Triple Challenge. For Lubm-8000 we use --loaders 16.

--encoding ENCODING, -C ENCODING

The encoding option controls how the ntriples and nquads parsers interpret characters in source files. The W3C standard for ntriples requires ntriple and nquad files to be 7-bit ASCII. 1 By default, agload will signal an error for non-ASCII ntriple and nquad files. You can use --encoding utf-8 to allow agload to accept files with UTF-8 characters in them.

agload also recognizes other encodings. See the "Name" and "Nicknames" table in the Allegro Common Lisp documentation for a list of valid encoding names.

Note that the encoding parameter only applies to ntriple and nquad source files because Turtle files always use UTF-8 and RDF/XML and TriX files use character encodings as defined by standard XML parsing rules.

--relax-syntax
For nquad files, this flag allows a syntax to be used which is often found in Billion Triple Challenge. This is a non-standard parser extension and should only be used when necessary. (In earlier releases, this argument was named --relax-for-btc.)
--duplicates SETTING, -d SETTING
Changes the handling of duplicate triples in the store. The valid values for SETTING are keep (the default); delete, meaning delete all but one of all groups of triples that match subject, predicate, object, and graph; and delete-spo, meaning delete groups that match subject, predicate, and object, regardless of graph. In AllegroGraph, triples are deleted only by the function delete-duplicate-triples. If duplicate deletion is specified, that function is (in effect) called at the end of data loading with arguments suitable for the specified argument, and the load completes when duplicates are deleted. This argument does not affect things after the load is complete as future duplicate deletions are only done when delete-duplicate-triples is called.
--supersede, -s
Delete the store before loading data. This flag should be used with caution, as it can cause an existing database to be deleted. When used, if the database exists of DBNAME, agload will delete the triple store DBNAME and re-create it before loading triples into it.
--bulk
This flag enables bulk mode while processing the loading job. Bulk mode turns off transaction file processing while the load is occurring. This can provide considerable performance gain for large jobs. We recommend you do a backup before doing a bulk load.
--fti NAME, -f NAME
This option allows for the creation of a free text index after database creation which will be populated during the loading process. This does slow down loading. This option is included for convenience.
--dry-run
This flag tells agload to print the loading strategy but not to load any triples.
--with-indices INDICES
Used to specify the indices of the triple store. The parameter should be a list of index names separated by commas. Example: spogi,posgi,ospgi,i. If not specified, newly created triple stores will use the standard set of indices and existing triple stores will retain their current indices. Note: spaces can also be used as separators but are deprecated. If spaces are used they must be escaped from the shell in some fashion, such as wrapping the index names in quotation marks. For example: "spogi posgi ospgi i".
--rapper
This flag tells agload to use rapper to transform rdf files into ntriples files before loading them. Rapper is described in Rapper.
--base-uri URI, -u URI
This option is used to specify a base URI for formats that support it, such as rdfxml. Note that if standard input is a source and rdfxml is the input format, --base-uri must be specified.
--graph GRAPH, -g GRAPH

This option is used to specify a default graph for formats that support it, such as ntriples. Special values are: :default, use the default graph node for the triple-store (this is the default value for graph). :source, use the source of the triple (i.e. the filename) as the default graph node. :blank, generate a blank node before loading triples and use the blank node as the default graph.

Any other value is interpreted as a resource or literal and use that as the default graph.

Note that nquad sources supply a graph node value and therefore do not use the default specified here. The first three possible values start with a colon (:) because user may want to use default, source, or blank as graph names, specified with (for example) --graph default. The colon thus designates the special meaning of those values and allows to works to be used like other words. See the examples section for more information on the use of the graph option.

--null, -0
This option is used to signal to agload that the file specified after an @ sign in the FILE inputs is a null separated list. This is useful for loading files with newlines or other strange characters in their names.
--external-references, -x
This option will cause agload to follow external references in RDF/XML source files.

Further Examples

Examples of the graph option:

bin/agload --graph :default --verbose --supersede foo /tmp/test.nt 

This call uses the default graph node for the particular AllegroGraph database:

triple-store(3): (print-triples (get-triples-list) :format :nquads)   
<subject> <predicate> <object> . 

Contrast that with this call:

bin/agload --graph :source --verbose --supersede foo /tmp/test.nt  
 
triple-store(6): (print-triples (get-triples-list) :format :nquads)   
<subject> <predicate> <object> <file:///tmp/test.nt> . 

In this next call, --graph is :blank:

bin/agload --graph :blank --verbose --supersede foo /tmp/test.nt 

agload generates a new blank node for the database and use that as the default graph for the whole job:

triple-store(9): (print-triples (get-triples-list) :format :nquads)   
<subject> <predicate> <object> _:bC87E16D1x1 . 

Here we use another value, which is the word (no colon!) default:

bin/agload --graph default --verbose --supersede foo /tmp/test.nt  
 
triple-store(12): (print-triples (get-triples-list) :format :nquads)   
<subject> <predicate> <object> <default> . 

Here we use the resource <http://foo.com/abc#123> for the default graph:

bin/agload --graph http://foo.com/abc#123 --verbose --supersede foo  
/tmp/test.nt   
 
triple-store(24): (print-triples (get-triples-list) :format :nquads)   
<subject> <predicate> <object> <http://foo.com/abc#123> . 

Finally, contrast these two cases. In the second, we use an actual string because we have escaped quotation marks:

bin/agload --graph "abc123" --verbose --supersede foo /tmp/test.nt   
 
triple-store(30): (print-triples (get-triples-list) :format :nquads)   
<subject> <predicate> <object> <abc123> .   
 
bin/agload --graph "\"abc123\"" --verbose --supersede foo /tmp/test.nt   
 
triple-store(33): (print-triples (get-triples-list) :format :nquads)   
<subject> <predicate> <object> "abc123" .   
 
bin/agload --graph "\"adios\"@es-mx" --verbose --supersede foo /tmp/test.nt  
 
triple-store(36): (print-triples (get-triples-list) :format :nquads)  
<subject> <predicate> <object> "adios"@es-mx .  
 
bin/agload --graph "\"123\"^^<http://www.w3.org/2001/XMLSchema#integer>" --  
verbose --supersede foo /tmp/test.nt  
 
triple-store(39): (print-triples (get-triples-list) :format :nquads)  
<subject> <predicate> <object>  
"123"^^<http://www.w3.org/2001/XMLSchema#integer> .  

Footnotes

  1. There is a draft standard that extends this to UTF-8 but AllegroGraph does not in general implement draft standards.