AllegroGraph Loader

Introduction

The AllegroGraph Loader (agload) is a command-line utility for putting data into a triple-store as easily as possible. It makes use of multiple CPU cores to load the data in parallel. agload has been significantly upgraded in the AllegroGraph 4.4 release. In general, calls to agload that worked in 4.x versions prior to version 4.4 will work in version 4.4, but note the following:

You can no longer specify a directory as the FILE argument to agload.: (Previously you could specify a directory and all files in the directory would be loaded).
The AGRAPH_PORT environment variable is no longer used to supply a default port value.: The port is 10035 unless you specify a different value in the call with the --port (or -p) argument.

Usage

agload [OPTIONS] DBNAME FILE ...

where DBNAME is an AllegroGraph triple store name, and FILE is one or more RDF files separated by whitespace. (RDF files being rdf/xml, turtle, trix, ntriples and nquads formats).

If a database named DBNAME does not exist, it will be created. If the database exists, it will be used. Please note: the --supersede option will cause an existing database to be deleted and re-created. Use with caution. Also note that DBNAME is affected by the presence of the --catalog option.

More on the FILE argument

If the character - is given instead of a file name, it will be interpreted as stdin. This allows for users to pipe output from one program to agload, as shown in the following example:

cat foo.nt bar.nt baz.nt | agload --input ntriples test_db -

Loading lists of files

Loading entire directories is no longer supported, but wildcards may be used. Also, you can load a single file which specifies a list of files to load, as we describe next.

If FILE begins with the character @, agload interprets everything after the @ as a file name which contains a list of actual source files. By default the file name items are separated by the a CR, LF or CRLF sequence. That is the usual case and covers most all actual situations.

But because UNIX allows newlines to appear in any file name, agload will permit the files to be separated by NULL characters. If you want this, you must specify the -0 option (the numeral zero, not an uppercase o). Here is an example:

agload -0 test_db @null-separated-list-of-sources.txt

Use the @file-name syntax when there are more source files than the shell will pass to agload as arguments. Directory processing is no longer supported. Use the @file-name syntax or wildcards instead. Wildcards will be expanded by the shell, as expected.

agload notes

agload may load the sources in a different order than they appear in the command line.

agload makes an attempt to optimize the dispatching of files for maximum use of loaders.

Example

The following example loads a single file into AllegroGraph.

./agload --supersede -v --with-indices "ospgi,posgi,spogi" -e ignore -p 10035 lesmisload lesmis.rdf

In this example, agload will load lesmis.rdf into the lesmisload repository on a running AllegroGraph server that is using port 10035. If there is an existing triple-store named lesmisload, then it will be deleted (a specified by --supersede). The program will generate verbose messages (-v) and will ignore errors (-e), if any. The triple store will generate three triple indices: ospgi, posgi and spogi.

Note on quotation marks in agload command line

As the example above shows, the values of some agload arguments can be strings (the value of --with-indices for example). The example given should work if typed as shown to a shell prompt, but using program like ssh to run the command may fail because of shell quoting issues. This is not the place to discuss that particular problem (nor to suggest solutions, but backslashes and/or additional double quotes seem often to be involved). If you are using ssh or something similar and things are not working, be aware that shell quoting issues may be the problem.

Options

The following options may be placed on the agload command line.

--port PORT, -p PORT: Set this to the front-end port number of the server you would like to use. Agload must be run on the same machine as the server, and the agload version must be the same version as the server. The default value is 10035.
--catalog CATALOG,-c CATALOG: Use this option to assign the database to a catalog. Absence of a catalog argument implies the root catalog (default).
--input FORMAT, -i FORMAT: Use this option to specify which input format for agload to use. The recognized values are: rdfxml, ntriples, nquads, trix, turtle and guess.; The default is guess. Use guess under the following conditions: 1) All sources have a recognizable extension, and 2) Every file is actually of the format of its extension. Recognized ntriples extensions are: .nt, .ntriple, .ntriples, .nt.gz, .ntriple.gz, .ntriples.gz, .nt.bz2, .ntriple.bz2, and .ntriples.bz2. Recognized nquads extensions are: .nq, .nquad, .nquads, .nq.gz, .nquad.gz, .nquads.gz, .nq.bz2, .nquad.bz2, and .nquads.bz2. Recognized rdfxml extensions are: .rdf, .rdfs, .owl, .rdf.gz, .rdfs.gz, .owl.gz, .rdf.bz2, .rdfs.bz2, and .owl.bz2. Recognized turtle extensions are: .ttl, .turtle, .ttl.gz, .turtle.gz, .ttl.bz2, and .turtle.bz2. And recognized trix extensions are: .trix, .trix.gz, and .trix.bz2.; If any other input format is used, the specified input format will take precedence over the extension. For example, if you have an ntriples file named triples.rdf and specify --input ntriples then triples.rdf will be parsed as an ntriples file.; If you use multiple formats in one agload command, it is to your advantage to name the files such that --input guess can determine what they are. Note also, agload cannot currently guess the format of a file file based on its contents.; The only two compression formats handled by agload are gzip and bzip2. Any files which are compressed must be named with .gz or .bz2 extensions in order to be decompressed. For example, if loading Billion Triple Challenge and the file is named btc-2011-chunk-045.gz then specify --input nquads and agload will automatically determine the compression format from the file type.; The use of stdin with agload always requires a non-default value for the input format, since stdin has no file type.
--dispatch-strategy STRATEGY: STRATEGY must be auto or file. The dispatch strategy tells agload how it might parallelize loading. ntriple and nquad files can be broken up and the pieces loaded in parallel.; The default is auto which is combination of dispatch strategies based on file format. ntriple and nquad files will be broken up and loaded in pieces. Specifying --dispatch-strategy file prevents loaded such files in pieces. That is generally only useful when sources are on Solid State Disks (SSDs), or an emulation of agload version prior to 4.4 is desired. This setting is only applicable for ntriple and nquad formats. Rdfxml, turtle and trix formats are always dispatched on a file basis.
--blank STRATEGY, -b STRATEGY: This switch changes how blank nodes whose names are used in multiple files are handled. STRATEGY must be one of file, job or none. The use of blank node strategy none with ntriple or nquad files which have blank nodes will cause an error. Blank node strategy none is simply an optimization which is available for ntriple and nquad files which do not have blank nodes.; Blank node strategy file is the default. With this strategy, agload will not consider blank nodes in different files to be the same blank node. For example, if :b1 is found in file1.nt and :b1 is found in file2.nt during the same load, they will be assigned different UPIs in AllegroGraph. Contrast this with blank node strategy job:; Blank node strategy job will consider all blank nodes found in ntriple and nquad files to be in the same "scope". This means that if :b1 is found in file1.nt and :b1 is found in file2.nt they will be assigned the same UPI in AllegroGraph. For example, Billion Triple Challenge has blank nodes which span multiple files.; Note that blank node strategy only applies to ntriple and nquad files. Other formats such as rdfxml, turtle and trix are defined to have a blank node scope of the file.
--error-strategy STRATEGY, -e STRATEGY: The available options for error strategy are cancel, ignore, print and save. Cancel is the default, and will stop the loading process as soon as an error is detected. Error strategy ignore will attempt to silently continue upon an error. Error strategy print is like ignore but will print the error to stdout. Error strategy save is like print but will also log the error to agload.log in the directory which agload is invoked.; Error strategy applies to all recoverable errors, not just parsing errors.
--help, -h: Print helpful information.
--verbose: The presence of the verbose option will cause more information to be printed to stdout. This includes basic information about the load as well as a report of completed files. This generates a lot of output when loading many files (one line per file), so can quickly fill a terminal's scrollback.
--debug: The debug flag will cause debugging information to be output upon error. If agload returns an exit code of 1, which means unhandled error, use the debug flag before contacting support. The flag also enables more verbose output.
--patch FILENAME: If provided, this option will load a patch file into agload before doing anything else. FILENAME is intended to be provided by [email protected] for troubleshooting or correction of problems discovered in the field.
--loaders COUNT, -l COUNT: The loaders option corresponds to the number of processes which will be connecting to AllegroGraph and committing triples. It is for optimization of agload performance. The default depends on the number of physical cores on your server. If you have 1 or 2 cores, loaders will be set to 1. If you have 4 cores, loaders will be set to 3. For more cores it is the number of physical cores minus one. agload also has a task dispatcher process and AllegroGraph has its own processes.; If you are not getting satisfactory performance for your load, try increasing or decreasing the number of loaders. If your data has no blank nodes, you may want to set the number of loaders to the number of logical cores on the machine and use --blank none. If you have files dense with blank nodes try decreasing the number of loaders to free up machine resources. For example on an 8 core, 48GB hyperthreaded server, we use --loaders 5 for good performance while loading Billion Triple Challenge. For Lubm-8000 we use --loaders 16.
--encoding ENCODING, -C ENCODING: The encoding option affects the way the ntriple and nquad parser will interpret characters in the source files. Please note that ntriple and nquad formats are defined in the ASCII charset, and this is the default value. However ntriple/nquad files are often generated from data which is not ASCII and the characters are not escaped. While this is not standard, the files are often still interpretable, so given an appropriate --encoding switch, agload will attempt to support them. For example, to load an 8 bit unicode ntriple or nquad file, use the option -C utf8. ENCODING must be a valid AllegroCL external format name or nickname. For example: big5, jis or latin1. (Note: ascii is a nickname for latin1.) For a list of valid encoding names, see the "Name" and "Nicknames" columns in: http://www.franz.com/support/documentation/8.2/doc/iacl.htm#basic-ef-types-3
--relax-syntax: For nquad files, this flag allows a syntax to be used which is often found in Billion Triple Challenge. This is a non-standard parser extension and should only be used when necessary. (In earlier releases, this argument was named --relax-for-btc.)
--duplicates SETTING, -d SETTING: Changes the handling of duplicate triples in the store. The valid values for SETTING are keep (the default); delete, meaning delete all but one of all groups of triples that match subject, predicate, object, and graph; and delete-spo, meaning delete groups that match subject, predicate, and object, regardless of graph. In AllegroGraph, triples are deleted only by the function delete-duplicate-triples. If duplicate deletion is specified, that function is (in effect) called at the end of data loading with arguments suitable for the specified argument, and the load completes when duplicates are deleted. This argument does not affect things after the load is complete as future duplicate deletions are only done when delete-duplicate-triples is called.
--supersede, -s: Delete the store before loading data. This flag should be used with caution, as it can cause an existing database to be deleted. When used, if the database exists of DBNAME, agload will delete the triple store DBNAME and re-create it before loading triples into it.
--bulk: This flag enables bulk mode while processing the loading job. Bulk mode turns off transaction file processing while the load is occurring. This can provide considerable performance gain for large jobs. We recommend you do a backup before doing a bulk load.
--fti NAME, -f NAME: This option allows for the creation of a free text index after database creation which will be populated during the loading process. This does slow down loading. This option is included for convenience.
--dry-run: This flag tells agload to print the loading strategy but not to load any triples.
--with-indices INDICES: Used to specify the indices of the triple store. The parameter should be a list of index names separated by commas. Example: spogi,posgi,ospgi,i. If not specified, newly created triple stores will use the standard set of indices and existing triple stores will retain their current indices. Note: spaces can also be used as separators but are deprecated. If spaces are used they must be escaped from the shell in some fashion, such as wrapping the index names in quotation marks. For example: "spogi posgi ospgi i".
--rapper: This flag tells agload to use rapper to transform rdf files into ntriples files before loading them. (This flag existed in previous versions of agload and is kept for compatibility.)
--base-uri URI, -u URI: This option is used to specify a base URI for formats that support it, such as rdfxml. Note that if stdin is a source and rdfxml is the input format, --base-uri must be specified.
--graph GRAPH, -g GRAPH: This option is used to specify a default graph for formats that support it, such as ntriples. Special values are: :default, use the default graph node for the triple-store (this is the default value for graph). :source, use the source of the triple (i.e. the filename) as the default graph node. :blank, generate a blank node before loading triples and use the blank node as the default graph.; Any other value is interpreted as a resource or literal and use that as the default graph.; Note that nquad sources supply a graph node value and therefore do not use the default specified here. The first three possible values start with a colon (:) because user may want to use default, source, or blank as graph names, specified with (for example) --graph default. The colon thus designates the special meaning of those values and allows to works to be used like other words. See the examples section for more information on the use of the graph option.
--null, -0: This option is used to signal to agload that the file specified after an @ sign in the FILE inputs is a null separated list. This is useful for loading files with newlines or other strange characters in their names.

Further Examples

Examples of the graph option:

bin/agload --graph :default --verbose --supersede foo /tmp/test.nt

This call uses the default graph node for the particular AllegroGraph database:

triple-store(3): (print-triples (get-triples-list) :format :nquads)   
<subject> <predicate> <object> .

Contrast that with this call:

bin/agload --graph :source --verbose --supersede foo /tmp/test.nt  
 
triple-store(6): (print-triples (get-triples-list) :format :nquads)   
<subject> <predicate> <object> <file:///tmp/test.nt> .

In this next call, --graph is :blank:

bin/agload --graph :blank --verbose --supersede foo /tmp/test.nt

agload generates a new blank node for the database and use that as the default graph for the whole job:

triple-store(9): (print-triples (get-triples-list) :format :nquads)   
<subject> <predicate> <object> _:bC87E16D1x1 .

Here we use another value, which is the word (no colon!) default:

bin/agload --graph default --verbose --supersede foo /tmp/test.nt  
 
triple-store(12): (print-triples (get-triples-list) :format :nquads)   
<subject> <predicate> <object> <default> .

Here we use the resource <http://foo.com/abc#123> for the default graph:

bin/agload --graph http://foo.com/abc#123 --verbose --supersede foo  
/tmp/test.nt   
 
triple-store(24): (print-triples (get-triples-list) :format :nquads)   
<subject> <predicate> <object> <http://foo.com/abc#123> .

Finally, contrast these two cases. In the second, we use an actual string because we have escaped quotation marks:

bin/agload --graph "abc123" --verbose --supersede foo /tmp/test.nt   
 
triple-store(30): (print-triples (get-triples-list) :format :nquads)   
<subject> <predicate> <object> <abc123> .   
 
bin/agload --graph "\"abc123\"" --verbose --supersede foo /tmp/test.nt   
 
triple-store(33): (print-triples (get-triples-list) :format :nquads)   
<subject> <predicate> <object> "abc123" .   
 
bin/agload --graph "\"adios\"@es-mx" --verbose --supersede foo /tmp/test.nt  
 
triple-store(36): (print-triples (get-triples-list) :format :nquads)  
<subject> <predicate> <object> "adios"@es-mx .  
 
bin/agload --graph "\"123\"^^<http://www.w3.org/2001/XMLSchema#integer>" --  
verbose --supersede foo /tmp/test.nt  
 
triple-store(39): (print-triples (get-triples-list) :format :nquads)  
<subject> <predicate> <object>  
"123"^^<http://www.w3.org/2001/XMLSchema#integer> .

Example of how to load Billion Triple Challenge:

The Billion Triple Challenge takes approximately 11 hours to load on a server which has 8 cores with hyperthreading, 48 GB of RAM, and a 3 [spinning] disk RAID 0 array for /disk1. We set the expected triple store size in the AllegroGraph config file to be 2200000000.

$ ssh myagraphserver

$ cd /agraph/installation/directory

(start agload) $ nohup time bin/agload --verbose --bulk --with-indices "spogi" -l 5 -d keep -e ignore -i nquads -b job -C utf8 --relax-for-btc btc2011 /disk1/test- sources/btc/2011/*

(from another terminal) $ tail -f /agraph/installation/directory/nohup.out Create the triple-store btc2011 and load 219 sources ...