remark -- indent-tabs-mode: nil; -- }

Deleting duplicate triples introduction

Two triples within a store may be SPO-identical or SPOG-identical. Two different triples are SPO-identical if they have identical subject, predicate, and object. They are SPOG-identical if they also have identical graphs. All triples also have a unique triple id, which is determined by the system and not under user control. Distinct triples always have distinct triple ids even if they are SPO- or SPOG-identical.

SPO-identical and SPOG-identical triples are also called duplicate triples. The term duplicate triples refers to both SPO- and SPOG-identical triples and so is ambiguous.

It is uncommon to deliberately add duplicate triples into a store. It is usually the result of loading data files twice or loading different files which happen to contain some duplicate data. Uncoordinated hand entry by multiple persons also may result in loading duplicates. There also may be reasons to load duplicates, such as wanting to determine whether separate large data files contain duplicate data. This may be difficult to determine by other means, particularly if the files use different formats.

Unless duplicate suppression is enabled (see below), the AllegroGraph system neither detects nor prevents loading of duplicate triples. It is not an error for a store to contain duplicate triples.

But duplicate triples do use resources unnecessarily, can cause slowdown of query processing, and may cause misleading results, particularly for queries involving counts of triples with specific components.

AllegroGraph provides facilities for identifying and for deleting duplicate triples. These facilities are described in this document. We first describe what can be done to identify and delete duplicates in general, and then we describe each interface (webview, REST, etc.)

Visible triples

A user can be restricted from viewing certain triples (see Security Implementation). The triples that can be viewed by a user are visible to that user. When we talk about duplicates in this document, we always mean duplicates among the triples visible to the current user. It may be the store contains a single triple which is visible to the user and additional SPO-identical triples which are not visible, because the user is restricted from seeing triples with the graphs of the other SPO-identical triples. Triples can have attributes (see Triple Attributes) which restrict which triples can be seen by a user and duplicate triples can have different attributes.

Permission to delete duplicates

Even if duplicate triples are visible to a user, the user may not have permission to delete some or all of the duplicates (again, see Security Implementation). Any command to delete duplicates issued by the user will not delete duplicates the user does not have permission to delete.

Duplicates in federated stores

The functionality (described below) for listing and deleting duplicate triples is not supported in federated stores (see AllegroGraph Federation in the Introduction). You can list/delete duplicates in each individual store which makes up the federation, but not in the federated store itself.

Listing duplicate triples

AllegroGraph will generate a list of all SPO-identical or all SPOG-identical duplicates. We have a repository named duptest that contains (in simplified format) the following 5 triples:

PREFIX franz: <http://www.franz.com/>  
 
franz:person1 rdf:type franz:person default-graph [2314]  
franz:person1 rdf:type franz:person default-graph [4345]  
franz:person1 rdf:type franz:person default-graph [4678]  
franz:person1 rdf:type franz:person graph1 [3417]  
franz:person1 rdf:type franz:person graph1 [2341] 

For each triple, we have shown the Subject, Predicate, Object, Graph, and a notional triple-id. Note that while the triple id of a triple can be accessed progammatically, users have no control over what id is assigned. We provide these made-up values just so we can talk about the triple ids in our example. Triple ids are positive integers.

All five triples are SPO-identical. The first three and the last two are also SPOG-identical but none of the first three is SPOG-identical to either of the last two.

We will now list duplicates (we show how to list duplicates programmatically below). If two triples are identical, the one with the higher id is the duplicate. Thus the list of SPO-identical duplicate triples will be

franz:person1 rdf:type franz:person default-graph [4345]  
franz:person1 rdf:type franz:person default-graph [4678]  
franz:person1 rdf:type franz:person graph1 [3417]  
franz:person1 rdf:type franz:person graph1 [2341] 

The triple

franz:person1 rdf:type franz:person default-graph [2314] 

is the non-SPO-duplicate, with the same S, P, and O as the other four, but with the lowest id.

The list of SPOG-identical duplicate triples will be

franz:person1 rdf:type franz:person default-graph [4345]  
franz:person1 rdf:type franz:person default-graph [4678]  
franz:person1 rdf:type franz:person graph1 [3417] 

The two triples

franz:person1 rdf:type franz:person default-graph [2314]  
franz:person1 rdf:type franz:person graph1 [2341] 

are the non-SPOG-duplicates since each has the lowest id among its SPOG fellows.

If you delete SPOG-identical triples, then two of the first three triples and one of the last two will be deleted, with these triples remaining:

franz:person1 rdf:type franz:person default-graph [2314]  
franz:person1 rdf:type franz:person graph1 [2341] 

Those are the non-SPOG-duplicates listed just above. Similarly, if you delete SPO-identical triples, four of the five triples will be deleted, leaving:

franz:person1 rdf:type franz:person default-graph [2314] 

Getting a list of duplicates

Here are curl commands to get a list of duplicate triples using the REST/HTTP interface interface. They are applied to the duptest repo we discussed above. You do not see the triple id in any of this output.

First we look at SPO duplicates (we have broken lines for readability). The default graph is not displayed. The other graph (called graph1 above) is <http://franz.com/OFFICER>.

% curl -X GET --header "Accept: text/x-nquads" -u test:xyzzy \  
      "https://localhost:10650/repositories/duptest/statements/duplicates?mode=spo"  
<http://www.franz.com/#person1>  
    <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>  
    <http://www.franz.com/#person>  
    <http://franz.com/OFFICER> .  
<http://www.franz.com/#person1>  
    <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>  
    <http://www.franz.com/#person>  
    <http://franz.com/OFFICER> .  
<http://www.franz.com/#person1>  
    <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>  
    <http://www.franz.com/#person> .  
<http://www.franz.com/#person1>  
    <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>  
    <http://www.franz.com/#person> . 

Since we are looking at SPO-duplicates and all five triples are SPO-identical, we get four duplicates.

Here are the SPOG duplicates:

% curl -X GET --header "Accept: text/x-nquads" -u test:xyzzy \  
      "localhost:10650/repositories/duptest/statements/duplicates?mode=spog"  
<http://www.franz.com/#person1>  
     <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>  
     <http://www.franz.com/#person>  
     <http://franz.com/OFFICER> .  
<http://www.franz.com/#person1>  
     <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>  
     <http://www.franz.com/#person .  
<http://www.franz.com/#person1>  
     <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>  
     <http://www.franz.com/#person> . 

Three duplicates, two with the default graph and one with the OFFICER graph.

The Lisp function for listing duplicates is get-duplicate-triples which returns a cursor containing the duplicates:

triple-store-user(15): (pprint  
                         (get-triples-list  
                           :cursor (get-duplicate-triples :mode :spog)))  
 
(<person1 type person OFFICER> <person1 type person> <person1 type person>)  
triple-store-user(16): (pprint  
                         (get-triples-list  
                           :cursor (get-duplicate-triples :mode :spo)))  
(<person1 type person OFFICER>  
 <person1 type person OFFICER>  
 <person1 type person>  
 <person1 type person>)  
triple-store-user(17): 

In the Python interface, use RepositoryConnection.getDuplicateStatements.

In the Java interface, see getDuplicateStatements().

Automating duplicate deletion

You can arrange that duplicate triples be deleted at commit time. Then there will never be any new committed duplicates to delete (though any committed duplicates already in the store when you implement this feature will remain until you take action to delete them: enabling duplicate deletion at commit time never affects already committed triples).

Purging deleted triples

When triples are deleted they are not immediately removed from indices because they must remain in all indices until all transactions which may potentially need to see those triples (such as transactions that started before the triples were deleted) have completed (been committed or rolled back). A triple that cannot possibly be accessed by any live transaction is referred to as an inaccessible triple. AllegroGraph provides a facility to purge inaccessible triples from indices. See Purging Deleted Triples for a complete discussion of deleted triple purging.

The AGWebview interface dealing with duplicate triples

AGWebview is a browser-based interface to AllegroGraph. It is the standard way users interact with AllegroGraph. Duplicate triples are handled on the Repository Overview Page:

Repository Overview Page

The three choices that deal with duplicates are Export duplicate statements, Delete duplicate statements, and Suppress duplicate statements:

Items concerning duplicate triples

Getting a file of duplicate triples: Export duplicate statements

The dropdown menu to the right of the Export duplicate statements choice allows you to select whether you are interested in SPOG-identical duplicates or SPO-identical duplicates:

export-duplicates menu

Select which type of duplicates you are interested in, and the click on the Export duplicate statements choice. You will be prompted for a filename and location, and the duplicates will be written to that file in Nquads format. Those triples are the triples which will be deleted if you request duplicate deletion. Note that triples which are not visible to the current user will not be written to the file even if they are duplicates.

Deleting duplicate triples: Delete duplicate statements

If you select this choice, a popup window appears asking whether you want to delete SPO-identical triples or SPOG identical triples:

Delete-duplicate popup window

Select the desired deletion mode and click OK. The duplicates of the selected type will be deleted from the store.

Restoring deleted duplicates

If, before deleting duplicate, you wrote a file of duplicates, you can restore the deleted duplicates by loading the file (see Data Loading).

Commits and duplicates: Suppress duplicate statements

You can have the system automatically delete duplicate triples at commit time. In the Webview interface, you can enable this feature using the Suppress duplicate statements choice on the Repository page. Clicking on that choice displays a popup window with a menu of three choices: do not suppress duplicates at commit time, suppress SPOG-identical at commit time, suppress SPO-identical duplicates at commit time.

Suppress duplicates choices

REST interface to deleting duplicates

The REST interface is described in the REST/HTTP interface document. The commands relating to duplicate triples start here in that document. In brief, you can get duplicate triples with

GET /repositories/[name]/statements/duplicates 

The mode argument can be spo or spog (the default).

You can delete them with

DELETE /repositories/[name]/statements/duplicates 

The mode argument can be spo or spog (the default).

You can get the duplicate suppression strategy with

GET /repositories/[name]/suppressDuplicates 

and set the duplicate suppression strategy with

PUT /repositories/[name]/suppressDuplicates 

The type argument can be false (no automatic duplicate deletion at commit time), spo, or spog. Disabling automatic duplicate suppression at commit time can also be done with

DELETE /repositories/[name]/suppressDuplicates 

Java interface to deleting duplicates

The Java interface is described in the Javadocs. The relevant class is the AGRepositoryConnection class and the methods are getDuplicateStatements() (for getting duplicates), deleteDuplicates (for deleting).

Python interface to deleting duplicates

The relevant methods, both are in the RepositoryConnection class, are:

def getDuplicateStatements(mode)  
def deleteDuplicateStatements(mode) 

Note you cannot enable supressing duplicates at commit time using the Python interface. See the Python API document.

Lisp interface to deleting duplicates

See Deleting triples. The function get-duplicate-triples returns a cursor of duplicates. The function delete-duplicate-triples delete duplicates. The function duplicate-suppression-strategy controls whether duplicates are deleted at commit time.