Replication and High Availability

Introduction

The Multi-master Replication facility provides a more flexible tool for warm standby. This older standby tool remains supported.

This document is a continuation of the Replication document. This document gives specific instructions on how to initiate database replication.

Replication is the process by which one or more databases can be kept in sync with a single master database. We refer to the master database as the primary database and the replicants as the secondary databases.

High availability refers to the ability to switch between the primary and secondary at will, so if the primary is suddenly unavailable (for whatever reason), the secondary can promoted to become the new primary.

The primary database can handle normal read and write operations while it is acting as the replication master. The secondary databases can only handle read operations from clients while they are replicating the master. (More precisely, the clients are free to add and delete triples but they cannot commit these changes to the secondary databases during replication.)

Replication occurs across the network so any set of Allegrograph servers connected by a network can participate in replication.

Replication occurs in real time. As commits are made to the primary database they are sent as soon as possible to the secondaries.

Replication can only be done between two instances of the same database. The primary will have at least the same commits as the replica and possibly more. The agraph backup program helps one create databases that can be used as secondaries.

The ReplicationPorts configuration option (see Top-level directives in Server Configuration and Control) allows specification of a range of port numbers from which the primary will select listening port numbers for incoming connections from replicas. If the option is specified and none of the ports in the specified range is available, the request to establish a replica will fail. If ReplicationPorts is not specified, the port number will be chosen by the OS.

UUIDs

Every database is assigned a uuid (Universally Unique IDentifier) when it is created. It is a string like de7f021d-b191-99f4-0181-001517d76b50. This is like a fingerprint for the database and can be used to identify it even if the database name changes. The uuid is stored in the file uuid in the database directory. The uuid is important for replication and for point-in-time recovery (see Point-in-Time Recovery).

Transaction Logs

Transaction logs record important information about database state. The state of the database is maintained persistently using files on one or more disks. A commit changes the database from one state to another. A commit will likely involve changing two or more files and this means that there will be a period of time during which one file was updated and another is yet to be updated. If the machine crashes at this point the database would be left in an inconsistent state.

Therefore AllegroGraph stores the changes it would make to the files in the transaction log first and then it updates the files. Thus if the machine crashes during the file update, AllegroGraph can look in the transaction log for the set of steps still needed to be done to complete the state change for the commit.

A further optimization is that the database files are not updated on each commit. Instead the commit is only reflected in the transaction log and the in-memory copy of the file data. Periodically an operation called a checkpoint is done. A checkpoint updates all the database files on disk and writes a record in the transaction log to note that it has done this.

While a database is active, transaction logs are written and never read (except they are read when doing replication which we'll describe later).

When a database is opened the most recent transaction logs are read to ensure that all commits after the last checkpoint have been applied to the database files. If a database is closed normally then a checkpoint is the last operation performed so there will be no commits after the checkpoint.

What one can conclude from this is that if you're not interested in replication or point in time recovery then you can safely get rid of most of the transaction log files that accumulate on your disk. AllegroGraph provides an automatic way to removing or archiving unneeded transaction logs using a process called the Transaction Log Archiver. There are more details on how to configure it in the Transaction Log Archiving document.

Transaction logs are named "tlog-uuid-N" where uuid is the uuid of the database and N is a number starting with 0 and incrementing each time Allegrograph moves to a new transaction log.

Replication

As commits are done, a database moves from one state to the next. If you have two copies of the same database, one after commit 10 was done and one after commit 20 was done, then replication allows you to move the first database from commit 10 to commit 20 using the transaction logs of the second database. Further, as more commits are added to the second database, replication will cause the first database to see the effect of those commits in its state as well.

There are two requirements for replication:

You can only replicate to the same database. This means that the database uuids must match.
The transaction logs must be available that contain the commit records needed to do the replication.

The steps for replication are as follows.

Assume we have two AllegroGraph servers, which we refer to as primary-server and secondary-server. primary-server is running on machine prime-host and listening at port 20000, and secondary-server is running on machine second-host and listening at port 30000. primary-server has a repository Sales that we wish to replicate. The host, port, user, password are all encoded in a server specification (see the SERVER SPECs section of the agtool document). The server spec for the primary-server would be something like (using one of the compact specifications)

user:password@prime-host:20000

and could also be specified with a URI ground store specification:

http://user:password@prime-host:20000

We will generally use the compact specification in our examples.

We first register a replication job on primary-server. We use a server spec to identify the host, port, user and password:

% agtool replicate \  
  --primary user:password@prime-host:20000 \  
  --name Sales \  
  --jobname repl-1 \  
  --register

The --uuid argument could have been used instead of the --name argument. Note we do not specify the secondary server at this time.

This tells the system that all Sales transaction log files recording commits subsequent to the registration of replication job repl-1 must be kept around until that replication job indicates it is done with them.

We wish to replicate the Sales repository on secondary-server.

We make a backup of the Sales repository. This backup must be done after the replication job repl-1 is registered. Otherwise some commits after the backup but before the registration may be lost. On machine prime-host do:

%  agtool archive backup user:password@prime-host:20000/Sales <sales-dir>

<sales-dir> must be the path of a non-existent or empty directory. You can do the backup while the Sales database is in use.

Now restore that backup to secondary-server:

% agtool archive --port 30000 --replica restore Sales.sec <sales-dir>

Here we have chosen to call the restored database Sales.sec instead of Sales just to illustrate when we use the name on the secondary machine and when we use the name on the primary machine.

We pass the --replica argument to ensure that no processes open this database and modify it before we have a chance to start replication.

Warning! Failure to restore the database in --replica mode can cause database corruption if any operations are performed on the secondary before or during replication.

Finally, we set up secondary-server as the repl-1 replica of Sales:

agtool replicate \  
  --primary user:password@prime-host:20000  \  
  --secondary user:password@second-host:30000 \  
  --name Sales  
  --jobname repl-1

Once replication starts it continues to run forever. Should either or both machines (primary-server or secondary-server) go down the replication will continue when the machines and their AllegroGraph servers are again running. To stop replication, run agtool replicate again and pass it the --stop argument. You can also stop replication using the AGWebView browser interface.

agtool replicate will mark the Sales.sec repository on secondary-server as no-commit so that no changes other than from replication will be made to this repository. This is done because if such changes were permitted, then secondary-server's repository would no longer be a copy of primary-server's repository.

High Availability

With replication running as prepared above you can change the roles of primary-server and secondary-server, thus bringing secondary-server online as a read/write repository. This only works if there is exactly one secondary replicating the primary.

% agtool replicate  --primary user:password@prime-host:20000 \  
                    --secondary user:password@second-host:30000 \  
                    --name Sales --switch-roles --become-client

Note that the --primary and --secondary arguments are the same as when we started the replication. In this case the command is sent to the primary machine so we use the database name on the primary.

--switch-roles causes AllegroGraph to put the primary-server repository in no-commit mode. Then all commits that the primary-server has not sent yet are sent to the secondary so that it is totally up to date before the switch. Then the role switch occurs and the no-commit flag is removed from the Sales.sec database on secondary-server.

--become-client tells AllegroGraph on primary-server to start replicating from Sales.sec on secondary-server. If --become-client is not passed in, then secondary-server will still become a read/write server for Sales.sec but primary-server will not attempt to follow commits made on secondary-server.

The agtool replicate program will initiate the role switch and will exit right away, before the role switch has been completed. However, a role switch can take some time (ten seconds or more) depending how many outstanding commits are still to be sent. During this time, both databases will be in no-commit mode. A process that attempts to commit during this period will receive an error message saying that commits are not possible at this time.

Now if we wish to switch the roles back to their original state with primary-server being the primary and secondary-server the secondary we issue the same command but must change the primary and secondary server-specs to reflect the pre-switch state.

% agtool replicate  --primary user:password@second-host:30000 \  
                    --secondary user:password@prime-host:20000 \  
                    --name Sales.sec --switch-roles --become-client

Command Reference

The replication program is one of the agtool utilities. The agtool program is the general program for AllegroGraph command-line operations. The replication primary and secondary are identified by server specifications (server-specs) which encapsulates all information about a server, including the host, port, scheme, user, and password: see the SERVER SPECs section of the agtool document for more information on server-specs.

agtool replicate [--primary primary-server-spec]  
[--secondary secondary-server-spec]  
[--catalog|-c cat] [--name name] [--uuid uuid]  
 [--jobname jobname] [--stop]  
[--status] [--switch-roles] [--become-client]  
[--list]

In earlier releases, the arguments --primary-host, --primary-port, --secondary-host, --secondary-port, --user, and --passowrd were used to specify the primary and secondary hosts. These arguments are still accepted but their use is deprecated. If they are specified as well as --primary and --secondary, the values specified to --primary and --secondary, including their defaults, are used. That means that --primary my-host --primary-port 12345 will use port 10035 as that is the default port for a server spec.

agtool replication can start and stops replication, switch primary and secondary roles, and get the status from the primary and secondary.

Databases can be named by catalog and name or by their uuid.

The jobname names this particular replication job. This is important for transaction log archiving as described below. If a jobname argument is not passed in but one is needed, then one will be created.

Replication is started if neither the --stop, --status nor --switch-roles arguments are passed.

If --stop is passed in then replication will stop. For stopping replication only these arguments need be provided:

agtool replicate  
  [--secondary secondary-server]  
  [--catalog|-c cat] [--name name] [--uuid uuid]  
  [--user|-u user] [--password password]  
  --stop

For a status report pass the --status argument. You will see status from both the primary and secondary sides.

--switch-roles causes AllegroGraph to put the primary database in no-commit mode. Then all commits that the secondary has not seen yet are sent to the secondary so that it is totally up to date before the switch. Then the role switch occurs and the no-commit flag is removed from the former secondary database.

--list causes the list of jobnames associated with the specified server(s) to be printed. The --primary and --secondary arguments specify servers, so associated jobnames will be printed for whichever one is specified, or for both if both are specified. For example:

$ agtool replicate \  
     --primary primary-server-spec \  
     --name test  
     --list  
Jobnames on primary:  
replica-1  
replica-2