Introduction
See the document Large Language Models (LLM) and Vector Databases for general information about Large Language Model support in AllegroGraph. In this document we discuss embeddings and creating vector databases.
An embedding is a vector representation of natural language text. A vector database is a table of embeddings and the associated original text. An Allegrograph Vector Database associates embeddings with literals found in the triple objects of a graph database. This Vector Database also stores the subject and predicate of each triple whose object was embedded. This permits a mapping from literals to subject URIs in support of nearest-neighbor matching between an input string and the embedded object literals as shown in the example below.
We assume in this document that you have defined the llm
namespace this way:
PREFIX llm: <http://franz.com/ns/allegrograph/8.0.0/llm/>
See Namespaces and query options for information on namespaces.
In Allegrograph the magic predicates llm:nearestNeighbor
and llm:askMyDocuments
utilize the vector database. A query clause of the form
(?uri ?score ?originalText) llm:nearestNeighbor (?text ?vector-database ?topN ?minScore)
binds the subject ?uri
of the ?topN
best matching items in the vector database named by the literal bound to ?vectorDatabase
, where the minimum matching score is above a float bound to ?minScore
. The llm:nearestNeighbor
predicate also binds the matching score ?score
and the source text ?originalText
.
The predicate llm:askMyDocuments
implements the process of Retrieval Augmented Generation. The predicate first retrieves the nearest-neighbor matching document fragments from the vector database, then forms a larger LLM prompt that combines this background content with the original query.
(?response ?citation ?score) llm:askMyDocuments (?query ?vectorDatabase ?topN ?minScore)
binds the ?response
to the LLM's response to a big prompt that combines the ?query
with the text content from the ?vectorDatabase
based on the ?topN
nearest neighbor matches above ?minscore
. What's more, it binds ?citation
to the subject URI of each matching document that contributes the response enabling the predicate to explain its response.
An LLM embed specification is a file in lineparse form that tells AllegroGraph which object strings in a repository should be converted to numerical vectors and stored in the vector database. The embedder stores that vector along with the original text and the corresponding subject URI in a vector database associated with the repo.
See the document Large Language Models for details and the document Lineparse Format for a general description of lineparse files.
A file containing the embed specification is the last argument to agtool
here:
agtool llm index reponame specification-file
In order to index strings The Large Language Model (LLM) embedder contacts an LLM API to process each text item a return a large vector of numbers (for example: a vector of 1536 element for OpenAI embeddings). The lineparse formatted specification file determines which text strings are sent to the LLM server.
Min and Max in the table below refer to the number of arguments to the named item.
Lineparse Items in specification file
item | min | max | required |
---|---|---|---|
embedder | 0 | 1 | no |
model | 0 | 1 | no |
api-key | 0 | 1 | no |
vector-database-name | 1 | 1 | yes |
if-exists | 0 | 1 | no |
splitter | 0 | 1 | no |
include-predicates | 0 | no max | no |
exclude-predicates | 0 | no max | no |
include-types | 0 | no max | no |
exclude-types | 0 | no max | no |
limit | 0 | 1 | no |
property | 0 | no max | no |
The embedder
is the service that will convert a text string to a vector of floats. The default is "demo" which will create an embedding quickly and at no cost but there is no meaning to the embedding and an api-key
is not required. The embedder
"openai" computes a useful embedding vector but it slower and typicall not free to use.
The model
is used to distinguish between different ways that an embedder can create an embedding. If not specified the default for that embedder will be chosen. For example for the embedder openai
the default model is text-embedding-ada-002
.
The api-key
is the appropriate api key for the embedder chosen, if such a api key is required. An OpenAI api key must be obtained from openai.com. (Keys shown in example in the AllegroGraph documentation are not valid.)
You can must specify the vector-database-name
. It can be any repo specification but is typically just the name of a repo. If it specifies an existing repo, it must be a vector database repo (as created by a previous call to agtool llm embed or directly by agtool repos create.
If specified vector database does not exist the it will be created. In that case you must have specified the embedder
and possibly the model
and api-key
so those can be put in the newly created vector store.
If the specified vector-database-name
already exists then the the value of if-exists
is consulted. if-exists
can be "open" (the default) meaning open add add data to the existing vector store. If if-exists
is "supersede" then the vector database will be re-created in which case the values of embedder
and possibly model
and `api-key' are used.
The vector-database-dim
is a property of the LLM server used. The default 1536 is size of embeddings generated by OpenAI models.
Currently only one splitter
is defined (list
) so that line should be omitted or the value should be list
. We recommend that this line be omitted since we are developing alternative ways to specify the splitting of text. This directive may go away in the future.
The strings that are processed are those found in the object position of a triple.
The embedder selects object literals for embedding. When include predicates
is specified, the embedder creates vectors only for objects in triples with the included predicates. When exclude-predicates
is specified, the embedder will omit processing objects in triples with those predicates.
You have the option to specify both include-predicates
and exclude-predicates
, only one, or neither. If the same predicate is listed in include-predicates
and in exclude-predicates
then the exclude-predicates
takes precedence. However there's no reason for specifying both exclude-predicates
and include-predicates
.
The point of exclude-predicates
is that you may want to include all predicates in the emedding selection, except for a short list of predicates you want to exclude. However all predicates are not considered if you have an include-predicates
item as well.
You can specify more than one predicate either on the same line or different lines.
For example to include three predicates you can write it:
include-predicates <http://sample.com/pred-a> <http://sample.com/count\#234>
include-predicates <http://sample.com/pred-c>
Note that because in the Lineparse format the hash character (#) starts a comment, if a URL contains a hash character one must precede the hash character with a backslash in order to turn the hash character into a normal character that doesn't start a comment. The example above demonstrates that.
If you specify either include-types
or exclude-types
then that further refines the search for text to process. In that case only triple of the form
subject predicate text-object
are considered if there is also a triple
subject rdf:type type
where type is one of the included-types
if there are any included types and type is not one of the excluded-types
if there are any excluded types.
Also the predicate must obey the included-predicates
and excluded-predicates
if any are specified.
The property
item allows you to specify a predicate and object to be associated with each object embedded. The property
item can be repeated and always has two arguments
property name value
This adds a triple with predicate http://franz.com/prop/name and the given value as object. The value should be a literal or resource in ntriple format. Thus for a resource you would write http://foo.com/bar and for literal "\"a literal value\"". Anything that's not in ntriple syntax is considered a literal so a value of
"a literal value"
would be considered the same literal as
"\"a literal value\""
For each item indexed, the vector database stores the embedding vector along with the subject URI, the predicate URI, the original text of the object literal and optionally the object type (when include-predicates
or exclude-predicates
is specified).
Practical Example
Suppose we have an Allegrograph running on localhost:10035 with repository called HistoricalFigures that contains information about people from the past. For each historical person, there is a unique URI subject and predicates rdfs:label
and rdf:type
. The historical figures have type <http://franz.com/HistoricalFigure>
.
We write the embed specification file historicalFigures.def
as follows:
gpt
api-key "sk-U01ABc2defGHIJKlmnOpQ3RstvVWxyZABcD4eFG5jiJKlmno"
vector-database-name "historicalFigures"
limit 1000000
splitter list
include-predicates <http://www.w3.org/2000/01/rdf-schema\#label>
include-types <http://franz.com/HistoricalFigure>
This configuration tells the embedder to use OpenAI (gpt) embeddings, provides our API key (example shown is not a valid key), and sets the name of the vector database to "historicalFigures". Note that we had to escape (with a backslash) the #
character in the include-predicates
line (because lineparse treats #
as a comment).
The vector-database-dim
is 1536 for OpenAI embeddings.
The limit
of 100000 is optional, in case we need to limit the size of the Vector Database.
The splitter
list
is the default value for text splitting (i.e., no splitting).
The include-predicates
and include-types
tell the embedder that we want to include triples where the subject has type <http://franz.com/HistoricalFigure>
and the predicate is rdfs:label
. Note that we had to escale the '#' character in the URI of for label, because lineparse interprets '#' as a comment start character.
We now run
agtool llm index localhost:10035/HistoricalFigures historicalFigures.def
We use the term index
to refer to the embedding process because a Vector Database is essentially an index from embedding vectors to original text.
By default agtool will print the text of each literal embedded, allowing us to monitor progress. Waiting for responses from the LLM API takes the majority of time in the embedding process.
When the embedding index completes, we can execute a magic predicate that uses the vector database we created:
(?uri ?score ?originalText) llm:nearestNeighbor ("Famous Scientist" "historicalFigures" 10 0.8)
The query binds ?uri
and ?originalText
to up to 10 best matches of "Famous Scientist" among the historical figures with a matching ?score above 0.8