Introduction
See the document Large Language Models (LLM) and Vector Databases for general information about Large Language Model support in AllegroGraph.
See the document LLM Embed Specification for definitions of embedding and vector database. In order to create embeddings from arbitrary text documents, we generally want to parse or split the document into smaller text chunks, called windows. We want the individual windows to be large enough to contain meaningful knowledge but small enough to fit under the token limits of the embedding model.
There are many types of documents to split (emails, web pages, PDF documents, Word documents, LaTex documents, and plain text files, to name a few) and many ways to split them (detecting sentences, paragraphs, sections, subsections, and scanning windows over the text, to name a few of those). In this release we provide only 2 types to splitting, by number of characters or number of lines, and only for plain text (.txt) files.
A window is specified by a size and an overlap. The units of size and overlap are in characters or lines, depending on the type of split applied. The splitting function scans the document and windows containing chunks of text of the given size.
We can think of window splitting as more like human reading than document parsing. A window is what the eye can view: a word surrounded other words in a sentence surrounded by a few other sentences. The eye doesn't care whether its viewing window begins or ends on a word or a sentence or even inside a word. The attention is focused on the sentence in the middle. We scroll our window over the entire document sequentially, just like a person reading the document from top to bottom.
An LLM split specification is an optional file in lineparse form that tells AllegroGraph how to split the documents into windows. If no split specification file is provided, the splitter assumes the default values in the table below.
The files containing the text to split are the last arguments to agtool
here:
agtool llm split [--split-spec split.def] reponame file.txt ...
Lineparse items in specification file
The split specification lineparse file is totally optional. Omitting it entirely results in the splitter choosing the default values shown in the table below. Omitting a specific item results in the splitter using the default for that value. Thus having no line for split-overlap
results in split-overlap
having the default value of 3.
item | min | max | required | default |
---|---|---|---|---|
split-type | 0 | 1 | no | window-lines |
split-size | 0 | 1 | no | 10 |
split-overlap | 0 | 1 | no | 3 |
index-name | 0 | 1 | no | windowIndex |
content-predicate | 0 | 1 | no | <http://franz.com/split/content> |
source-predicate | 0 | 1 | no | <http://franz.com/split/source> |
index-predicate | 0 | 1 | no | <http://franz.com/split/index> |
content-type | 0 | 1 | no | <http://franz.com/split/Window> |
oversize-content-type | 0 | 1 | no | <http://franz.com/split/OversizeWindow> |
oversize-content-size | 0 | 1 | no | 10000 |
split-type
can bewindow-lines
orwindow-chars
.split-size
is the length of each window, in lines (ifsplit-type
iswindow-lines
) or characters (ifsplit-type
iswindow-chars
).split-overlap
is the amount of overlapping text between windows. Its value must be strictly less thansplit-size
.index-name
is the name used to form an indexed identifier (subject) for each text chunk window, e.g using the default value:<http://franz.com/split/WindowIndex-0>
source-predicate
is a predicate to link the split text chunk to its source filename.content-type
identifies the type of the split text content, useful later for selecting split text for embedding.oversize-content-type
andoversize-content-size
: The concept of "oversize" applies to window text chunks that may have too many characters for the LLM's embedding token limit. If a window size exceedsoversize-content-size
characters, then the split text window has typeoversize-content-type
. (In earlier vesions of the 8.1.1 documentation, this value was incorrectly namedoversize-content-limit
.oversize-content-size
is the correct name.)
for the first indexed window.
Practical Example
Suppose we have an Allegrograph running on localhost:10035
and we have a file usc.txt
that contains the United States Constitution (we do not supply this but easily downloadable from many sources including https://constitutioncenter.org/the-constitution/full-text.
We write the split specification file usc.def
as follows:
split
split-type window-lines
split-size 10
split-overlap 3
index-name USC
content-predicate <http://franz.com/split/content>
source-predicate <http://franz.com/split/source>
index-predicate <http://franz.com/split/index>
content-type <http://franz.com/split/Window>
oversize-content-type <http://franz.com/split/OversizeWindow>
oversize-content-size 10000
Note that this configuration contains the default value for each lineparse item, except for index-name
. So we could just as well have written the file as:
split
index-name USC
This configuration tells the splitter to use the window-lines
splitting method (split the file by lines); the size of each window is 10 lines, with an overlap of 3 lines between adjacent windows. The splitter will create triples with a unique subject ID based on the index-name
. The index-predicate
will link each ID to an integer index value. The source-predicate
will associate each ID to the filename usc.txt (identified when agtool llm split ... is run, see below). The content-predicate
links the ID to the text content within the splitting window. The content-type
assigns a type to each ID.
The US Constitution is a relatively short document and the selected window sizes are small, so the concept of "oversize" does not come into play in this example. First, we create a destination repository to store the split text (replace localhost:10035
with your actual host and port, if different and http with https if that is how the store was set up):
% agtool repos create http://localhost:10035/repositories/usc --supersede
We now run
agtool llm split --split-spec usc.def http://localhost:10035/repositories/usc usc.txt
This inserts triples into the repository usc
, representing text chunk windows, all with ten lines except the last one, containing text split from the document.
Here are some sample RDF triples resulting from the splitting (output somewhat rearranged for ease of viewing).
<http://franz.com/split/USC-26> <http://franz.com/split/content>
"The Congress shall have power to enforce this article by appropriate legislation.
27th Amendment
No law, varying the compensation for the services of
the Senators and Representatives, shall take effect, until an election
of Representatives shall have intervened.
"
<http://franz.com/split/USC-26> <http://franz.com/split/index> "26"^^<http://www.w3.org/2001/XMLSchema#integer>
<http://franz.com/split/USC-26> <http://franz.com/split/source> "usc.txt"
To see how the splitting works in relation to retrieval augmented generation (RAG), let's see what happens when we embed the split text. First, we create an index specification file usc-vec.def
(again replace localhost:10035
if necessary):
gpt
embedder openai
if-exists supersede
api-key "<api key goes here>"
vector-database-name localhost:10035/usc-vec
limit 1000000
splitter list
include-predicates <http://franz.com/split/content>
include-types <http://franz.com/split/Window>
See the document LLM Embed Specification for definitions of the index specification lineparse items. This lineparse file creates a vector database called usc-vec
. We add the embeddings to the vector database with (the --quiet
argument supresses the voluminous output):
agtool llm index --quiet localhost:10035/usc usc-vec.def
Finally, after building the vector database, we can test it with a query. You can run a query on the vector database from any repository, including the vector database itself, but generally we favor running it the "knowledge" repository the split text came from, in this case usc
.
Place the following SPARQL query in a file query.rq
PREFIX llm: <http://franz.com/ns/allegrograph/8.0.0/llm/>
PREFIX franzOption_openaiApiKey: <franz:api key goes here>
SELECT ?response {
bind("Can a state have more than two senators, and if not, why?"
as ?query)
(?response ?score ?citation ?content) llm:askMyDocuments (?query "localhost:10035/usc-vec" 10 0.8).
} LIMIT 1
Run the query in agtool
agtool query http://localhost:10035/repositories/usc query.rq --output-format simple-csv
And observe the response
"No, a state cannot have more than two senators.
This is outlined in the 17th Amendment of the U.S. Constitution,
which states that the Senate of the United States shall be composed
of two Senators from each State."