Introduction

See the document Large Language Models (LLM) and Vector Databases for general information about Large Language Model support in AllegroGraph.

See the document LLM Embed Specification for definitions of embedding and vector database. In order to create embeddings from arbitrary text documents, we generally want to parse or split the document into smaller text chunks, called windows. We want the individual windows to be large enough to contain meaningful knowledge but small enough to fit under the token limits of the embedding model.

There are many types of documents to split (emails, web pages, PDF documents, Word documents, LaTex documents, and plain text files, to name a few) and many ways to split them (detecting sentences, paragraphs, sections, subsections, and scanning windows over the text, to name a few of those). In this release we provide only 2 types to splitting, by number of characters or number of lines, and only for plain text (.txt) files.

A window is specified by a size and an overlap. The units of size and overlap are in characters or lines, depending on the type of split applied. The splitting function scans the document and windows containing chunks of text of the given size.

We can think of window splitting as more like human reading than document parsing. A window is what the eye can view: a word surrounded other words in a sentence surrounded by a few other sentences. The eye doesn't care whether its viewing window begins or ends on a word or a sentence or even inside a word. The attention is focused on the sentence in the middle. We scroll our window over the entire document sequentially, just like a person reading the document from top to bottom.

An LLM split specification is an optional file in lineparse form that tells AllegroGraph how to split the documents into windows. If no split specification file is provided, the splitter assumes the default values in the table below.

The files containing the text to split are the last arguments to agtool here:

agtool llm split [--split-spec split.def] reponame file.txt ...  

Lineparse items in specification file

The split specification lineparse file is totally optional. Omitting it entirely results in the splitter choosing the default values shown in the table below. Omitting a specific item results in the splitter using the default for that value. Thus having no line for split-overlap results in split-overlap having the default value of 3.

item min max required default
split-type 0 1 no window-lines
split-size 0 1 no 10
split-overlap 0 1 no 3
index-name 0 1 no windowIndex
content-predicate 0 1 no<http://franz.com/split/content>
source-predicate 0 1 no<http://franz.com/split/source>
index-predicate 0 1 no<http://franz.com/split/index>
content-type 0 1 no<http://franz.com/split/Window>
oversize-content-type 0 1 no<http://franz.com/split/OversizeWindow>
oversize-content-limit 0 1 no 10000

Practical Example

Suppose we have an Allegrograph running on localhost:10035 and we have a file usc.txt that contains the United States Constitution (we do not supply this but easily downloadable from many sources including https://constitutioncenter.org/the-constitution/full-text.

We write the split specification file usc.def as follows:

split  
  split-type window-lines  
  split-size 10  
  split-overlap 3  
  index-name USC  
  content-predicate <http://franz.com/split/content>  
  source-predicate <http://franz.com/split/source>  
  index-predicate <http://franz.com/split/index>  
  content-type <http://franz.com/split/Window>  
  oversize-content-type <http://franz.com/split/OversizeWindow>  
  oversize-content-size 10000 

Note that this configuration contains the default value for each lineparse item, except for index-name. So we could just as well have written the file as:

split  
  index-name USC 

This configuration tells the splitter to use the window-lines splitting method (split the file by lines); the size of each window is 10 lines, with an overlap of 3 lines between adjacent windows. The splitter will create triples with a unique subject ID based on the index-name. The index-predicate will link each ID to an integer index value. The source-predicate will associate each ID to the filename usc.txt (identified when agtool llm split ... is run, see below). The content-predicate links the ID to the text content within the splitting window. The content-type assigns a type to each ID.

The US Constitution is a relatively short document and the selected window sizes are small, so the concept of "oversize" does not come into play in this example. First, we create a destination repository to store the split text (replace localhost:10035 with your actual host and port, if different and http with https if that is how the store was set up):

% agtool repos create http://localhost:10035/repositories/usc  --supersede 

We now run

agtool llm split --split-spec usc.def http://localhost:10035/repositories/usc usc.txt 

This inserts triples into the repository usc, representing text chunk windows, all with ten lines except the last one, containing text split from the document.

Here are some sample RDF triples resulting from the splitting (output somewhat rearranged for ease of viewing).

<http://franz.com/split/USC-26> <http://franz.com/split/content>   
"The Congress shall have power to enforce this article by appropriate legislation.  
 
27th Amendment  
No law, varying the compensation for the services of  
the Senators and Representatives, shall take effect, until an election  
of Representatives shall have intervened.  
"  
<http://franz.com/split/USC-26> <http://franz.com/split/index>  "26"^^<http://www.w3.org/2001/XMLSchema#integer>  
<http://franz.com/split/USC-26> <http://franz.com/split/source> "usc.txt" 

To see how the splitting works in relation to retrieval augmented generation (RAG), let's see what happens when we embed the split text. First, we create an index specification file usc-vec.def (again replace localhost:10035 if necessary):

gpt  
  embedder openai  
  if-exists supersede  
  api-key "<api key goes here>"  
  vector-database-name localhost:10035/usc-vec  
  limit 1000000  
  splitter list  
  include-predicates <http://franz.com/split/content>  
  include-types <http://franz.com/split/Window> 

See the document LLM Embed Specification for definitions of the index specification lineparse items. This lineparse file creates a vector database called usc-vec. We add the embeddings to the vector database with (the --quiet argument supresses the voluminous output):

 agtool llm index --quiet localhost:10035/usc usc-vec.def 

Finally, after building the vector database, we can test it with a query. You can run a query on the vector database from any repository, including the vector database itself, but generally we favor running it the "knowledge" repository the split text came from, in this case usc.

Place the following SPARQL query in a file query.rq

PREFIX llm: <http://franz.com/ns/allegrograph/8.0.0/llm/>  
PREFIX franzOption_openaiApiKey: <franz:api key goes here>  
SELECT  ?response {  
  bind("Can a state have more than two senators, and if not, why?"  
       as ?query)  
  (?response ?score ?citation ?content) llm:askMyDocuments (?query "localhost:10035/usc-vec" 10 0.8).  
} LIMIT 1 

Run the query in agtool

agtool query http://localhost:10035/repositories/usc query.rq --output-format simple-csv 

And observe the response

"No, a state cannot have more than two senators.  
This is outlined in the 17th Amendment of the U.S. Constitution,  
which states that the Senate of the United States shall be composed  
of two Senators from each State."