LLM Split Specification | AllegroGraph 8.2.0

Introduction

See the document Large Language Models (LLM) and Vector Databases for general information about Large Language Model support in AllegroGraph.

See the document LLM Embed Specification for definitions of embedding and vector database. In order to create embeddings from arbitrary text documents, we generally want to parse or split the document into smaller text chunks, called windows. We want the individual windows to be large enough to contain meaningful knowledge but small enough to fit under the token limits of the embedding model.

There are many types of documents to split (emails, web pages, PDF documents, Word documents, LaTex documents, and plain text files, to name a few) and many ways to split them (detecting sentences, paragraphs, sections, subsections, and scanning windows over the text, to name a few of those). In this release we provide only 2 types of splitting, by number of characters or number of lines, and only for plain text (.txt) files.

A window is specified by a size and an overlap. The units of size and overlap are in characters or lines, depending on the type of split applied. The splitting function scans the document and windows containing chunks of text of the given size.

We can think of window splitting as more like human reading than document parsing. A window is what the eye can view: a word surrounded by other words in a sentence surrounded by a few other sentences. The eye doesn't care whether its viewing window begins or ends on a word or a sentence or even inside a word. The attention is focused on the sentence in the middle. We scroll our window over the entire document sequentially, just like a person reading the document from top to bottom.

You start with a text file of interest and you create a new, empty repository. Then you give instructions to split the text into triples and those triples are stored in the repository. Once the repo of split text triples has been created, a vector database can be created from that repo which can be queried.

The US Constitution as an example

We will use the United States Constitution as an example. We do not supply this but easily downloadable from many sources including https://constitutioncenter.org/the-constitution/full-text. In this document, we assume you have downloaded a copy into the file usc.txt.

Giving instructions for splitting a text document

LLM split instructions can be specified in two ways:

Using the New WebView text split command, described below in the section Using WebView to split text.
Creating a split instruction file (or simply using the defaults) and calling agtool with the command llm split and additional arguments. See Splitting with agtool.

Using WebView to split text

You can use a dialog in New WebView to split text. We assume you have the United States Constitution saved as a text file named usc.txt.

Create a repo (we name it usc) and open it in New WebView. Choose Add, delete & import data from the repository menu on the left so this page is displayed:

Add, delete, import data text file choice

The Import text file as triples choice is marked with a red arrow. Click on that choice and this page is displayed:

Text file import page

Fill in the Import Options fields, whose meanings are described in the tables in the section below. There are many fields but, at least when starting out, the defaults can be accepted for almost all. For the example detailed below, specify the Index name only: usc. The text file to split is specified by dragging it into the Drag-N-Drop box or by using a file select dialog. We click VERIFY & IMPORT FILE to complete the process. (The example has just one text file but as many as you wish can be specified.)

Splitting with agtool

There is a command-line option for splitting text, using the agtool command tool. (Note agtool is not available with AllegroGraph.cloud.)

The instructions for splitting text are given in an optional file in lineparse format with the extension .def. If a split specification file is not provided, the splitting uses the default values shown in the table below.

Here is the agtool command line for splitting. One or more files containing the text to split are given as the last arguments:

agtool llm split [--split-spec split.def] reponame file.txt ...

In the US Constitution example, here is the .def file which we name usc.def:

split  
  split-type window-lines  
  split-size 10  
  split-overlap 3  
  index-name usc  
  content-predicate <http://franz.com/split/content>  
  source-predicate <http://franz.com/split/source>  
  index-predicate <http://franz.com/split/index>  
  content-type <http://franz.com/split/Window>  
  oversize-content-type <http://franz.com/split/OversizeWindow>  
  oversize-content-size 10000

In fact all values in that file are defaults except index-name usc. We wanted to show the general form of the file. So the command line for this example is (see Repository Specification for how to specify a repository in an agtool command line):

agtool llm split --split-spec usc.def [usc-repo-spec] usc.txt

See the Practical example for more details.

Lineparse items in .def specification file

Instructions for splitting text are either specified in the dialog displayed in New WebView or in a .def file in lineparse format that is used by agtool. There are defaults for all values. When using agtool the split specification lineparse file is optional. Omitting it entirely results in the splitter choosing the default values shown in the table below. Omitting a specific item results in the splitter using the default for that value. Thus having no line for split-overlap results in split-overlap having the default value of 3.

item	min	max	required	default
split-type	0	1	no	window-lines
split-size	0	1	no	10
split-overlap	0	1	no	3
index-name	0	1	no	windowIndex
content-predicate	0	1	no	<http://franz.com/split/content>
source-predicate	0	1	no	<http://franz.com/split/source>
index-predicate	0	1	no	<http://franz.com/split/index>
content-type	0	1	no	<http://franz.com/split/Window>
oversize-content-type	0	1	no	<http://franz.com/split/OversizeWindow>
oversize-content-size	0	1	no	10000

split-type can be window-lines or window-chars.
split-size is the length of each window, in lines (if split-type is window-lines) or characters (if split-type is window-chars).
split-overlap is the amount of overlapping text between windows. Its value must be strictly less than split-size.
index-name is the name used to form an indexed identifier (subject) for each text chunk window, e.g using the default value:
```
<http://franz.com/split/WindowIndex-0> 
```

for the first indexed window.

source-predicate is a predicate to link the split text chunk to its source filename.
content-type identifies the type of the split text content, useful later for selecting split text for embedding.
oversize-content-type and oversize-content-size: The concept of "oversize" applies to window text chunks that may have too many characters for the LLM's embedding token limit. If a window size exceeds oversize-content-size characters, then the split text window has type oversize-content-type. (In earlier vesions of the 8.1.1 documentation, this value was incorrectly named oversize-content-limit. oversize-content-size is the correct name.)

Practical Example

We have given partial instructions on splitting the text of the US constitution in the discussion above. We repeat the steps here showing the two methods.

Splitting Using New WebView

This method must be used when running AllegroGraph.cloud and is optional when running a locally installed AllegroGraph server.

Create a repository named usc (Choose a catalog from the list on the left, one typically chooses root; click on the Create Repository button on the upper right, specify the name usc. If the repo already exists, delete all its triples.)
Open the usc repo and click on Add, delete, & import data from the Repository menu on the left.

Add, delete, import data text file choice

Click on the Import text file as triples choice (marked with a red arrow). This page is displayed:

Text file import page

Specify usc as the index name (already done in the illustration). All other values are defaults. Drag-N-Drop usc.txt into the import box (or use a file selection tool) and click Verify and Import File.

After processing completes, split text triples will be stored in the usc as described below.

Splitting Using agtool

Suppose we have an Allegrograph running on localhost:10035 and we have a file usc.txt that contains the United States Constitution (again. we do not supply this but easily downloadable from many sources including https://constitutioncenter.org/the-constitution/full-text).

We write the split specification file usc.def as follows:

split  
  split-type window-lines  
  split-size 10  
  split-overlap 3  
  index-name usc  
  content-predicate <http://franz.com/split/content>  
  source-predicate <http://franz.com/split/source>  
  index-predicate <http://franz.com/split/index>  
  content-type <http://franz.com/split/Window>  
  oversize-content-type <http://franz.com/split/OversizeWindow>  
  oversize-content-size 10000

Note that this configuration contains the default value for each lineparse item, except for index-name. So we could just as well have written the file as:

split  
  index-name USC

This configuration tells the splitter to use the window-lines splitting method (split the file by lines); the size of each window is 10 lines, with an overlap of 3 lines between adjacent windows. The splitter will create triples with a unique subject ID based on the index-name. The index-predicate will link each ID to an integer index value. The source-predicate will associate each ID to the filename usc.txt (identified when agtool llm split ... is run, see below). The content-predicate links the ID to the text content within the splitting window. The content-type assigns a type to each ID.

The US Constitution is a relatively short document and the selected window sizes are small, so the concept of "oversize" does not come into play in this example. First, we create a destination repository to store the split text (replace localhost:10035 with your actual host and port, if different and http with https if that is how the store was set up):

% agtool repos create http://localhost:10035/repositories/usc  --supersede

We now run

agtool llm split --split-spec usc.def http://localhost:10035/repositories/usc usc.txt

This inserts triples into the repository usc, representing text chunk windows, all with ten lines except the last one, containing text split from the document.

Once splitting is complete

Here are some sample RDF triples resulting from the splitting (output somewhat rearranged for ease of viewing).

<http://franz.com/split/USC-26> <http://franz.com/split/content>   
"The Congress shall have power to enforce this article by appropriate legislation.  
 
27th Amendment  
No law, varying the compensation for the services of  
the Senators and Representatives, shall take effect, until an election  
of Representatives shall have intervened.  
"  
<http://franz.com/split/USC-26> <http://franz.com/split/index>  "26"^^<http://www.w3.org/2001/XMLSchema#integer>  
<http://franz.com/split/USC-26> <http://franz.com/split/source> "usc.txt"

This repo is suitable for embedding as described next.

Embedding split text

To see how the splitting works in relation to retrieval augmented generation (RAG), let's see what happens when we embed the split text. See the document LLM Embed Specification for a description of embedding and definitions of the index specification lineparse items. The embedding process creates a vector database associated with the split text repo usc we have created.

We can embed with New WebView, described next, and with agtool. When running AllegroGraph.cloud you must use New WebView.

Embedding with New WebView

When the usc repo is open, open the Repository control menu and select Create LLM Embedding:

Repository Control menu

Fill in the dialog, as shown using the following values (see the document LLM Embed Specification for definitions of the index specification items):

New of existing vector database: usc-vec.
If Vector Database exits: supersede
Limit: 1000000
Embedded: openai
Model: (this will fill in automatically)
API key: enter your openaiApiKey.
Include triples to index by predicate: http://franz.com/split/content
Incluse triples to index by types: http://franz.com/split/Window

Create LLM Embedding dialog

When ready, click Launch Index at the bottom. Creating the index may take some time. When it competes, you are ready to query the database.

Embedding with agtool

First, create an index specification file usc-vec.def (replace localhost:10035 if necessary):

embed  
  embedder openai  
  if-exists supersede  
  api-key "<api key goes here>"  
  vector-database-name localhost:10035/usc-vec  
  limit 1000000  
  splitter list  
  include-predicates <http://franz.com/split/content>  
  include-types <http://franz.com/split/Window>

See the document LLM Embed Specification for definitions of the index specification lineparse items. This lineparse file creates a vector database called usc-vec. We add the embeddings to the vector database with (the --quiet argument supresses the voluminous output):

 agtool llm index --quiet localhost:10035/usc usc-vec.def

Querying a vector database

Finally, after building the usc-vec vector database, we can test it with a query. You can run a query on the vector database from any repository, including the vector database itself, but generally we favor running it in the "knowledge" repository containing the split text, in this case usc.

Here is a sample query. Copy and paste it into the usc repo query page, or use agtool query as we describe below.

PREFIX llm: <http://franz.com/ns/allegrograph/8.0.0/llm/>  
PREFIX franzOption_openaiApiKey: <franz:api key goes here>  
SELECT  ?response {  
  bind("Can a state have more than two senators, and if not, why?"  
       as ?query)  
  (?response ?score ?citation ?content) llm:askMyDocuments (?query "localhost:10035/usc-vec" 10 0.8).  
} LIMIT 1

To use agtool, Place the SPARQL query above in a file query.rq. Then run the query in agtool.

agtool query http://localhost:10035/repositories/usc query.rq --output-format simple-csv

And observe the response

"No, a state cannot have more than two senators.  
This is outlined in the 17th Amendment of the U.S. Constitution,  
which states that the Senate of the United States shall be composed  
of two Senators from each State."

Updating databases when documents are updated

The US constitution is rarely updated but many documents are revised from time to time. Suppose you have a text document mydoc-1.txt and you split it into a repository and create an associated vector database, as we have described in this document.

Then a revised version, mydoc-2.txt comes out. You can redo the splitting entirely (specify supersede in the If Vector Database exists field of the New WebView dialog) and must do so if there are deletions as well as additions. But if there are only additions, you can just update the existing repo by specifying if-exists open in the .def file or specify open instead of supersede in the If Vector Database exists field of the New WebView dialog. The embedder will skip embedding any strings already in the vector database. Skipping existing embeddings is much faster than redoing everything when processing additions. Only new text will be used to create new triples in the vector database.

AllegroGraph 8.2.0 LLM Split Specification