The BioDB-Loader Toolkit

Introduction

BioDB-Loader is a toolkit for loading bioinformatics databases into the Common Lisp environment, and for querying databases within that environment. BioDB-Loader contains utilities for loading flatfiles from the the Swiss-Prot, Prosite, ENZYME, EcoCyc, and MetaCyc databases.

 At the core of the BioDB-Loader is a parameterized parser for bioinformatics databases. The parameterized parser can parse any bioinformatics database that conforms to a certain syntactic style. Specialized parsers for each of the preceding databases were constructed by configuring the parameterized parser appropriately. Similarly, users can create parsers for new bioinformatics DBs relatively easily if those DBs fall into the family accepted by the parameterized parser.

 Once a DB has been parsed, users can use a number of utilites to retrieve entries within a DB.

Terminology

The BioDB-Loader toolkit parses databases that consist of a series of entries. For example, the following text is an entry from the Swiss-Prot DB. Each line of the entry begins with a tag, or attribute. Associated with each attribute are one or more data values, such as the dates for the DT attributes, or the authors on the RA attributes. For example, the first line of this Swiss-Prot entry contains the tag "ID" and the values "26KD_HELPY", "STANDARD", "PRT" and "198 AA".
 
 
  ID   26KD_HELPY     STANDARD;      PRT;   198 AA.
  AC   P21762;
  DT   01-MAY-1991 (Rel. 18, Created)
  DT   01-NOV-1997 (Rel. 35, Last sequence update)
  DT   01-NOV-1997 (Rel. 35, Last annotation update)
  DE   26 KD ANTIGEN.
  GN   HP1563.
  OS   Helicobacter pylori (Campylobacter pylori).
  OC   Bacteria; Proteobacteria; epsilon subdivision; Helicobacter group;
  OC   Helicobacter.
  RN   [1]
  RP   SEQUENCE FROM N.A., AND PARTIAL SEQUENCE.
  RC   STRAIN=915;
  RX   MEDLINE; 91100336.
  RA   O'TOOLE P.W., LOGAN S.M., KOSTRZYNSKA M., WADSTROM T., TRUST T.J.;
  RT   "Isolation and biochemical and molecular analyses of a
  RT   species-specific protein antigen from the gastric pathogen
  RT   Helicobacter pylori.";
  RL   J. Bacteriol. 173:505-513(1991).
  RN   [2]
  RP   SEQUENCE FROM N.A.
  RC   STRAIN=26695 / ATCC 700392;
  RX   MEDLINE; 97394467.
  RA   TOMB J.-F., WHITE O., KERLAVAGE A.R., CLAYTON R.A., SUTTON G.G.,
  RA   FLEISCHMANN R.D., KETCHUM K.A., KLENK H.-P., GILL S., DOUGHERTY B.A.,
  RA   NELSON K., QUACKENBUSH J., ZHOU L., KIRKNESS E.F., PETERSON S.,
  RA   LOFTUS B., RICHARDSON D., DODSON R., KHALAK H.G., GLODEK A.,
  RA   MCKENNEY K., FITZGERALD L.M., LEE N., ADAMS M.D., HICKEY E.K.,
  RA   BERG D.E., GOCAYNE J.D., UTTERBACK T.R., PETERSON J.D., KELLEY J.M.,
  RA   COTTON M.D., WEIDMAN J.M., FUJII C., BOWMAN C., WATTHEY L., WALLIN E.,
  RA   HAYES W.S., BORODOVSKY M., KARP P.D., SMITH H.O., FRASER C.M.,
  RA   VENTER J.C.;
  RT   "The complete genome sequence of the gastric pathogen Helicobacter
  RT   pylori.";
  RL   Nature 388:539-547(1997).
  CC   -!- SUBUNIT: HOMODIMER, DISULFIDE-LINKED.
  CC   -!- SUBCELLULAR LOCATION: CYTOPLASMIC.
  CC   -!- SIMILARITY: BELONGS TO THE AHPC/TSA FAMILY.
  CC   --------------------------------------------------------------------------
  CC   This SWISS-PROT entry is copyright. It is produced through a collaboration
  CC   between  the Swiss Institute of Bioinformatics  and the  EMBL outstation -
  CC   the European Bioinformatics Institute.  There are no  restrictions on  its
  CC   use  by  non-profit  institutions as long  as its content  is  in  no  way
  CC   modified and this statement is not removed.  Usage  by  and for commercial
  CC   entities requires a license agreement (See http://www.isb-sib.ch/announce/
  CC   or send an email to license@isb-sib.ch).
  CC   --------------------------------------------------------------------------
  DR   EMBL; M55507; AAA18984.1; -.
  DR   EMBL; AE000654; AAD08603.1; -.
  DR   PIR; A33168; A33168.
  DR   TIGR; HP1563; -.
  DR   PFAM; PF00578; AhpC-TSA; 1.
  KW   Antioxidant; Antigen.
  FT   CONFLICT     36     36       A -> V (IN REF. 1).
  FT   CONFLICT     64     64       Q -> H (IN REF. 1).
  FT   CONFLICT     98     98       T -> S (IN REF. 1).
  SQ   SEQUENCE   198 AA;  22235 MW;  104F3BCA CRC32;
       MLVTKLAPDF KAPAVLGNNE VDEHFELSKN LGKNGAILFF WPKDFTFVCP TEIIAFDKRV
       KDFQEKGFNV IGVSIDSEQV HFAWKNTPVE KGGIGQVTFP MVADITKSIS RDYDVLFEEA
       IALRGAFLID KNMKVRHAVI NDLPLGRNAD EMLRMVDALL HFEEHGEVCP AGWRKGDKGM
       KATHQGVAEY LKENSIKL
  //

Example Session

Let us consider an example of using the BioDB-Loader to parse and process the databases ENZYME and Prosite.

The following calls make the BioDB package the current Lisp package, and then parses the ENZYME database from a file that is included with the BioDB-Loader distribution and loads it into internal data structures of the BioDB toolkit.


 
 

  (in-package :biodb)
  (setq endb (parse-enzyme "biodb-loader/enzyme.dat"))
The call to parse-enzyme returns a Lisp biodb defstruct that serves as a handle to the parsed DB. The user might choose to have more than one DB loaded into Lisp simultaneously. For example, this will load the Prosite database that is included with the BioDB-Loader distribution.


 

    (setq psdb (parse-prosite "biodb-loader/prosite.dat"))

To make Enzyme the current DB, meaning that subsequent BioDB calls refer to Enzyme, we would execute the following.
 

  (goto-db endb)
A DB processed by the BioDB toolkit must consist of a set of entries, and each entry consists of a set of records. Usually one of the records within an entry specifies a unique identifier for that entry, which we call its primary key. In the case of Enzyme, the ID line contains the primary key. The following call returns the Enzyme entry whose unique ID is 1.1.1.3.
 
 
  (setq e (get-entry "1.1.1.3"))
Note that if Enzyme were not the current DB for the BioDB toolkit, we could still access it by specifying a db argument to get-entry:
 
 
  (setq e (get-entry "" :db endb))
The variable E now contains a pointer to the entry we requested from Enzyme. We can obtain the value of a record within the entry as follows.
 
 
  (get-values e 'de)
  (get-values e 'CA)
Note that each attribute is specified as a Lisp symbol, and that the Lisp reader normally converts all symbols to uppercase on input, so the case specified for the symbol is irrelevant. Get-Values returns all of the values of a given record. If only the first value is desired, then Get-Value should be used:
 
 
  (get-value e 'DR)
In cases where the user might want to alter records from a DB, we have supplied update functions for adding a new value to a record, and for replacing all of the values of a record, and for replacing a specific value of a record.
 
 
  (add-value e 'DR "P00561, AK1H_ECOLI")
  (get-values e 'DR)

  (replace-value e 'DR "P01579, AK1H_ECOLI" "P00561, AK1H_ECOLI")
  (get-values e 'DR)

  (put-values e 'DE '("Acetoin dehydrogenase"))
  (get-values e 'DE)
By default, the parser functions supplied by the BioDB toolkit allow fast indexed retrieval of entries based on the primary key only. However, the user may specify that additional attributes should be indexed as follows.
 
 
  (setq endb (parse-enzyme "~/databases/enzyme.dat" :index-attributes '(AN DE))
  (goto-db endb)
The loader has built a hash table that allows fast retrieval of entries using the AN or the DE records. The following call returns the one or more entries whose AN record has the value "Lactic acid dehydrogenase". Note that because, in general, more than one entry might match the specified value, Get-Entries-By-Indexed-Attribute always returns a list of entries.
 
 
  (get-entries-by-indexed-attribute 'AN "Lactic acid dehydrogenase")

  (setq elist (get-entries-by-indexed-attribute 'DE "Deleted entry"))
We could print the primary keys of all of the matched entries as follows.
 
  (print-ids elist)
It is also possible to write queries that do not rely soley on indices created by the BioDB toolkit. Such queries rely on iterating across either all entries within a DB, or across entries returned by an index-based query. For example, the following query returns all Enzyme entries that have the substring "dehydrogenase" in the DE record, and relies on function Get-All-Entries, which returns a list of all entries in the BioDB.
 
  (setq elist 
    (loop for e in (get-all-entries)
      when (search "dehydrogenase" (get-value e 'DE))
      collect e) )
Again, we could print all of those entries using Print-IDs.

 A query can have multiple selection criteria. For example, the following query returns all Prosite pattern entries that were created in April 1999.
 

  (setq elist 
    (loop for e in (get-all-entries)
      when (and (string-equal "APR-1990 (CREATED)" (get-value e 'DT))
                (find "PATTERN" (get-values e 'ID) :test #'string-equal))
      collect e) )
Faster performance can be for the preceding query obtained by looping through a set of elements selected through an indexed query.
 
  (setq elist 
    (loop for e in (get-entries-by-indexed-attribute 'DT "APR-1990 (CREATED)")
      when (find "PATTERN" (get-values e 'ID) :test #'string-equal)
      collect e) )
 

 
 
 
 
 
 
 
 
 

Dictionary of BioDB-Loader Operations

Background Information

All symbols in the BioDB-Loader toolkit are in the Common Lisp package called biodb.

 Many of the operations in the toolkit take an argument called db. In all cases, this argument must be a defstruct of type BioDB as returned by a call to parse-flatfile-db or to a more specialized parsing function such as parse-swissprot.

 Many of the operations in the toolkit take an argument called entry. This argument encodes a single database entry, and must be a Lisp alist, as returned by a BioDB-Loader function such as get-all-entries or get-entry.
 
 

Parsing Operations

Function: parse-ecocyc

Arguments:      (filename &key index-attributes)

Parse-ecocyc parses a file containing a segment of data from the EcoCyc database or the MetaCyc database, and loads it into Common Lisp memory. The function returns a BioDB defstruct that serves as a handle to the parsed file. Database will be indexed on the ID attribute (tag "UNIQUE-ID"). Secondary indices will be built for any attributes included in index-attributes.
 
 

Argument
Description
filename The name of the file containing the Prosite database.
index-attributes A list of symbols representing tags which are used to construct secondary indices on the loaded data. An index will be built for each tag included in the list. The default value is NIL.

Function: parse-enzyme

Arguments:      (filename &key index-attributes)

Parse-enzyme parses a file containing the ENZYME database, and loads it into Common Lisp memory. The function returns a BioDB defstruct that serves as a handle to the parsed database. Database will be indexed on the ID attribute (tag "ID"). Secondary indices will be built for any attributes included in index-attributes.
 
 

Argument
Description
filename The name of the file containing the Prosite database.
index-attributes A list of symbols representing tags which are used to construct secondary indices on the loaded data. An index will be built for each tag included in the list. The default value is NIL.

Function: parse-flatfile-db

Arguments:      (filename name entry-separator separator-alist continuator-alist removal-alist spacers &key comment-tag primary-key index-attributes)

Parse-flatfile-db  parses a flatfile database into a LISP structure that includes the database contents, coded as a list of alists (one alist per database entry) as well as one or more indices for the data, implemented as hash tables. A flatfile DB consists of an ASCII file in which data is stored into lines of text of the form

<tag1><spacers><data1>
<tag2><spacers><data2>
    ...
<tagN><spacers><dataN>
<entry-separator>

where the tag is a combination of one or more characters, the spacers are usually one or more spaces or tab characters, and the data can contain different data elements separated by defined characters called separators.  Data for a given tag can occupy more than one line, in which case lines following the first either start with the same tag or have no tag at all (they start with spacers). For the purposes of this description, a group of lines so associated to the same tag is called a tag group.
 
 
 
 

Argument
Description
filename A string containing the complete path of the flatfile DB to be loaded.
name A string  with the name of the database to be loaded.
entry-separator A string used to separate different entries within the DB flatfile (example: "//").
separator-alist Asociation list that carries the characters used as data separators. Data associated with different tags may use a different set of separator characters. Each element of the alist is of the form

(<tag> <list of characters>)

Example: (DI (#\; #\.)) means that within a line (or lines) preceded by the DI tag, both the semicolon and the period are used to separate different data elements.

If all tags use the same separator, then separator-alist will have only one element of the form

(default <list of characters>)

so there is no need to create an alist with all tags having the same separators.

continuator-alist Asociation list that carries the character -called continuator- used to connect data that flows through different lines in a tag group. Each line in the flatfile ends with a #\Newline character. When parsing, the continuator-alist is checked for each tag, and, in case one tag runs over more than one line (i.e., a tag group is found), the #\Newlines are either deleted (when no continuator is defined) or substituted by the current tag's continuator character as it appears in continuator-alist. Each element of the alist is of the form

(<tag> character)

Examples:

(RT #\Space)   -> Connect continuing RT lines with spaces.
(CC #\Newline) -> Connect continuing CC lines with newline characters (equivalent to "keep the newlines at the end of continuing CC lines").

removal-alist Asociation list that carries the characters to be eliminated from the parsed data corresponding to given tags.  In some cases it could be desirable to eliminate some characters from the parsed data, line quotes, parens or spaces. This alist carries lists of such characters for different tags. As with separator-alist, each element is of the form

(<tag> <list of characters>)

Example: (RT (#\")) means that double quotes have to be eliminated from RT data elements.

spacers List of characters or a string used to separate the tag from the data portion of a flatfile line.

Example: 
(#\space #\tab) means that one or more spaces, tabs, or a combination thereof, can be used to separate the tag from the data portion on any given line.
" - " means that the string " - " is used to separate the tag from the corresponding data.

comment-tag A Lisp symbol that corresponds to the DB tag used for comments (example: 'CC). Default value: NIL
primary-key Symbol. DB data can be indexed during parsing according to the value associated with the tag represented by the symbol carried by primary-key. The default value is NIL (no indexing)
index-attributes A list of symbols representing tags which are used to construct secondary indices on the loaded data. An index will be built for each tag included in the list. The default value is NIL.

The output of  parse-flatfile-db  is a defstruct of type biodb whose fields are as follows.
 
 

name Database name.
source-filename Complete path of the original flatfile DB.
db-header In some cases, the first entry in a database consists completely of a long comment describing the database, its version and other information. parse-flatfile-db can detect such "headers" and save them in this field as a single string with newline characters separarting the different lines of text.
entry-list This is a list of asociation lists, one for each database entry. For a given entry, an element of its corresponding association list will be of the form

(<tag> "data string" ["data string"*])

The tag is encoded as a symbol. Each data element for that tag is encoded as a separate string.

primary-key Symbol corresponding to the tag on which the primary index is based.
primary-index Hash table used as primary index.
attribute-indices Asociation list carrying a hash table for each tag provided by the user in index-attributes (see above). Each element of the alist is of the form

(<tag> <corresponding hash table>)

Function: parse-prosite

Arguments:      (filename &key index-attributes)

Parse-prosite parses a file containing the Prosite database, and loads it into Common Lisp memory. The function returns a BioDB defstruct that serves as a handle to the parsed database. Database will be indexed on the ID attribute (tag "ID"). Secondary indices will be built for any attributes included in index-attributes.
 
 

Argument
Description
filename The name of the file containing the Prosite database.
index-attributes A list of symbols representing tags which are used to construct secondary indices on the loaded data. An index will be built for each tag included in the list. The default value is NIL.

Function: parse-swissprot

Arguments:      (filename &key index-attributes)

Parse-swissprot parses a file containing the SwissProt database, and loads it into Common Lisp memory. The function returns a BioDB defstruct that serves as a handle to the parsed database. Database will be indexed on the ID attribute (tag "ID"). Secondary indices will be built for any attributes included in index-attributes.
 
 

Argument
Description
filename The name of the file containing the Prosite database.
index-attributes A list of symbols representing tags which are used to construct secondary indices on the loaded data. An index will be built for each tag included in the list. The default value is NIL.


Searching Operations

Function: get-entries-by-indexed-attribute

Arguments:      (attribute value &key (db *current-db*))

Get-entries-by-indexed-attribute  returns a list of the one or more entries of db such that a value of attribute of that entry is value. For this function to operate properly, attribute must have been specified as one of the attributes to index when db was originally loaded, using the parameter index-attributes.
 
 

Argument
Description
attribute A symbol containing the name of the attribute to query.
value The value (usually a string) to match.
db The database to query.

Function: get-entry

Arguments:      (value &key (db *current-db*))

Get-entry returns the entry in db whose primary key has value as its value. The tag that is the primary key is specified in the call used to parse the db. For example, in our Swiss-Prot example earlier in this document, the tag "ID" is the primary key for each Swiss-Prot entry.
 
 

Argument
Description
value A value (usually a string) of the primary-key tag that is sought.
db The database to be searched.


Operations on Entries

Function: add-value

Arguments:      (entry attribute value &key (db *current-db*))

Add-value  adds value as an additional value of attribute of entry.
 
 

Argument
Description
entry An alist that is a DB entry.
attribute A symbol naming an attribute within entry.
value The new value to be added to attribute of entry.
db A BioDB defstruct representing a database.

Function: get-all-entries

Arguments:      (&optional (db *current-db*))

Get-all-entries  returns a list of all of the entries in db. Each element of the list is an alist that can serve as the entry argument to other BioDB-Loader functions such as get-value.
 
 

Argument
Description
db A BioDB defstruct representing a database.

Function: get-value

Arguments:      (entry attribute &key (db *current-db*))

Get-values  returns the first of the values of attribute of entry in db.
 
 

Argument
Description
entry An alist representing an entry in db.
attribute A symbol naming an attribute in entry.
db A BioDB defstruct representing a database.

Function: get-values

Arguments:      (entry attribute &key (db *current-db*))

Get-values  returns a list of the values of attribute of entry in db. This function always returns a list, even if the attribute has a single value.
 
 

Argument
Description
entry An alist representing an entry in db.
attribute A symbol naming an attribute in entry.
db A BioDB defstruct representing a database.

Function: put-values

Arguments:      (entry attribute values &key (db *current-db*))

Put-values  replaces all existing values of attribute of entry with values (which must be a list).
 
 

Argument
Description
entry An alist describing a DB entry.
attribute A symbol naming an attribute in entry.
values A list of the new values for attribute.
db A BioDB defstruct describing a database.


Other Operations

Function: goto-db

Arguments:      (db)

Goto-db sets the current-db for the BioDB-Loader toolkit to be db, allowing the db argument to be omitted from many future calls to BioDB-Loader operations.
 
 

Argument
Description
db The biodb defstruct object that should become the current db.