File Formats¶
Kindred can load several different file formats that contain text and their annotations. Below are examples of the different file formats with code for loading them.
JSON format¶
This format, used by PubAnnotation and PubTator, stores the text and annotation data all together in a single file. Furthermore, multiple documents can be stored in a single document.
The format is standard JSON and is either a dictionary (for a single document) or a list of dictionaries (for multiple documents). Each dictionary needs to have three fields: text, denotations, and relations. The text is the text of the document. The denotations are the entity annotations and provide the unique identifier, entity type and location (span) in the text. The relations are the relation annotations.
Example file: example.json
{
"text": "The colorectal cancer was caused by mutations in APC",
"denotations":
[{"id":"T1", "obj":"disease",
"span":{"begin":4,"end":21}},
{"id":"T2", "obj":"gene",
"span":{"begin":49,"end":52}}],
"relations":
[{"id":"R1","pred":"causes",
"subj":"T2", "obj":"T1"}]
}
To load a whole corpus with multiple files in the format, use the following code assuming that the files are in the example directory. This will create a kindred.Corpus
object.
corpus = kindred.load('json','example')
BioC XML format¶
The BioC XML format contains text and annotations together in a single file. Furthermore, it is designed to store more than one document. It stores each document as “document” within a larger “collection”. Each document contains passages (e.g. sections of a paper) which then contain the text, entity annotations, and relations. In loading this, each passage is turned into a single kindred.Document
. An example of the format is outlined below.
<?xml version='1.0' encoding='UTF-8'?><!DOCTYPE collection SYSTEM 'BioC.dtd'>
<collection>
<source></source>
<date></date>
<key></key>
<document>
<id></id>
<passage>
<offset>0</offset>
<text>The colorectal cancer was caused by mutations in APC</text>
<annotation id="T1">
<infon key="type">disease</infon>
<location offset="4" length="17"/>
<text>colorectal cancer</text>
</annotation>
<annotation id="T2">
<infon key="type">gene</infon>
<location offset="49" length="3"/>
<text>APC</text>
</annotation>
<relation id="R1">
<infon key="type">causes</infon>
<node refid="T2" role="subj"/>
<node refid="T1" role="obj"/>
</relation>
</passage>
</document>
</collection>
To load a whole directory of BioC XML files, use the code below. This will create a single kindred.Corpus
file with each passage found in all XML files in the directory turned a kindred.Document
entity.
corpus = kindred.load('bioc','example')
Simple Tag format¶
This format is not designed for production-use but for illustration and testing purposes. It is Kindred-specific. It is an XML-based format that keeps all annotations inline, to make it easier to see which entities are annotated. A relation tag provides a relation annotation and must have a type attribute. All other attributes are assumed to be relation argument. Any non-relation tag is assumed to be an entity annotation and must wrap around text. It must also have an id attribute.
Example file: example.simple
The <disease id="T1">colorectal cancer</disease> was caused by mutations in <gene id="T2">APC</gene>
<relation type="causes" subj="T2" obj="T1" />
It is most useful for quickly creating examples for testing. For example, the code below creates a kindred.Corpus
with a single document of the associated text and annotations.
text = '<drug id="1">Erlotinib</drug> is a common treatment for <cancer id="2">NSCLC</cancer>. <drug id="3">Aspirin</drug> is the main cause of <disease id="4">boneitis</disease>. <relation type="treats" subj="1" obj="2" />'
corpus = kindred.Corpus(text,loadFromSimpleTag=True)
If you do need to load a directory of these files (with suffix: .simple), the following command will load them into a kindred.Corpus
file.
corpus = kindred.load('simpletag','example')
Streaming¶
Some corpora are too large to load into memory in a single go. Kindred supports streaming in chunks of a corpus in the BioC format. The code below uses an iterator to load smaller kindred.Corpus
objects that contain a subset of the documents each time.
for corpus in kindred.iterLoad('example.bioc.xml',corpusSizeCutoff=3):
pass