Software

The ART project produced a tool for manual sentence based semantic annotation of papers (SAPIENT). SAPIENT incorporates SSSplit, an XML aware sentence splitter, which was also created within the ART project. In the SAPIENT Automation (SAPIENTA) project we have released a new version of SAPIENT, SAPIENTA, which allows the automatic annotation of core scientific concepts (CoreSC) at the sentence level and also permits multi-label manual annotation. SAPIENTA includes an improved version of SSSplit, which works with the Pubmed Central DTD as well as papers in Scixml, but can also be applied to plain text and other XML schemas. You can also use SSSplit at the command line to obtain sentence boundaries for a batch of papers in XML.

You can download the latest versions of SAPIENTA and SSSplit for non-commercial use below.

SAPIENT stands for “Semantic Annotation of Papers: Interface & ENrichment Tool”. It is an annotation interface implemented as a web application, to help users annotate scientific papers in XML, sentence by sentence, with a set of concepts called Core Scientific Concepts (CoreSCs: see this paper Guidelines for the Annotation of General Scientific Concepts, GSCs have been rebranded as CoreSCs). CoreSCs constitute the set of concepts essential for describing a scientific investigation. However, SAPIENT can also be used in conjunction with other annotation schemes to annotate papers in XML sentence by sentence. SAPIENT also incorporates Oscar3 functionality, allowing the automatic annotation of chemical named entities.

SAPIENTA stands for “Semantic Annotation of Papers: Interface & ENrichment Tool Automated” and incorporates a machine learning classifier for identifying CoreSCs trained using Conditional Random Fields (CRF). The machine learning classifier has been evaluated on 265 chemistry and bio-chemistry papers yielding more than 50% average accuracy for the 11 Core Scientific Concepts. The automatically generated concepts have been used to generate automatic summaries, evaluated in a question answering task by chemistry experts, yielding a precision of 75% and a recall of 66%. SAPIENTA also allows multi-label annotation at the sentence level and has been used by three biology experts to annotate 50 biology papers from Pubmed Central, which are relevant for Cancer Risk Assessment (CRA).

SAPIENT Sentence Splitter (SSSplit) is an XML-aware sentence splitter which preserves XML markup and identifies sentences through the addition of in-line markup. The reason for developing our own sentence splitter was that sentence splitters widely available could not handle XML properly. The XML markup contains useful information about the document structure and formatting in the form of inline tags, which is important for determining the logical structure of the paper.

SSSplit has been written in the platform-independent Java language (version 1.6), based on and extending open source Perl code for handling plain text. In order to make our sentence splitter XML aware, we translated the Perl regular expression rules into Java and modifed them to make them compatible with the SciXML and Pubmed journal schemas.

For more details about SAPIENT and SSSplit you can also refer to our BioNLP2009 paper. Please reference this paper, if you find SAPIENT or SSSplit useful:

Liakata M., Q Claire and Soldatova L. N. (2009) Semantic Annotation of Papers: Interface and Enrichment Tool (SAPIENT). Proceedings of BioNLP 2009, Boulder, Colorado, pp 193–200

For SAPIENTA, publication is pending but you can e-mail liakata-At-ebi-dot-ac-dot-uk to obtain more information about our manuscript “Automatic recognition of conceptualisation zones in scientific articles to aid biological information extraction”

To download files click on the appropriate name below.