Ontologies provide the basis for identifying concepts in text mining technologies. Subsequent extraction of facts and relationships between these concepts enables data mining and provides the foundation for novel “in silico” knowledge discovery methods. OntoChem is using ontologies for the extraction of implicit, unknown and useful information from databases and document collections such as patents or scientific literature.
Ontology (derived from onto- the Greek ὤν, ὄντος "being; that which is", present participle of the verb εἰμί "be", and -λογία, -logia: science, study, theory) is the philosophical study of the nature of being, existence, or reality as such, as well as the basic categories of being and their relations. In computer science, an ontology formally represents knowledge as a set of concepts within a domain, and the relationships between those concepts, enabling semantic data integration, data mining and knowledge generation. Ontologies are explicit specifications of a topic including a vocabulary of terms and concepts with defined logical relationships to each other. http://en.wikipedia.org/wiki/Ontology_(information_science)
OntoChem develops novel tools and algorithms to build, update, validate and merge general, chemical and biological ontologies. OntoChem's approach allows for stable concept IDs and a modular approach to quickly assemble and validate derived meta-ontologies. OntoChem's unique selling point is also the scalability of its patented methods – enabling ontologies to contain up to billion terms for annotation.
Use of ontologies
Our data and knowledge extraction technology OCMiner® uses ontologies for a variety of information retrieval tasks:
- Classification of entities, for example assigning specific compounds to a compound class, relating physiological symptoms to a disease, or defining specific relationship types using a custom developed regular expression syntax language
- Ontology aware search engines such as our demo server www.ocminer.com allow to search for concepts, for example using the search term “plants” will return documents mentioning specific plants such as “salix” or “Filipendula ulmaria”
- Finding specific relationships between domains, e.g. which compounds have been isolated from plants – information that was previously only available from manually curated databases is now generated on the fly
- Similarity and ranking of documents based on used ontology concepts. This gives more relevant results than conventional technologies using word frequencies or key words.
OntoChem develops ontologies in the areas of chemistry, species, diseases, anatomy, cell lines, proteins, pharmacological effects, languages, geopolitical and climate zones, company information for business intelligence and others.
Ontologies together with heuristic and linguistic methods are applied for semantic processing of unstructured information sources like scientific articles, patents and others. Using for example the species, chemistry and geographical ontologies, one may retrieve relationships for the white willow (Salix alba) as follows:
Salix alba <is a> Salix <is a> Salicaceae <is a> … <is a> species
Salix alba <contains> D(-)-Salicin
D(-)-Salicin <is a> aromatic compounds < is a> … <is a> compounds
D(-)-Salicin <is a> 6 membered heterocycles < is a> ... <is a> compounds
D(-)-Salicin <is found in> Filipendula ulmaria <is a> Filipendula <is a> … <is a> species
Salix alba <is a> anti-inflammatory agent <is a> … <is a> effect
Salix alba <treats> rheumatic fever <is a> … <is a> diseases
Salix alba <range of distribution> northern africa <is part of> africa <is part of> world <is part of> regions
Salix alba <range of distribution> europe <is part of> world <is part of> regions
OntoChem has implemented technologies to build, validate and update dictionaries, controlled vocabulary, taxonomies or ontologies comprising more than 100 million terms from various domains. Examples are our ontologies for general species, plants and fungi, cell lines, general anatomy, plant and fungal anatomy, diseases, pharmacological and physiological effects, cosmetology, proteins, genes, chemistry, languages, geopolitical and climate zones, company information for business intelligence and domain specific relationship ontologies.
Each ontology concept contains further data, such as relationships to other concepts, links to external sources, language information, its synonyms and related updating information. OntoChem's ontologies can be stored and used in various formats such as OBO, CSV, XML (using specific flavors such as RDF, OWL, CML, SBML or others), SKOS etc.
When ontologies are used for text mining, we have specific modules that enhance the value of ontologies, either by generating an enriched ontology with additional terms or by using these modules at the time point of annotation:
- Spelling variations (e.g. British-American English, plural forms)
- Diacritic character, space/hyphen/apostrophe handling
- Ontology dependent conditional black and white lists, case sensitive annotations
- Automated detection of acronyms and abbreviations
An unique ontology format has been developed to extract relationships between named entities (NE) in text. Domain specific relationship ontologies are used together with the related ontologies and a new regular syntax expression language to extract relationships with high precision and recall.
To create, manage, update and validate ontologies we have developed a range of different software tools.
Chemistry ontology editor We have developed the first specialized chemistry ontology editor, SODIAC (structure ontology development and individual assignment center), to support the development of chemical ontologies. Using the OBO format, it implements known functionality of an ontology editor together with a chemistry structure editor that allows structure based addition, editing and ontology checks. SODIAC can be used to annotate conventional structure files or chemical databases whereby each compound will be assigned to its chemical structure classes.
Using SODIAC, we have developed a chemical ontology that comprises structure based classifications but also biology related classifications of chemical compounds. Particular emphasis has been given to natural products, for example steroids or sugars, but also to all classes of heterocycles and compound classes that are of interest for medicinal chemistry. In addition, classifications such as vitamins, food and flavor, cosmetics, drugs and FEMA compounds can be assigned.
OntViewer is designed to display, review and check very large ontologies with up to multi-GB data, such as for example the chemistry of the proteins/genes ontologies. It also performs logical, statistical and consistency checks on the ontology.
HugeEdit is a simple and fast text editor for displaying, searching and editing very large data files with up to multi-GB data and multi-million lines without the need to hold the complete data in memory. It is especially suited to work with column separated data too large to be edited in standard spreadsheet editors.
OntoChem has also developed a series of custom build command line tools that aid the creation, updating and validation of ontologies:
- Searching and proposing candidate synonym concept terms in document collections
- Automated generation of spelling variations
- Checking and correcting homonyms or logical errors within or between ontologies
Together, our technologies provide a straightforward and comprehensive toolbox for various tasks when working with ontologies.
OntoChem's ontologies, together with OCMiner® are ideally suited for high speed, high quality annotation and search of large data volumes. For example, annotating PubMed abstracts in the demo application www.ocminer.com and using the chemistry search term “heterocyclic compounds” in www.ocminer.com retrieves 3.124.129 hit documents, while a native PubMed (http://www.ncbi.nlm.nih.gov/pubmed?term=heterocyclic) search finds 24.524 hit documents.
Using the cell line “SKMEL-28” as a search term retrieves 296 documents, while the native PubMed search (http://www.ncbi.nlm.nih.gov/pubmed?term=skmel-28) delivers 26 hit documents.