OCMiner
is OntoChem’s chemistry aware semantic search engine. Annotating and searching chemical compounds and concepts together with life science ontologies such as diseases, anatomy, species taxonomies and proteins in text documents, patents and databases are straightforward with OCMiner.
Particular strengths of our technology is in:
- Chemical ontologies allow to search for chemical classes such as monoterpenes, steroids and drugs...
- Chemical structure searching, including sub-structure and similarity searches based on identified chemical names such as IUPAC names, synonyms and trivial names
- Smart and high quality conversion of input data formats such as PDF into readable XML using standard web-browsers
- Very large, high-speed dictionaries to annotate names, ontology terms and phrases
- Combined ontology and free text search including disambiguation, homonym resolution, fuzzy and boolean search
Data Sources
OCMiner may annotate plain text, HTML, XML, PDF, Word, Excel and Powerpoint documents as well as databases such as PubMed. PDFs are the main source for information from scientific literature and patents, but are not suitable for high quality automated content processing and data mining.We have developed a proprietary PDF reader to extract encoded information in PDF:
- characters that follow unknown coding schemata or semantics are resolved by using OCR to uncover the correct encoding, additional character semantics is provided by font data
words and their logical order is reassembled by spatial analysis
- recognition of document formating such as title or footers of the original document
- transformation into XML documents, suitable for display as HTML pages
- all data is stored after indexing in a database
Annotation with Database Terms
The input text data is split into sentences and tokens. These tokens are than indexed by using dictionaries that are fueled by domain specific relational databases containing chemical compounds, anatomy terms, species and taxonomy hierarchies, proteins, diseases etc. Databases are updated on regular intervals.
- each recognized named entity is annotated with a unique identifier pointing to the used ontology, its meaning and synonyms
- confidence values are attached to the annotation, triggered by curated word lists and the token environment to reduce false positive annotations and to allow for disambiguation of terms
- annotating named entities with concept identifiers by using phrase ontology dictionaries, chemical concept substructures and text fragment similarities
- parent concept identifiers are enabling ontology searches
- the included free text search engine enables advanced search logic including boolean and fuzzysearches
Dictionaries
OCMiner dictionaries are based on our patented Compact Trie based Dictionary (CTD) technology for named entity recognition. These dictionaries can hold single and multi-word entries and simultaneously provide fast lookup of terms (>100,000 terms per second), providing support for prefix and suffix lookup as well as chemical names recognition.
Currently our dictionaries hold more than 100 million named entity terms. |