SARminer: Automatic Extraction of Effects and Applications of Chemical Compounds from Patent Texts
Patents are a rich information source for effects and applications of chemical products (so-called Structure-Activity Relationships, SAR) in many areas of life, e.g. as drugs and medicines, food additives, cosmetics, colorings or building material. Patents provide a basis for decision making in research and development of new chemical products, and they also serve as a knowledge source of already known chemical compounds and products. Today's commercial SAR databases are still being created by manual curators and are therefore prone to errors, expensive, and incomplete.
The goal of the SARminer project is the development of a prototypical method for automatically extracting SAR data from patent full texts.
Figure 1: SARminer workflow
SARminer's processing of patent texts consists of four essential steps: annotation of named entities, syntactic analysis, relation extraction, and final normalization. The extracted and normalized SAR triples are then efficiently stored and linked with their context within the patent, together with the priority date and other meta data of the corresponding patent application. New methods will be developed which, on the one hand, take the linguistic and structural peculiarities of patent texts into account and, on the other hand, make extensive use of OntoChem's fine-grained ontologies in the areas of chemistry, pharmacology, biology, and medicine.
The project will provide a basis for the development of marketable products. Envisaged applications include the automated creation of SAR databases which will eventually replace conventional manually curated databases or, at least, facilitate their manual creation. Extracted SAR data may also be integrated into the back-end of dedicated search engines. Customers of this service business are pharmaceutical and biotechnology companies, patent attorneys, and publishing houses.
The SARminer project is a cooperation of OntoChem with the University of Leipzig and is funded by
the German Federal Ministry of Education and Research (BMBF).