Biomedical information contained in text repositories (e.g. Medline) represents the vast majority of genomic information accumulated through years. Methods to transform unstructured text into structured information are necessary to provide access to these resources. Regularities in structured data can then be discovered by means of data mining techniques. Transforming unstructured text into structured information means transforming texts into structured objects described on the basis of discrete data (e.g. words denoting biomedical entities such as genes, disease names) characterized by one or more attributes and eventually by relations between data (e.g. words denoting relations between genes and drug reactions or diseases). Entity and relation extraction is generally performed by using Information Extraction techniques which allow to analyze natural language texts at different levels of complexity in order to extract features on entities and relations. Obtained features may express statistical (e.g. word frequency), lexical (e.g. alphanumeric, capitalized word), structural (e.g. the order of sentences in a text and of entities in a sentence), syntactical (e.g. singular/plural proper/not proper nouns, base/conjugated verbs) or domain specific knowledge (e.g. an entity belonging to a dictionary). Biomedical entities can also be described in terms of specialized taxonomies available in the life science field (e.g. GeneOntology, MeSH, UMLS). Association rule mining on biomedical literature exploiting the MeSH taxonomy to discover associations between entities at different level of abstraction has been already investigated. While previous works ignore information on relations among objects, we propose to exploit object interactions by resorting to a first order formalism and a multirelational approach to association rule mining. In this case, the mining process is able to extract association rules involving objects and relations at different levels of granularity with respect to a hierarchy defined on objects of interest.
Beyond unstructured textual data for life science
MALERBA, Donato
2005-01-01
Abstract
Biomedical information contained in text repositories (e.g. Medline) represents the vast majority of genomic information accumulated through years. Methods to transform unstructured text into structured information are necessary to provide access to these resources. Regularities in structured data can then be discovered by means of data mining techniques. Transforming unstructured text into structured information means transforming texts into structured objects described on the basis of discrete data (e.g. words denoting biomedical entities such as genes, disease names) characterized by one or more attributes and eventually by relations between data (e.g. words denoting relations between genes and drug reactions or diseases). Entity and relation extraction is generally performed by using Information Extraction techniques which allow to analyze natural language texts at different levels of complexity in order to extract features on entities and relations. Obtained features may express statistical (e.g. word frequency), lexical (e.g. alphanumeric, capitalized word), structural (e.g. the order of sentences in a text and of entities in a sentence), syntactical (e.g. singular/plural proper/not proper nouns, base/conjugated verbs) or domain specific knowledge (e.g. an entity belonging to a dictionary). Biomedical entities can also be described in terms of specialized taxonomies available in the life science field (e.g. GeneOntology, MeSH, UMLS). Association rule mining on biomedical literature exploiting the MeSH taxonomy to discover associations between entities at different level of abstraction has been already investigated. While previous works ignore information on relations among objects, we propose to exploit object interactions by resorting to a first order formalism and a multirelational approach to association rule mining. In this case, the mining process is able to extract association rules involving objects and relations at different levels of granularity with respect to a hierarchy defined on objects of interest.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.