SIABO

Semantic Information Access through Biomedical Ontologies

Home
Description
Participants
Events
Publications
Prototypes
Private

Contact:
Troels Andreasen

Project description

This project addresses IT research targeting biomedical applications in an industrial research environment.

1 Scientific summary
2 Research idea and plan
3 Overview of the project approach
4 Innovative aspects and relations to international and national research
5 Strategic impact and relevance to society and industry
6 References

1 Scientific summary

The scientific aim of the project is to provide systems architecture for representing, organizing of, and accessing conceptual content of biomedical texts using a formal ontology. Ontologies are formal tools for structuring the concepts of a scientific domain by means of relationships between the concepts, e.g. along the specialization/generalization dimension. The present approach introduces the notion of generative ontologies, that is, ontologies providing ever more specialized concepts reflecting the phrase structure of natural language. The project seeks to set up a novel so-called "ontological semantics" mapping noun phrases into points in the generative ontology. This enables an advanced form of data mining of texts identifying paraphrases and conceptual relationships, and measuring distances between key concepts in texts. Thus, the project is unique in its attempt to provide a formal underpinning to conceptual similarity or relatedness of meaning. The project focuses on ontological engineering of biomedical ontologies applying the notion of lattices and relation-algebras, which facilitates visualization of concepts as "ontoscapes". The project has clear affinities to contemporary research in the semantic web area, to description logic as well as XML approaches and gains its distinct innovative scientific profile by means of the above mentioned notions. The intended benefits of the proposed system lie within the areas of knowledge discovery, data management, search, and data visualization.

2 Research idea and plan

The aim of this project is to provide solutions to the problems of identifying, representing, and comparing content in biomedical scientific texts, thereby establishing a semantic approach to information access drawing on concepts appearing in the scientific texts targeted. The purpose is to provide easier and more direct access to the content of scientific texts as well as to provide means for surveying and visualizing collections of texts based on conceptual content. Flexibility and task-orientation in the access to scientific texts depend crucially on the possibility of bringing described topics, as well as concepts and their relationships, into play. A foundation for knowledge discovery is established by organizing the local data sources into ontologies. The interrelation of the local sources through the ontologies facilitates the discovery of previously unknown connections between these sources. The relation of the sources through ontologies with external data furthermore enables the discovery of connections between the local data and external contextual knowledge about the subjects in the sources. The thesis of the project is that using ontology-based methods in representing core concepts and their mutual relations in an ontology-structured knowledge base on top of a primary database of scientific texts will bring about a much-improved functionality in the areas of search results, automatic indexing and classification, as well as taskoriented viewing of and navigating in collections of documents. The project derives its direct relevance to business concerns from providing solutions to concrete questions such as the following:

• How to semi-automatically collect and structure the domain knowledge relevant to a collection of texts by using descriptions of central concepts, terminology, taxonomies, and existing knowledge resources in the field? • How to optimally represent domain knowledge in a formal system which, on the one hand, enables consistency checks and inference of new knowledge, and on the other, ensures a transparent and easily accessible systematization providing new perspectives on the knowledge domain of the texts? • How to describe and index scientific texts once relevant domain knowledge has been established and systematized? • How to group and classify scientific texts on the basis of properties expressed in the formal descriptions of these texts? • How to simplify and improve search and navigation in texts by taking into account the full amount of domain knowledge and the descriptions based on that knowledge? • What possibilities wrt. visualizing a scientific “knowledge landscape” are created by the combination of a domain-specific ontology with a text database indexed with reference to that ontology?

The primary research problems which have to be addressed in order to provide principled solutions to the above questions, include in particular: • How to represent conceptual knowledge extracted from arbitrary collections of scientific texts? • How to build a model of the contents of an, in principle, arbitrary collection of texts with a rich conceptual structure? • How to relate complex natural language expressions to composite conceptual representations? • How to provide a user with a task-oriented overview of the contents of a text database? The commercial aim of the project is to build a framework for the treatment of scientific literature within the Novo Nordisk research and development departments. The purpose is to provide scientists easier means of combining and surveying information from various sources. Furthermore, an important objective is to enhance the possibilities for finding relevant information in external sources, such as scientific journals and articles, that relate to e.g. an initial article, a textual description or a conceptual expression.

3 Overview of the project approach

Below, the foundations of content description are introduced, and after that the methods of semi-automatic content modelling are described. Finally the concept-based approach to text indexing and search is sketched. A. Theoretical foundations of content description. Formal ontologies are formulated in a dedicated language, for instance a logical language like description logic. The conceptual approach envisaged builds on a knowledge base represented as a ‘generative ontology’ whose concepts (nodes) are dynamically linked by semantic relations. A generative ontology is at the outset a skeleton ontology comprising a finite number of primitive concepts ordered by the generic relation ISA. Complex concept terms are formed by introducing a finite number of (associative) ontological relationships, e.g., AGENT, CAUSE, TEMPORALITY, LOCATION, whose arguments are the concepts in the skeleton ontology. The argument concept terms are either primitive concepts in the skeleton ontology, or complex concept terms. Thus, concept structures may be nested, giving rise to an infinity of arbitrarily complex concepts. This is what makes the ontology generative. By means of the concept-algebraic description language, OntoLog, developed in its seminal form in the OntoQuery project [1,4], we can build the ontology and represent concepts independently of their various linguistic realizations. For example, synonymous phrases like obese children, children with obesity, children who are obese are represented by the same (complex) concept descriptor. The content of the generative ontology is elicited from domain experts, culled from dictionaries, thesauri, taxonomies, etc., and supplemented by statistically extracted excerpts from the text database. The conceptual approach employed goes beyond conventional methodologies in that contemporary text processing involves only limited morpho-syntactic analysis, whereas here semantic analysis plays a major role. Semantic analysis and representation are based on conceptual knowledge from the ontology and associative networks. A core challenge is to provide semantic analyses of larger chunks of text based on the generative ontology. Nominal expressions, including nouns modified by prepositional phrases, noun-noun compounds, and genitive constructions are particularly interesting in this respect, because they are typically the heaviest contributors to document content. Such expressions, however, present very hard semantic problems because their semantics involves unexpressed semantic relations. A central problem to be addressed is how a finite set of ontological, language-independent relations may contribute to computing relevant semantic relations in the representation of document content [19,20]. This partial semantic analysis results in a formal representation of the themes of the text expressed as sets of OntoLog-descriptors, which are then compared to the formal concept descriptors defined in the ontology. The result of this comparison forms the basis of the ontology-based access to the text database. As in more conventional approaches, indexing, i.e., attribution of some form of description to documents, text fragments, etc., constitutes the basis of the semanticsbased approach of this project. The main difference is that, under our approach, such attributed descriptions, rather than being taken from lists of words, are constituted by structures of descriptors representing ontological concepts. A major challenge is defining descriptor similarity in terms of the structure and relations of the ontology [2,6]. Search in a text collection indexed by ontology-based descriptors can draw on similarity-measures relating these so that conceptual reasoning can be replaced by simple similarity computation thereby allowing for a scaling to very large information bases. A core problem is to investigate this kind of ontology-based similarity search as a semantics-based but still efficient approach. [22]. B. Methods of semi-automatic content modelling. Today’s computerised ontologies are usually hand-crafted through extensive and costly empirical work. State-of-the-art methods and tools for automatic ontology construction in large measure fail to capture the relevant meaning structures. Examples of the shortcomings of current automatic tools are misplacement of concepts in the ontology, e.g. sub-sub concepts of a concept registered as subconcepts, and a typical lack of ability to capture other than hierarchical relations (ISA and part-whole relations), thus ignoring associative relationships. Semi-automatic content modelling is to be realized in part by applying principles of terminological concept modelling. It is a major challenge to examine the interaction between terminological ontologies and the concept of generative ontology. The backbone of terminological ontologies is constituted by the ISA-relationship. However, associative relationships form the basis for identifying characteristics of concepts which can be used to introduce feature specifications on concepts. In ISA hierarchies, subdividing dimensions which group sub-concepts can be introduced on the basis of feature specifications, the advantages being that the ontology will build on more solid principles. In the CAOS project (Computer-Aided Ontology Structuring) [31,33] a number of principles have been proposed for working with feature specifications. In the area of ontology building we will work with the problem of integrating domain-specific and general ontologies. Furthermore, we will investigate the relation between the formal ontology and the language-specific expressions found in the texts. Ontology building also involves enhancement by inductively/statistically extracted excerpts from documents reflected in associations and classifications based on occurrences in documents. Our basic assumption is precisely that combining the notion of generative ontology with inductive knowledge can successfully constitute the foundation of a semantics-based approach to information access which will be realistic also for very large databases.

4 Innovative aspects and relations to international and national research

As has been described in several articles concerning the retrieval of biomedical information, there is a growing need for mechanisms to gather, organize and present biomedical information. The solutions have concentrated on integrating text mining, text classification and ontologies, and research is presently ongoing on these areas. The EU Network of Excellence “Semantic Interoperability and Data Mining in Biomedicine”1, created 2004, has gathered 25 institutions from 11 EU countries to support the development of “generic methods and tools supporting the critical tasks of the field; data mining, knowledge discovery, knowledge representation, abstraction and indexing of information, semantic-based information retrieval in a complex and high-dimensional information space, and knowledge-based adaptive systems for provision of decision support for dissemination of evidence based medicine”. Within the areas of text mining, data enhancements and semantic web technologies there has been progress e.g. with respect to parsing techniques and classification methods, but the area of organising the extracted data has only been sketched. An important goal in the proposed project is to develop and implement technologies that combine ontological engineering and knowledge discovery approaches, thereby adding significant strength to the biomedical text mining techniques. A. Knowledge representation and modelling. Knowledge representation and modelling by ontologies is an active area of research, description logic having a crucial role providing notations for ontologies. Focussing on semantic mark-up of documents 1 http://www.semanticmining.org (Semantic Web), proposals have been made for text mark-up languages, e.g., extensions of XML and RDF, based on the main stream description-logic paradigm. A key notion in the adopted formal framework in this project is the algebraic notion of lattice, which is the mathematical structure accounting for multi-hierarchical classification organisations. However algebraic lattices are to be extended with attributes and relations in order to accommodate the conceptual models intended in the project. The use of lattices as the logical framework for ontology representation and classification gives the project a distinct profile. At the same time, however, it does not prevent the project from associating with similar contemporary research projects based on, e.g., the description-logic paradigm. One advantage of our pursuit of lattice models is that lattices, in contrast to description logic, come with suggestive classification diagrams providing intuitive graphical visualizations of ontologies, e.g. by Hasse diagrams or by entity set-relationship diagrams. However, from an implementation perspective the lattice algebraic approach is ambitious, though progress has been made recently through our collaboration with prof. Mai Gehrke, New Mexico, USA, one of the leading mathematicians in this field. As a back-up position, we have proposed a notion of conceptual or ontological grammars in the OntoQuery research project. This notion provides much of the representational and computational expressivity of the lattice framework, but is more tractable and instructive. The potential of this tool is yet to be explored and the present project is a perfectly suited application for this. B. Search, indexing and mining. Search and indexing, too, are active fields of research in, e.g., information retrieval and databases, recently also attracting attention due to internet use. Generally, in these fields of research the focus is on indexing, evaluation, and ranking techniques, exhibiting results based on methods from AI and involving statistical analysis and data mining. Only to a limited extent have knowledge modelling and search been linked, due to the fact that knowledge structures, e.g., ontologies, cannot function as a basis if search is prepared using only string- and word-based indexing. If concepts are to be referenced in a description resulting from an automatic indexing, this indexing must involve recognizing concepts based on text fragments. The OntoQuery project has worked with linking ontological modelling of domain knowledge and ontology-based search. The link is brought about by an analysis unfolding an ontological semantics for natural language. C. Viewing structured information. An area characterized by many scattered suggestions but only limited coherent research efforts is that of visualization of information. Inspired by the natural graphical representations coming with the lattice organization of ontologies, one of the ideas we wish to pursue in the project is to enable visualization of conceptual content in the form of so called "conceptual globes". In these globes the ontological lattice structured diagrams are arranged so that the centre of the globe forms the top node with specializations sprouting radially in all directions. This is suggested as a "mental" conceptualisation as well as a graphical one. This organisation offers radial regionalisation as sectors, wedges, and cones as well as spherical regionalisation as "continents", cartographic projections, slices, and perspectives.

5 Strategic impact and relevance to society and industry

In order to be competitive, companies need to have access to the contents of the ever increasing amount of documentation about their products, processes ad projects. Only a semantics-based approach to information management addressing content is adequate to that task. High-tech and research based companies wish to document as accurately as possible the immaterial values they command, in particular, the knowledge and expertise available from all resources in the company. This is a strategic concern in the company. In principle, a company or organisation using the present approach as described above will be able to answer questions like: What do we know about the field of knowledge X? From which resources (e.g. collection of documents) is this knowledge available? How does X relate to the overall landscape of knowledge commanded by the company? The possibility of answering such questions is brought about by extracting knowledge from company documents and representing this knowledge in a flexible and powerful language internal to the computer. The content of each document is described as a set of arbitrarily complex conceptual descriptors formulated in this language facilitating detailed comparison of the content of documents. The properties of an ontology-based system as sketched above lead to easier access to data sources, locally as well as globally, integration of the scientific research and the available information present on the subject within Novo Nordisk and in the scientific literature. Alltogether, the proposed system will enable the research units within Novo Nordisk to structure the findings and documentation on the experiments, relate and discover information relevant to the research subjects and access this information easily, as well as browsing the globally available information on the subjects. The academic relevance of the proposed system is significant within the areas of ontological engineering, semantic web technologies and applications of text mining technologies, and the conclusions drawn from the construction of the system will be aimed at contributing to the scientific work made on these research areas.