Home
Description
Participants
Events
Publications
Prototypes
Private
Contact:
Troels Andreasen
|
Project description
This project addresses IT research targeting biomedical applications in an industrial
research environment.
Contents
1 Scientific summary
2 Research idea and plan
3 Overview of the project approach
4 Innovative aspects and relations to international and national research
5 Strategic impact and relevance to society and industry
6 References
1 Scientific summary
The scientific aim of the project is to provide systems architecture for representing,
organizing of, and accessing conceptual content of biomedical texts using a formal
ontology.
Ontologies are formal tools for structuring the concepts of a scientific domain by
means of relationships between the concepts, e.g. along the specialization/generalization
dimension. The present approach introduces the notion of generative ontologies, that is,
ontologies providing ever more specialized concepts reflecting the phrase structure of
natural language. The project seeks to set up a novel so-called "ontological semantics"
mapping noun phrases into points in the generative ontology. This enables an advanced
form of data mining of texts identifying paraphrases and conceptual relationships, and
measuring distances between key concepts in texts. Thus, the project is unique in its
attempt to provide a formal underpinning to conceptual similarity or relatedness of
meaning.
The project focuses on ontological engineering of biomedical ontologies applying
the notion of lattices and relation-algebras, which facilitates visualization of concepts as
"ontoscapes".
The project has clear affinities to contemporary research in the semantic web area,
to description logic as well as XML approaches and gains its distinct innovative
scientific profile by means of the above mentioned notions.
The intended benefits of the proposed system lie within the areas of knowledge
discovery, data management, search, and data visualization.
2 Research idea and plan
The aim of this project is to provide solutions to the problems of identifying,
representing, and comparing content in biomedical scientific texts, thereby establishing
a semantic approach to information access drawing on concepts appearing in the
scientific texts targeted. The purpose is to provide easier and more direct access to the
content of scientific texts as well as to provide means for surveying and visualizing
collections of texts based on conceptual content. Flexibility and task-orientation in the
access to scientific texts depend crucially on the possibility of bringing described topics,
as well as concepts and their relationships, into play.
A foundation for knowledge discovery is established by organizing the local data
sources into ontologies. The interrelation of the local sources through the ontologies
facilitates the discovery of previously unknown connections between these sources. The
relation of the sources through ontologies with external data furthermore enables the
discovery of connections between the local data and external contextual knowledge
about the subjects in the sources.
The thesis of the project is that using ontology-based methods in representing core
concepts and their mutual relations in an ontology-structured knowledge base on top of
a primary database of scientific texts will bring about a much-improved functionality in
the areas of search results, automatic indexing and classification, as well as taskoriented
viewing of and navigating in collections of documents.
The project derives its direct relevance to business concerns from providing
solutions to concrete questions such as the following:
• How to semi-automatically collect and structure the domain knowledge relevant
to a collection of texts by using descriptions of central concepts, terminology,
taxonomies, and existing knowledge resources in the field?
• How to optimally represent domain knowledge in a formal system which, on the
one hand, enables consistency checks and inference of new knowledge, and on
the other, ensures a transparent and easily accessible systematization providing
new perspectives on the knowledge domain of the texts?
• How to describe and index scientific texts once relevant domain knowledge has
been established and systematized?
• How to group and classify scientific texts on the basis of properties expressed in
the formal descriptions of these texts?
• How to simplify and improve search and navigation in texts by taking into
account the full amount of domain knowledge and the descriptions based on that
knowledge?
• What possibilities wrt. visualizing a scientific “knowledge landscape” are
created by the combination of a domain-specific ontology with a text database
indexed with reference to that ontology?
The primary research problems which have to be addressed in order to provide
principled solutions to the above questions, include in particular:
• How to represent conceptual knowledge extracted from arbitrary collections of
scientific texts?
• How to build a model of the contents of an, in principle, arbitrary collection of
texts with a rich conceptual structure?
• How to relate complex natural language expressions to composite conceptual
representations?
• How to provide a user with a task-oriented overview of the contents of a text
database?
The commercial aim of the project is to build a framework for the treatment of scientific
literature within the Novo Nordisk research and development departments. The purpose
is to provide scientists easier means of combining and surveying information from
various sources. Furthermore, an important objective is to enhance the possibilities for
finding relevant information in external sources, such as scientific journals and articles,
that relate to e.g. an initial article, a textual description or a conceptual expression.
3 Overview of the project approach
Below, the foundations of content description are introduced, and after that the methods
of semi-automatic content modelling are described. Finally the concept-based approach
to text indexing and search is sketched.
A. Theoretical foundations of content description. Formal ontologies are formulated in a
dedicated language, for instance a logical language like description logic. The
conceptual approach envisaged builds on a knowledge base represented as a ‘generative
ontology’ whose concepts (nodes) are dynamically linked by semantic relations.
A generative ontology is at the outset a skeleton ontology comprising a finite
number of primitive concepts ordered by the generic relation ISA. Complex concept
terms are formed by introducing a finite number of (associative) ontological
relationships, e.g., AGENT, CAUSE, TEMPORALITY, LOCATION, whose arguments
are the concepts in the skeleton ontology. The argument concept terms are either
primitive concepts in the skeleton ontology, or complex concept terms. Thus, concept
structures may be nested, giving rise to an infinity of arbitrarily complex concepts. This
is what makes the ontology generative. By means of the concept-algebraic description
language, OntoLog, developed in its seminal form in the OntoQuery project [1,4], we
can build the ontology and represent concepts independently of their various linguistic
realizations. For example, synonymous phrases like obese children, children with
obesity, children who are obese are represented by the same (complex) concept
descriptor. The content of the generative ontology is elicited from domain experts,
culled from dictionaries, thesauri, taxonomies, etc., and supplemented by statistically
extracted excerpts from the text database.
The conceptual approach employed goes beyond conventional methodologies in that
contemporary text processing involves only limited morpho-syntactic analysis, whereas
here semantic analysis plays a major role. Semantic analysis and representation are
based on conceptual knowledge from the ontology and associative networks. A core
challenge is to provide semantic analyses of larger chunks of text based on the
generative ontology. Nominal expressions, including nouns modified by prepositional
phrases, noun-noun compounds, and genitive constructions are particularly interesting
in this respect, because they are typically the heaviest contributors to document content.
Such expressions, however, present very hard semantic problems because their
semantics involves unexpressed semantic relations. A central problem to be addressed is
how a finite set of ontological, language-independent relations may contribute to
computing relevant semantic relations in the representation of document content
[19,20]. This partial semantic analysis results in a formal representation of the themes of
the text expressed as sets of OntoLog-descriptors, which are then compared to the
formal concept descriptors defined in the ontology. The result of this comparison forms
the basis of the ontology-based access to the text database.
As in more conventional approaches, indexing, i.e., attribution of some form of
description to documents, text fragments, etc., constitutes the basis of the semanticsbased
approach of this project. The main difference is that, under our approach, such
attributed descriptions, rather than being taken from lists of words, are constituted by
structures of descriptors representing ontological concepts. A major challenge is
defining descriptor similarity in terms of the structure and relations of the ontology
[2,6]. Search in a text collection indexed by ontology-based descriptors can draw on
similarity-measures relating these so that conceptual reasoning can be replaced by
simple similarity computation thereby allowing for a scaling to very large information
bases. A core problem is to investigate this kind of ontology-based similarity search as a
semantics-based but still efficient approach. [22].
B. Methods of semi-automatic content modelling. Today’s computerised ontologies are
usually hand-crafted through extensive and costly empirical work. State-of-the-art
methods and tools for automatic ontology construction in large measure fail to capture
the relevant meaning structures. Examples of the shortcomings of current automatic
tools are misplacement of concepts in the ontology, e.g. sub-sub concepts of a concept
registered as subconcepts, and a typical lack of ability to capture other than hierarchical
relations (ISA and part-whole relations), thus ignoring associative relationships.
Semi-automatic content modelling is to be realized in part by applying principles of
terminological concept modelling. It is a major challenge to examine the interaction
between terminological ontologies and the concept of generative ontology. The backbone
of terminological ontologies is constituted by the ISA-relationship. However,
associative relationships form the basis for identifying characteristics of concepts which
can be used to introduce feature specifications on concepts. In ISA hierarchies,
subdividing dimensions which group sub-concepts can be introduced on the basis of
feature specifications, the advantages being that the ontology will build on more solid
principles. In the CAOS project (Computer-Aided Ontology Structuring) [31,33] a
number of principles have been proposed for working with feature specifications.
In the area of ontology building we will work with the problem of integrating
domain-specific and general ontologies. Furthermore, we will investigate the relation
between the formal ontology and the language-specific expressions found in the texts.
Ontology building also involves enhancement by inductively/statistically extracted
excerpts from documents reflected in associations and classifications based on
occurrences in documents. Our basic assumption is precisely that combining the notion
of generative ontology with inductive knowledge can successfully constitute the
foundation of a semantics-based approach to information access which will be realistic
also for very large databases.
4 Innovative aspects and relations to international and national research
As has been described in several articles concerning the retrieval of biomedical
information, there is a growing need for mechanisms to gather, organize and present
biomedical information.
The solutions have concentrated on integrating text mining, text classification and
ontologies, and research is presently ongoing on these areas. The EU Network of
Excellence “Semantic Interoperability and Data Mining in Biomedicine”1, created 2004,
has gathered 25 institutions from 11 EU countries to support the development of
“generic methods and tools supporting the critical tasks of the field; data mining,
knowledge discovery, knowledge representation, abstraction and indexing of
information, semantic-based information retrieval in a complex and high-dimensional
information space, and knowledge-based adaptive systems for provision of decision
support for dissemination of evidence based medicine”.
Within the areas of text mining, data enhancements and semantic web technologies
there has been progress e.g. with respect to parsing techniques and classification
methods, but the area of organising the extracted data has only been sketched. An
important goal in the proposed project is to develop and implement technologies that
combine ontological engineering and knowledge discovery approaches, thereby adding
significant strength to the biomedical text mining techniques.
A. Knowledge representation and modelling. Knowledge representation and modelling
by ontologies is an active area of research, description logic having a crucial role
providing notations for ontologies. Focussing on semantic mark-up of documents
1 http://www.semanticmining.org
(Semantic Web), proposals have been made for text mark-up languages, e.g., extensions
of XML and RDF, based on the main stream description-logic paradigm.
A key notion in the adopted formal framework in this project is the algebraic notion
of lattice, which is the mathematical structure accounting for multi-hierarchical
classification organisations. However algebraic lattices are to be extended with
attributes and relations in order to accommodate the conceptual models intended in the
project.
The use of lattices as the logical framework for ontology representation and
classification gives the project a distinct profile. At the same time, however, it does not
prevent the project from associating with similar contemporary research projects based
on, e.g., the description-logic paradigm. One advantage of our pursuit of lattice models
is that lattices, in contrast to description logic, come with suggestive classification
diagrams providing intuitive graphical visualizations of ontologies, e.g. by Hasse
diagrams or by entity set-relationship diagrams.
However, from an implementation perspective the lattice algebraic approach is
ambitious, though progress has been made recently through our collaboration with prof.
Mai Gehrke, New Mexico, USA, one of the leading mathematicians in this field. As a
back-up position, we have proposed a notion of conceptual or ontological grammars in
the OntoQuery research project. This notion provides much of the representational and
computational expressivity of the lattice framework, but is more tractable and
instructive. The potential of this tool is yet to be explored and the present project is a
perfectly suited application for this.
B. Search, indexing and mining. Search and indexing, too, are active fields of research
in, e.g., information retrieval and databases, recently also attracting attention due to
internet use.
Generally, in these fields of research the focus is on indexing, evaluation, and
ranking techniques, exhibiting results based on methods from AI and involving
statistical analysis and data mining. Only to a limited extent have knowledge modelling
and search been linked, due to the fact that knowledge structures, e.g., ontologies,
cannot function as a basis if search is prepared using only string- and word-based
indexing. If concepts are to be referenced in a description resulting from an automatic
indexing, this indexing must involve recognizing concepts based on text fragments. The
OntoQuery project has worked with linking ontological modelling of domain
knowledge and ontology-based search. The link is brought about by an analysis
unfolding an ontological semantics for natural language.
C. Viewing structured information. An area characterized by many scattered
suggestions but only limited coherent research efforts is that of visualization of
information. Inspired by the natural graphical representations coming with the lattice
organization of ontologies, one of the ideas we wish to pursue in the project is to enable
visualization of conceptual content in the form of so called "conceptual globes". In these
globes the ontological lattice structured diagrams are arranged so that the centre of the
globe forms the top node with specializations sprouting radially in all directions. This is
suggested as a "mental" conceptualisation as well as a graphical one. This organisation
offers radial regionalisation as sectors, wedges, and cones as well as spherical
regionalisation as "continents", cartographic projections, slices, and perspectives.
5 Strategic impact and relevance to society and industry
In order to be competitive, companies need to have access to the contents of the ever
increasing amount of documentation about their products, processes ad projects. Only a
semantics-based approach to information management addressing content is adequate to
that task.
High-tech and research based companies wish to document as accurately as possible
the immaterial values they command, in particular, the knowledge and expertise
available from all resources in the company. This is a strategic concern in the company.
In principle, a company or organisation using the present approach as described above
will be able to answer questions like: What do we know about the field of knowledge
X? From which resources (e.g. collection of documents) is this knowledge available?
How does X relate to the overall landscape of knowledge commanded by the company?
The possibility of answering such questions is brought about by extracting
knowledge from company documents and representing this knowledge in a flexible and
powerful language internal to the computer. The content of each document is described
as a set of arbitrarily complex conceptual descriptors formulated in this language
facilitating detailed comparison of the content of documents. The properties of an
ontology-based system as sketched above lead to easier access to data sources, locally
as well as globally, integration of the scientific research and the available information
present on the subject within Novo Nordisk and in the scientific literature.
Alltogether, the proposed system will enable the research units within Novo Nordisk
to structure the findings and documentation on the experiments, relate and discover
information relevant to the research subjects and access this information easily, as well
as browsing the globally available information on the subjects.
The academic relevance of the proposed system is significant within the areas of
ontological engineering, semantic web technologies and applications of text mining
technologies, and the conclusions drawn from the construction of the system will be
aimed at contributing to the scientific work made on these research areas.
6 References
|