logolpt fri_logo

Information extraction

Information extraction (IE) is a subfield of Information Retrieval and it's task is to extract structured data from unstructured sources. We are mainly focusing on textual web sources. Tim Berners-Lee had a vision of semantic web which enables representation of all internet information as a semantic graph. Obviously there are very few fully semantically annotated web sources. Also in the future, a vast majority of published data will not be hand-labelled. That is why we are trying to simulate semantic web through ontology-based information extraction.  


State-of-the-art research on IE is divided into two main approaches - rule based and machine learning based. Traditionally, systems use manually defined rules to extract data. Rules can be hardcoded into applications, explicitly given by regular expressions or defined by other pattern languages. These rules can also be semi-automatically generated using seed expansion techniques where some initial set of results is manually annotated and then extraction rules are generated. Such approaches are usually unaware of emerging patterns and regularities in the database, so therefore the use of data mining techniques is possible to uncover these patterns even if not knowing the data origin or its inherent uncertainties. Recently, the use of machine learning classifiers (e.g. naive Bayes, Logistic Regression, Structured Vector Models and especially Conditional Random Fields) has become popular and gives best results on large datasets.

Our contribution

We use a combination of both mentioned approaches and take also semantic data into account. Our goal is to have completely "parameter-less" information extraction system by leveraging and populating custom ontology which is beside textual data the only input to the system. To achieve this goal, we are going to merge the following base IE tasks:

- ENTITY EXTRACTION: Labeling of words, phrases or symbols to appropriate ontology concepts.
- RELATION EXTRACTION: Identification of specific n-ary relation between mentioned concepts.
- COREFERENCE RESOLUTION: Clustering of entity mentions (i.e. references to specific entities), represented implicitly or as concepts or pronouns.

IOBIE - Intelligent Ontology-based Information Extraction


More information about latest research results can be found on the researcher's websites: