1. Introduction

Natural language processing (NLP) is the ability of a computer program to understand human language as it is spoken or written. But human language is often not precise and ambiguous. Sometimes it is even hard to understand by another person. NLP is therefore very complicated and combines such fields as computer science, artificial intelligence and linguistics.

Current approaches to NLP are based on machine learning, especially statistical machine learning. It is based on the analysis of large corpora (sets of documents that have been manually annotated with correct values to be learned) of typical real-world examples.

Common NLP tasks in software nowadays include the following:

Word segmentation

Separate a chunk of continuous text into separate words. For a language like English, this is fairly trivial, since words are usually separated by spaces. However, some written languages like Chinese, Japanese and Thai do not mark word boundaries in such a fashion, and in those languages text segmentation is a significant task requiring knowledge of the vocabulary and morphology of words in the language.

Morphological segmentation

Separate words into individual morphemes and identify the class of the morphemes. The difficulty of this task depends greatly on the complexity of the morphology (i.e. the structure of words) of the language being considered. English has fairly simple morphology and thus it is often possible to ignore this task entirely and simply model all possible forms of a word (e.g. "open, opens, opened, opening") as separate words. In languages such as Turkish, however, such an approach is not possible, as each dictionary entry has thousands of possible word forms.

Tagging

Given a sentence, determine the part of speech for each word. Many words, especially common ones, can serve as multiple parts of speech. For example, "book" can be a noun ("the book on the table") or verb ("to book a flight"); "set" can be a noun, verb or adjective; and "out" can be any of at least five different parts of speech. Some languages have more such ambiguity than others.

Parsing

Determine the parse tree (grammatical analysis) of a given sentence. The grammar for natural languages is ambiguous and typical sentences have multiple possible analyses. In fact, perhaps surprisingly, for a typical sentence there may be thousands of potential parses (most of which will seem completely nonsensical to a human).

Chunking

Also called shallow parsing. It's basically the identification of parts of speech and short phrases (like noun phrases). Part of speech tagging tells you whether words are nouns, verbs, adjectives, etc, but it doesn't give you any clue about the structure of the sentence or phrases in the sentence. Sometimes it's useful to have more information than just the parts of speech of words, but you don't need the full parse tree that you would get from parsing. An example of when chunking might be preferable is Named Entity Recognition.

Named entity recognition

Given a stream of text, determine which items in the text map to proper names, such as people or places, and what the type of each such name is (e.g. person, location, organization). Note that, although capitalization can aid in recognizing named entities in languages such as English, this information cannot aid in determining the type of named entity, and in any case is often inaccurate or insufficient. For example, the first word of a sentence is also capitalized, and named entities often span several words, only some of which are capitalized. Furthermore, many other languages in non-Western scripts (e.g. Chinese or Arabic) do not have any capitalization at all, and even languages with capitalization may not consistently use it to distinguish names. For example, German capitalizes all nouns, regardless of whether they refer to names, and French and Spanish do not capitalize names that serve as adjectives.

2. Computer Tools

Field of the natural language processing is still in development and probably we will have to wait a few years until the current methods and tools could be used in real-life applications. However, one may find interesting software, which is worth noting. These are mostly free tools developed primarily by academic groups from many countries. Of course, the most usable software has been developed for the English language.

2.1. NLTK

The Natural Language Toolkit (known commonly as NLTK) is a suite of libraries and programs for symbolic and statistical natural language processing for the Python programming language. NLTK comes with a number of corpora that have been preprocessed (often manually) to various degrees, conceptually each layer relies on the processing in the adjacent lower layer. Tokenization comes first, then words are tagged, then groups of words are parsed into grammatical elements, like noun phrases or sentences, and finally sentences or other grammatical units can be classified. It also gives the ability to generate statistics about occurrences of various elements and draw graphs that can represent statistical aggregates in results.

Some examples of NLTK usage are presented below.

Words relations

Using corpora for different languages, such as very popular WordNet (lexical database for the English language) or Słowosieć (WordNet's equivalent for Polish language, but with fewer features) one can get some basic information about each word, such as its synonyms, definitions, etc.

Tokenization and tagging

Tokenization extracts single units from the whole sentence. Tagging describes those units as parts of speech. For example RB are adverbs, NN - nouns, MD - modals, WDT are Wh-determiners and VBZ means singular present verb in 3rd person.

Named entities identification

Named entities chunker can be used with tagged text. For example from sentence At eight o'clock on Thursday morning Arthur didn't feel very good. it extracts 2 trees, outer S tree and inner PERSON tree.

2.2. MaltParser

MaltParser is a system for data-driven dependency parsing. It is an implementation of inductive dependency parsing, where the syntactic analysis of a sentence amounts to the derivation of a dependency structure, and where inductive machine learning is used to guide the parser at non-deterministic choice points. MaltParser is developed at Växjö University and Uppsala University. It is written in Java language, but can be executed from NLTK in Python. In below example MaltParser was used for tokenized and tagged sentence.

For this example engmalt.linear-1.7 was used. It is model trained on Penn Treebank. But there are also available models for other languages such as French, Swedish or even Polish - prepared by Alina Wróblewska from IPI PAN, trained on Polish Dependency Bank.

Tree output represents only words, but text output generated by MaltParser (from which tree has been drawn) describes also parts of speech or Stanford dependency types, e.g. advmod - adverbial modifier, nsubj - nominal subject or dobj - direct object. Such information may be useful for further text processing.

2.3. C and C / Boxer

CandC and Boxer are tools included in C&C package. First of them is a parser (it uses combinatory categorial grammar) developed by James Curran (The University of Sydney) and Stephen Clark (University of Oxford), second is a semantic analysis tool developed by Johan Bos (The University of Edinburgh). Both tools can produce graphical and text (Prlog or XML) output.

Let's use CandC and Boxer for example sentence: This is just an example sentence that anyone can write.

Parsed sentence is divided into many related parts (there are NP for noun phrase and N for nouns, there can be also VP for verb phrase or D for determiner) . And also each word is described as part of speech using similar symbols as NLTK tagger.

Boxer output is based on Discourse Representation Theory, which is framework used in linguistics for exploring meaning under a formal semantics approach.

2.4. TaKIPI

TaKIPI is a morphosyntactic tagger for Polish language.It assigns contextually appropriate morphosyntactic descriptions (tags) to subsequent words in text. The name is an acronym: Tager Korpusu IPI PAN (Tagger of the IPI PAN Corpus). TaKIPI software has been developed at Wrocław University of Technology and Polish Academy of Sciences. It is just console application so to use in from any programming language requires writing a wrapper to run TaKIPI with passed arguments.

For input sentence To jest przykładowe zdanie. the output is the following:

As can be seen, TaKIPI for each word from input text prints base forms (lemmas), which are very precisely tagged as parts of speech. TaKIPI uses Morfeusz, which is morphological analysis software for Polish language. Morfeusz can be also used as a separate tool, for example from Java or Python languages.

2.5. CLARIN

Common Language Resources and Technology Infrastructure (CLARIN) is European consortium working on language resources and software tools for natural language processing. The resources include digital archives, corpora, electronic digtionaries and language models. The tools help perform such tasks as syntactic and semantic analysis or speech recognition.

Tools of Polish CLARIN group are at an early stage of development, but there are few services (not finished) available already. These are for example tagger, chunker or relation chunker, which segments a text into pharses (nominal, verbal, adjectival), identify dominant elements in phrases and determine relations between words. Demonstration versions of aforementioned tools are available on project's web page. They can produce interactive html output or simple text document.

Graphical output of Clarin-PL ChunkRel tool

ChunkRel tokenizes the text, tag each word as part of speech and annotate it with chunk headers (NP, VP, ADJP - adjectival phrase and AGP - agreement phrase. In separate node of XML output document relations between parts of the text are defined.

3. Experiment

In the experiment I wanted to prepare such application, that would be able to extract some information from few (more than one) sentences written in Polish language, using one (or more) from the tools described above.

First of all, I assumed that output of the application will be similar but simplified version of resources description models used within semantic web. I mean, that there will be triples, consisting of subject, object and predicate between them.

Another assumption is I will use CLARIN chunker with relations to process plain text, before I will parse it with my application.

And the last assumption is that the software will be prepared to use with very simple sentences (at least so far), just to study possibilities, get familiar with advantages and disadvantages, and see whether my approach is good enough and could be extended or another approach should be proposed.

Because my idea is based on the semantic web, there have to be one more element defined, before a text will be processed. This is ontology, which defines set of concepts, relationships between them and available vocabulary. In my project I use at least 3 properties (maximum 5) to define each concept. Those properties are following:

Subject Type (required)
Subject part of speech: noun, verb, adjective or adverb
Subject Form (optional)
To describe noun subject with grammatical case
Object Type (required)
Object part of speech: noun, verb, adjective or adverb
Object Form (optional)
To describe noun object with grammatical case
Base Form (required)
Predicate name, e.g. base form of the verb

At the current phase of the work I defined 6 base concepts:

Nominative Noun ----- być ---- > Instrumental Noun

Kot jest zwierzeciem.

kot ----- być ---- > zwierze

Nominative Noun ----- mieć ---- > Accusative Noun

Pies ma uszy.

pies ----- mieć ---- > ucho

Nominative Noun ----- lubić ---- > Accusative Noun

Kura lubi jajka.

kura ----- lubić ---- > jajko

Nominative Noun ----- lubić ---- > Verb

Krowa lubi muczeć.

krowa ----- lubić ---- > muczeć

Noun ----- jaki ---- > Adjective

>Cieżki słoń.

słoń ----- jaki ---- > cieęki

Nominative Noun ----- co robi ---- > Verb

Struś biega.

struś ----- co robi ---- > biegać

Input for the parser application is XML file generated by CLARIN. So far I download it manually, but it could be done automatically by interacting with the web page (until CLARIN src/bin files will be available to download). This XML files are defined by ccl.dtd file, which I used to generate classes. Now each node can be mapped into object. To simplify operations on those objects I extended their classes with some methods (e.g. to check whether given element is noun or verb).

When the whole text is stored in memory in the form of objects, proper processing is done. It includes finding a verb from the sentence, then finding noun and adjective phrases and finally matching data with ontology elements. When no subject is found in the sentence (no nominative noun), the algorithm tries to match the last used subject with current sentence. As a result, collection of subjects connected with objects is created.

The SIMPLE PARSER application have been developed using C# language and the newest version of .NET Framework (4.5). As IDE I use Visual Studio 2012. However it doesn't use any specific .NET features, so it can be easily migrated into other object-oriented languages, e.g. Java.

In the pictures below I present some examples of a program execution for simple data input.

Application output for 'Tomek nie jest kucharzem, ale lubi gotować.' sentence.

Application output for 'Krzysztof jest informatykiem. Ma czarny komputer, który jest szybki.' text.

Sentences processed in CLARIN ChunkRel - graphical output.

Output for 'Jestem studentem. Jestem wysokim człowiekiem. Daniel jest programistą. On pracuje.'.

Although input data is very simple I think there is possibility to develop more complex parser, which will cover different types of sentences. First thing to do is of course extending the ontology, but more important in my opinion is developing precise and 'intelligent' algorithm.

4. Materials

C&C Tools (offical webpage)

Clarin (offical webpage)

IPI PAN Tags description (offical document)

MaltParser (offical webpage)

Natural language processing (Wikipedia)

Natural Language Toolkit (offical webpage)

Parse tree (Wikipedia)

Python's Natural Language Toolkit (IBM developerWorks)

Słowosieć - Polish WordNet (offical webpage)

TaKIPI (offical webpage)