Morphology in Computational Linguistics
1)
Introduction
Ruslan Mitkov a professor of
Computing and Communications at Lancaster University has written a book ‘Oxford
Handbook of Computational Linguistics’ wherein he has discussed how morphology,
syntax, semantics, pragmatics can be applied in NLP in Computational
Linguistics.
Morphology is the study of the
internal structure of words and the meaningful units that compose them. These
units, called morphemes, may be roots, prefixes, suffixes, or grammatical
markers that modify meaning or function.
1. Root (Base word):
·
teach →
the core meaning is “to instruct.”
2. Prefix:
·
un + happy
→ unhappy (prefix un- adds the meaning “not”).
3. Suffix:
·
quick +
-ly → quickly (suffix -ly changes an adjective into an adverb).
4. Grammatical marker (inflection):
·
walk + -ed
→ walked (suffix -ed marks past tense).
In computational linguistics, the
knowledge of morphology becomes crucial because computers must not only process
whole words but also understand how words are formed.
Computational morphology applies
techniques of computer science, algorithms, linguistics, and artificial
intelligence to automatically analyse (break words into components) and
generate (construct surface words from grammatical features) in natural
languages. Let’s take an example:
A computational system takes the word
“unhappiness” and automatically analyses it as:
·
un- (prefix meaning “not”)
·
happy (root)
·
-ness (suffix forming a noun)
The same system can also generate a correct surface word. For
example, given the features:
·
ROOT: happy
·
PREFIX: un
·
SUFFIX: ness
It will automatically construct the
word “unhappiness.
This is essential for tasks like
machine translation, spell checkers, search engines, speech recognition,
document indexing, corpus annotation, and text-to-speech software.
Languages differ greatly in
morphology. Isolating languages like Chinese use little affixation, whereas complex
languages like Turkish or Finnish contain long words formed from many
morphemes. Thus, computational systems must handle diverse patterns of word
formation, making morphology a core study area in language technology.
2)
Overview of Morphology
Morphology studies how different
morphemes combine to form complex words. These morphemes can be:
Free
morphemes (can
stand alone): book, run, chair
Bound
morphemes (cannot
stand alone): -ing, -ed, un-, -s
Morphological processes include:
A.
Inflection
Changes
grammatical properties (tense, number, case) without changing category:
play →
played, book → books
B.
Derivation
Creates new
words or categories:
happy →
happiness, teach → teacher
C.
Compounding
Joining two
free morphemes:
blackboard,
sunflower
Some languages add prefix, suffix,
infix, or circumfix (Arabic), zero morphology (sheep → sheep), subtractive
morphology (Spanish hermano → hermanita). The complexity of these processes
requires computers to learn or model many rules for correct analysis and
generation.
3)
Structure & Ambiguity in Morphology
Ambiguity is one of the greatest
challenges in both analysis and generation in natural language processing. Words
can be morphologically unclear, meaning one surface form can have multiple
analyses. For example:
·
Second (English) can function as noun, ordinal number
·
Okuma (Turkish) can mean reading, don’t read, or to my arrow,
depending on morpheme boundaries.
Computational morphology needs to
decide the correct meaning based on context. This requires identifying:
·
Correct root or stem
·
Proper affix boundaries
·
Grammatical features (tense, mood, case, etc.)
·
Computational Morphology (Very Simple Explanation)
4.
Morphological Analysis (Breaking a word)
·
Computational morphology is about teaching computers how to understand and
create different word forms. The computer takes a full word (surface form) and
breaks it into:
Root word, Grammatical
information (features)
Example:
walked →
walk + PAST
(“walked”
is the surface form, “walk” is the root, “PAST” is the tense)
·
Morphological Generation (Making a word)
The computer
starts with:
Root word, Features
(tense, number, person, etc.)
…and
creates the correct surface word.
Example:
walk + 3rd
person + present → walks
To do this correctly, two things are
needed:
A.
Morphotactics (Order of morphemes)
These are
rules about which pieces of words can join together and in what order.
Example:
In English,
you can add -ed after a verb, but you cannot say edwalk.
B.
Morphophonemics (also called Morphographemics)
These are
the spelling or sound changes that happen when affixes attach.
Examples:
carry + ed
→ carried (y changes to i)
make + ing
→ making (drop the e)
In short:
A computational morphology system
must understand:
·
Which parts can combine (morphotactics), and
·
How spelling/sound changes happen (morphophonemics).
Only then can a computer correctly
break words apart or form new ones.
5)
Finite-State Morphology
A Finite-State Transducer (FST) is a
simple computer tool used to convert:
- Lexical level (root + grammar
features)
walk + PAST
into
- Surface level (the actual word)
walked
Why FSTs are useful
- Very fast
- Can both analyse and generate words
- Store rules in a small, compact way
- Handle morpheme order (morphotactics) and
- Handle spelling changes (morphographemics)
6)
Handling Morphotactics (Allowed Word Building Rules)
Morphotactics
= rules about which morphemes can join together.
Example:
·
Correct: dog → dogs
·
Incorrect: sheeps, boyses → these must be blocked
In FSTs:
- Each word type (noun, verb, adjective) has its
own small dictionary called a sub-lexicon.
- These sub-lexicons say what is allowed:
- nouns → can take plural
- adjectives → cannot take plural
So morphotactics keeps word formation
legal and grammatical.
Conclusion
Computational morphology helps
computers:
- understand how words are built
- know the meaning of different word parts
- choose the correct word form
- handle tasks like translation, speech, and text
search
By using finite-state methods, rule
systems, and modern machine-learning models, computational morphology keeps
improving.
This makes language technology more accurate, faster, and better connected to
real linguistic knowledge.