Text Normalization - Preparing text for processing

For many NLP tasks, it’s necessary to transform raw text into a format that is suitable for use. This process is text normalization. There are many tasks in text normalization like tokenization, lemmatization, stemming and sentence segmentation.

Tokenization is grabbing the terms of the text. For example, instead of splitting up ‘New York’ into ‘New’ and ‘York’, you would have to recognize that these two words go together.

Stemming is stripping off the suffixes of a word while lemmatization is determining that two words have the same root even though they may look different (eg. sing and sang).

Sentence segmentation refers to splitting up the text based on each sentence, so that’s usually splitting it by a period, exclamation mark or question mark.

Before processing text, it’s important to know what is a word for your particular use case. Although this sound simple, here are some questions worth thinking about:

  • Are punctuation marks, like ! or . a word?
  • Are words with the same lemma the same word type or different? (eg. sing vs sang). A lemma is a set of forms that have the same stem and same word meaning.
  • Are fillers, like uh and um, considered words?

Usually, the first step of text processing is tokenization, separating the text out into words. Some of the difficulties in tokenization are:

  • recognizing named entities like ‘New York City’, ‘N.B.A.', ‘Nobel Prize’
  • deciding what to do with contractions
  • and other considerations of what consists of a word

Character normalization is also important to consider. This means looking at your character sets and making sure everything is of the same set. Sometimes, you may have to decide whether you want to lowercase your text. Information can be loss if this happens, so again it depends on the data set.

A good production example of a tokenizer is spacy’s tokenizer. It’s considered a *lossless tokenizer, meaning that the tokens can be converted back into their raw text because the whitespace information is preserved in the tokens.

At a high level it does a few main things:

  • First it splits the string roughly into words by whitespace
  • Then it looks at each word substring and checks to see if it’s an ‘exception’. If it is, the word is added to the tokens without any processing
  • If it matches a suffix, prefix, or infix, the word is split up and the new substrings are again run through the tokenizer

Here’s their pseudo code:

def tokenizer_pseudo_code(text, special_cases, find_prefix, find_suffix, find_infixes):
    tokens = []
    for substring in text.split(' '):
        suffixes = []
        while substring:
            if substring in special_cases:
                substring = ''
            elif find_prefix(substring) is not None:
                split = find_prefix(substring)
                substring = substring[split:]
            elif find_suffix(substring) is not None:
                split = find_suffix(substring)
                substring = substring[:-split]
            elif find_infixes(substring):
                infixes = find_infixes(substring)
                offset = 0
                for match in infixes:
                    tokens.append(substring[offset : match.start()])
                    tokens.append(substring[match.start() : match.end()])
                    offset = match.end()
                substring = substring[offset:]
                substring = ''
    return tokens

Sentence segmentation is one more component of normalization. The main issue here is deciding what counts as a word boundary. Is it punctuation marks like . ? !? How should abbreviations be dealt with?

Once these tasks are completed, raw text is turned into a format that is more suitable for processing.