text-analysis

Session on text analysis with NLTK, including discussion of cleaning data, creating text corpora, and analyzing texts programmatically.

View the Project on GitHub dhsouthbend/text-analysis

«< Previous Next »>

Cleaning and Normalizing

Generally, however, our questions are more about topics rather than writing style. So, once we have a corpus—whether that is one text or millions—we usually want to clean and normalize it. There are three terms we are going to need:

You probably know what removing punctuation and capitalization refer to, but the other terms may be new:

Language is messy, and created for and by people, not computers. There is a lot of grammatical information in a sentence that a computer cannot use. For example, I could say to you

The house is burning.

and you would understand me. You would also understand if I say

house burn.

The first has more information about tense, and which house in particular, but the sentiment is the same either way.

In going from the first sentence to the normalized words, we removed the stop words (the and is), and removed punctuation and case, and lemmatized what was left (burning becomes burn - though we might have stemmed this, its impossible to tell from the example). This results in what is essentially a “bag of words,” or a corpus of words without any structure. Because normalizing your text reduces the number of words (and therefore the number of dimensions in your data), and keeps only the words that contribute meaning to the document, this cleaning is usually desirable.

For now, we just need to know that there are “clean” and “dirty” versions of text data. Sometimes our questions are about the clean data, but sometimes our questions are in the “dirt.”

Words into numbers

In the next section, we are going to go through a series of methods that come built-in to NLTK that allow us to turn our words into numbers and visualizations. This is just scratching the surface, but should give you an idea of what is possible beyond just counting words.

«< Previous Next »>