Session on text analysis with NLTK, including discussion of cleaning data, creating text corpora, and analyzing texts programmatically.
«< Previous | Next »> |
The first function we will look at is concordance
. “Concordance” in this context means the characters on either side of the word. Our text is behaving like a string. As discussed in the Python tutorial, Python does not evaluate strings, so it just counts the number of characters on either side. By default, this is 25 characters on either side of our target word (including spaces).
In the Jupyter Notebook, type:
text1.concordance("whale")
The output shows us the 25 characters on either side of the word “whale” in Moby Dick. Let’s try this with another word, “love.” Just replace the word “whale” with “love,” and we get the contexts in which Melville uses “love” in Moby Dick. Concordance
is used (behind the scenes) for several other functions, including similar
and common_contexts
.
Let’s now see which words appear in similar contexts as the word “love.” NLTK has a built-in function for this as well: similar
.
text1.similar("love")
Behind the scenes, Python found all the contexts where the word “love” appears. It also finds similar environments, and then what words were common among the similar contexts. This gives a sense of what other words appear in similar contexts. This is somewhat interesting, but more interesting if we can compare it to something else. Let’s take a look at another text. What about Sense and Sensibility? Let’s see what words are similar to “love” in Jane Austen’s writing. In the next cell, type:
text2.similar("love")
We can compare the two and see immediately that Melville and Austen use the word “love” differently.
Let’s expand from novels for a minute and take a look at the NLTK Chat Corpus. In chats, text messages, and other digital communication platforms, “lol” is exceedingly common. We know it doesn’t simply mean “laughing out loud”—maybe the similar
function can provide some insight into what it does mean.
text5.similar("lol")
The resulting list is a lot of greetings, indicating that “lol” probably has more of a phatic function. Phatic language is language primarily for communicating social closeness. Phatic words stand in contrast to semantic words, which contribute meaning to the utterance.
If you are interested in this type of analysis, take a look at the common_contexts
function in the NLTK book or in the NLTK docs.
«< Previous | Next »> |