I’m currently taking a text mining class with Dr. Catherine Blake at GSLIS. Our first assignment was to pre-process 50,000 .txt files containing scientific abstracts and information about them, all formatted with whitespace and linebreaks. We had to extract the award organization, abstract ID number, and the abstract itself. In addition, we had to split the abstract into sentences. You’d think, sure, that’s fine, just use a simple regular expression like this in Python, where abstract is the variable for a chunk of text:
sentEnd = re.compile('[.!?]')
sentList = sentEnd.split(abstract)
The regex looks for periods, exclamation points, and question marks as locations for splitting sentences into a list. But when you have things like “Dr. Smith” or “E. coli” or “1. this, 2. that” or “in the U.S. blah” in the text like these scientific abstracts did, that’s not really good enough. As it turns out, splitting text into sentences is a big headache! There’s no easy, one-size-fits-all-datasets solution. Any script you run will have to be customized to your text set. To start, though, we can use tools from the NLP (natural language processing) community.
One such tool is the NLTK, or Natural Language Toolkit, which contains a bunch of modules and data (like corpora) for use with Python. It’s free and pretty cool! The PunktSentenceTokenizer (see #6) was designed to split text into sentences “by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences.” It’s pre-trained for English and a dozen other Western languages. I tested it out with a sentence from a random book on Project Gutenberg (The Romance of a Great Store, by Edward Hungerford, 1922, which appears to be all about Macy’s?), and also with some text I penned myself. Here’s how it went down in IDLE:
So, the Punkt tokenizer works great on fiction prose, not as hot in my blurb, which was admittedly a toughie. To summarize, it has trouble recognizing:
- uncommon or domain-specific abbreviation (M.L.I.S.)
- two name initials (A.B. Collins-Davis), although one name initial is recognized (Mr. T. Pain)
- “Mr. Pain.”, though I don’t know why here, as “Mr. T. Pain” was fine
- common latin abbreviations (e.g., i.e.)
- numbered lists (1. blah, 2. blah)
It’s also interesting to think about what to call a sentence. Is a quoted sentence within a longer sentence part of the long sentence? Or should you, like this tokenizer, split that into sentences as well? And leave a hanging quotation mark, as in the last printed line? What do you think? What IS a sentence? O_O
Anyway, the PunktSentenceTokenizer works well as a basic sentence splitter for English text, but it’s certainly a lesson in knowing your data, e.g. what abbreviations and other quirks are common in your text set. Any other tools out there that might be helpful?
Note: to work with this tokenizer, you’ll have to use Python 2.5-7.* (not Python 3.0) and download/install NLTK software (instructions here), then download NLTK.data (instructions here, under ‘Interactive installer’ — although I just stuck it in my directory, rather than using /share/, since only I use my laptop).