python

How to build a URL text summarizer with simple NLP

To view the source code, please visit my GitHub page.

Wouldn’t it be great if you could automatically get a summary of any online article? Rather you’re too busy, or have too many articles in your reading list, sometimes all you really want is a short article summary. 

That’s why TL;DR (too long didn’t read) is so commonly used these days. While this internet acronym can criticize a piece of writing as overly long, it is often used to give a helpful summary of a much longer story or complicated phenomenon. While my last piece focused on how to estimate any article read time, this time we will build a TL;DR given any article.

Getting started

For this tutorial, we’ll be using two Python libraries:

  1. Web crawlingBeautiful Soup. Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

  2. Text summarizationNLTK (Natural Language Toolkit). NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries.

Go ahead and get familiar with the libraries before continuing, and also make sure to install them locally. Alternatively, run this command within the project repo directory

pip install -r requirements.txt

Next, we will download the stopwords corpus from the nltk library individually. Open Python command line and enter:

import nltknltk.download("stopwords")

Text Summarization using NLP

Lets describe the algorithm:

  1. Get URL from user input

  2. Web crawl to extract the page text from the HTML page (by paragraphs <p>).

  3. Execute the summarize frequency algorithm (implemented using NLTK) on the extracted text sentences. The algorithm ranks sentences according to the frequency of the words they contain, and the top sentences are selected for the final summary.

  4. Return the highest ranked sentences (I prefer 5) as a final summary.

For part 2 (1 is self explanatory), we’ll develop a method called getTextFromURL as shown below:

def getTextFromURL(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.text, "html.parser")
    text = ' '.join(map(lambda p: p.text, soup.find_all('p')))
    return text

The method initiates a GET request to the given URL, and returns the text from the HTML page.

From Text to TL;DR

We will use several methods here including some that are not included (to learn more see code source in repo).

def summarizeURL(url, total_pars):
    url_text = getTextFromURL(url).replace(u"Â", u"").replace(u"â", u"")
    fs = FrequencySummarizer()
    final_summary = fs.summarize(url_text.replace("\n"," "),       total_pars)
    return " ".join(final_summary)

The method calls getTextFromURL above to retrieve the text, and clean it from HTML characters and trailing new lines (\n). 

Next, we execute the FrequencySummarizer algorithm on a given text. The algorithm tokenizes the input into sentences then computes the term frequency map of the words. Then, the frequency map is filtered in order to ignore very low frequency and highly frequent words, this way it is able to discard the noisy words such as determiners, that are very frequent but don’t contain much information, or words that occur only few times. To see the code source click here.

Finally, we return a list of the highest ranked sentences which is our final summary.


Summary

That’s it! Try it out with any URL and you’ll get a pretty decent summary. There are many summarization algorithms which have been proposed in recent years (such as TF-IDF), and there’s much more to do in this algorithm. For example, go ahead and improve the filtering of text. If you have any suggestions or recommendations, I’d love to hear about them so comment below!