URL text summarizer using Web Crawling and NLP (Python)

To skip this tutorial, feel free to download the source code from my Github repo here.

I’ve been asked by a few friends to develop a feature for a WhatsApp chatbot of mine, that summarizes articles based on URL inputs. So when a friend sends an article to a WhatsApp group, the bot will reply with a summary of the given URL article. I like this feature because from my personal research, 65% of group users don’t even click the shared URLs, but 97% of them will read a few lines of the articles summary.

As part of being a Fullstack developer, it is important to know how to choose the right stack for each product you develop, depending on the requirements and limitations. For web crawling, I love using Python. The Python community is filled with efficient, easy to implement open source libraries both for web crawling and text summarization. Once you’re done with this tutorial, you won’t believe how simple it is to implement the task.

 

GETTING STARTED

For this tutorial, we’ll be using two Python libraries:

  1. Web crawling - Beautiful Soup. Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

  2. Text summarization - NLTK (Natural Language Toolkit). NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries.

Go ahead and get familiar with the libraries before continuing, and also make sure to install them locally. If you’re having trouble installing the libraries, follow this commands in your Terminal:

pip install beautifulsoup4
pip install -U nltk
pip install -U numpy
pip install -U setuptools
pip install -U sumy

After that, open Python command line and enter:

import nltk
nltk.download(“stopwords”) 

 

THE ALGORITHM

Lets describe the algorithm:

  1. Get URL from user input

  2. Web crawl to extract the natural language from the URL html (by paragraphs <p>).

  3. Execute the summarize class algorithm (implemented using NLTK) on the extracted sentences.

    1. The algorithm ranks sentences according to the frequency of the words they contain, and the top sentences are selected for the final summary.

  4. Return the highest ranked sentences (I prefer 5) as a final summary.

For section 2 (1 is self explanatory), we’ll develop a method called getTextFromURL as shown below:

def getTextFromURL(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.text, "html.parser")
    text = ' '.join(map(lambda p: p.text, soup.find_all('p')))
    return text

The method initiates a get request to the given URL, and returns the extracted natural language from the URL html page.

For sections 3-4, we’ll develop a method called summarizeURL as shown below:

def summarizeURL(url, total_pars):
    url_text = getTextFromURL(url).replace(u"Â", u"").replace(u"â", u"")
    fs = FrequencySummarizer()
    final_summary = fs.summarize(url_text.replace("\n"," "), total_pars)
    return " ".join(final_summary)

The method calls the method above to retrieve the text, and clean it from html characters and trailing new lines (\n). Secondly, execute the Summarize algorithm (inspired by this post) on the given text, which then returns a list with the highest ranked sentences which is our final summary.

 

SUMMARY

That’s it! Try it out with any URL and you’ll get a pretty decent summary. The algorithm proposed in this article as as stated, inspired by this post, which implements a simple text summarizer using the NLTK library. There are many summarization algorithms which have been proposed in recent years, and there’s no doubt there are even better solutions. If you have any suggestions, recommendations I’de love to hear about them so comment below!

Feel free to download directly the source code via my Github account.