Extracting the most frequently occurring words from a wikipedia entry

Maciej Tarsa
4 min readSep 25, 2021
A map with top word from each country’s wikipedia page

My 6-year-old son has been continuing with his strong interest in political geography and recently discovered a book by Ian Wright titled ‘Brilliant Maps’. One of those maps was about the most occuring word in each country’s English Wikipedia page.

As I have been recently delving into the world of Natural Language Processing, I though it would be fun to recreate it by coding a simple Python application that returns the top occuring words from a Wikipedia page and plot them on a map.

First, choose the URL to use, in this example, let’s use the wikipedia page for the United Kingdom.

from urllib.request import urlopen
# specify the url of the web page
url = 'https://en.wikipedia.org/wiki/United_Kingdom'
source = urlopen(url).read()

Next, we need to extract the text from a wikipedia page. I tried using wikipedia Python package, however it wasn’t easily extracting pages for certain queries. I tried BeautifulSoup package instead and it worked great.

from bs4 import BeautifulSoup
# make a soup
soup = BeautifulSoup(source,'lxml')

The code above extracts all the data from a page, but we would be happy with just the text enclosed in <p> tags.

# extract the plain text content from paragraphs
text = ''
for paragraph in soup.find_all('p'):
text += paragraph.text

Next, we can use NLTK package to tokenise the words. Tokenising is essentially breaking the sentences down into separate words and punctuations. We will remove some stopwords and punctuation in a moment.

import nltk
# tokenize the words
tokens = nltk.word_tokenize(text, language="english")

Next, we want to remove stopwords (commonly used words such as ‘in’, ‘a’, ‘the’) and remove pnctuation. We can use the set of stopwords provided with NLTK and remove punctualtion using .isalpha() function.

# downloads and prepare the set of stopwords
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
# remove stopwords
tokens = [word for word in tokens if word.lower() not in stop_words]
# remove punctuation
tokens = [word for word in tokens if word.isalpha()]

Now, we can vectorise the tokens. The easiest thing to do is to create a dictionary for all words remaining and count their occurrance.

# an empty dictionary to be returned at the end
vectors = {}
# iterate through all tokens
for token in tokens:
# check if that token already exists in the dictionary
try:
i = vectors[token]
# if it does, increment the count
vectors[token] = i + 1
# otherwise, assign a count of 1
except KeyError:
vectors[token] = 1

As this would return a very long list, we want to limit it to, e.g. only occurrances greater than 20 — the words that occured in the page more than 20 times.

# only keep the ones with counts over 20
vectors_reduced = {key:value for (key,value) in vectors.items() if value >= 20}

Finally, sort the dictionary so that it shows the highest counts first.

import operator
# sort the result in descending order
vectors_sorted =dict(sorted(vectors_reduced.items(), key=operator.itemgetter(1),reverse=True))

All that’s left to do is to print the results.

# print the response
print(vectors_sorted)

This is the result:

{'UK': 177, 'per': 128, 'British': 114, 'cent': 114, 'United': 113, 'Kingdom': 104, 'Ireland': 101, 'England': 91, 'Britain': 77, 'Wales': 70, 'Scotland': 68, 'Northern': 64, 'world': 64, 'population': 49, 'Great': 39, 'million': 34, 'first': 34, 'century': 34, 'London': 32, 'people': 29, 'Scottish': 27, 'Welsh': 27, 'English': 26, 'also': 26, 'government': 26, 'around': 26, 'Europe': 25, 'include': 25, 'country': 24, 'countries': 24, 'Union': 24, 'number': 24, 'largest': 23, 'including': 22, 'European': 21, 'Irish': 20, 'international': 20, 'one': 20, 'national': 20}

To make the map at the top of this article, I iterated through a list of countries and extracted the top words for each one. I then combined this with geospatial information from GeoPandas in order to plot it using Matplotlib.

I excluded some of the words — words that partly contain the name of the country or its capital as well as some words which generally occur often for a lot of countries, such as ‘government’, ‘population’ or ‘country’. In order to do so, I created a global dictionary of words with their counts for all countries and then divided that number by the number of wikipedia pages explored. This gave me the average number of times a word occurs on a page. For example ‘country’ occured an average 29 times. I excluded any words with average number greater than 8.

Thanks for reading this far. If you are interested in the full code, you can find it on my GitHub.

--

--