Mining Wikipedia for Topic Modelling

Thursday. March 08, 2018 - 18 mins

DEFINITION

In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract “topics” that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body. Intuitively, given that a document is about a particular topic, one would expect particular words to appear in the document more or less frequently: “dog” and “bone” will appear more often in documents about dogs, “cat” and “meow” will appear in documents about cats, and “the” and “is” will appear equally in both. A document typically concerns multiple topics in different proportions; thus, in a document that is 10% about cats and 90% about dogs, there would probably be about 9 times more dog words than cat words. The “topics” produced by topic modeling techniques are clusters of similar words. A topic model captures this intuition in a mathematical framework, which allows examining a set of documents and discovering, based on the statistics of the words in each, what the topics might be and what each document’s balance of topics is.MORE

OBJECTIVE

we will mine wikipedia page by searching for a keyword “Bangalore” and build a topic model

PACKAGES USED

pattern for wikipedia minning

gensim for topic modelling

NLTK for Natural language processing

 from pattern.web import Wikipedia

article = Wikipedia().search('Bangalore')

article.sections ### section

[WikipediaSection(title=u'Bangalore'),
 WikipediaSection(title=u'Etymology'),
 WikipediaSection(title=u'History'),
 WikipediaSection(title=u'Early and medieval history'),
 WikipediaSection(title=u'Foundation and early modern history'),
 WikipediaSection(title=u'Later modern and contemporary history'),
 WikipediaSection(title=u'Geography'),
 WikipediaSection(title=u'Climate'),
 WikipediaSection(title=u'Demographics'),
 WikipediaSection(title=u'Civic administration'),
 WikipediaSection(title=u'Pollution control'),
 WikipediaSection(title=u'Slums'),
 WikipediaSection(title=u'Waste management'),
 WikipediaSection(title=u'Economy'),
 WikipediaSection(title=u'Transport'),
 WikipediaSection(title=u'Air'),
 WikipediaSection(title=u'Rail'),
 WikipediaSection(title=u'Road'),
 WikipediaSection(title=u'Culture'),
 WikipediaSection(title=u'Art and literature'),
 WikipediaSection(title=u'Indian Cartoon Gallery'),
 WikipediaSection(title=u'Theatre, music, and dance'),
 WikipediaSection(title=u'Education'),
 WikipediaSection(title=u'Media'),
 WikipediaSection(title=u'Sports'),
 WikipediaSection(title=u'Sister cities'),
 WikipediaSection(title=u'See also'),
 WikipediaSection(title=u'References'),
 WikipediaSection(title=u'Further reading')]

the section return a list, lets take the third item on the list for our topic modelling

ban = article.sections[0]

print ban.plaintext()

For other uses, see Bangalore (disambiguation).

Not to be confused with Mangalore.

Bangalore (/bæŋɡəˈlɔːr/), officially known as Bengaluru [14] ( [ˈbeŋɡəɭuːɾu] ( listen)), is the capital of the Indian state of Karnataka. It has a population of over ten million, [8] making it a megacity and the third most populous city and fifth most populous urban agglomeration in India. [15] It is located in southern India on the Deccan Plateau. Its elevation is over 900 m (3,000 ft) above sea level, the highest of India's major cities. [16]

A succession of South Indian dynasties, the Western Gangas, the Cholas and the Hoysalas, ruled the present region of Bangalore until in 1537 CE, Kempé Gowdā – a feudal ruler under the Vijayanagara Empire – established a mud fort considered to be the foundation of modern Bangalore. In 1638, the Marāthās conquered and ruled Bangalore for almost 50 years, after which the Mughals captured and sold the city to the Mysore Kingdom of the Wadiyar dynasty. It was captured by the British after victory in the Fourth Anglo-Mysore War (1799), who returned administrative control of the city to the Maharaja of Mysore. The old city developed in the dominions of the Maharaja of Mysore and was made capital of the Princely State of Mysore, which existed as a nominally sovereign entity of the British Raj.

In 1809, the British shifted their cantonment to Bangalore, outside the old city, and a town grew up around it, which was governed as part of British India. Following India's independence in 1947, Bangalore became the capital of Mysore State, and remained capital when the new Indian state of Karnataka was formed in 1956. The two urban settlements of Bangalore – city and cantonment – which had developed as independent entities merged into a single urban centre in 1949. The existing Kannada name, Bengalūru, was declared the official name of the city in 2006.

Bangalore is sometimes referred to as the "Silicon Valley of India" (or "IT capital of India") because of its role as the nation's leading information technology (IT) exporter. [1] [17] [18] Indian technological organisations ISRO, Infosys, Wipro and HAL are headquartered in the city. A demographically diverse city, Bangalore is the second fastest-growing major metropolis in India. [19] It is home to many educational and research institutions in India, such as Indian Institute of Science (IISc), Indian Institute of Management (Bangalore) (IIMB), National Institute of Fashion Technology, Bangalore, National Institute of Design, Bangalore (NID R&D Campus), National Law School of India University (NLSIU) and National Institute of Mental Health and Neurosciences (NIMHANS). Numerous state-owned aerospace and defence organisations, such as Bharat Electronics, Hindustan Aeronautics and National Aerospace Laboratories are located in the city. The city also houses the Kannada film industry.

text = ban.plaintext()

we will have to preprocess the text

import nltk
from gensim import corpora , models
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

lets first turn the text to a corpus

corpus = nltk.tokenize.sent_tokenize(text)

we check the length of the corpus

len(corpus)

we will have to tokenize each document

token_corpus = [nltk.tokenize.word_tokenize(document) for document in corpus]

lets inspect first tokenized document

token_corpus[0]

[u'For',
 u'other',
 u'uses',
 u',',
 u'see',
 u'Bangalore',
 u'(',
 u'disambiguation',
 u')',
 u'.']

convert all the token to lower case

### turn each token to lower case

token_corpus = [[string.lower(wrd) for wrd in document ] for document in token_corpus]

now we will have to remove the stopword because they dont really add any value to our model

stop_words = set('for a of the and is it major was to after in by are were which use or between i as with meanwhile'.split())

token_corpus = [[wrd for wrd in document if wrd
                 not in stop_words] for document in token_corpus]

### lets check the forst document again

token_corpus[0] ### good is, of and others have been removed

[u'other', u'uses', u'see', u'bangalore', u'disambiguation']

we will remove the dot and other punctuation as well

import string

token_corpus = [[wrd for wrd in document if wrd not in string.punctuation ] for document in token_corpus]

### lets check again

token_corpus[0]

[u'other', u'uses', u'see', u'bangalore', u'disambiguation']

###### great we have removed dot from all the documents of the corpus

lets make a dictionary out of the corpus so that each unique word will have an id

dictionary = corpora.Dictionary(token_corpus)

2018-03-09 19:59:27,411 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-03-09 19:59:27,416 : INFO : built Dictionary(221 unique tokens: [u'wipro', u'technological', u'developed', u'over', u'settlements']...) from 20 documents (total 307 corpus positions)

print(dictionary.token2id)

{u'wipro': 175, u'technological': 174, u'developed': 106, u'over': 30, u'settlements': 143, u'dynasty': 87, u'entity': 108, u'its': 50, u'fifth': 23, u'nominally': 111, u'state-owned': 216, u'1949': 136, u'hal': 169, u'also': 217, u'had': 139, u'outside': 121, u'nimhans': 200, u'entities': 138, u'plateau': 40, u'under': 77, u'8': 20, u'has': 24, u'administrative': 96, u'silicon': 162, u'merged': 142, u'kingdom': 88, u'town': 125, u'1799': 95, u'elevation': 47, u'returned': 102, u'agglomeration': 21, u'ruler': 74, u'iimb': 189, u'dominions': 107, u'nation': 159, u'grew': 120, u'aerospace': 209, u'not': 8, u'neurosciences': 198, u'years': 94, u'electronics': 212, u'ruled': 73, u'school': 204, u'population': 31, u'level': 51, u'numerous': 215, u'university': 207, u'50': 83, u'cantonment': 118, u'iisc': 190, u'victory': 103, u'succession': 76, u'bengal\u016bru': 147, u'referred': 160, u'aeronautics': 208, u'3,000': 43, u'because': 155, u'old': 112, u'karnataka': 14, u'ten': 33, u'be': 5, u'existing': 149, u'mar\u0101th\u0101s': 89, u'national': 197, u'nlsiu': 201, u'up': 126, u'hindustan': 213, u'see': 3, u'maharaja': 101, u'design': 184, u'sea': 53, u'ft': 48, u'empire': 60, u'established': 61, u'vijayanagara': 79, u'megacity': 27, u'research': 203, u'mud': 70, u'state': 18, u'indian': 13, u'health': 187, u'captured': 85, u'above': 45, u'capital': 12, u'new': 133, u'bharat': 210, u'officially': 17, u'cholas': 57, u'foundation': 64, u'metropolis': 179, u'gangas': 65, u'existed': 109, u'headquartered': 170, u'sold': 92, u'million': 28, u'houses': 219, u'bangalore': 0, u'cities': 46, u'science': 205, u'on': 39, u'900': 44, u'hoysalas': 67, u'gowd\u0101': 66, u'region': 72, u'british': 98, u'1537': 54, u'became': 129, u'fastest-growing': 178, u'conquered': 86, u'south': 75, u'/b\xe6\u014b\u0261\u0259\u02c8l\u0254\u02d0r/': 9, u'into': 141, u'declared': 148, u'ce': 56, u'1947': 127, u'd': 183, u'feudal': 62, u'mental': 196, u'city': 22, u'management': 194, u'second': 180, u'1638': 82, u'leading': 158, u'sovereign': 115, u'name': 151, u'their': 124, u'bengaluru': 11, u'organisations': 173, u'valley': 165, u'until': 78, u'listen': 16, u'shifted': 123, u'urban': 35, u'``': 154, u'raj': 114, u'formed': 131, u'part': 122, u'western': 80, u'known': 15, u'highest': 49, u'2006': 146, u'present': 71, u'fort': 63, u'mughals': 90, u'made': 110, u'15': 36, u'14': 10, u'17': 167, u'16': 55, u'19': 181, u'18': 168, u'official': 152, u'\u2013': 81, u'war': 104, u'considered': 58, u'r': 202, u'kemp\xe9': 68, u'many': 195, u'following': 130, u'making': 26, u'mangalore': 7, u'control': 99, u'fashion': 186, u'remained': 134, u'almost': 84, u'disambiguation': 1, u'modern': 69, u'india': 25, u'confused': 6, u'kannada': 150, u"''": 153, u'single': 144, u'diverse': 177, u'governed': 119, u'home': 188, u'infosys': 171, u'technology': 164, u'campus': 182, u'film': 218, u'around': 117, u'information': 157, u'educational': 185, u'\u02c8be\u014b\u0261\u0259\u026du\u02d0\u027eu': 19, u'mysore': 91, u'when': 135, u'1': 166, u'located': 38, u'other': 2, u'role': 161, u'fourth': 100, u'populous': 32, u'dynasties': 59, u'independent': 140, u'exporter': 156, u"'s": 42, u'princely': 113, u'centre': 137, u'wadiyar': 93, u'who': 105, u'southern': 41, u'nid': 199, u'anglo-mysore': 97, u'most': 29, u'defence': 211, u'uses': 4, u'two': 145, u'such': 206, u'law': 193, u'demographically': 176, u'institute': 191, u'third': 34, u'deccan': 37, u'industry': 220, u'sometimes': 163, u'm': 52, u'1809': 116, u'1956': 128, u'independence': 132, u'laboratories': 214, u'institutions': 192, u'isro': 172}

Bag of Words

The bag-of-words model is a simplifying representation used in natural language processing and information retrieval (IR). Also known as the vector space model. In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity.Source

corpus_bow = [dictionary.doc2bow(document) for document in token_corpus]

print(corpus_bow)

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1)], [(5, 1), (6, 1), (7, 1), (8, 1)], [(0, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1)], [(20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 2), (30, 1), (31, 1), (32, 2), (33, 1), (34, 1), (35, 1)], [(25, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1)], [(25, 1), (30, 1), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 1), (51, 1), (52, 1), (53, 1)], [(0, 2), (5, 1), (13, 1), (54, 1), (55, 1), (56, 1), (57, 1), (58, 1), (59, 1), (60, 1), (61, 1), (62, 1), (63, 1), (64, 1), (65, 1), (66, 1), (67, 1), (68, 1), (69, 1), (70, 1), (71, 1), (72, 1), (73, 1), (74, 1), (75, 1), (76, 1), (77, 1), (78, 1), (79, 1), (80, 1), (81, 2)], [(0, 1), (22, 1), (73, 1), (82, 1), (83, 1), (84, 1), (85, 1), (86, 1), (87, 1), (88, 1), (89, 1), (90, 1), (91, 1), (92, 1), (93, 1), (94, 1)], [(22, 1), (85, 1), (91, 1), (95, 1), (96, 1), (97, 1), (98, 1), (99, 1), (100, 1), (101, 1), (102, 1), (103, 1), (104, 1), (105, 1)], [(12, 1), (18, 1), (22, 1), (91, 2), (98, 1), (101, 1), (106, 1), (107, 1), (108, 1), (109, 1), (110, 1), (111, 1), (112, 1), (113, 1), (114, 1), (115, 1)], [(0, 1), (22, 1), (25, 1), (98, 2), (112, 1), (116, 1), (117, 1), (118, 1), (119, 1), (120, 1), (121, 1), (122, 1), (123, 1), (124, 1), (125, 1), (126, 1)], [(0, 1), (12, 2), (13, 1), (14, 1), (18, 2), (25, 1), (42, 1), (91, 1), (127, 1), (128, 1), (129, 1), (130, 1), (131, 1), (132, 1), (133, 1), (134, 1), (135, 1)], [(0, 1), (22, 1), (35, 2), (81, 2), (106, 1), (118, 1), (136, 1), (137, 1), (138, 1), (139, 1), (140, 1), (141, 1), (142, 1), (143, 1), (144, 1), (145, 1)], [(22, 1), (146, 1), (147, 1), (148, 1), (149, 1), (150, 1), (151, 2), (152, 1)], [(0, 1), (12, 1), (25, 2), (42, 1), (50, 1), (153, 2), (154, 2), (155, 1), (156, 1), (157, 1), (158, 1), (159, 1), (160, 1), (161, 1), (162, 1), (163, 1), (164, 1), (165, 1)], [(13, 1), (22, 1), (166, 1), (167, 1), (168, 1), (169, 1), (170, 1), (171, 1), (172, 1), (173, 1), (174, 1), (175, 1)], [(0, 1), (22, 1), (25, 1), (176, 1), (177, 1), (178, 1), (179, 1), (180, 1)], [(0, 3), (13, 2), (25, 2), (164, 1), (181, 1), (182, 1), (183, 1), (184, 1), (185, 1), (186, 1), (187, 1), (188, 1), (189, 1), (190, 1), (191, 5), (192, 1), (193, 1), (194, 1), (195, 1), (196, 1), (197, 4), (198, 1), (199, 1), (200, 1), (201, 1), (202, 1), (203, 1), (204, 1), (205, 1), (206, 1), (207, 1)], [(22, 1), (38, 1), (173, 1), (197, 1), (206, 1), (208, 1), (209, 2), (210, 1), (211, 1), (212, 1), (213, 1), (214, 1), (215, 1), (216, 1)], [(22, 1), (150, 1), (217, 1), (218, 1), (219, 1), (220, 1)]]

Term Frequency Inverse Document Frequency TF-IDF

tfidf = models.TfidfModel(corpus_bow)

2018-03-09 19:59:30,516 : INFO : collecting document frequencies
2018-03-09 19:59:30,518 : INFO : PROGRESS: processing document #0
2018-03-09 19:59:30,521 : INFO : calculating IDF weights for 20 documents and 220 features (281 matrix non-zeros)

tfidf_corpus = tfidf[corpus_bow]

Bulding the model

Latent Semantic Indexing

Latent semantic indexing (LSI) is an indexing and retrieval method that uses a mathematical technique called singular value decomposition (SVD) to identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text. LSI is based on the principle that words that are used in the same contexts tend to have similar meanings. A key feature of LSI is its ability to extract the conceptual content of a body of text by establishing associations between those terms that occur in similar contexts.Source

lsi = models.LsiModel(tfidf_corpus, id2word=dictionary, num_topics=2) # initialize an LSI transformation
corpus_lsi = lsi[tfidf_corpus] # create a double wrapper over the original corpus: bow->tfidf->fold-in-lsi

2018-03-09 19:59:32,597 : INFO : using serial LSI version on this node
2018-03-09 19:59:32,599 : INFO : updating model with new documents
2018-03-09 19:59:32,620 : INFO : preparing a new chunk of documents
2018-03-09 19:59:32,628 : INFO : using 100 extra samples and 2 power iterations
2018-03-09 19:59:32,632 : INFO : 1st phase: constructing (221, 102) action matrix
2018-03-09 19:59:32,635 : INFO : orthonormalizing (221, 102) action matrix
2018-03-09 19:59:32,715 : INFO : 2nd phase: running dense svd on (102, 20) matrix
2018-03-09 19:59:32,721 : INFO : computing the final decomposition
2018-03-09 19:59:32,737 : INFO : keeping 2 factors (discarding 87.165% of energy spectrum)
2018-03-09 19:59:32,740 : INFO : processed documents up to #20
2018-03-09 19:59:32,742 : INFO : topic #0(1.183): 0.300*"state" + 0.274*"capital" + 0.259*"mysore" + 0.203*"british" + 0.179*"karnataka" + 0.152*"maharaja" + 0.141*"old" + 0.139*"indian" + 0.125*"bangalore" + 0.122*"sovereign"
2018-03-09 19:59:32,745 : INFO : topic #1(1.081): -0.226*"british" + 0.178*"institute" + 0.158*"national" + -0.147*"maharaja" + -0.140*"captured" + -0.136*"mysore" + 0.135*"karnataka" + 0.128*"indian" + 0.128*"aerospace" + 0.126*"capital"

checking the result

print lsi.print_topics(2)

2018-03-09 19:59:33,669 : INFO : topic #0(1.183): 0.300*"state" + 0.274*"capital" + 0.259*"mysore" + 0.203*"british" + 0.179*"karnataka" + 0.152*"maharaja" + 0.141*"old" + 0.139*"indian" + 0.125*"bangalore" + 0.122*"sovereign"
2018-03-09 19:59:33,672 : INFO : topic #1(1.081): -0.226*"british" + 0.178*"institute" + 0.158*"national" + -0.147*"maharaja" + -0.140*"captured" + -0.136*"mysore" + 0.135*"karnataka" + 0.128*"indian" + 0.128*"aerospace" + 0.126*"capital"


[(0, u'0.300*"state" + 0.274*"capital" + 0.259*"mysore" + 0.203*"british" + 0.179*"karnataka" + 0.152*"maharaja" + 0.141*"old" + 0.139*"indian" + 0.125*"bangalore" + 0.122*"sovereign"'), (1, u'-0.226*"british" + 0.178*"institute" + 0.158*"national" + -0.147*"maharaja" + -0.140*"captured" + -0.136*"mysore" + 0.135*"karnataka" + 0.128*"indian" + 0.128*"aerospace" + 0.126*"capital"')]

it return a list, we can print it one by one, we can even split it if we like

topic = lsi.print_topics(2)

2018-03-09 19:59:34,755 : INFO : topic #0(1.183): 0.300*"state" + 0.274*"capital" + 0.259*"mysore" + 0.203*"british" + 0.179*"karnataka" + 0.152*"maharaja" + 0.141*"old" + 0.139*"indian" + 0.125*"bangalore" + 0.122*"sovereign"
2018-03-09 19:59:34,763 : INFO : topic #1(1.081): -0.226*"british" + 0.178*"institute" + 0.158*"national" + -0.147*"maharaja" + -0.140*"captured" + -0.136*"mysore" + 0.135*"karnataka" + 0.128*"indian" + 0.128*"aerospace" + 0.126*"capital"

print " Topic 1:\n"
print topic[0][1]

 Topic 1:

0.300*"state" + 0.274*"capital" + 0.259*"mysore" + 0.203*"british" + 0.179*"karnataka" + 0.152*"maharaja" + 0.141*"old" + 0.139*"indian" + 0.125*"bangalore" + 0.122*"sovereign"

print " Topic 2:\n"
print topic[1][1]

 Topic 2:

-0.226*"british" + 0.178*"institute" + 0.158*"national" + -0.147*"maharaja" + -0.140*"captured" + -0.136*"mysore" + 0.135*"karnataka" + 0.128*"indian" + 0.128*"aerospace" + 0.126*"capital"

lets check the keywords of our search term

print gensim.summarization.keywords(ban.plaintext())

bangalore
indian state
urban
city
cities
nation
national
technology
mysore
technological organisations
india
entity
entities
film
institutions
institute
university
british
anglo
isro

SUMMARY

The model can be finetuned by training our own POSTagger and make sure that word that are not nouns dont appear since topic is usually noun or noun phrase….

the model result is still reasonably okay eventhough i had to hard code some word that initially appear and ensure they are removed like “ after, was etc”…

the model is still okay since all the words that contribute to the topic has something to do with Bangalore like “ capital , mysore, indian “ etc

Mustapha Omotosho

constant learner,machine learning enthusiast,huge Barcelona fan