Mining Wikipedia for Topic Modelling
- 18 minsDEFINITION
In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract “topics” that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body. Intuitively, given that a document is about a particular topic, one would expect particular words to appear in the document more or less frequently: “dog” and “bone” will appear more often in documents about dogs, “cat” and “meow” will appear in documents about cats, and “the” and “is” will appear equally in both. A document typically concerns multiple topics in different proportions; thus, in a document that is 10% about cats and 90% about dogs, there would probably be about 9 times more dog words than cat words. The “topics” produced by topic modeling techniques are clusters of similar words. A topic model captures this intuition in a mathematical framework, which allows examining a set of documents and discovering, based on the statistics of the words in each, what the topics might be and what each document’s balance of topics is.MORE
OBJECTIVE
we will mine wikipedia page by searching for a keyword “Bangalore” and build a topic model
PACKAGES USED
pattern for wikipedia minning
gensim for topic modelling
NLTK for Natural language processing
from pattern.web import Wikipedia
article = Wikipedia().search('Bangalore')
article.sections ### section
[WikipediaSection(title=u'Bangalore'),
WikipediaSection(title=u'Etymology'),
WikipediaSection(title=u'History'),
WikipediaSection(title=u'Early and medieval history'),
WikipediaSection(title=u'Foundation and early modern history'),
WikipediaSection(title=u'Later modern and contemporary history'),
WikipediaSection(title=u'Geography'),
WikipediaSection(title=u'Climate'),
WikipediaSection(title=u'Demographics'),
WikipediaSection(title=u'Civic administration'),
WikipediaSection(title=u'Pollution control'),
WikipediaSection(title=u'Slums'),
WikipediaSection(title=u'Waste management'),
WikipediaSection(title=u'Economy'),
WikipediaSection(title=u'Transport'),
WikipediaSection(title=u'Air'),
WikipediaSection(title=u'Rail'),
WikipediaSection(title=u'Road'),
WikipediaSection(title=u'Culture'),
WikipediaSection(title=u'Art and literature'),
WikipediaSection(title=u'Indian Cartoon Gallery'),
WikipediaSection(title=u'Theatre, music, and dance'),
WikipediaSection(title=u'Education'),
WikipediaSection(title=u'Media'),
WikipediaSection(title=u'Sports'),
WikipediaSection(title=u'Sister cities'),
WikipediaSection(title=u'See also'),
WikipediaSection(title=u'References'),
WikipediaSection(title=u'Further reading')]
the section return a list, lets take the third item on the list for our topic modelling
ban = article.sections[0]
print ban.plaintext()
For other uses, see Bangalore (disambiguation).
Not to be confused with Mangalore.
Bangalore (/bæŋɡəˈlɔːr/), officially known as Bengaluru [14] ( [ˈbeŋɡəɭuːɾu] ( listen)), is the capital of the Indian state of Karnataka. It has a population of over ten million, [8] making it a megacity and the third most populous city and fifth most populous urban agglomeration in India. [15] It is located in southern India on the Deccan Plateau. Its elevation is over 900 m (3,000 ft) above sea level, the highest of India's major cities. [16]
A succession of South Indian dynasties, the Western Gangas, the Cholas and the Hoysalas, ruled the present region of Bangalore until in 1537 CE, Kempé Gowdā – a feudal ruler under the Vijayanagara Empire – established a mud fort considered to be the foundation of modern Bangalore. In 1638, the Marāthās conquered and ruled Bangalore for almost 50 years, after which the Mughals captured and sold the city to the Mysore Kingdom of the Wadiyar dynasty. It was captured by the British after victory in the Fourth Anglo-Mysore War (1799), who returned administrative control of the city to the Maharaja of Mysore. The old city developed in the dominions of the Maharaja of Mysore and was made capital of the Princely State of Mysore, which existed as a nominally sovereign entity of the British Raj.
In 1809, the British shifted their cantonment to Bangalore, outside the old city, and a town grew up around it, which was governed as part of British India. Following India's independence in 1947, Bangalore became the capital of Mysore State, and remained capital when the new Indian state of Karnataka was formed in 1956. The two urban settlements of Bangalore – city and cantonment – which had developed as independent entities merged into a single urban centre in 1949. The existing Kannada name, Bengalūru, was declared the official name of the city in 2006.
Bangalore is sometimes referred to as the "Silicon Valley of India" (or "IT capital of India") because of its role as the nation's leading information technology (IT) exporter. [1] [17] [18] Indian technological organisations ISRO, Infosys, Wipro and HAL are headquartered in the city. A demographically diverse city, Bangalore is the second fastest-growing major metropolis in India. [19] It is home to many educational and research institutions in India, such as Indian Institute of Science (IISc), Indian Institute of Management (Bangalore) (IIMB), National Institute of Fashion Technology, Bangalore, National Institute of Design, Bangalore (NID R&D Campus), National Law School of India University (NLSIU) and National Institute of Mental Health and Neurosciences (NIMHANS). Numerous state-owned aerospace and defence organisations, such as Bharat Electronics, Hindustan Aeronautics and National Aerospace Laboratories are located in the city. The city also houses the Kannada film industry.
text = ban.plaintext()
we will have to preprocess the text
import nltk
from gensim import corpora , models
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
lets first turn the text to a corpus
corpus = nltk.tokenize.sent_tokenize(text)
we check the length of the corpus
len(corpus)
20
we will have to tokenize each document
token_corpus = [nltk.tokenize.word_tokenize(document) for document in corpus]
lets inspect first tokenized document
token_corpus[0]
[u'For',
u'other',
u'uses',
u',',
u'see',
u'Bangalore',
u'(',
u'disambiguation',
u')',
u'.']
convert all the token to lower case
### turn each token to lower case
token_corpus = [[string.lower(wrd) for wrd in document ] for document in token_corpus]
#
now we will have to remove the stopword because they dont really add any value to our model
stop_words = set('for a of the and is it major was to after in by are were which use or between i as with meanwhile'.split())
token_corpus = [[wrd for wrd in document if wrd
not in stop_words] for document in token_corpus]
#
### lets check the forst document again
token_corpus[0] ### good is, of and others have been removed
[u'other', u'uses', u'see', u'bangalore', u'disambiguation']
we will remove the dot and other punctuation as well
import string
token_corpus = [[wrd for wrd in document if wrd not in string.punctuation ] for document in token_corpus]
### lets check again
token_corpus[0]
[u'other', u'uses', u'see', u'bangalore', u'disambiguation']
###### great we have removed dot from all the documents of the corpus
lets make a dictionary out of the corpus so that each unique word will have an id
dictionary = corpora.Dictionary(token_corpus)
2018-03-09 19:59:27,411 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-03-09 19:59:27,416 : INFO : built Dictionary(221 unique tokens: [u'wipro', u'technological', u'developed', u'over', u'settlements']...) from 20 documents (total 307 corpus positions)
print(dictionary.token2id)
{u'wipro': 175, u'technological': 174, u'developed': 106, u'over': 30, u'settlements': 143, u'dynasty': 87, u'entity': 108, u'its': 50, u'fifth': 23, u'nominally': 111, u'state-owned': 216, u'1949': 136, u'hal': 169, u'also': 217, u'had': 139, u'outside': 121, u'nimhans': 200, u'entities': 138, u'plateau': 40, u'under': 77, u'8': 20, u'has': 24, u'administrative': 96, u'silicon': 162, u'merged': 142, u'kingdom': 88, u'town': 125, u'1799': 95, u'elevation': 47, u'returned': 102, u'agglomeration': 21, u'ruler': 74, u'iimb': 189, u'dominions': 107, u'nation': 159, u'grew': 120, u'aerospace': 209, u'not': 8, u'neurosciences': 198, u'years': 94, u'electronics': 212, u'ruled': 73, u'school': 204, u'population': 31, u'level': 51, u'numerous': 215, u'university': 207, u'50': 83, u'cantonment': 118, u'iisc': 190, u'victory': 103, u'succession': 76, u'bengal\u016bru': 147, u'referred': 160, u'aeronautics': 208, u'3,000': 43, u'because': 155, u'old': 112, u'karnataka': 14, u'ten': 33, u'be': 5, u'existing': 149, u'mar\u0101th\u0101s': 89, u'national': 197, u'nlsiu': 201, u'up': 126, u'hindustan': 213, u'see': 3, u'maharaja': 101, u'design': 184, u'sea': 53, u'ft': 48, u'empire': 60, u'established': 61, u'vijayanagara': 79, u'megacity': 27, u'research': 203, u'mud': 70, u'state': 18, u'indian': 13, u'health': 187, u'captured': 85, u'above': 45, u'capital': 12, u'new': 133, u'bharat': 210, u'officially': 17, u'cholas': 57, u'foundation': 64, u'metropolis': 179, u'gangas': 65, u'existed': 109, u'headquartered': 170, u'sold': 92, u'million': 28, u'houses': 219, u'bangalore': 0, u'cities': 46, u'science': 205, u'on': 39, u'900': 44, u'hoysalas': 67, u'gowd\u0101': 66, u'region': 72, u'british': 98, u'1537': 54, u'became': 129, u'fastest-growing': 178, u'conquered': 86, u'south': 75, u'/b\xe6\u014b\u0261\u0259\u02c8l\u0254\u02d0r/': 9, u'into': 141, u'declared': 148, u'ce': 56, u'1947': 127, u'd': 183, u'feudal': 62, u'mental': 196, u'city': 22, u'management': 194, u'second': 180, u'1638': 82, u'leading': 158, u'sovereign': 115, u'name': 151, u'their': 124, u'bengaluru': 11, u'organisations': 173, u'valley': 165, u'until': 78, u'listen': 16, u'shifted': 123, u'urban': 35, u'``': 154, u'raj': 114, u'formed': 131, u'part': 122, u'western': 80, u'known': 15, u'highest': 49, u'2006': 146, u'present': 71, u'fort': 63, u'mughals': 90, u'made': 110, u'15': 36, u'14': 10, u'17': 167, u'16': 55, u'19': 181, u'18': 168, u'official': 152, u'\u2013': 81, u'war': 104, u'considered': 58, u'r': 202, u'kemp\xe9': 68, u'many': 195, u'following': 130, u'making': 26, u'mangalore': 7, u'control': 99, u'fashion': 186, u'remained': 134, u'almost': 84, u'disambiguation': 1, u'modern': 69, u'india': 25, u'confused': 6, u'kannada': 150, u"''": 153, u'single': 144, u'diverse': 177, u'governed': 119, u'home': 188, u'infosys': 171, u'technology': 164, u'campus': 182, u'film': 218, u'around': 117, u'information': 157, u'educational': 185, u'\u02c8be\u014b\u0261\u0259\u026du\u02d0\u027eu': 19, u'mysore': 91, u'when': 135, u'1': 166, u'located': 38, u'other': 2, u'role': 161, u'fourth': 100, u'populous': 32, u'dynasties': 59, u'independent': 140, u'exporter': 156, u"'s": 42, u'princely': 113, u'centre': 137, u'wadiyar': 93, u'who': 105, u'southern': 41, u'nid': 199, u'anglo-mysore': 97, u'most': 29, u'defence': 211, u'uses': 4, u'two': 145, u'such': 206, u'law': 193, u'demographically': 176, u'institute': 191, u'third': 34, u'deccan': 37, u'industry': 220, u'sometimes': 163, u'm': 52, u'1809': 116, u'1956': 128, u'independence': 132, u'laboratories': 214, u'institutions': 192, u'isro': 172}
Bag of Words
The bag-of-words model is a simplifying representation used in natural language processing and information retrieval (IR). Also known as the vector space model. In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity.Source
corpus_bow = [dictionary.doc2bow(document) for document in token_corpus]
print(corpus_bow)
[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1)], [(5, 1), (6, 1), (7, 1), (8, 1)], [(0, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1)], [(20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 2), (30, 1), (31, 1), (32, 2), (33, 1), (34, 1), (35, 1)], [(25, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1)], [(25, 1), (30, 1), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 1), (51, 1), (52, 1), (53, 1)], [(0, 2), (5, 1), (13, 1), (54, 1), (55, 1), (56, 1), (57, 1), (58, 1), (59, 1), (60, 1), (61, 1), (62, 1), (63, 1), (64, 1), (65, 1), (66, 1), (67, 1), (68, 1), (69, 1), (70, 1), (71, 1), (72, 1), (73, 1), (74, 1), (75, 1), (76, 1), (77, 1), (78, 1), (79, 1), (80, 1), (81, 2)], [(0, 1), (22, 1), (73, 1), (82, 1), (83, 1), (84, 1), (85, 1), (86, 1), (87, 1), (88, 1), (89, 1), (90, 1), (91, 1), (92, 1), (93, 1), (94, 1)], [(22, 1), (85, 1), (91, 1), (95, 1), (96, 1), (97, 1), (98, 1), (99, 1), (100, 1), (101, 1), (102, 1), (103, 1), (104, 1), (105, 1)], [(12, 1), (18, 1), (22, 1), (91, 2), (98, 1), (101, 1), (106, 1), (107, 1), (108, 1), (109, 1), (110, 1), (111, 1), (112, 1), (113, 1), (114, 1), (115, 1)], [(0, 1), (22, 1), (25, 1), (98, 2), (112, 1), (116, 1), (117, 1), (118, 1), (119, 1), (120, 1), (121, 1), (122, 1), (123, 1), (124, 1), (125, 1), (126, 1)], [(0, 1), (12, 2), (13, 1), (14, 1), (18, 2), (25, 1), (42, 1), (91, 1), (127, 1), (128, 1), (129, 1), (130, 1), (131, 1), (132, 1), (133, 1), (134, 1), (135, 1)], [(0, 1), (22, 1), (35, 2), (81, 2), (106, 1), (118, 1), (136, 1), (137, 1), (138, 1), (139, 1), (140, 1), (141, 1), (142, 1), (143, 1), (144, 1), (145, 1)], [(22, 1), (146, 1), (147, 1), (148, 1), (149, 1), (150, 1), (151, 2), (152, 1)], [(0, 1), (12, 1), (25, 2), (42, 1), (50, 1), (153, 2), (154, 2), (155, 1), (156, 1), (157, 1), (158, 1), (159, 1), (160, 1), (161, 1), (162, 1), (163, 1), (164, 1), (165, 1)], [(13, 1), (22, 1), (166, 1), (167, 1), (168, 1), (169, 1), (170, 1), (171, 1), (172, 1), (173, 1), (174, 1), (175, 1)], [(0, 1), (22, 1), (25, 1), (176, 1), (177, 1), (178, 1), (179, 1), (180, 1)], [(0, 3), (13, 2), (25, 2), (164, 1), (181, 1), (182, 1), (183, 1), (184, 1), (185, 1), (186, 1), (187, 1), (188, 1), (189, 1), (190, 1), (191, 5), (192, 1), (193, 1), (194, 1), (195, 1), (196, 1), (197, 4), (198, 1), (199, 1), (200, 1), (201, 1), (202, 1), (203, 1), (204, 1), (205, 1), (206, 1), (207, 1)], [(22, 1), (38, 1), (173, 1), (197, 1), (206, 1), (208, 1), (209, 2), (210, 1), (211, 1), (212, 1), (213, 1), (214, 1), (215, 1), (216, 1)], [(22, 1), (150, 1), (217, 1), (218, 1), (219, 1), (220, 1)]]
Term Frequency Inverse Document Frequency TF-IDF
tfidf = models.TfidfModel(corpus_bow)
2018-03-09 19:59:30,516 : INFO : collecting document frequencies
2018-03-09 19:59:30,518 : INFO : PROGRESS: processing document #0
2018-03-09 19:59:30,521 : INFO : calculating IDF weights for 20 documents and 220 features (281 matrix non-zeros)
tfidf_corpus = tfidf[corpus_bow]
#
Bulding the model
Latent Semantic Indexing
Latent semantic indexing (LSI) is an indexing and retrieval method that uses a mathematical technique called singular value decomposition (SVD) to identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text. LSI is based on the principle that words that are used in the same contexts tend to have similar meanings. A key feature of LSI is its ability to extract the conceptual content of a body of text by establishing associations between those terms that occur in similar contexts.Source
lsi = models.LsiModel(tfidf_corpus, id2word=dictionary, num_topics=2) # initialize an LSI transformation
corpus_lsi = lsi[tfidf_corpus] # create a double wrapper over the original corpus: bow->tfidf->fold-in-lsi
2018-03-09 19:59:32,597 : INFO : using serial LSI version on this node
2018-03-09 19:59:32,599 : INFO : updating model with new documents
2018-03-09 19:59:32,620 : INFO : preparing a new chunk of documents
2018-03-09 19:59:32,628 : INFO : using 100 extra samples and 2 power iterations
2018-03-09 19:59:32,632 : INFO : 1st phase: constructing (221, 102) action matrix
2018-03-09 19:59:32,635 : INFO : orthonormalizing (221, 102) action matrix
2018-03-09 19:59:32,715 : INFO : 2nd phase: running dense svd on (102, 20) matrix
2018-03-09 19:59:32,721 : INFO : computing the final decomposition
2018-03-09 19:59:32,737 : INFO : keeping 2 factors (discarding 87.165% of energy spectrum)
2018-03-09 19:59:32,740 : INFO : processed documents up to #20
2018-03-09 19:59:32,742 : INFO : topic #0(1.183): 0.300*"state" + 0.274*"capital" + 0.259*"mysore" + 0.203*"british" + 0.179*"karnataka" + 0.152*"maharaja" + 0.141*"old" + 0.139*"indian" + 0.125*"bangalore" + 0.122*"sovereign"
2018-03-09 19:59:32,745 : INFO : topic #1(1.081): -0.226*"british" + 0.178*"institute" + 0.158*"national" + -0.147*"maharaja" + -0.140*"captured" + -0.136*"mysore" + 0.135*"karnataka" + 0.128*"indian" + 0.128*"aerospace" + 0.126*"capital"
checking the result
print lsi.print_topics(2)
2018-03-09 19:59:33,669 : INFO : topic #0(1.183): 0.300*"state" + 0.274*"capital" + 0.259*"mysore" + 0.203*"british" + 0.179*"karnataka" + 0.152*"maharaja" + 0.141*"old" + 0.139*"indian" + 0.125*"bangalore" + 0.122*"sovereign"
2018-03-09 19:59:33,672 : INFO : topic #1(1.081): -0.226*"british" + 0.178*"institute" + 0.158*"national" + -0.147*"maharaja" + -0.140*"captured" + -0.136*"mysore" + 0.135*"karnataka" + 0.128*"indian" + 0.128*"aerospace" + 0.126*"capital"
[(0, u'0.300*"state" + 0.274*"capital" + 0.259*"mysore" + 0.203*"british" + 0.179*"karnataka" + 0.152*"maharaja" + 0.141*"old" + 0.139*"indian" + 0.125*"bangalore" + 0.122*"sovereign"'), (1, u'-0.226*"british" + 0.178*"institute" + 0.158*"national" + -0.147*"maharaja" + -0.140*"captured" + -0.136*"mysore" + 0.135*"karnataka" + 0.128*"indian" + 0.128*"aerospace" + 0.126*"capital"')]
it return a list, we can print it one by one, we can even split it if we like
topic = lsi.print_topics(2)
2018-03-09 19:59:34,755 : INFO : topic #0(1.183): 0.300*"state" + 0.274*"capital" + 0.259*"mysore" + 0.203*"british" + 0.179*"karnataka" + 0.152*"maharaja" + 0.141*"old" + 0.139*"indian" + 0.125*"bangalore" + 0.122*"sovereign"
2018-03-09 19:59:34,763 : INFO : topic #1(1.081): -0.226*"british" + 0.178*"institute" + 0.158*"national" + -0.147*"maharaja" + -0.140*"captured" + -0.136*"mysore" + 0.135*"karnataka" + 0.128*"indian" + 0.128*"aerospace" + 0.126*"capital"
print " Topic 1:\n"
print topic[0][1]
Topic 1:
0.300*"state" + 0.274*"capital" + 0.259*"mysore" + 0.203*"british" + 0.179*"karnataka" + 0.152*"maharaja" + 0.141*"old" + 0.139*"indian" + 0.125*"bangalore" + 0.122*"sovereign"
print " Topic 2:\n"
print topic[1][1]
Topic 2:
-0.226*"british" + 0.178*"institute" + 0.158*"national" + -0.147*"maharaja" + -0.140*"captured" + -0.136*"mysore" + 0.135*"karnataka" + 0.128*"indian" + 0.128*"aerospace" + 0.126*"capital"
lets check the keywords of our search term
print gensim.summarization.keywords(ban.plaintext())
bangalore
indian state
urban
city
cities
nation
national
technology
mysore
technological organisations
india
entity
entities
film
institutions
institute
university
british
anglo
isro
SUMMARY
The model can be finetuned by training our own POSTagger and make sure that word that are not nouns dont appear since topic is usually noun or noun phrase….
the model result is still reasonably okay eventhough i had to hard code some word that initially appear and ensure they are removed like “ after, was etc”…
the model is still okay since all the words that contribute to the topic has something to do with Bangalore like “ capital , mysore, indian “ etc