Set Expansion

Given seed terms, expand the set

View project on GitHub

About Set Expansion

Some seed terms belonging to a set is given as input. The task is to expand the set by finding more elements belonging to that set. Phrases as seed set are also supported.

For example,

Input - Python, Perl, Java

Output - C++, Javascript, php, Ruby etc.

Crawling

The first phase of the project involves crawling the webpages to get the information to work on. The seed terms are passed to a search API which gives back search results in form of web URLs. These web links are then downloaded, parsed and indexed. We can now use certain techniques to find the required terms

Word2Vec

Word2vec is a powerful tool by which each word is represented as a vector in a vector space. The similarity between any two words can be calculated from their corresponding vectors in this vector space. Gensim module was used to build the model of size 1.7 GB by training data from Wikipedia dump of size around 13 GB.

Pattern Recognition

The other technique which can be used is pattern recognition. Here, we find a webpage where seed terms are stored in a pattern (for example, a table or a list). These patterns are recognized and the terms that are found within these patterns and does not belong to our seed set might complete the set. These patterns may also be referred to as wrappers that wraps the elements of a set. This technique does not require a dataset unlike word2vec and hence can find more terms. But, the reliability of the data is something to worry about.

Hybrid Model

A hybrid model was used. We took the top terms from word2vec and to increase accuracy, we lifted the rank of output terms which are also recognized by the pattern recognition technique. For example, 'fruit' is related to both 'mango' and 'banana' according to word2vec but 'strawberry', 'orange', 'apple' etc. are what completes their set. It is very unlikely that 'fruit' would be present in the same list or table as 'mango' or 'banana' on a website but other fruits are more likely to be. On the other hand, word2vec ensure the reliability of data found by pattern recognition.

Sample Inputs and Outputs

Input :

Cricket, Football, Volleyball

Output :

1 : rugby

2 : hockey

3 : squash

4 : bowling

5 : tennis

6 : sport

7 : soccer

8 : boxing

9 : baseball

10 : darts

Input (phrase queries) :

A hot potato, Best of both worlds, Once in a blue moon

Output :

1 : couch potato

2 : kill two birds with one stone

3 : cut corners

4 : miss the boat

5 : hear it on the grapevine

6 : sit on the fence

7 : see eye to eye

8 : costs an arm and a leg

9 : add insult to injury

10 : a penny for your thoughts

Tags

#IIITH #IIIT-H #MajorProject #IRE #InformationRetrievalAndExtraction #SetExpansion #Word2Vec #PatternRecognition #Wrappers #Crawling #SearchAPI #Google #Bing #DuckDuckGo #Wiki #Faroo #Twitter #WebHose #StackOverflow #Parsing #Indexing #StopWordsRemoval #Similarity

Contributers

Sandeep Kasa ( @sandeepkasa ) , Deep Jahan Grewal ( @deepjahan ) and Romil Punetha ( @romilpunetha )