The codebase (assuming you should have reached out by now!) is here
For CS 106X, Programming Abstractions Accelerated, we were assigned an open-ended final project with only the requirements of recursion and invocation of some non-trivial data structure. The project is partly inspired by my being on Wikipedia a lot lately for finding test corpuses for Quizkly. Wikipedia is largely a web of hyperlinks between different pages, each with its own set of hyperlinks appearing in contexts throughout the page.
I was inspired to build a “concept tree”, best illustrated with a
With some scripting and recursive calls, this can be automated. I was inspired to create a weighted trie (or Lexicon) data structure, because every “train of thought” (a tree path) would have some merit resembling that of a real train of thought (i.e. cardiovascular system -> heart -> aorta), with the amount of merit a function of the edge weights (i.e. how strong a parent to child relation is).
hypothetical scenario.Suppose you are cramming for a test for which a lot of breadth is covered, perhaps a biology exam. You search up an article on “Cardiovascular System” – a concept you are not familiar with – and get a paragraph hyperlinking to ten more concepts you are not familiar with, and so on.
What do you do? Seek another source (but what if you really do need to master everything about this topic)? Click them one by one?
Personally, my strategy would be to look for the “big idea” and go into the subtopics only when necessary, in order of relevance (i.e. the hyperlink to “Cardiologists” can wait).
This poses some big questions:
- For an unfamiliar topic, how do we know which topics are more relevant?
- When should I pause and read, and when should I click? How can I do this knowing I only have so much time?
I broke down my task into two challenges that can be worked on in modular fashion:
A parent-or-child function (topic1, topic2)A decision function between parent or child
A relation_discovery function (topic)A way to generate new relations relative to the topic (only children if it’s the origin topic) that can recursively (with backtracking) call itself to build new subtrees for subtopics
For the final project…
I wrote the function as the number of ratio of to and from references between topic1 and topic2, standardized to the length of the page. It makes sense that the child topic would refer to the parent more times than vice-versa.
Relations discovery was just the links of topic1’s Wikipedia page.
Because I knew the C++ implementation for a console application would take up most of my time, I didn’t put much effort to improve on those two functions, considering they worked relatively well.
After the class ended…
I was inspired to improve on the parent-or-child function from the ontological construction work here. Using Stanford’s OpenIE Triplet Extraction library, I can extract (subject, predicate, object) triplets from sentences in which either or both topics appear.
Then using Google News’ word2vec, I compared the predicate’s semantic meaning to words like “includes” and “contains” that epitomize parent-to-child subject-object relation. For robustness, I can take an average over multiple such relation representative.
However, to my linguistic surprise, I could hardly think of singleton words (can you?) that represents a child-to-parent relation. Instead, I needed to calculate appropriate vectors (an active research area!) for phrases like “a part of”, “is an example of”. To do this, I decided to train a TF-IDF model in gensim.
To train the TF-IDF model, I initially calculated document frequency using the contents of all linked pages from the topic page, and the term frequency from within the content the topic page.
I thought even as the tree recurses to topics multiple degrees separation away from the original topic, all the terms is viewed “in the context” of the original topic (indeed, scientific predicates like hypothesize varies sharply from hypothesize used in the context of, say).
However, even a moderately complex topic makes this take forever for marginal benefit (since we only care about predicate words, not the subject / object). Instead, I set the first doc as simply the page summary, additional docs being generated as consecutive text blocks each of the same number of sentences as the page summary.
After implementing triplet extraction with phrase2vec, I effectively have a function that extracts the relation of the subject and object for any sentence.
I attempted to extract all (subject, relation_to, object) triplets for a few topics by running my algorithm on all sentences in which the original topic appears, but realized the number of relations discovered to a given topic is underwhelming, because of various reasons among others:
I can remedy 4 by processing only on the page’s summary, which would require the calling of more linked pages that require a much longer runtime. I can remedy 1-3 by calling more pages that likely have references to the given topic, which would require a longer runtime still.
Whereas previously extracting all links as relations produced redundant topics, using only links to the topic page was far underwhelming, which leads me to where I am now.
As a compromise, I want to use relation_discovery on the original topic and selectively open up new links then backtracking from pages that don’t seem too relevant to the original topic. This is because it’s still slower to run triplet extraction than open up a new page. A new pipeline would look like:
- The sentence is compound the original topic didn’t play the role of subject or object, and so wasn’t picked up on.
- The sentence is compound and the original topic did play the role of subject or object, but the triplet extraction algorithm didn’t pick up on it.
- There are many sentences that refer to the original topic with pronouns like “it” or “this process” so new information is looked over.
- Some relations that are picked up on are too loose for the original topic.
- Call relation_discovery on original topic
- For links of above-threshold similarity to the original topic as computed from GoogleNews / custom domain word2vec
- Open up page
- Run some qualification test on whether the page is relevant, backtrack if fail
- i.e. number of references to topics on tree we’re building
- i.e. highest word2vec similarity scores for terms of low frequency
- Find sentences in which a subtopic on the tree we’re building appears as a subject or object
- Run parent_or_child