Building a bespoke language model for student feedback

In the course of investigating the use of Natural Language Processing and machine learning tools to better extract and navigate our student comments data, it became apparent that many tools have some shortcomings when used on this type of text. Such writing has a very specific style, vocabulary and context which can hinder the effectiveness of generic tools and pre-trained machine learning models. Given our access to a large corpus of student feedback comments, we decided to turn the problem around and see if we could use our data to build new tools.

One of the issues at the very foundation of Natural Language Processing is how to represent words and groupings of words in a format to which mathematical algorithms can be applied. One such representation developed at Google is the Word2Vec model, which represents words as “high dimensional vectors”. That is, each word is represented by a set of numbers (usually a couple of hundred) and these numbers define how a word fits in with other words in the language. This representation is found entirely from the words and their placement relative to other terms in the corpus of text used to train the model. It is a purely machine learning and data-driven approach with no input knowledge or rules about the language itself.

To train our own Word2Vec model we used 250,000 student comments from the past 4 years of subject-level student feedback survey data. As a generic comparison model, we used a popular open-source Word2Vec model available in the Python Natural-Language-Tool-Kit (NLTK) package. This model was trained on a 100 Billion word corpus of Google News stories. While this is far larger than our bespoke model it is also less domain specific and we will demonstrate the effect of this domain-specific aspect in the examples below.

The mathematical representation of words in Word2Vec makes it straight-forward to compute a measure of the similarity between words (or terms) and we will use some examples of this computed similarity to make our comparison between models.

Table 1: Comparison of the most similar terms computed from the Word2Vec models to the terms “occupation” and “real”. This example clearly demonstrates an advantage of a domain specific model. In a university setting the correct context for “occupation” is as a reference to a job or work. This context is correctly picked up in our bespoke model. However, the more generic model relates these terms to military occupations with the most similar words being things like imperialism, tyranny, war etc.

Rank	Most similar terms (Our Model: trained on 250,000 comments from the Student Feedback Survey)	Most similar terms (Comparison Model: trained on 100 Billion Words from Google News)
1	future_job	occupation
2	work_environment	war
3	work_place	oppression
4	project_manager	imperialism
5	legal_practitioner	subjugation
6	workplaces	genuine
7	may_face	tyranny
8	working_environment	imperialist
9	real_situation	actual
10	global_business	oppressors
11	real_world	colonialism
12	care_women	colonialist
13	intended_profession	liberation
14	customers	dispossession
15	real_project	disengagement
16	managers	postwar
17	insight_real_world	Zionism
18	planners	profoundest
19	humanity	militarism
20	professional_career	invasion

Table 2: In the case of less ambiguous terms, it could still be argued that the bespoke model can be superior to the larger but more generic one. In this case, we search for terms similar to “lecture” and “authentic”. While both models give good in-context results some more detailed and specific terms are returned by the purpose trained model (e.g. real-world_cases, class_debates, story_telling) which make sense in the context of student feedback.

Rank	Most similar terms (Our Model: trained on 250,000 comments from the Student Feedback Survey)	Most similar terms (Comparison Model: trained on 100 Billion Words from Google News)
1	interaction_class	lectures
2	real-world_cases	authentically
3	class_debates	lecture
4	practice_theory	lectures
5	story_telling	presentation
6	easy_remember	contemporary
7	economic_models	colloquium
8	us_think_critically	authenticity
9	pragmatic	seminar
10	class_discussion	informative
11	lecture_tutorial	presentations
12	indigenous_perspective	symposium
13	relaxed_learning_environment	enlightening
14	throughout_lecture	seminar
15	active_learning	oration
16	robust	travelogue
17	inviting	sermon
18	worldly	intellectuality
19	real-world_experience	storyteller
20	lectures_bit_boring	timeless