Building a bespoke language model for student feedback

In the course of investigating the use of Natural Language Processing and machine learning tools to better extract and navigate our student comments data, it became apparent that many tools have some shortcomings when used on this type of text. Such writing has a very specific style, vocabulary and context which can hinder the effectiveness of generic tools and pre-trained machine learning models. Given our access to a large corpus of student feedback comments, we decided to turn the problem around and see if we could use our data to build new tools.

One of the issues at the very foundation of Natural Language Processing is how to represent words and groupings of words in a format to which mathematical algorithms can be applied. One such representation developed at Google is the Word2Vec model, which represents words as “high dimensional vectors”. That is, each word is represented by a set of numbers (usually a couple of hundred) and these numbers define how a word fits in with other words in the language.  This representation is found entirely from the words and their placement relative to other terms in the corpus of text used to train the model. It is a purely machine learning and data-driven approach with no input knowledge or rules about the language itself.

To train our own Word2Vec model we used 250,000 student comments from the past 4 years of subject-level student feedback survey data.  As a generic comparison model, we used a popular open-source Word2Vec model available in the Python Natural-Language-Tool-Kit (NLTK) package. This model was trained on a 100 Billion word corpus of Google News stories. While this is far larger than our bespoke model it is also less domain specific and we will demonstrate the effect of this domain-specific aspect in the examples below.

The mathematical representation of words in Word2Vec makes it straight-forward to compute a measure of the similarity between words (or terms) and we will use some examples of this computed similarity to make our comparison between models.

Table 1: Comparison of the most similar terms computed from the Word2Vec models to the terms “occupation” and “real”. This example clearly demonstrates an advantage of a domain specific model. In a university setting the correct context for “occupation” is as a reference to a job or work. This context is correctly picked up in our bespoke model. However, the more generic model relates these terms to military occupations with the most similar words being things like imperialism, tyranny, war etc.

Rank Most similar terms
(Our Model: trained on 250,000 comments from the Student Feedback Survey)
Most similar terms
(Comparison Model: trained on 100 Billion Words from Google News)
1 future_job occupation
2 work_environment war
3 work_place oppression
4 project_manager imperialism
5 legal_practitioner subjugation
6 workplaces genuine
7 may_face tyranny
8 working_environment imperialist
9 real_situation actual
10 global_business oppressors
11 real_world colonialism
12 care_women colonialist
13 intended_profession liberation
14 customers dispossession
15 real_project disengagement
16 managers postwar
17 insight_real_world Zionism
18 planners profoundest
19 humanity militarism
20 professional_career invasion


Table 2: In the case of less ambiguous terms, it could still be argued that the bespoke model can be superior to the larger but more generic one. In this case, we search for terms similar to “lecture” and “authentic”. While both models give good in-context results some more detailed and specific terms are returned by the purpose trained model (e.g. real-world_cases, class_debates, story_telling) which make sense in the context of student feedback.

Rank Most similar terms
(Our Model: trained on 250,000 comments from the Student Feedback Survey)
Most similar terms
(Comparison Model: trained on 100 Billion Words from Google News)
1 interaction_class lectures
2 real-world_cases authentically
3 class_debates lecture
4 practice_theory lectures
5 story_telling presentation
6 easy_remember contemporary
7 economic_models colloquium
8 us_think_critically authenticity
9 pragmatic seminar
10 class_discussion informative
11 lecture_tutorial presentations
12 indigenous_perspective symposium
13 relaxed_learning_environment enlightening
14 throughout_lecture seminar
15 active_learning oration
16 robust travelogue
17 inviting sermon
18 worldly intellectuality
19 real-world_experience storyteller
20 lectures_bit_boring timeless