Nowadays, the term artificial intelligence (AI) is not new to anybody. "Intelligent“ devices and algorithms are everywhere – as digital assistants or chatbots for example. The notion, that an algorithm can handle human language – a very complex construct – always baffled me. After all, it is just code written by someone – and some very complicated maths. Can a machine truly "learn“? Can it "understand“ language?
When I started my bachelor project as a design student, I knew nearly nothing about the functionality of machine learning algorithms. I wanted to explore what "learning“ means for a machine – and if it in any way resembles what humans do. My dataset consists of so-called "word embeddings“. A multidimensional vector is assigned to a specific word. In comparing the vectors in multidimensional space you also compare the relationship of those words to each other.
In my understanding, word embeddings show an AI´s perspective on the human language – its "vocabulary“, if you will. By visualising a dataset trained by a machine learning algorithm, I got a better understanding of the possibilities and limitations of AI. And that it is only maths after all – but a powerful tool nonetheless.
Here are my five interactive data visualizations using a word embedding dataset trained on TEDtalk transcripts. If you want to explore them right away, go ahead:
Otherwise, just keep scrolling! ;)
(1/8) A machine learning algorithm called Word2Vec was used for this project. It needs continuous text as training data – in this case transcripts of over 4000 TEDTalks.
(2/8) Before the training process, each word in every single sentence is assigned their context words which stand before or after a chosen word in the sentence. Those are later used as input for the machine learning algorithm.
(3/8) A so-called weight matrix, initially filled with random numbers, contains one column for each intended dimension of the word embedding, in my case 100 dimensions. Each row represents an indiviudal word included in the training text (TED talks).
(4/8) The machine training starts: The algorithm gets some context words as input (for simplification, only one word is represented in the sketch) and selects the equivalent rows in the matrix. That´s actually already the 100-dimensional vector of the word. Right now, it has no meaning because the numbers were randomly initiated. This will change.
(5/8) A second weight matrix is introduced. The rows also represent the individual words of the TEDtalks. By multiplying two rows, the probability of those words to be next to each other in a sentence is calculated. This is done for each row in the second matrix.
(6/8) The result is a probability for every row (every word) which sum op to 1 or 100%. In our example, the algorithm is supposed to predict the word “people“ but calculated a probability of 30%. This value should be increased.
(7/8) The last step is called back-propagation and is responsible for the “learning“ process. It updates the numbers in the weight matrices to better match the intended outcome. This causes the 100dimensional vectors of the words to change.
(8/8) Those steps are repeated for every extracted word combination in the training data. There are also several iterations – 20 iterations in my case. With each iteration, the predictions get better and the word vectors gain meaning when compared to each other.
The transcripts contained over 90.000 individual words. For the following tools, I only considered words that were used more than 50 times.
I now have a dataset consisting of about 7.500 words – each represented by a 100-dimensional vector. I can´t visualize multidimensional space. So why not show every axis separately?
Since my word embeddings were trained on TED talks, I also have information about the talks. Each of them has tags referring to the theme and content, e.g. "education“ or "technology“. I assumed that many of those terms could also be found in my dataset of words. If so, I could create a subset of only TED tags and compare their vectors to each other.
After doing so, I experimented with finding the most similar TED tag for each word of my whole dataset using cosine similarity. Maybe that’s a way of automatically sorting words by theme?
Until now, I have only looked at the closeness of datapoints in multidimensional space. "Similarity“ can mean a lot of things in this context. I would love to look at specific word relationships.
"good / bad“, "happy / unhappy“ and "safe / dangerous“ are word pairs with a similar relationship between two words. Same goes for "say / said“ and "go / went“. By finding examples like that, I could calculate how to get from one word to its significant other regarding a specific word relationship which could be of contextual or grammatical nature.