AI meets human language

Explorative perspectives on Word Embeddings in Natural Language Processing (NLP)

Nowadays, the term artificial intelligence (AI) is not new to anybody. "Intelligent“ devices and algorithms are everywhere – as digital assistants or chatbots for example. The notion, that an algorithm can handle human language – a very complex construct – always baffled me. After all, it is just code written by someone – and some very complicated maths. Can a machine truly "learn“? Can it "understand“ language? 

When I started my bachelor project as a design student, I knew nearly nothing about the functionality of machine learning algorithms. I wanted to explore what "learning“ means for a machine – and if it in any way resembles what humans do. My dataset consists of so-called "word embeddings“. A multidimensional vector is assigned to a specific word. In comparing the vectors in multidimensional space you also compare the relationship of those words to each other.

In my understanding, word embeddings show an AI´s perspective on the human language – its "vocabulary“, if you will. By visualising a dataset trained by a machine learning algorithm, I got a better understanding of the possibilities and limitations of AI. And that it is only maths after all – but a powerful tool nonetheless.

Here are my five interactive data visualizations using a word embedding dataset trained on TEDtalk transcripts. If you want to explore them right away, go ahead:

Otherwise, just keep scrolling! ;)

100D vectors
Cosine Similarity
Changing Expressions
TED tags
Word pairs


What are Word Embeddings?

Word Embeddings are used in the field of Natural Language Processing (NLP). Machine learning algorithms work with mathematical functions – that is why language has to somehow be converted into numbers. You could just give every existing word a different random number – but that would be impractical. That is why word embeddings work with different concepts of assigning language-fractions (e.g. words) some numbers you can calculate.
Imagine a person with two characteristics. Maybe their age and their wealth. They can now be placed on a cartesian coordinate system. 
If you add other people you can compare them to each other. Now imagine seeing the graph and not knowing what characteristics the axes represent. You can tell that two people are more similar to each other than others but you don’t know in which way. 
People are very complex – you could assign a high number of characteristics and represent each of them with a new axis. We have now entered multidimensional space which is hard to visualize.
Replacing people with words, you get word embeddings. Every word is represented by a number of coordinates in multidimensional space (vectors). Instead of semantics experts working for years to calculate the different connotations and usage of words in text, a machine learning algorithm calculates those vectors over a few minutes or maybe hours. But how does the result look?
Basis of word embeddings is the so-called distributional hypothesis by J.R. Firth. It says that a word is defined by its context – the words it is accompanied with in a sentence. Two words with similar context words should also be similar to each other. 


How are word embeddings created?

Word Embeddings are created during the process of machine training. They are rarely the end product but the means to calculate some sort of prediction. My word embedding dataset was trained on TEDtalk transcripts with a pre-written algorithm called Word2Vec.

A big thank you goes out to Nils Freyer, M.Sc. (@ FH Aachen) for training the algorithm! The TEDtalk transcripts were provided on
TED Talk Transcripts (2006-2021) dataset

(1/8) A machine learning algorithm called Word2Vec was used for this project. It needs continuous text as training data – in this case transcripts of over 4000 TEDTalks.

(2/8) Before the training process, each word in every single sentence is assigned their context words which stand before or after a chosen word in the sentence. Those are later used as input for the machine learning algorithm.

(3/8) A so-called weight matrix, initially filled with random numbers, contains one column for each intended dimension of the word embedding, in my case 100 dimensions. Each row represents an indiviudal word included in the training text (TED talks).

(4/8) The machine training starts: The algorithm gets some context words as input (for simplification, only one word is represented in the sketch) and selects the  equivalent rows in the matrix. That´s actually already the 100-dimensional vector of the word. Right now, it has no meaning because the numbers were randomly initiated. This will change.

(5/8) A second weight matrix is introduced. The rows also represent the individual words of the TEDtalks. By multiplying two rows, the probability of those words to be next to each other in a sentence is calculated. This is done for each row in the second matrix.

(6/8) The result is a probability for every row (every word) which sum op to 1 or 100%. In our example, the algorithm is supposed to predict the word “people“ but calculated a probability of 30%. This value should be increased.

(7/8) The last step is called back-propagation and is responsible for the “learning“ process. It updates the numbers in the weight matrices to better match the intended outcome. This causes the  100dimensional vectors of the words to change.

(8/8) Those steps are repeated for every extracted word combination in the training data. There are also several iterations – 20 iterations in my case. With each iteration, the predictions get better and the word vectors gain meaning when compared to each other.

100 dimensions as 100 axes

The transcripts contained over 90.000 individual words. For the following tools, I only considered words that were used more than 50 times.

I now have a dataset consisting of about 7.500 words – each represented by a 100-dimensional vector. I can´t visualize multidimensional space. So why not show every axis separately?

100D vectors
When I started this project, I did not know about the functionality of Word Embeddings. I thought that every axis would have a specific meaning. If a human was tasked to arrange words along multiple axes, he or she might define one axis as the grammatical nature of a word, one axis that describes its positive or negative connotation, one axis that divides male and female words, and so forth. That would be a nearly impossible and extremely time consuming task because of the complexity of words and language.
An algorithm is much faster and can handle big amounts of data. But it also cannot "understand" language as humans do. As a result, a specific axis of the coordinate system doesn't have meaning. You can only look at relationships between words by looking at their multidimensional positions relative to each other. That is why this depiction is not sufficient for understanding Word Embeddings. It also shows that humans sometimes expect way to much from AI.


The Chinese Room Argument

Let’s pretend a person is sitting in a sealed room. There is a book written in a foreign language with rules demonstrating how it operates. Through a hole the person receives written tasks and questions in this language and has to answer them without knowing the meaning of it. Of course, the first few answers are a result of guesswork. The only feedback is whether the answer was correct or not. Over time, the person will have learned what to write for a specific task. But she still won't have learned the language. 
This thought experiment is known as the Chinese Room Argument by John Searle. It argues that an algorithm can only be trained so it looks like it understands the human language. But only from the outside. It is actually trained, like a dog that learns how to react to commands. An AI that communicates with you doesn’t actually have to understand what you are saying. It would be wrong to call it intelligent.

Cosine Similarity

How can we compare two vectors as a whole? The Cosine similarity calculation looks at the angle between vectors and their comparative length. The result is one number that says how similar the two vectors – and respectively the two represented words – are.

Cosine Similarity

What are vectors?

Vectors can be described by the position of a data point on each axis. Or you can represent them by their length and their angle in the coordinate system. They show a direction from point A to point B.
The result of a cosine similarity calculation lies between -1 and +1. A similarity of one would mean that the vectors are identical. A lower number means lower similarity. With this method, you can compare two vectors with each other.

What does similarity mean in this context?

Words can be similar to each other in different ways. They might be synonyms. They could both be nouns. Maybe they are used within a thematic group or specific context, for example words referring to time. In my dataset, the most similar words to "week“ are "month“ and "year“. The most similar to the plural "weeks“ is "months“.
In searching for most similar words, you can also guess in which grammatical context the word is usually used. "aim“ and "love“ can both be a verb or a noun but one of them is similar to other nouns, the other to verbs.
It is interesting that "good“ and "bad“ have a high cosine similarity. They are opposites after all. But the word embeddings are trained by looking at a word´s context words in a sentence. "Good“ and "bad“ were frequently used in similar ways, e.g. "a good idea“, "a bad idea“. That’s why they have similar vectors in multidimensional space. As antonyms, those words have a close relationship. Cosine similarity can not define the nature of word relationships.


Using cosine similarity, this tool searches for the most similar word in the dataset in a specific grammatical group, e.g. adjectives. This is how the expression slightly changes with every click of a button. 

Changing Expressions
The colored rectangles show the 100 dimensional coordinates of the word. Size refers to the magnitude of the number, color shows whether it is positive or negative.

Don’t forget that antonyms are also considered similar. This means that the expression can also turn opposite in one step.

TED Tags

Since my word embeddings were trained on TED talks, I also have information about the talks. Each of them has tags referring to the theme and content, e.g. "education“ or "technology“. I assumed that many of those terms could also be found in my dataset of words. If so, I could create a subset of only TED tags and compare their vectors to each other. 

After doing so, I experimented with finding the most similar TED tag for each word of my whole dataset using cosine similarity. Maybe that’s a way of automatically sorting words by theme?

TED tags
My subset of TED tags contains 266 words that refer to the context of a talk. Here they are sorted with cosine similarity. By using color, they can be roughly grouped in clusters but with fluid borders. 
Now, each of the 7.500 word vectors is compared to the subset of 266 TED tags. The most similar tag is chosen. For some words, that works really well. For example, "imagination“ is most similar to the TED tag "creativity“.
In the final depiction, words are sorted by their frequency, colored by similar TED tag and assigned a symbol for their most frequent grammatical usage. 

Word Pairs

Until now, I have only looked at the closeness of datapoints in multidimensional space. "Similarity“ can mean a lot of things in this context. I would love to look at specific word relationships.

"good / bad“, "happy / unhappy“ and "safe / dangerous“ are word pairs with a similar relationship between two words. Same goes for "say / said“ and "go / went“. By finding examples like that,  I could calculate how to get from one word to its significant other regarding a specific word relationship which could be of contextual or grammatical nature.

Word pairs

Vector operations

We have already learned that vectors can define how to get from point A to point B in a coordinate system. You can also calculate a vector between to data points in word embeddings, for example from the datapoint "woman“ to "man“. This vector is now called "woman → man“ and should be similar to the vector "girl → boy“ because both of them refer to the relationship of female and male word pairs. In that way, vectors could represent a specific word relationship.
You can add a vector to another data point. If you want to find the male equivalent of the word "queen“, you could add "woman → man“ to it. You won’t land exactly at the data point you were looking for but very close to it. 
The interactive data visualization uses this concept to explore the TED talk word embeddings. Examples of a chosen word pair relationship can be entered. The average vector between word A to word B is calculated and added to singular words.


Closing statement

AI is a fascinating topic which will constantly change and advance. But as it does, it is getting more and more complex, difficult to understand without a technical background. That’s why people get scared and might think that AI could develop its own "will“. I think the greatest strength of AI is identifying patterns of everyday life by analyzing big piles of data. It can give you a different perspective and make your life easier. That’s a powerful tool (which could also be used with bad intentions, of course). But a machine is not "intelligent“. It can’t "think“ or "understand“. Machine learning and human learning are two very different things. It´s important to keep that in mind while talking to Alexa or ChatGPT. 

Whatever your background is, I hope you enjoyed my journey, exploring the functionalities of word embeddings. If you do have questions or feedback, please get in touch!

Don´t forget to test the interactive tools for yourself. You will find all five links at the top of this page.
Back to top

Contact Info

Anna-Lena Keith
@ FH Aachen – University of Applied Sciences | design
Instagram: _annalena_designprojekte
Portfolio (German)


Exhibition @ FH Aachen in February 2023


Bachelor project as booklet

Back to top