Exploring a new way to compare storytelling in classics by comparing changing signals in a word embedding space using dynamic time warping.
If you are a fanatic reader of a particular author, you can immediately identify whether a given (unknown) text is written by him/her or not. Apart from the idiosyncrasies and diction, the flow and development of text is particularly unique to an author. You can actually visualize a curve of rising and falling emotional cues like thrill, optimism, confusion in their works. Taking these works as signals can, in principle, allow us to compare the writing styles. This post tries to do the same using a technique called dynamic time warping on signals based on word vectors gathered from classic literary works.
1. Text to Signal
There are many ways to see text as a time series. I will use a pretty basic and intuitive technique with word embeddings.
1.1. Word Embeddings
Word embeddings provide projections of words from any language to some \(N\) dimensional mathematical space. In simple terms, it provides a vector for each word it has seen. The vector is learned in relation to the context in which it appears. A popular method uses Continuous Bag of Words and Skip-gram and has a python implementation in gensim (Word2Vec).
An excellent primer on the topic is on the blog of Christopher Olah here. This vector representation does two things for us:
- Gives us something much more amenable to mathematical analysis, numbers.
- Arranges the words in vector space according to the semantic meanings.
A sample of 1000 words from a 100 dimensional word space (t-SNE/ed/ to 2 dimensions) after training on Project Gutenberg's 1000 ebooks is shown below. Although the words look mostly archaic, lookout for nearby words with semantic connections (hover on dots).
1.2. Cramming text
After training an embedding model (with 100 dimensions) and computing vectors for each word in a book, we are left with a matrix of size \(N_w \times 100\), where \(N_w\) is the number of words. One simple reduction strategy for \(N_w\) is to simply take sentence vectors by averaging out the words (though, we could have used Doc2Vec here, but let's go with this). For 100 columns, we can hinge to few fixed word vectors like romance, mystery and calculate cosine distances of the rows from these anchor points. But, since the word space coverage might be severely affected, a better way is to create anchor points in the number space directly.
A simple K-means clustering with 4 centers provide us the anchor points and now the matrices are of size \(N_s \times 4\). Below is the graph for The Sign of the Four by Arthur Conan Doyle.
The smooth lines are generated after using a gaussian filter and they more or less capture the essence of the flow.
2. Comparing signals
Once we have set of comparable signals for each book, next step is to do actual comparison. A simple way would be to extract some sort of features from these time series or directly apply techniques to learn from the series. But, let's try something more crude and direct.
In time series analysis, dynamic time warping (DTW) is an algorithm for measuring similarity between two temporal sequences which may vary in time or speed.
In short this is what we want. DTW has simple working principles and is invariant of length and positions of peaks and valleys in signals. The pseudo code on Wikipedia should get you started.
2.1. Distances to coordinates
Once we get the distances, we get 2D cartesian coordinates using Multi Dimensional Scaling (MDS). Although we won't talk about this here, MDS is a cool thing to look for in data visualization.
Anyway, here is the scatterplot of 24 classic books.
Jane Austen gets a personal space of her own. Mark Twain looks versatile.
One thing that personally worries me while reading fiction is the easy predictability, not of facts (which doesn't hurt), but of what will happen next in a global emotional context. Reading The Lost Symbol made me feel like I am reading Digital Fortress all again. Hopefully, this won't happen again.