exploration ml

While sweeping through my old notes, I found an entry with one line of text: blurring for text, like we do for images. I should have written something more, but I think the idea was to have an operation similar to blurring for text and see what it might mean. The general theme in image blurring is something along the lines of 'bringing ambiguity in the roles of the constituents', i.e., the pixels. This, if continued, makes it hard to perceive the semantics of the image.

burrito.png
Figure 1: A burrito fading into meaninglessness

The important point though is that even with the blurred pixels we still try to get some projection into the semantics. Let's try to do the same for text.

Here is the text that I picked for an example, randomly1. This is (are) first (few) line(s) from the wikipedia page on Animal after removing punctuation.

Animals are multicellular eukaryotic organisms that form the biological kingdom
Animalia With few exception animals consume organic material breathe oxygen are
able to move reproduce sexually and grow from a hollow sphere of cells the
blastula during embryonic development

1. Attempt #1

Using the usual weighted sum of neighbour approach, we can try applying a one dimensional kernel to word vectors which should give smoothened vector for each word.

I am taking \(w = 3\) and my kernel looks like \([\frac{\alpha}{2}, (1 - \alpha), \frac{\alpha}{2}]\). For values of \(\alpha\) like \(0.0, 0.1, 0.2, \ldots 1.0\), the next few paragraphs show the smoothened outputs. Note that I did not touch the edges so the first and last words are going to be intact.

If you have JavaScript enabled, you can see a bit more by hovering over the words. The face color of each word 2 defines the closeness with smoothened vector at that place. This is important since we are trying to impose meaning by projecting the vector on symbols where there might be none really. On hovering you can see top 3 closest projections along with cosine similarity with the smooth vector.

Also, yellow highlights word positions and underline highlights same words across all the cases.

\(\alpha = 0.0\)

Animals are multicellular eukaryotic organisms that form the biological kingdom
Animalia With few exception animals consume organic material breathe oxygen are
able to move reproduce sexually and grow from a hollow sphere of cells the
blastula during embryonic development

\(\alpha = 0.1\)

Animals are multicellular eukaryotic organisms that form the biological kingdom
Animalia With few exception animals consume organic material breathe oxygen are
able to move reproduce sexually and grow from a hollow sphere of cells the
blastula during embryonic development

\(\alpha = 0.2\)

Animals are multicellular eukaryotic organisms that form the biological kingdom
Animalia With few exception animals consume organic material breathe oxygen are
able to move reproduce sexually and grow from a hollow sphere of cells the
blastula during embryonic development

\(\alpha = 0.3\)

Animals are multicellular eukaryotic organisms that form the biological kingdom
Animalia With few exception animals consume organic material breathe oxygen are
able to move reproduce sexually and grow from a hollow sphere of cells the
blastula during embryonic development

\(\alpha = 0.4\)

Animals are multicellular eukaryotic organisms that form the biological kingdom
Animalia With few few animals consume organic material breathe oxygen are able
to move reproduce sexually and grow a a a sphere of cells the blastula during
embryonic development

\(\alpha = 0.5\)

Animals are are eukaryotic organisms that form the the kingdom Animalia With few
few animals consume organic material breathe oxygen are able to move reproduce
sexually and grow a a a of of of the the during embryonic development

\(\alpha = 0.6\)

Animals are are eukaryotic organisms that the the the kingdom Animalia With few
few animals consume organic organic oxygen are are to to to reproduce sexually
and grow a a a of of of the the during embryonic development

\(\alpha = 0.7\)

Animals are are multicellular that form the the the kingdom With few few few
consume animals organic organic oxygen are are to to to move and grow and a a a
of of of cells the embryonic during development

Next few \(\alpha\) values are kind of different since the center vector has lesser weights than neighbours.

\(\alpha = 0.8\)

Animals Animals are organisms that form the form the Animalia With few With few
consume animals consume organic oxygen are able to to to move and grow and a a a
of cells of cells the embryonic during development

\(\alpha = 0.9\)

Animals Animals are organisms that form the form the Animalia With few With few
consume organic consume breathe oxygen are able to able to move and grow and a
from a of cells of cells the embryonic during development

\(\alpha = 1.0\)

Animals Animals are organisms that form the form the Animalia With few With few
consume organic consume breathe oxygen are able to able to move and grow from a
from a of cells of cells the embryonic during development

Other than virtually nothing, you can notice a few specific words going away, like the add-ons in burrito. Nothing interesting happens as far as new words' introduction is concerned. A few words like of, a, few etc. are easily repeated and look pretty resilient and imposing (?) in a certain sense.

Much of these can be directly attributed to training corpus. For example, if the weighted average turns out to be around something else, you will start getting new words. Better question probably is if similar degradations (which looks similar to the case with image) are to be found across training sets/methods.

2. Attempt #2

I wonder how these mixings in spaces can be connected with our understanding of semantics. Working in the vector space is all nice and good—in fact smoothing with an all \(1/|k|\) kernel is what people use for basic document vectorization—but what about their projections into symbol space? A few cases like analogies are commonly understood, but there are probably other interesting cases coming from the side of human usage of language.

This mixing of representations is not very uncommon, both computationally (of course) and in regular usage. Look around various kinds of speech disfluencies 3. Freudian slips?

Another interesting piece is the progression of semantic degradations as we push mixing in certain direction. What are the washed out colors in the last blurred burritos image? Do we have an equivalent in higher order constituents like words?

I will probably not touch this exact thread, but hope to get on something similar in some time.


Some notes on the vectors used for generating this. I had the wiki-news-300d-1M-subword.vec.zip from fastext lying around so loaded that up in gensim. The top-3 word lookup was done using gensim's default implementation.

Footnotes:

1

This might be a decent sample since there are normal, as well as niche words in the text.

2

This is a bit hand wavy specially since I cherry picked the color scale; Check out the distances for something more meaningful

3

This is a stretch as far as mixing is concerned, but there probably are things here that can be modelled in our computational vector spaces