Original article
Identifying the right meaning of the words using BERT

Glossary

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
Uncased
    [model]

    The text has been lowercased before
    WordPiece tokenization, e.g., John Smith
    becomes john smith.

    The Uncased model also strips out any
    accent markers.

Cased
    [model]

    The true case and accent markers are
    preserved.

    Typically, the Uncased model is better
    unless you know that case information is
    important for your task (e.g., Named
    Entity Recognition or Part-of-Speech
    tagging).

Apparatus

URL
dataset sentences including ‘duck’ https://sentence.yourdictionary.com/duck
embeddings original BERT base uncased model https://github.com/google-research/bert
algorithm PCA https://tkv.io/posts/tutorial-on-pca/

Hypothesis

The use of the context can solve the problem of categorizing multiple-meaning words (homonyms and homographs) into the same embedding vector.

Aim

To prove that contextualised word embeddings solve the problem.

Questions

  • Can BERT embeddings can be used to classify different meanings of a word?
  • Can we classify the different meanings using these 768 size vectors (duck words).

Method

  • Generate contextualised embedding vectors for every word depending on its sentence Keep only the embedding for the β€˜duck’ word’s token.