The Illustrated Word2Vec (2019)

(jalammar.github.io)

180 points | by wcedmisten 14 days ago

4 comments

  • VHRanger 13 days ago
    This is a great guide.

    Also - despite the fact that language model embedding [1] are currently the hot rage, good old embedding models are more than good enough for most tasks.

    With just a bit of tuning, they're generally as good at many sentence embedding tasks [2], and with good libraries [3] you're getting something like 400k sentence/sec on laptop CPU versus ~4k-15k sentences/sec on a v100 for LM embeddings.

    When you should use language model embeddings:

    - Multilingual tasks. While some embedding models are multilingual aligned (eg. MUSE [4]), you still need to route the sentence to the correct embedding model file (you need something like langdetect). It's also cumbersome, with one 400mb file per language.

    For LM embedding models, many are multilingual aligned right away.

    - Tasks that are very context specific or require fine-tuning. For instance, if you're making a RAG system for medical documents, the embedding space is best when it creates larger deviations for the difference between seemingly-related medical words.

    This means models with more embedding dimensions, and heavily favors LM models over classic embedding models.

    1. sbert.net

    2. https://collaborate.princeton.edu/en/publications/a-simple-b...

    3. https://github.com/oborchers/Fast_Sentence_Embeddings

    4. https://github.com/facebookresearch/MUSE

    • iman453 13 days ago
      Could you explain what you mean by

      > Tasks that are very context specific or require fine-tuning. For instance, if you're making a RAG system for medical documents, the embedding space is best when it creates larger deviations for the difference between seemingly-related medical words.

      (sorry I'm very new to ML stuff :))

      • SgtBastard 13 days ago
        Not the person you’re replying too, but:

        Foundational models (GPT-4, Llama 3 etc) effectively compress “some” human knowledge into its neural network weights so that it can generate outputs from inputs.

        However, obviously it can’t compress ALL human knowledge for obvious time and cost reasons, but also on the basis that not all knowledge is publicly available (it’s either personal information such as your medical records or otherwise proprietary).

        So we build Retrieval Augmented Gen AI, where we retrieve additional knowledge that the model wouldn’t know about to help answer the query.

        We found early on the LLMs are very effective at in-context learning (look at 1-shot, few-shot learning) and so if you can include the right reference material and/or private information, the foundational models can demonstrate that they’ve “learnt” something and answer far more effectively.

        The challenge is how do you the right content to pass to the foundational model? One very effective way is to use vector search, which basically means:

        Pass your query to an embedding model, get a vector back. Then use that vector to perform a cosine-similarity search on all of the private data you have, that you’ve previously also generated an embedding vector for.

        The closest vectors are likely to be the most similar (and relevant) if the embedding model is able to generate very different vectors for sources that superficially, seemingly related topics but are actually very very different.

        A good embedding model returns very different vectors for “University” and “Universe” but similar for “University” and “College”

      • tanananinena 13 days ago
        Classical word embeddings are static - their value doesn't change depending on the context they appear in.

        You can think of the word embedding as a weighted average of embeddings of words which co-occur with the initial word.

        So it's a bit of a blurry meaning.

        Is "bark" related to a dog? Or to a tree?

        Well, a bit of both, really. The embedding doesn't care about the context of the word - once it's been trained.

        So if you search for related documents based on word embeddings of your query - it can happen that you miss the mark. The embeddings simply don't encode the semantics you need.

        In fact, this can happen even with contextual embeddings, when you look for something specific or in a specialized domain. With word embeddings it's just much more apparent.

  • kinow 12 days ago
    In case anyone is interested in how the author creates the illustrations, here's his video "My visualization tools (my Apply Keynote setup for visualizations and animations)" https://www.youtube.com/watch?v=gSPRxJLxIHA
  • dang 13 days ago
    Discussed at the time:

    The Illustrated Word2vec - https://news.ycombinator.com/item?id=19498356 - March 2019 (37 comments)

  • russfink 13 days ago
    “Embedding” —> representation(?)

    I do not think that word means what *I* think it means.

    • epistasis 13 days ago
      That is essentially correct. You take an object and "embed" it in a high-dimensional vector space to represent it.

      For a deep dive, I highly recommend Vicki Boykis's free materials:

      https://vickiboykis.com/what_are_embeddings/

      • mercurybee 13 days ago
        It's more common to refer to embeddings as low-dimensional.
        • epistasis 12 days ago
          Can you give an example of that?

          I've rarely seen embeddings with fewer than hundreds of dimensions.

          UMAP/T-SNE are dimensional reduction techniques that could maybe considered embeddings, but I haven't encountered that in anything that relates to word2vec or LLMs or much of the current AI fashion.

          • nmfisher 12 days ago
            If your vocabulary size is 10000, then that's also your initial one-hot "dimension". Projecting down to 512 floats (or whatever) is, relatively speaking, low dimension.
            • epistasis 12 days ago
              Ah, I see, I understand but I haven't encountered that terminology used that way. I tend to think of one-hot encoding as not a vector space, however, as the information is actually only lg2(10000) per token. However the embedding vector of 512 or however many floats includes lots of information from positionally related tokens, so it's quite a bit more than lg2(vocab).
              • karma_pharmer 12 days ago
                I tend to think of one-hot encoding as not a vector space

                But it is a vector space. Technically the one-hot elements are the basis of the vector space, whose elements are weighted lists of words.

                And it is the vector space with the highest useful dimension for embedding that vocabulary. Seriously, what are you going to do with the extra axes if you're representing a 10,000-word vocabulary using 1,000,000-element vectors? Those extra axes simply go to waste. In the one-hot case the only thing wasted is the magnitudes.

                This is why word2vec embeddings are considered lower-dimensional embeddings. They're lower-dimension than the most-naive-but-not-wasteful representation.

                • epistasis 11 days ago
                  But when does anyone actually use a one-hot encoding as a vector space? They are used as a lookup table into a different representation, and the one-hit encoding is there to simplify mathematical description, not to be used as a vector space. Or have you experienced systems where people use one-hit encodings?

                  While I would agree that technically, in terms of raw dimensionality, a one-hit encoding is larger dimension than the embedding, that would be a "lower" dimensionality, not "low" dimensionality. Who uses "low" as an absolute term when referring to 300-2000 dimensions?

              • srean 11 days ago
                Thought of letting this one go, but then changed my mind.

                > I tend to think

                > I havent encountered

                Egocentric views of things are far less notable and far less interesting than what those things actually are. It makes for tiring reading. Hope it does not come off as being too testy, my apologies if it does.

                • epistasis 11 days ago
                  When we are talking about two people trying to establish common language, is it not important to talk about one has seen personally? I didn't do it to boost my ego, very odd to think that it came across that way!
                  • srean 11 days ago
                    >I didn't do it to boost my ego

                    I dont doubt that for a minute and neither did I intend such an interpretation.

                    Perhaps if I were to debate a well known and well understood concept in epigenetics in terms how I personally think about it (with not even an iota of intent to boost my ego), my comment above might resonate better. For the purpose of such a hypothetical debate, how I personally think about that concept in epigenetics, becomes more of a distraction. Does it not ?

                    No offense intended and none taken. I do learn a lot from your comments, especially about the intricacies of biology.

      • coreyp_1 13 days ago
        Quite frankly, it's stuff like this that makes me love the HN community.

        Thank you for the additional resource!

    • DISCURSIVE 13 days ago
      Yeah, we usually just say vector embeddings are the numerical representation of that piece of unstructured data. This glossary page put it together quite nicely. https://zilliz.com/glossary/vector-embeddings