LLM2Vec: Text Embedding with Decoder-Only Language Models

In a groundbreaking development, researchers have unveiled a transformative method that leverages decoder-only language models (LLMs) to excel in text embedding tasks, previously dominated by traditional bidirectional models. This innovation, named LLM2Vec, promises to revolutionize natural language processing (NLP) by capitalizing on the strengths of decoder-only architectures.

Text embedding models, pivotal for a wide array of NLP applications, encode the semantic essence of text into vector representations, enhancing tasks like semantic textual similarity and information retrieval. Traditionally, models like BERT and T5 have reigned supreme, undergoing meticulous adaptation for text embedding purposes through elaborate training pipelines.

However, a recent study reveals a paradigm shift as decoder-only LLMs, renowned for their prowess in NLP tasks, step into the limelight for text embedding endeavors. Spearheaded by the introduction of LLM2Vec, this methodology promises a streamlined approach to text embedding, bypassing the complexities of adaptation and synthetic data generation.

The reluctance to embrace decoder-only LLMs for text embedding tasks stemmed partly from their inherent causal attention mechanism, limiting their ability to capture rich contextualized representations across the input sequence. Yet, LLM2Vec surmounts this hurdle by enabling bidirectional attention, masked next token prediction, and unsupervised contrastive learning in three simple steps.

The efficacy of LLM2Vec was demonstrated across a spectrum of LLMs, ranging from 1.3 billion to 7 billion parameters. Notably, the transformed models outperformed encoder-only counterparts by a significant margin on word-level tasks, setting a new benchmark for unsupervised performance on the Massive Text Embeddings Benchmark (MTEB).

Furthermore, the fusion of LLM2Vec with supervised contrastive learning propelled these models to the forefront of text embedding prowess, achieving state-of-the-art results on MTEB without relying on proprietary or synthetic data sources.

This breakthrough underscores the untapped potential of decoder-only LLMs in the realm of text embedding, offering a parameter-efficient avenue to harness the generative capabilities of these models for universal text representation. With LLM2Vec paving the way, the landscape of NLP stands poised for a transformative shift, promising enhanced efficiency and performance across a myriad of applications.

Download paper: https://arxiv.org/pdf/2404.05961v1.pdf

0 Comentarios