自然语言处理从入门到应用——LangChain：模型（Models）-[文本嵌入模型：Embaas、Fake Embeddings、Google Vertex AI PaLM等]

分类目录：《自然语言处理从入门到应用》总目录

本文将介绍如何在LangChain中使用Embedding类。Embedding类是一种与嵌入交互的类。有很多嵌入提供商，如：OpenAI、Cohere、Hugging Face等，这个类旨在为所有这些提供一个标准接口。

嵌入创建文本的向量表示会很有用，因为这意味着我们可以在向量空间中表示文本，并执行类似语义搜索这样的操作。LangChain中的基本Embedding类公开两种方法：

embed_documents：适用于多个文档
embed_query：适用于单个文档

将这两种方法作为两种不同的方法的另一个原因是一些嵌入提供商对于需要搜索的文档和查询（搜索查询本身）具有不同的嵌入方法，下面是文本嵌入的集成示例：

Table of Contents

Embaas

embaas是一种全面托管的NLP API服务，提供诸如嵌入生成、文档文本提取、文档转换为嵌入等功能。我们可以选择各种预训练模型。下面展示的是如何使用embaas的嵌入API为给定的文本生成嵌入：

# Set API key
embaas_api_key = "YOUR_API_KEY"
# or set environment variable
os.environ["EMBAAS_API_KEY"] = "YOUR_API_KEY"
from langchain.embeddings import EmbaasEmbeddings
embeddings = EmbaasEmbeddings()
# Create embeddings for a single document
doc_text = "This is a test document."
doc_text_embedding = embeddings.embed_query(doc_text)
# Print created embedding
print(doc_text_embedding)
# Create embeddings for multiple documents
doc_texts = ["This is a test document.", "This is another test document."]
doc_texts_embeddings = embeddings.embed_documents(doc_texts)
# Print created embeddings
for i, doc_text_embedding in enumerate(doc_texts_embeddings):
    print(f"Embedding for document {i + 1}: {doc_text_embedding}")
# Using a different model and/or custom instruction
embeddings = EmbaasEmbeddings(model="instructor-large", instruction="Represent the Wikipedia document for retrieval")

Fake Embeddings

LangChain还提供了伪造嵌入（Fake Embeddings）类，我们可以使用它来测试流程：

from langchain.embeddings import FakeEmbeddings
embeddings = FakeEmbeddings(size=1352)
query_result = embeddings.embed_query("foo")
doc_results = embeddings.embed_documents(["foo"])

Google Vertex AI PaLM

Google Vertex AI PaLM是与Google PaLM集成是分开的。Google选择通过GCP提供PaLM的企业版，它支持通过GCP提供的模型。Vertex AI上的PaLM API是预览版本，受GCP特定服务条款的预GA产品条款的约束。GA产品和功能具有有限的支持，并且对GA版本的产品和功能的更改可能与其他GA版本不兼容。有关更多信息，我们可以参阅发布阶段描述。另外，使用Vertex AI上的PaLM API即表示同意生成式AI预览版本的条款和条件（预览版条款）。对于Vertex AI上的PaLM API，我们可以根据“云数据处理附加协议”中概述的适用限制和义务，在合同（如预览版条款中所定义）中处理个人数据。要使用Vertex AI PaLM，我们必须安装google-cloud-aiplatform Python包，并且配置了我们的环境的凭据并将服务帐号JSON文件的路径存储为GOOGLE_APPLICATION_CREDENTIALS环境变量。下面的代码库使用了google.auth库，该库首先查找上述应用凭据变量，然后查找系统级身份验证：

#!pip install google-cloud-aiplatform

from langchain.embeddings import VertexAIEmbeddings
embeddings = VertexAIEmbeddings()
text = "This is a test document."
query_result = embeddings.embed_query(text)
doc_result = embeddings.embed_documents([text])

Hugging Face Hub

我们加载Hugging Face Embedding类：

from langchain.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(model_name="bert-base-uncased")
text = "This is a test document."
query_result = embeddings.embed_query(text)
doc_result = embeddings.embed_documents([text])

HuggingFace Instruct

我们加载HuggingFace Instruct Embeddings类：

from langchain.embeddings import HuggingFaceInstructEmbeddings
embeddings = HuggingFaceInstructEmbeddings(
    query_instruction="Represent the query for retrieval: "
)
load INSTRUCTOR_Transformer
max_seq_length  512
text = "This is a test document."
query_result = embeddings.embed_query(text)

Jina

我们加载Jina Embedding类：

from langchain.embeddings import JinaEmbeddings
embeddings = JinaEmbeddings(jina_auth_token=jina_auth_token, model_name="ViT-B-32::openai")
text = "这是一个测试文档。"
query_result = embeddings.embed_query(text)
doc_result = embeddings.embed_documents([text])

Llama-cpp

这个示例介绍了如何在LangChain中使用Llama-cpp embeddings：

!pip install llama-cpp-python

from langchain.embeddings import LlamaCppEmbeddings
llama = LlamaCppEmbeddings(model_path="/path/to/model/ggml-model-q4_0.bin")
text = "这是一个测试文档。"
query_result = llama.embed_query(text)
doc_result = llama.embed_documents([text])

MiniMax

MiniMax提供了一个嵌入服务。以下示例演示如何使用 LangChain 与 MiniMax 推理进行文本嵌入交互：

import os

os.environ["MINIMAX_GROUP_ID"] = "MINIMAX_GROUP_ID"
os.environ["MINIMAX_API_KEY"] = "MINIMAX_API_KEY"
from langchain.embeddings import MiniMaxEmbeddings
embeddings = MiniMaxEmbeddings()
query_text = "这是一个测试查询。"
query_result = embeddings.embed_query(query_text)
document_text = "这是一个测试文档。"
document_result = embeddings.embed_documents([document_text])
import numpy as np

query_numpy = np.array(query_result)
document_numpy = np.array(document_result[0])
similarity = np.dot(query_numpy, document_numpy) / (np.linalg.norm(query_numpy) * np.linalg.norm(document_numpy))
print(f"文档与查询之间的余弦相似度：{similarity}")

输出：

文档与查询之间的余弦相似度：0.1573236279277012

ModelScope

我们加载 ModelScope 嵌入类：

from langchain.embeddings import ModelScopeEmbeddings
model_id = "damo/nlp_corom_sentence-embedding_english-base"
embeddings = ModelScopeEmbeddings(model_id=model_id)
text = "This is a test document."
query_result = embeddings.embed_query(text)
doc_results = embeddings.embed_documents(["foo"])

MosaicML

MosaicML提供了一个托管的推理服务。我们可以使用各种开源模型，也可以部署我们自己的模型。以下示例演示如何使用LangChain与MosaicML推理服务进行文本嵌入交互：

# 注册账号：https://forms.mosaicml.com/demo?utm_source=langchain

from getpass import getpass

MOSAICML_API_TOKEN = getpass()
import os

os.environ["MOSAICML_API_TOKEN"] = MOSAICML_API_TOKEN
from langchain.embeddings import MosaicMLInstructorEmbeddings
embeddings = MosaicMLInstructorEmbeddings(
    query_instruction="Represent the query for retrieval: "
)
query_text = "This is a test query."
query_result = embeddings.embed_query(query_text)
document_text = "This is a test document."
document_result = embeddings.embed_documents([document_text])
import numpy as np

query_numpy = np.array(query_result)
document_numpy = np.array(document_result[0])
similarity = np.dot(query_numpy, document_numpy) / (np.linalg.norm(query_numpy)*np.linalg.norm(document_numpy))
print(f"Cosine similarity between document and query: {similarity}")

OpenAI

我们加载OpenAI嵌入类：

from langchain.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()
text = "This is a test document."
query_result = embeddings.embed_query(text)
doc_result = embeddings.embed_documents([text])

让我们加载第一代模型（例如：text-search-ada-doc-001/text-search-ada-query-001）的 OpenAI 嵌入类。需要注意的是其实这些模型并不推荐使用。

from langchain.embeddings.openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()
text = "This is a test document."
query_result = embeddings.embed_query(text)
doc_result = embeddings.embed_documents([text])
# 如果您在明确的代理后面，可以使用 OPENAI_PROXY 环境变量进行传递
os.environ["OPENAI_PROXY"] = "http://proxy.yourcompany.com:8080"

SageMaker Endpoint

我们加载SageMaker Endpoint Embeddings类。如果我们在SageMaker上托管自己的Hugging Face模型，则可以使用该类。需要注意的是，为了处理批量请求，我们需要调整自定义inference.py脚本中的predict_fn()函数中的返回行，即从return {"vectors": sentence_embeddings[0].tolist()}到return {"vectors": sentence_embeddings.tolist()}：

!pip3 install langchain boto3

from typing import Dict, List
from langchain.embeddings import SagemakerEndpointEmbeddings
from langchain.llms.sagemaker_endpoint import ContentHandlerBase
import json

class ContentHandler(ContentHandlerBase):
    content_type = "application/json"
    accepts = "application/json"

    def transform_input(self, inputs: list[str], model_kwargs: Dict) -> bytes:
        input_str = json.dumps({"inputs": inputs, **model_kwargs})
        return input_str.encode('utf-8')

    def transform_output(self, output: bytes) -> List[List[float]]:
        response_json = json.loads(output.read().decode("utf-8"))
        return response_json["vectors"]

content_handler = ContentHandler()

embeddings = SagemakerEndpointEmbeddings(
    # endpoint_name="endpoint-name", 
    # credentials_profile_name="credentials-profile-name", 
    endpoint_name="huggingface-pytorch-inference-2023-03-21-16-14-03-834", 
    region_name="us-east-1", 
    content_handler=content_handler
)
query_result = embeddings.embed_query("foo")
doc_results = embeddings.embed_documents(["foo"])

SelfHostedEmbeddings

我们加载SelfHostedEmbeddings、SelfHostedHuggingFaceEmbeddings和 SelfHostedHuggingFaceInstructEmbeddings类：

from langchain.embeddings import (
    SelfHostedEmbeddings,
    SelfHostedHuggingFaceEmbeddings,
    SelfHostedHuggingFaceInstructEmbeddings,
)
import runhouse as rh
# 对于 GCP、Azure 或 Lambda 上的按需 A100
gpu = rh.cluster(name="rh-a10x", instance_type="A100:1", use_spot=False)

# 对于 AWS 上的按需 A10G（AWS 上没有单个 A100）
# gpu = rh.cluster(name='rh-a10x', instance_type='g5.2xlarge', provider='aws')

# 对于现有的集群
# gpu = rh.cluster(ips=['<集群的 IP>'],
#                  ssh_creds={'ssh_user': '...', 'ssh_private_key':'<私钥路径>'},
#                  name='my-cluster')
embeddings = SelfHostedHuggingFaceEmbeddings(hardware=gpu)
text = "This is a test document."
query_result = embeddings.embed_query(text)

类似地，对于SelfHostedHuggingFaceInstructEmbeddings：

embeddings = SelfHostedHuggingFaceInstructEmbeddings(hardware=gpu)

现在我们使用自定义加载函数加载一个嵌入模型：

def get_pipeline():
    from transformers import (
        AutoModelForCausalLM,
        AutoTokenizer,
        pipeline,
    )  # Must be inside the function in notebooks

    model_id = "facebook/bart-base"
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    model = AutoModelForCausalLM.from_pretrained(model_id)
    return pipeline("feature-extraction", model=model, tokenizer=tokenizer)


def inference_fn(pipeline, prompt):
    # Return last hidden state of the model
    if isinstance(prompt, list):
        return [emb[0][-1] for emb in pipeline(prompt)]
    return pipeline(prompt)[0][-1]
embeddings = SelfHostedEmbeddings(
    model_load_fn=get_pipeline,
    hardware=gpu,
    model_reqs=["./", "torch", "transformers"],
    inference_fn=inference_fn,
)
query_result = embeddings.embed_query(text)

Sentence Transformers

使用HuggingFaceEmbeddings集成来调用Sentence Transformers嵌入。LangChain还为熟悉直接使用SentenceTransformer包的用户添加了SentenceTransformerEmbeddings的别名。SentenceTransformers是一个可以生成文本和图像嵌入的Python包，源自于SentenceBERT：

!pip install sentence_transformers > /dev/null

from langchain.embeddings import HuggingFaceEmbeddings, SentenceTransformerEmbeddings
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
# 等同于 SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
text = "This is a test document."
query_result = embeddings.embed_query(text)
doc_result = embeddings.embed_documents([text, "This is not a test document."])

TensorFlow Hub

TensorFlow Hub是一个包含训练好的机器学习模型的仓库，可随时进行微调并在任何地方部署。TensorFlow Hub 让我们可以在一个地方搜索和发现数百个已经训练好、可直接部署的机器学习模型。

from langchain.embeddings import TensorflowHubEmbeddings
embeddings = TensorflowHubEmbeddings()

text = "This is a test document."
query_result = embeddings.embed_query(text)
doc_results = embeddings.embed_documents(["foo"])

参考文献：
[1] LangChain 🦜️🔗 中文网，跟着LangChain一起学LLM/GPT开发：https://www.langchain.com.cn/
[2] LangChain中文网 – LangChain 是一个用于开发由语言模型驱动的应用程序的框架：http://www.cnlangchain.com/

文章出处登录后可见！

已经登录？立即刷新