LlamaIndex 组件 – Indexing

文章目录

- 一、索引概览
- - 概念
- 二、每个指数如何运作
- - 1、摘要索引（以前称为列表索引）
  - - 查询
  - 2、向量存储索引
  - - 查询
  - 3、树索引
  - - 查询
  - 4、关键字表索引
  - - 查询
- 三、使用VectorStoreIndex
- - 1、将数据加载到索引中
  - - 1.1 基本用法
    - 1.2 使用摄取管道创建节点
    - 1.3 直接创建和管理节点
    - - 处理文档更新
  - 2、存储向量索引
  - 3、可组合检索
- 四、文件管理
- - 1、插入
  - 2、删除
  - 3、更新
  - 4、刷新
  - 5、文件追踪
- 五、LlamaCloudIndex + LlamaCloudRetriever
- - 1、使用权
  - 2、设置
  - 3、用法
  - 4、Retriever 设置
- 六、元数据提取
- - 1、介绍
  - 2、用法
  - 3、定制提取器
  - 4、模块
- 七、模块指南

本文转载改编自： https://docs.llamaindex.ai/en/stable/module_guides/indexing/

一、索引概览

概念

一个 Index是一种数据结构，允许我们快速检索用户查询的相关上下文。
对于 LlamaIndex 来说，它是检索增强生成 (RAG) 用例的核心基础。

在高层次上，Indexes是从Documents构建的。
它们用于构建查询引擎和聊天引擎，从而可以通过数据进行问答和聊天。

在底层，Indexes将数据存储在Node对象中（代表原始文档的块），并公开支持额外配置和自动化的Retriever接口。

迄今为止最常见的索引是VectorStoreIndex；最好的起点是 VectorStoreIndex 使用指南。

对于其他索引，请查看我们关于每个索引如何工作的指南，以帮助您决定哪个索引适合您的用例。

二、每个指数如何运作

本指南描述了每个索引如何与图表配合使用。

一些术语：

节点：对应于文档中的一段文本。
LlamaIndex 接收 Document 对象并在内部将它们解析/分块为 Node 对象。
响应合成：我们的模块根据检索到的节点合成响应。
您可以了解如何指定不同的响应模式。

1、摘要索引（以前称为列表索引）

摘要索引只是将节点存储为顺序链。

查询

在查询期间，如果没有指定其他查询参数，LlamaIndex 只会将列表中的所有节点加载到我们的响应综合模块中。

摘要索引确实提供了多种查询摘要索引的方法，可以通过基于嵌入的查询获取 top-k 邻居，或者添加关键字过滤器，如下所示：

2、向量存储索引

向量存储索引存储每个节点以及向量存储中相应的嵌入。

查询

查询向量存储索引涉及获取前 k 个最相似的节点，并将它们传递到我们的响应综合模块中。

3、树索引

树索引从一组节点（成为该树中的叶节点）构建层次树。

查询

查询树索引涉及从根节点向下遍历到叶节点。
默认情况下，( child_branch_factor=1) 查询会在给定父节点的情况下选择一个子节点。
如果child_branch_factor=2，则查询为每个级别选择两个子节点。

4、关键字表索引

关键字表索引从每个节点中提取关键字，并构建从每个关键字到该关键字对应节点的映射。

查询

在查询期间，我们从查询中提取相关关键字，并将其与预先提取的 Node 关键字进行匹配以获取相应的 Node。
提取的节点被传递到我们的响应合成模块。

三、使用VectorStoreIndex

矢量存储是检索增强生成 (RAG) 的关键组件，因此您最终将在使用 LlamaIndex 制作的几乎每个应用程序中直接或间接使用它们。

Node向量存储接受对象列表并从中构建索引

1、将数据加载到索引中

1.1 基本用法

使用向量存储的最简单方法是加载一组文档并使用它们构建索引from_documents：

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

# Load documents and build index
documents = SimpleDirectoryReader(
    "../../examples/data/paul_graham"
).load_data()

index = VectorStoreIndex.from_documents(documents)

提示

如果您在命令行上使用from_documents，可以方便地show_progress=True在索引构建过程中传递显示进度条。

当您使用时from_documents，您的文档会被分割成块并解析为Node对象，即对文本字符串的轻量级抽象，以跟踪元数据和关系。

有关如何加载文档的更多信息，请参阅了解加载。

默认情况下，VectorStoreIndex 将所有内容存储在内存中。
有关如何使用持久向量存储的更多信息，请参阅下面的使用向量存储。

提示

默认情况下，VectorStoreIndex将以 2048 个节点为一批生成并插入向量。
如果您的内存有限（或内存过剩），您可以通过传递insert_batch_size=2048所需的批量大小来修改它。

当您插入远程托管的矢量数据库时，这特别有用。

1.2 使用摄取管道创建节点

如果您想更好地控制文档的索引方式，我们建议使用摄取管道。
这允许您自定义节点的分块、元数据和嵌入。

from llama_index.core import Document
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.extractors import TitleExtractor
from llama_index.core.ingestion import IngestionPipeline, IngestionCache

# create the pipeline with transformations
pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(chunk_size=25, chunk_overlap=0),
        TitleExtractor(),
        OpenAIEmbedding(),
    ]
)

# run the pipeline
nodes = pipeline.run(documents=[Document.example()])

提示

您可以了解有关如何使用摄取管道的更多信息。

1.3 直接创建和管理节点

如果您想完全控制索引，您可以手动创建和定义节点并将它们直接传递给索引构造函数：

from llama_index.core.schema import TextNode

node1 = TextNode(text="<text_chunk>", id_="<node_id>")
node2 = TextNode(text="<text_chunk>", id_="<node_id>")
nodes = [node1, node2]
index = VectorStoreIndex(nodes)

处理文档更新

直接管理索引时，您将需要处理随时间变化的数据源。
Index类具有插入、删除、更新和刷新操作，您可以在下面了解有关它们的更多信息：

元数据提取
文件管理

2、存储向量索引

LlamaIndex 支持数十种向量存储。
您可以通过传入 a 来指定要使用的参数StorageContext，然后在该参数上指定vector_store参数，如本例中使用 Pinecone 所示：

import pinecone
from llama_index.core import (
    VectorStoreIndex,
    SimpleDirectoryReader,
    StorageContext,
)
from llama_index.vector_stores.pinecone import PineconeVectorStore

# init pinecone
pinecone.init(api_key="<api_key>", environment="<environment>")
pinecone.create_index(
    "quickstart", dimension=1536, metric="euclidean", pod_type="p1"
)

# construct vector store and customize storage context
storage_context = StorageContext.from_defaults(
    vector_store=PineconeVectorStore(pinecone.Index("quickstart"))
)

# Load documents and build index
documents = SimpleDirectoryReader(
    "../../examples/data/paul_graham"
).load_data()
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)

有关如何使用 VectorStoreIndex 的更多示例，请参阅我们的矢量存储索引使用示例笔记本。

有关如何将 VectorStoreIndex 与特定向量存储结合使用的示例，请查看“存储”部分下的向量存储。

3、可组合检索

（VectorStoreIndex以及任何其他索引/检索器）能够检索通用对象，包括

references to nodes
query engines
retrievers
query pipelines

如果检索到这些对象，它们将使用提供的查询自动运行。例如：

from llama_index.core.schema import IndexNode

query_engine = other_index.as_query_engine
obj = IndexNode(
    text="A query engine describing X, Y, and Z.",
    obj=query_engine,
    index_id="my_query_engine",
)

index = VectorStoreIndex(nodes=nodes, objects=[obj])
retriever = index.as_retreiver(verbose=True)

如果检索到包含查询引擎的索引节点，则将运行查询引擎并将结果响应作为节点返回。

欲了解更多详情，请查看指南

四、文件管理

大多数 LlamaIndex 索引结构允许插入、删除、更新和刷新操作。

1、插入

最初构建索引后，您可以将新文档“插入”到任何索引数据结构中。
该文档将被分解为节点并被摄取到索引中。

插入背后的底层机制取决于索引结构。
例如，对于摘要索引，新文档作为附加节点插入到列表中。
对于向量存储索引，新的文档（和嵌入）被插入到底层文档/嵌入存储中。

这里给出了一个展示我们插入功能的示例笔记本。
在本笔记本中，我们展示了如何构建空索引，手动创建 Document 对象，并将它们添加到我们的索引数据结构中。

下面给出了示例代码片段：

from llama_index.core import SummaryIndex, Document

index = SummaryIndex([])
text_chunks = ["text_chunk_1", "text_chunk_2", "text_chunk_3"]

doc_chunks = []
for i, text in enumerate(text_chunks):
    doc = Document(text=text, id_=f"doc_id_{i}")
    doc_chunks.append(doc)

# insert
for doc_chunk in doc_chunks:
    index.insert(doc_chunk)

2、删除

您可以通过指定 document_id 从大多数索引数据结构中“删除”文档。
（注意：树索引目前不支持删除）。
该文档对应的所有节点都将被删除。

index.delete_ref_doc("doc_id_0", delete_from_docstore=True)

delete_from_docstore``False如果您使用相同的文档存储在索引之间共享节点，则默认为。
但是，当设置为时，在查询时将不会使用这些节点，False因为它们将从索引中删除index_struct，索引会跟踪哪些节点可用于查询。

3、更新

如果索引中已存在文档，您可以使用相同的文档“更新”文档id_（例如，如果文档中的信息已更改）。

# NOTE: the document has a `doc_id` specified
doc_chunks[0].text = "Brand new document text"
index.update_ref_doc(
    doc_chunks[0],
    update_kwargs={"delete_kwargs": {"delete_from_docstore": True}},
)

在这里，我们传递了一些额外的 kwargs 以确保文档从文档存储中删除。这当然是可选的。

4、刷新

如果在加载数据时设置了每个文档的doc id_，还可以自动刷新索引。

该refresh()函数只会更新具有相同 docid_但文本内容不同的文档。
索引中根本不存在的任何文档也将被插入。

refresh()还返回一个布尔列表，指示输入中的哪些文档已在索引中刷新。

# modify first document, with the same doc_id
doc_chunks[0] = Document(text="Super new document text", id_="doc_id_0")

# add a new document
doc_chunks.append(
    Document(
        text="This isn't in the index yet, but it will be soon!",
        id_="doc_id_3",
    )
)

# refresh the index
refreshed_docs = index.refresh_ref_docs(
    doc_chunks, update_kwargs={"delete_kwargs": {"delete_from_docstore": True}}
)

# refreshed_docs[0] and refreshed_docs[-1] should be true

同样，我们传递了一些额外的 kwargs 以确保文档从文档存储中删除。这当然是可选的。

如果您查看print()的输出refresh()，您将看到哪些输入文档已刷新：

print(refreshed_docs)
# > [True, False, False, True]

当您从不断更新新信息的目录中读取内容时，这非常有用。

id_要在使用时自动设置文档SimpleDirectoryReader，您可以设置filename_as_id标志。
您可以了解有关自定义文档的更多信息。

5、文件追踪

任何使用文档存储的索引（即除大多数矢量存储集成之外的所有索引），您还可以查看已插入到文档存储中的文档。

print(index.ref_doc_info)
"""
> {'doc_id_1': RefDocInfo(node_ids=['071a66a8-3c47-49ad-84fa-7010c6277479'], metadata={}),
   'doc_id_2': RefDocInfo(node_ids=['9563e84b-f934-41c3-acfd-22e88492c869'], metadata={}),
   'doc_id_0': RefDocInfo(node_ids=['b53e6c2f-16f7-4024-af4c-42890e945f36'], metadata={}),
   'doc_id_3': RefDocInfo(node_ids=['6bedb29f-15db-4c7c-9885-7490e10aa33f'], metadata={})}
"""

输出中的每个条目都将摄取的文档显示id_为键，以及node_ids它们所分割到的节点的关联。

metadata最后，还跟踪每个输入文档的原始词典。
您可以在自定义文档metadata中阅读有关该属性的更多信息。

五、LlamaCloudIndex + LlamaCloudRetriever

LlamaCloud 是新一代托管解析、摄取和检索服务，旨在为您的 LLM 和 RAG 应用程序带来生产级上下文增强。

目前，LlamaCloud支持

托管摄取 API，处理解析和文档管理
托管检索 API，为您的 RAG 系统配置最佳检索

1、使用权

我们正在向有限的一组企业合作伙伴开放托管摄取和检索 API 的私人测试版。
如果您有兴趣集中数据管道并花更多时间处理实际的 RAG 用例，请与我们联系。

如果您有权访问 LlamaCloud，则可以访问LlamaCloud登录并获取 API 密钥。

2、设置

首先，确保您安装了最新的 LlamaIndex 版本。

**注意：**如果您从 v0.9.X 升级，我们建议您遵循我们的迁移指南，并首先卸载以前的版本。

pip uninstall llama-index  # run this if upgrading from v0.9.x or older
pip install -U llama-index --upgrade --no-cache-dir --force-reinstall

该llama-index-indices-managed-llama-cloud软件包包含在上述安装中，但您也可以直接安装

pip install -U llama-index-indices-managed-llama-cloud

3、用法

您可以使用以下代码在 LlamaCloud 上创建索引：

import os

os.environ[
    "LLAMA_CLOUD_API_KEY"
] = "llx-..."  # can provide API-key in env or in the constructor later on

from llama_index.core import SimpleDirectoryReader
from llama_index.indices.managed.llama_cloud import LlamaCloudIndex

# create a new index
index = LlamaCloudIndex.from_documents(
    documents,
    "my_first_index",
    project_name="default",
    api_key="llx-...",
    verbose=True,
)

# connect to an existing index
index = LlamaCloudIndex("my_first_index", project_name="default")

您还可以配置检索器以进行托管检索：

# from the existing index
index.as_retriever()

# from scratch
from llama_index.indices.managed.llama_cloud import LlamaCloudRetriever

retriever = LlamaCloudRetriever("my_first_index", project_name="default")

当然，您可以使用其他索引快捷方式来使用新的托管索引：

query_engine = index.as_query_engine(llm=llm)

chat_engine = index.as_chat_engine(llm=llm)

4、Retriever 设置

检索器设置/kwargs 的完整列表如下：

dense_similarity_top_k: 可选[int] – 如果大于0，则k使用密集检索来检索节点
sparse_similarity_top_k: 可选[int] – 如果大于0，则k使用稀疏检索来检索节点
enable_reranking: 可选[bool] – 是否启用重新排名。
为了准确性牺牲一些速度
rerank_top_n: 可选[int] – 对初始检索结果重新排序后返回的节点数
alpha可选[float] – 密集检索和稀疏检索之间的权重。
1 = 全密集检索，0 = 全稀疏检索。

六、元数据提取

1、介绍

在许多情况下，尤其是对于长文档，文本块可能缺乏区分该块与其他类似文本块的歧义所需的上下文。

为了解决这个问题，我们使用法学硕士来提取与文档相关的某些上下文信息，以更好地帮助检索和语言模型消除相似段落的歧义。

我们在示例笔记本中展示了这一点，并展示了它在处理长文档时的有效性。

2、用法

首先，我们定义一个元数据提取器，它接受将按顺序处理的特征提取器列表。

然后，我们将其提供给节点解析器，该解析器会将附加元数据添加到每个节点。

from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.extractors import (
    SummaryExtractor,
    QuestionsAnsweredExtractor,
    TitleExtractor,
    KeywordExtractor,
)
from llama_index.extractors.entity import EntityExtractor

transformations = [
    SentenceSplitter(),
    TitleExtractor(nodes=5),
    QuestionsAnsweredExtractor(questions=3),
    SummaryExtractor(summaries=["prev", "self"]),
    KeywordExtractor(keywords=10),
    EntityExtractor(prediction_threshold=0.5),
]

然后，我们可以对输入文档或节点运行转换：

from llama_index.core.ingestion import IngestionPipeline

pipeline = IngestionPipeline(transformations=transformations)

nodes = pipeline.run(documents=documents)

以下是提取的元数据的示例：

{'page_label': '2',
 'file_name': '10k-132.pdf',
 'document_title': 'Uber Technologies, Inc. 2019 Annual Report: Revolutionizing Mobility and Logistics Across 69 Countries and 111 Million MAPCs with $65 Billion in Gross Bookings',
 'questions_this_excerpt_can_answer': '\n\n1. How many countries does Uber Technologies, Inc. operate in?\n2. What is the total number of MAPCs served by Uber Technologies, Inc.?\n3. How much gross bookings did Uber Technologies, Inc. generate in 2019?',
 'prev_section_summary': "\n\nThe 2019 Annual Report provides an overview of the key topics and entities that have been important to the organization over the past year. These include financial performance, operational highlights, customer satisfaction, employee engagement, and sustainability initiatives. It also provides an overview of the organization's strategic objectives and goals for the upcoming year.",
 'section_summary': '\nThis section discusses a global tech platform that serves multiple multi-trillion dollar markets with products leveraging core technology and infrastructure. It enables consumers and drivers to tap a button and get a ride or work. The platform has revolutionized personal mobility with ridesharing and is now leveraging its platform to redefine the massive meal delivery and logistics industries. The foundation of the platform is its massive network, leading technology, operational excellence, and product expertise.',
 'excerpt_keywords': '\nRidesharing, Mobility, Meal Delivery, Logistics, Network, Technology, Operational Excellence, Product Expertise, Point A, Point B'}

3、定制提取器

如果提供的提取器不能满足您的需求，您还可以定义自定义提取器，如下所示：

from llama_index.core.extractors import BaseExtractor

class CustomExtractor(BaseExtractor):
    async def aextract(self, nodes) -> List[Dict]:
        metadata_list = [
            {
                "custom": node.metadata["document_title"]
                + "\n"
                + node.metadata["excerpt_keywords"]
            }
            for node in nodes
        ]
        return metadata_list

extractor.extract()将自动aextract()在后台调用，以提供同步和异步入口点。

在更高级的示例中，它还可以利用llm从节点内容和现有元数据中提取特征。
有关更多详细信息，请参阅提供的元数据提取器的源代码。

4、模块

您将在下面找到各种元数据提取器的指南和教程。

SEC 文档元数据提取
LLM调查提取
实体提取
Marvin 元数据提取
Pydantic 元数据提取

七、模块指南

向量存储索引
概要索引
树索引
关键字表索引
知识图谱索引
知识图谱查询引擎
知识图谱RAG查询引擎
REBEL+知识图谱索引
REBEL + 维基百科过滤
SQL查询引擎
DuckDB 查询引擎
文件摘要索引
对象索引

2024-04-15（一）

原文链接：https://blog.csdn.net/lovechris00/article/details/137786993