本教程将带你一步步了解如何使用LangChain库从网页中提取特定内容,并构建一个基于这些内容的问答系统。我们将通过具体的代码示例来解释每一步的操作和逻辑。
1. 从网页加载文档
接下来,我们使用WebBaseLoader
从指定网页加载文档,并使用bs4
库筛选出特定的HTML元素。
import bs4
from langchain_community.document_loaders import WebBaseLoader# 仅保留帖子标题、头部和内容
bs4_strainer = bs4.SoupStrainer(class_=("post-title", "post-header", "post-content"))
loader = WebBaseLoader(web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),bs_kwargs={"parse_only": bs4_strainer},
)
docs = loader.load()len(docs[0].page_content)
这段代码首先定义了一个SoupStrainer
对象,用于筛选出包含特定类名的HTML元素。然后创建WebBaseLoader
对象,指定要加载的网页路径和筛选条件。最后调用load()
方法加载文档,并打印出第一个文档的内容长度。
输出:
USER_AGENT environment variable not set, consider setting it to identify your requests.
43131
2. 查看文档内容
为了确认加载的内容是否符合预期,我们可以打印出文档的前500个字符。
print(docs[0].page_content[:500])
输出:
LLM Powered Autonomous AgentsDate: June 23, 2023 | Estimated Reading Time: 31 min | Author: Lilian WengBuilding agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.Agent System Overview#In
3. 文档分割
为了更好地处理长文档,我们使用RecursiveCharacterTextSplitter
将文档分割成多个小块。
from langchain_text_splitters import RecursiveCharacterTextSplittertext_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200, add_start_index=True
)
all_splits = text_splitter.split_documents(docs)len(all_splits)
输出:
66
4. 查看分割后的文档内容
我们可以查看第一个分割后的文档内容及其元数据。
all_splits[0].page_content
输出:
'LLM Powered Autonomous Agents\n \nDate: June 23, 2023 | Estimated Reading Time: 31 min | Author: Lilian Weng\n\n\nBuilding agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.\nAgent System Overview#\nIn a LLM-powered autonomous agent system, LLM functions as the agent’s brain, complemented by several key components:\n\nPlanning\n\nSubgoal and decomposition: The agent breaks down large tasks into smaller, manageable subgoals, enabling efficient handling of complex tasks.\nReflection and refinement: The agent can do self-criticism and self-reflection over past actions, learn from mistakes and refine them for future steps, thereby improving the quality of final results.\n\n\nMemory'
len(all_splits[0].page_content)
输出:
969
all_splits[0].metadata
输出:
{'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/','start_index': 8}
5. 创建向量存储
接下来,我们使用Chroma
和ZhipuAIEmbeddings
创建一个向量存储,用于后续的相似度检索。
from langchain_chroma import Chroma
from langchain_community.embeddings import ZhipuAIEmbeddings
from langchain_community.vectorstores import FAISS
import os embed = ZhipuAIEmbeddings(model="Embedding-3",api_key="your api key",
)vectorstore = Chroma.from_documents(all_splits[:30], embedding=embed,)
6. 创建检索器
使用向量存储创建一个检索器,并设置检索参数。
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 6})retrieved_docs = retriever.invoke("What are the approaches to Task Decomposition?")len(retrieved_docs)
输出:
6
7. 查看检索结果
打印出第一个检索结果的文档内容。
print(retrieved_docs[0].page_content)
输出:
Fig. 1. Overview of a LLM-powered autonomous agent system.
Component One: Planning#
A complicated task usually involves many steps. An agent needs to know what they are and plan ahead.
Task Decomposition#
Chain of thought (CoT; Wei et al. 2022) has become a standard prompting technique for enhancing model performance on complex tasks. The model is instructed to “think step by step” to utilize more test-time computation to decompose hard tasks into smaller and simpler steps. CoT transforms big tasks into multiple manageable tasks and shed lights into an interpretation of the model’s thinking process.
8. 初始化LLM
使用ChatOpenAI
初始化一个大型语言模型(LLM),用于后续的问答生成。
import getpass
import osfrom langchain_openai import ChatOpenAI
llm = ChatOpenAI(temperature=0,model="GLM-4-Plus",openai_api_key=" your api key ",openai_api_base="https://open.bigmodel.cn/api/paas/v4/"
)
9. 获取提示模板
从LangChain Hub中获取一个提示模板,用于生成问答。
from langchain import hubprompt = hub.pull("rlm/rag-prompt")example_messages = prompt.invoke({"context": "filler context", "question": "filler question"}
).to_messages()example_messages
输出:
[HumanMessage(content="You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\nQuestion: filler question \nContext: filler context \nAnswer:", additional_kwargs={}, response_metadata={})]
10. 查看提示模板内容
打印出提示模板的内容。
print(example_messages[0].content)
输出:
You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: filler question
Context: filler context
Answer:
11. 构建问答链
构建一个问答链,用于将检索到的文档和问题输入到LLM中,并生成答案。
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthroughdef format_docs(docs):return "\n\n".join(doc.page_content for doc in docs)rag_chain = ({"context": retriever | format_docs, "question": RunnablePassthrough()}| prompt| llm| StrOutputParser()
)for chunk in rag_chain.stream("What is Task Decomposition?"):print(chunk, end="", flush=True)
输出:
Task Decomposition is the process of breaking down complex tasks into smaller, more manageable steps to enhance model performance and provide insight into the model's thinking process. This can be achieved through prompting techniques like Chain of Thought (CoT) or Tree of Thoughts (ToT), which explore multiple reasoning possibilities. It can be executed by LLMs with simple prompts, task-specific instructions, or with human inputs.
12. 内置链条
使用LangChain的create_retrieval_chain
和create_stuff_documents_chain
创建一个内置链条。
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplatesystem_prompt = ("You are an assistant for question-answering tasks. ""Use the following pieces of retrieved context to answer ""the question. If you don't know the answer, say that you ""don't know. Use three sentences maximum and keep the ""answer concise.""\n\n""{context}"
)prompt = ChatPromptTemplate.from_messages([("system", system_prompt),("human", "{input}"),]
)question_answer_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)response = rag_chain.invoke({"input": "What is Task Decomposition?"})
print(response["answer"])
输出:
Task Decomposition is a method used in AI systems to break down complex tasks into smaller, more manageable steps. It involves techniques like Chain of Thought (CoT) and Tree of Thoughts (ToT), which help the model think step by step and explore multiple reasoning possibilities. This process enhances the model's performance by making complex tasks easier to handle and interpret.
13. 查看上下文文档
打印出用于生成答案的上下文文档。
for document in response["context"]:print(document)print()
输出:
page_content='Fig. 1. Overview of a LLM-powered autonomous agent system.
Component One: Planning#
A complicated task usually involves many steps. An agent needs to know what they are and plan ahead.
Task Decomposition#
Chain of thought (CoT; Wei et al. 2022) has become a standard prompting technique for enhancing model performance on complex tasks. The model is instructed to “think step by step” to utilize more test-time computation to decompose hard tasks into smaller and simpler steps. CoT transforms big tasks into multiple manageable tasks and shed lights into an interpretation of the model’s thinking process.' metadata={'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/', 'start_index': 1585}page_content='Fig. 11. Illustration of how HuggingGPT works. (Image source: Shen et al. 2023)
The system comprises of 4 stages:
(1) Task planning: LLM works as the brain and parses the user requests into multiple tasks. There are four attributes associated with each task: task type, ID, dependencies, and arguments. They use few-shot examples to guide LLM to do task parsing and planning.
Instruction:' metadata={'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/', 'start_index': 17414}page_content='Tree of Thoughts (Yao et al. 2023) extends CoT by exploring multiple reasoning possibilities at each step. It first decomposes the problem into multiple thought steps and generates multiple thoughts per step, creating a tree structure. The search process can be BFS (breadth-first search) or DFS (depth-first search) with each state evaluated by a classifier (via a prompt) or majority vote.
Task decomposition can be done (1) by LLM with simple prompting like "Steps for XYZ.\n1.", "What are the subgoals for achieving XYZ?", (2) by using task-specific instructions; e.g. "Write a story outline." for writing a novel, or (3) with human inputs.' metadata={'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/', 'start_index': 2192}page_content='The AI assistant can parse user input to several tasks: [{"task": task, "id", task_id, "dep": dependency_task_ids, "args": {"text": text, "image": URL, "audio": URL, "video": URL}}]. The "dep" field denotes the id of the previous task which generates a new resource that the current task relies on. A special tag "-task_id" refers to the generated text image, audio and video in the dependency task with id as task_id. The task MUST be selected from the following options: {{ Available Task List }}. There is a logical relationship between tasks, please note their order. If the user input can't be parsed, you need to reply empty JSON. Here are several cases for your reference: {{ Demonstrations }}. The chat history is recorded as {{ Chat History }}. From this chat history, you can find the path of the user-mentioned resources for your task planning.' metadata={'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/', 'start_index': 17804}page_content='Another quite distinct approach, LLM+P (Liu et al. 2023), involves relying on an external classical planner to do long-horizon planning. This approach utilizes the Planning Domain Definition Language (PDDL) as an intermediate interface to describe the planning problem. In this process, LLM (1) translates the problem into “Problem PDDL”, then (2) requests a classical planner to generate a PDDL plan based on an existing “Domain PDDL”, and finally (3) translates the PDDL plan back into natural language. Essentially, the planning step is outsourced to an external tool, assuming the availability of domain-specific PDDL and a suitable planner which is common in certain robotic setups but not in many other domains.
Self-Reflection#
Self-reflection is a vital aspect that allows autonomous agents to improve iteratively by refining past action decisions and correcting previous mistakes. It plays a crucial role in real-world tasks where trial and error are inevitable.' metadata={'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/', 'start_index': 2837}page_content='(3) Task execution: Expert models execute on the specific tasks and log results.
Instruction:With the input and the inference results, the AI assistant needs to describe the process and results. The previous stages can be formed as - User Input: {{ User Input }}, Task Planning: {{ Tasks }}, Model Selection: {{ Model Assignment }}, Task Execution: {{ Predictions }}. You must first answer the user's request in a straightforward manner. Then describe the task process and show your analysis and model inference results to the user in the first person. If inference results contain a file path, must tell the user the complete file path.' metadata={'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/', 'start_index': 19373}
14. 自定义提示模板
创建一个自定义的提示模板,并在问答链中使用。
from langchain_core.prompts import PromptTemplatetemplate = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Use three sentences maximum and keep the answer as concise as possible.
Always say "thanks for asking!" at the end of the answer.{context}Question: {question}Helpful Answer:"""
custom_rag_prompt = PromptTemplate.from_template(template)rag_chain = ({"context": retriever | format_docs, "question": RunnablePassthrough()}| custom_rag_prompt| llm| StrOutputParser()
)rag_chain.invoke("What is Task Decomposition?")
输出:
'Task Decomposition is the process of breaking down a complex task into smaller, more manageable steps. It can be achieved through methods like Chain of Thought (CoT) or Tree of Thoughts (ToT), which involve step-by-step reasoning or exploring multiple reasoning paths. This helps in simplifying task execution and improving model performance.\n\nThanks for asking!'
通过以上步骤,我们成功构建了一个基于网页内容的问答系统。希望这个教程对你有所帮助!
注意事项:
langchain 官网 中向量模型是openai的,而这篇文章中用的智谱的向量模型,区别在于智谱向量模型没办法存储过长的内容,导致代码报错,如图:
将all_splits
改为all_splits[:30]
即可运行成功。
参考链接:https://python.langchain.com/docs/tutorials/rag/
如果有任何问题,欢迎在评论区提问。