Vectorstores#

vectorstoreはembeddingされたベクターデータをストアします。

ここではLangChainのドキュメントを ReadTheDocsLoader で読み込みます。

事前にHTMLファイルをダウンロードします。

wget -r -A.html -P rtdocs https://langchain.readthedocs.io/en/latest/
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.document_loaders import ReadTheDocsLoader

loader = ReadTheDocsLoader("rtdocs", features='html.parser')
data = loader.load()
len(data)
text_splitter = CharacterTextSplitter(chunk_size=300, separator = "\n", chunk_overlap=0)
texts = text_splitter.split_text(data[0].page_content)
len(texts)

分割したテキストデータをEmbeddingします。

embeddings = OpenAIEmbeddings()
docsearch = Chroma.from_texts(texts, embeddings)
Using embedded DuckDB without persistence: data will be transient

クエリを実行します。もとのデータは英文ですが、日本語でもクエリできます。

query = "チャットボットに関連するところを教えてください"
docs = docsearch.similarity_search(query)
print(docs[0].page_content)
Question Answering: The second big LangChain use case. Answering questions over specific documents, only utilizing the information in those documents to construct an answer.
Chatbots: Since language models are good at producing text, that makes them ideal for creating chatbots.
print(docs[1].page_content)
Additional Resources#
Additional collection of resources we think may be useful as you develop your application!
LangChainHub: The LangChainHub is a place to share and explore other prompts, chains, and agents.
print(docs[2].page_content)
Personal Assistants: The main LangChain use case. Personal assistants need to take actions, remember interactions, and have knowledge about your data.
print(docs[3].page_content)
Indexes: Language models are often more powerful when combined with your own text data - this module covers best practices for doing exactly that.