2.8a課 嵌入,使用chroma 向量數(shù)據(jù)庫代碼講解 -- 大語言模型應(yīng)用開發(fā)

使用chroma 向量數(shù)據(jù)庫代碼
# MAGIC ## Read data
# COMMAND ----------
import pandas as pd
dais_pdf = pd.read_parquet(f"{DA.paths.datasets}/dais/dais23_talks.parquet")
display(dais_pdf)
# COMMAND ----------
dais_pdf["full_text"] = dais_pdf.apply(
??lambda row: f"""Title: {row["Title"]}
????????Abstract:?{row["Abstract"]}""".strip(),
??axis=1,
)
print(dais_pdf.iloc[0]["full_text"])
# COMMAND ----------
texts = dais_pdf["full_text"].to_list()
# COMMAND ----------
# MAGIC %md
# MAGIC ## Question 1
# MAGIC Set up Chroma and create collection
# COMMAND ----------
import chromadb
from chromadb.config import Settings
chroma_client = chromadb.Client(
??Settings(
????chroma_db_impl="duckdb+parquet",
????persist_directory=DA.paths.user_db,?# this is an optional argument. If you don't supply this, the data will be ephemeral
??)
)
# COMMAND ----------
# MAGIC %md
# MAGIC
# MAGIC Assign the value of `my_talks` to the `collection_name` variable.
# COMMAND ----------
# TODO
collection_name = "<FILL_IN>"
# If you have created the collection before, you need to delete the collection first
if len(chroma_client.list_collections()) > 0 and collection_name in [chroma_client.list_collections()[0].name]:
??chroma_client.delete_collection(name=collection_name)
else:
??print(f"Creating collection: '{collection_name}'")
??talks_collection = chroma_client.create_collection(name=collection_name)
# COMMAND ----------
# Test your answer. DO NOT MODIFY THIS CELL.
dbTestQuestion2_1(collection_name)
# COMMAND ----------
# MAGIC %md
# MAGIC ## Question 2
# MAGIC
# MAGIC [Add](https://docs.trychroma.com/reference/Collection#add) data to the collection.?
# COMMAND ----------
# TODO
talks_collection.add(
??documents=<FILL_IN>,
??ids=<FILL_IN>
)
# COMMAND ----------
# Test your answer. DO NOT MODIFY THIS CELL.
dbTestQuestion2_2(talks_collection)
# COMMAND ----------
# MAGIC %md
# MAGIC ## Question 3
# MAGIC
# MAGIC [Query](https://docs.trychroma.com/reference/Collection#query) for relevant documents. If you are looking for talks related to language models, your query texts could be `language models`.?
# COMMAND ----------
# TODO
import json
results = talks_collection.query(
??query_texts=<FILL_IN>,
??n_results=<FILL_IN>
)
print(json.dumps(results, indent=4))
# COMMAND ----------
# Test your answer. DO NOT MODIFY THIS CELL.
dbTestQuestion2_3(results)
# COMMAND ----------
# MAGIC %md
# MAGIC ## Question 4
# MAGIC
# MAGIC Load a language model and create a [pipeline](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines).
# COMMAND ----------
# TODO
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
# Pick a model from HuggingFace that can generate text
model_id = "<FILL_IN>"
tokenizer = AutoTokenizer.from_pretrained(model_id)
lm_model = AutoModelForCausalLM.from_pretrained(model_id)
pipe = pipeline(
??"<FILL_IN>", model=lm_model, tokenizer=tokenizer, max_new_tokens=512, device_map="auto", handle_long_generation="hole"
)
# COMMAND ----------
# Test your answer. DO NOT MODIFY THIS CELL.
dbTestQuestion2_4(pipe)
# COMMAND ----------
# MAGIC %md
# MAGIC ## Question 5
# MAGIC
# MAGIC Prompt engineering for question answering
# COMMAND ----------
# TODO
# Come up with a question that you need the LLM assistant to help you with
# A sample question is "Help me find sessions related to XYZ"?
# Note: Your "XYZ" should be related to the query you passed in Question 3.?
question = "<FILL_IN>"
# Provide all returned similar documents from the cell above below
context = <FILL_IN>
# Feel free to be creative how you construct the prompt. You can use the demo notebook as a jumpstart reference.
# You can also provide more requirements in the text how you want the answers to look like.
# Example requirement: "Recommend top-5 relevant sessions for me to attend."
prompt_template = <FILL_IN>
# COMMAND ----------
# Test your answer. DO NOT MODIFY THIS CELL.
dbTestQuestion2_5(question, context, prompt_template)
# COMMAND ----------
# MAGIC %md
# MAGIC ## Question 6?
# MAGIC
# MAGIC Submit query for language model to generate response.
# MAGIC
# MAGIC Hint: If you run into the error `index out of range in self`, make sure to check out this [documentation page](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.TextGenerationPipeline.__call__.handle_long_generation).
# COMMAND ----------
# TODO
lm_response = pipe(<FILL_IN>)
print(lm_response[0]["generated_text"])
# COMMAND ----------
# Test your answer. DO NOT MODIFY THIS CELL.
dbTestQuestion2_6(lm_response)
# COMMAND ----------
# MAGIC %md
# MAGIC Notice that the output isn't exactly helpful. Head on to using OpenAI to try out GPT-3.5 instead!?
# COMMAND ----------
# MAGIC %md
# MAGIC ## OPTIONAL (Non-Graded): Use OpenAI models for Q/A
# MAGIC
# MAGIC For this section to work, you need to generate an Open AI key.?
# MAGIC
# MAGIC Steps:
# MAGIC 1. You need to [create an account](https://platform.openai.com/signup) on OpenAI.?
# MAGIC 2. Generate an OpenAI [API key here](https://platform.openai.com/account/api-keys).?
# MAGIC
# MAGIC Note: OpenAI does not have a free option, but it gives you $5 as credit. Once you have exhausted your $5 credit, you will need to add your payment method. You will be [charged per token usage](https://openai.com/pricing). **IMPORTANT**: It's crucial that you keep your OpenAI API key to yourself. If others have access to your OpenAI key, they will be able to charge their usage to your account!?
# COMMAND ----------
# TODO
import os
os.environ["OPENAI_API_KEY"] = "<FILL IN>"
# COMMAND ----------
import openai
openai.api_key = os.environ["OPENAI_API_KEY"]
# COMMAND ----------
# MAGIC %md
# MAGIC If you would like to estimate how much it would cost to use OpenAI, you can use `tiktoken` library from OpenAI to get the number of tokens from your prompt.
# MAGIC
# MAGIC
# MAGIC We will be using `gpt-3.5-turbo` since it's the most economical option at ($0.002/1k tokens), as of May 2023. GPT-4 charges $0.04/1k tokens. The following code block below is referenced from OpenAI's documentation on ["Managing tokens"](https://platform.openai.com/docs/guides/chat/managing-tokens).
# COMMAND ----------
import tiktoken
price_token = 0.002
encoder = tiktoken.encoding_for_model("gpt-3.5-turbo")
cost_to_run = len(encoder.encode(prompt_template)) / 1000 * price_token
print(f"It would take roughly ${round(cost_to_run, 5)} to run this prompt")
# COMMAND ----------
# MAGIC %md
# MAGIC We won't have to create a new vector database again. We can just send our `context` from above to OpenAI. We will use their chat completion API to interact with `GPT-3.5-turbo`. You can refer to their [documentation here](https://platform.openai.com/docs/guides/chat).
# MAGIC
# MAGIC Something interesting is that OpenAI models use the system message to help their assistant to be more accurate. From OpenAI's [docs](https://platform.openai.com/docs/guides/chat/introduction):
# MAGIC
# MAGIC > Future models will be trained to pay stronger attention to system messages. The system message helps set the behavior of the assistant.
# MAGIC
# MAGIC
# COMMAND ----------
# TODO
gpt35_response = openai.ChatCompletion.create(
??model="gpt-3.5-turbo",
??messages=[
????{"role": "system", "content": "You are a helpful assistant."},
????{"role": "user", "content": <FILL_IN>},
??],
??temperature=0, # 0 makes outputs deterministic; The closer the value is to 1, the more random the outputs are for each time you re-run.
)
# COMMAND ----------
print(gpt35_response.choices[0]["message"]["content"])
# COMMAND ----------
from IPython.display import Markdown
Markdown(gpt35_response.choices[0]["message"]["content"])
# COMMAND ----------
# MAGIC %md
# MAGIC We can also check how many tokens OpenAI has used
# COMMAND ----------
gpt35_response["usage"]["total_tokens"]