LLMを活用した生成AIの評価(LLM as a Judge)

2024.02.01

前回はRAGの検索で、正しい文章を見つけてきているかの評価について書きました。今回は生成される文章に注目して、その評価方法を考えます。

ライター：荒牧　大樹: 2007年ネットワンシステムズ入社し、コラボレーション・クラウド製品の担当を経て現在はAI・データ分析製品と技術の推進に従事。最近では次世代の計算環境であるGPU・FPGA・量子コンピュータに注目している。

LLMを活用した評価について

LLMのシステムを構成する場合に、RAGであれば、文章検索の方法、Chunk Size、含めるDocumentの数、使用するLLMの種類等の様々な選択肢があります。質問(Prompt)に対して、理想的な回答を作っておき生成された文章と比較して定量的な評価が出来れば一番良い組み合わせを選択する事が出来ます。自然言語処理の世界では、従来は性能指標としてROUGEやBLUE等が利用されてきましたが2つの文章を比較して指標を出すだけでした。生成文章と回答の関係性や正しさ等の指標を出せないかという事で、新しくLLMを活用したLLM as a Judgeが考えられました。

https://arxiv.org/abs/2306.05685

DatabricksのLLM as a Judgeについて

MLFlowのEvaluateの中に新しくLLM as a Judgeの機能が取り込まれています。様々な実験を行い判断基準と判断例を含めたPromptを作成してLLMに送付して評価してもらいます。結果はエキスペリメントの画面で表示可能で、条件を変えた結果をRunに記録可能です。

RAGの作成と評価LangchainでDatabricksを活用した文章の作成

今回はDatabricksのVector StoreとLangchainを組み合わせてRAGのシステムを作成します。Embeddingは前回作成したAzure Open AIのada-002を利用しました。

from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain.chat_models import ChatDatabricks

TEMPLATE = """あなたはITのシステムエンジニアです。ITインフラストラクチャの管理の質問に答えています。質問がこれらのトピックのいずれかに関連していない場合は、回答をご遠慮ください。答えがわからない場合は、答えをでっち上げようとせず、わからないと言ってください。箇条書きで答えるのは避けてください。

次の文脈を使用して、最後の質問に答えてください。
{context}
質問: {question}
"""

prompt = PromptTemplate(template=TEMPLATE, input_variables=["context", "question"])

chain = RetrievalQA.from_chain_type(
　　llm=chat_model,
　　chain_type="stuff",
　　retriever=get_retriever(),
　　chain_type_kwargs={"prompt": prompt},
)

NET ONE BLOGのデータを使ったRAGシステムの評価の実施

Mlflow.evaluateを利用すると生成AIを使った評価を行えます。
今回も、前回Blogと同様にNET ONE BLOGのデータを使い、Correctness / Relevance / Similarityの3つを選択してGPT-4に評価を行ってもらいます。また、指定を行えばROUGEも計算されますので、指標として利用します。比較の条件ですが、Chunk Sizeと参照ドキュメント数を元に比較を行います。Chunk Sizeは512/1024/2048の3つで区切ります。ドキュメント数は2048を最大として、Chunk Size別では512の場合は最大４、1024の時には最大2、2048の時は最大1としました。以下のコードで条件を変えながら、評価を行っています。

from mlflow.models import infer_signature
import mlflow
import langchain
from mlflow import MlflowClient

mlflow.set_registry_uri("databricks-uc")
model_name = "nos_data_llm.rag.nogblog_langchain"

with mlflow.start_run(run_name=run_name) as run:
　signature = infer_signature(question, answer)

　model_info = mlflow.langchain.log_model(
　　chain,
　　loader_fn=get_retriever,
　　artifact_path="chain",
　　registered_model_name=model_name,
　　pip_requirements=[
　　　"mlflow==2.9.2",
　　　"langchain==" + langchain.__version__,
　　　"databricks-vectorsearch",
　　],
　　input_example=question,
　)

　evaluate_results = mlflow.evaluate(
　　model=model_info.model_uri,
　　data=eval_data.toPandas(),
　　targets="ground_truth",
　　model_type = "question-answering",
　　evaluators="default",
　　extra_metrics=[mlflow.metrics.latency(),mlflow.metrics.rouge1(),mlflow.metrics.rougeL(),answer_relevance_metric, answer_correctness_metric,answer_similarity_metric],
　　evaluator_config={
　　　"col_mapping": {
　　　"inputs": "query"
　　　}
　　},
)

LLM as a Judgeでの評価結果

条件を変えながら、エキスペリメントに記録された結果は以下になります。

結果を見ると、Correctnessで大きな違いが出ておりChunkが512で、4文章を入れた場合が良いようです。しかしながら、詳細の結果画面を見るとNan（空）が結構入ってしまっているのがわかります。これは、今回利用したAzure Open AIからエラーが返ってきているのですが、GPT-4の一定時間でのAPI Callの上限に達してしまっている為、これ以上の評価が出来ない場合に(空)と記録されていると考えられます。

以下のMLflow中のコード中のdefault値を見るとWorkerが10であることが理由のように思えます。またmax tokenも200となっていますが、回答集は400程度の文字で書かれているのでmax tokenも400としたいところです。

https://github.com/mlflow/mlflow/blob/v2.9.2/mlflow/metrics/genai/genai_metric.py

Custom Metricを作成する

Workerの数とMax Tokenを調整したい為、CustomのMetricsとして作成します。以下の元のPromptを参考にすれば作成可能です。

mlflow/mlflow/metrics/genai/prompts/v1.py at v2.9.2 · mlflow/mlflow · GitHub

以下のようなコードでCustom Metricを作成可能です。Exampleは省略しています。

custom_correctness = make_genai_metric(
　name="custom_correctness",
　definition=(
　　"Answer correctness is evaluated on the accuracy of the provided output based on the "
　　"provided targets, which is the ground truth. Scores can be assigned based on the degree "
　　"of semantic similarity and factual correctness of the provided output to the provided "
　　"targets, where a higher score indicates higher degree of accuracy."
　),
　grading_prompt=(
　　"Answer Correctness: Below are the details for different scores:\n"
　　"- Score 1: The output is completely incorrect. It is completely different from or "
　　"contradicts the provided targets.\n"
　　"- Score 2: The output demonstrates some degree of semantic similarity and includes "
　　"partially correct information. However, the output still has significant discrepancies "
　　"with the provided targets or inaccuracies.\n"
　　"- Score 3: The output addresses a couple of aspects of the input accurately, aligning "
　　"with the provided targets. However, there are still omissions or minor inaccuracies.\n"
　　"- Score 4: The output is mostly correct. It provides mostly accurate information, but "
　　"there may be one or more minor omissions or inaccuracies.\n"
　　"- Score 5: The output is correct. It demonstrates a high degree of accuracy and "
　　"semantic similarity to the targets."
　),

　examples=[custom_correctness_example_score_2, custom_correctness_example_score_4],
　model="openai:/gpt-4",
　grading_context_columns=["targets"],
　parameters={"max_tokens": 400},
　aggregations=["mean", "variance"],
　greater_is_better=True,
　max_workers=2,
)

改めて見るとLLM as a Judgeの為のPromptは結構長いことがわかります。

Custom Metricを利用した評価

今回はRAGを使わないGPT-3.5/4の結果も含めました。RAGを使えば、GPT-3.5/4より3つの評価の値が改善されていることがわかります。また、512で４文章、1024で2文章が良い値でした。

自然言語処理で利用されるROUGE-L/ROUGE-1も計算されるので、結果を見るとChunkが2048の場合が一番良い結果となっていました。上位の3つは評価の指標によって順位が変わる事が確認出来ました。

詳細を見ると、(空)が無くなっています。Workerを2として場合は2-3倍程度の時間がかかるようになってしまいました。

所感

LLM as a Judgeの機能を利用した条件を変えた場合にどのような評価となるかの確認を行いました。RAGのシステムを作る場合は、Chunk Sizeを小さくして複数の文章を添付した方が結果が良くなると言われていますが、今回も同様の結果が確認出来ました。RAGを使って検索対象の中に正しい文章が存在する場合は、GPT-4/3.5より良い文章を生成出来る事も確認出来ました。

※本記事の内容は執筆者個人の見解であり、所属する組織の見解を代表するものではありません。

お問い合わせはこちら

OTHER CATEGORY