文本生成（KerasNLP）

发表于 2024-04-15 分类于人工智能阅读次数：本文字数： 1.1k 阅读时长 ≈ 4 分钟

大语言模型非常流行。而这个大语言模型的核心是预测句子中的下一个单词或标记，这通常被称为COCO-LM预训练。大语言模型构建起来很复杂，而且从头开始训练的成本很高，幸运的是有经过预先训练的LLM可供使用。KerasNLP提供了大量预训练检查点，从而可以实验SOTA模型，而无需自行训练。例如，你可以通过from_preset方法调用GPT2CausalLM加载GPT-2模型，除了GPT-2模型之外，还有许多其它预训练模型，例如OPT、ALBERT、RoBeRTa等。

from keras_nlp.models import {GPT2CausalLM, GPT2CausalLMPreprocessor}

preprocessor = GPT2CausalLMPreprocessor.from_preset('gpt2_base_en',sequente_length=128,)
model = GPT2CausalLM.from_preset('gpt2_base_en',preprocessor=preprocessor)
model.compile(...)
model.fit(cnn_dailymail_dataset)

model.generate('Snowfall in Buffalo',max_length=40,)

现在你可以调用generate方法来生成文本。生成文本的质量还算不错，但我们可以通过微调来改进它。但在进行微调之前，让我们看一下整体架构。与我们上次讨论的BERT分类器类似，GPT2CausalLM模型也有一个预处理器、分词器和主干，所有这些都可以通过简单的from_preset方法轻松加载。为了进行微调我们将使用Reddit TIFU数据集，以便输出遵循Reddit的写作风格。

这是训练数据的实例：

import os
import keras_nlp
import keras
import tensorflow as tf
import time
import tensorflow_datasets as tfds

os.environ["KERAS_BACKEND"] = "jax"  # or "tensorflow" or "torch"
keras.mixed_precision.set_global_policy("mixed_float16")

reddit_ds = tfds.load("reddit_tifu", split="train", as_supervised=True)
for document, title in reddit_ds:
    print(document.numpy())
    print(title.numpy())
    break

train_ds = (
    reddit_ds.map(lambda document, _: document)
    .batch(32)
    .cache()
    .prefetch(tf.data.AUTOTUNE)
)

train_ds = train_ds.take(500)
num_epochs = 1

# Linearly decaying learning rate.
learning_rate = keras.optimizers.schedules.PolynomialDecay(5e-5,decay_steps=train_ds.cardinality() * num_epochs,end_learning_rate=0.0,)
loss = keras.losses.SparseCategoricalCrossentropy(from_logits=True)
gpt2_lm.compile(optimizer=keras.optimizers.Adam(learning_rate),loss=loss,weighted_metrics=["accuracy"],)

gpt2_lm.fit(train_ds, epochs=num_epochs)

start = time.time()

output = gpt2_lm.generate("I like basketball", max_length=200)
print("\nGPT-2 output:")
print(output)

end = time.time()
print(f"TOTAL TIME ELAPSED: {end - start:.2f}s")

# 500/500 ━━━━━━━━━━━━━━━━━━━━ 75s 120ms/step - accuracy: 0.3189 - loss: 3.3653
# so i go to the
# TOTAL TIME ELAPSED: 21.13s

由于我们正在语言模型中执行下一个单词的预测，因此我们只需要此处的文档特征。接下来，我们定义自定义学习率。并使用fit方法开始微调，这需要相当多的时间和GPU内存。但完成后生成的文本会更接近Reddit的写作风格，生成的长度也更接近我们在训练中预设的长度。您可以做另一件事是：你可以将模型转换为TensorFlow Lite，并在Android设备上运行。KerasNLP提供了多种采样方法，例如贪婪搜索、Top K和BEAM搜索。编译模型时，你可以通过流轻松设置采样器，默认情况下，GPT-2模型使用Top K采样，或者你可以传入采样器实例。

# Use a string identifier.
gpt2_lm.compile(sampler="top_k")
output = gpt2_lm.generate("I like basketball", max_length=200)
print("\nGPT-2 output:")
print(output)

# 自定义采样器实例
# Use a `Sampler` instance. `GreedySampler` tends to repeat itself,
greedy_sampler = keras_nlp.samplers.GreedySampler()
gpt2_lm.compile(sampler=greedy_sampler)

output = gpt2_lm.generate("I like basketball", max_length=200)
print("\nGPT-2 output:")
print(output)

# GPT-2 output:
# I like basketball, and this is a pretty good one. 
# so i was playing basketball at my local high school, and i was playing with my friends.

我们还可以在中文数据集上微调GPT2。如何在中文诗歌数据集上微调GPT2以教授我们的模型成为诗人！由于GPT2使用字节对编码器，并且原始预训练数据集包含一些汉字，因此我们可以使用原始词汇对中文数据集进行微调。从json文件加载文本。我们仅将《全唐诗》用于演示：

1 2	!# Load chinese poetry dataset. !git clone https://github.com/chinese-poetry/chinese-poetry.git

import os
import json

poem_collection = []
for file in os.listdir("chinese-poetry/全唐诗"):
    if ".json" not in file or "poet" not in file:
        continue
    full_filename = "%s/%s" % ("chinese-poetry/全唐诗", file)
    with open(full_filename, "r") as f:
        content = json.load(f)
        poem_collection.extend(content)

paragraphs = ["".join(data["paragraphs"]) for data in poem_collection]

# 与Reddit的例子类似，我们转换为 TF 数据集，并且仅使用部分数据进行训练。
train_ds = (
    tf.data.Dataset.from_tensor_slices(paragraphs)
    .batch(16)
    .cache()
    .prefetch(tf.data.AUTOTUNE)
)

# Running through the whole dataset takes long, only take `500` and run 1
# epochs for demo purposes.
train_ds = train_ds.take(500)
num_epochs = 1

learning_rate = keras.optimizers.schedules.PolynomialDecay(5e-4,
                decay_steps=train_ds.cardinality() * num_epochs,end_learning_rate=0.0,)
loss = keras.losses.SparseCategoricalCrossentropy(from_logits=True)
gpt2_lm.compile(optimizer=keras.optimizers.Adam(learning_rate),loss=loss,weighted_metrics=["accuracy"],)

gpt2_lm.fit(train_ds, epochs=num_epochs)
output = gpt2_lm.generate("昨夜雨疏风骤", max_length=200)
print(output)

# 昨夜雨疏风骤，爲臨江山院短靜。石淡山陵長爲羣，臨石山非處臨羣。美陪河埃聲爲羣，漏漏漏邊陵塘