Utiliser ChatGPT pour générer les résumés de tous mes blogposts

I help my clients acquire new users and make more money with their web businesses. I have ten years of experience with SaaS projects. If that’s something you need help with, we should get in touch!

< Back to article list

Utiliser ChatGPT pour générer les résumés de tous mes blogposts

January 11, 2023

Je suis tombé sur un Tweet qui m’a scotché, surtout une de ses réponses.

L’auteur a utilisé ChatGPT pour générer un résumé de tous ses posts. Et pour ca… il a demandé à ChatGPT d’écrire le code -_-. Mind blown, c’est tellement meta.

John: currently getting openai to write summaries of my 170 substack posts. it is doing an OK job!
[…]

Russel: Pretty cool! How did you automate that?

John: I exported my posts out of Substack

The HTML files are sitting in a folder

And I asked ChatGPT to write the code lol

Après avoir lu le tweet initial, j’aurais pu écrire le code moi même, l’auteur aussi mais il a choisi de laisser l’IA écrire un truc correct.

Enfin presque.

On va reproduire l’expérience sur mon blog. C’est du Jekyll/markdown mais l’idée est la même.

Reproduisons l’expérience

J’ai demandé à ChatGPT de générer le code, comme l’a fait John. J’ai du modifier mon prompt pour gérer le markdown, puis pour lui dire de stocker le résultat dans un répertoire mais c’est moi qui avait été un peu vague et dans l’ensemble, ChatGPT s’en sort très bien. En 3 prompts, j’ai un truc convaincant.

you are a python software developper.  You can find all my blog posts in the "_posts" directory. Can you write the python code to ask ChatGPT for a summary of this posts, and print the output?

Can you modify the piece of code to use only the markdown files, their file extension is ".md" ?

Now can this piece of code save the output with the same file names in the "output" directory (it may not exist) ?

import openai_secret_manager
import openai
import os

# Use the openai_secret_manager library to easily fetch your API key
secrets = openai_secret_manager.get_secret("openai")
openai.api_key = secrets["api_key"]

# The directory containing the markdown files
input_directory = "_posts"
output_directory = "output"

# Create output directory if it doesn't exist
if not os.path.exists(output_directory):
    os.makedirs(output_directory)

# Iterate through all files in the directory
for filename in os.listdir(input_directory):
    if filename.endswith(".md"):
        filepath = os.path.join(input_directory, filename)
        with open(filepath, "r") as f:
            text = f.read()

        # Generate a summary of the text using the OpenAI API
        response = openai.Completion.create(engine="text-davinci-002", prompt=(f"summarize: {text}"))
        summary = response["choices"][0]["text"]
        print(f"Summary for {filename}: {summary}")

        # Save the summary to a file in the output directory
        output_filepath = os.path.join(output_directory, filename)
        with open(output_filepath, "w") as f:
            f.write(summary)

Ce code semble correct, mais il y a deux problèmes.

openai_secret_manager

Pour faire l’appel API, il faut utiliser une clé d’API et ChatGPT suggère l’utilisation de openai_secret_manager. Soit.

Le problème, c’est que la librairie openai_secret_manager n’existe pas.

$ pip install openai_secret_manager
ERROR: Could not find a version that satisfies the requirement openai_secret_manager (from versions: none)
ERROR: No matching distribution found for openai_secret_manager

Ca n’est pas une erreur de nommage ; elle n’est nulle part dans la doc, ni sur les dépots d’OpenAI, ni sur google, ni sur PyPi.

Je ne suis pas le seul à m’être fait avoir.

Pourtant, quand on demande à ChatGPT comment installer, il persiste et propose la commande pip install openai_secret_manager ci-dessus.

On peut résoudre le problème avec openai.api_key = “xy”, ou mieux, dotenv.

La génération de résumé

Le bout de code clé pour générer le résumé, fourni par ChatGPT, c’est celui-ci:

# Generate a summary of the text using the OpenAI API
response = openai.Completion.create(engine="text-davinci-002", prompt=(f"summarize: {text}"))
summary = response["choices"][0]["text"]
print(f"Summary for {filename}: {summary}")

Le problème c’est que c’est trop simple. Si vous ne faites que ca, vous n’aurez que les débuts de réponse.

$ python generate-summary.py 
Summary for 2021-05-12-one-makefile-to-rule-them-all.md: 
This person switched to using a Makefile in order to centralize all of
Summary for 2014-12-01-les-calendriers-de-lavent-du-web.md: 

It's December 1st, which means it's the first day of
Summary for 2020-04-21-how-to-remember-your-life.md: 
The author suggests taking pictures, maintaining a yearly journal, and a gratitude journal
Summary for 2022-08-19-batch-resizing-files-with-nautilus.md: 

The text explains how to create a script that will resize images and store

ChatGPT fournit les réponses en plusieurs “évènements” ; pour avoir toute la réponse, il faut concaténer les différentes réponses. Je le savais car j’avais déjà utilisé l’API. Sans cette connaissance, je n’aurais sans doute pas trouvé la source du problème.

Je n’ai pas obtenu de ChatGPT un refactoring correct pour utiliser les évènements:

In the last piece of python code you wrote, can you use openAI events in order to extract the complete chatgpt answer?

La réponse est mieux, mais exploite toujours uniquement le début de la réponse (response["choices"][0]["text"]) et non la concaténation des évènements:

#[...]
# Generate a summary of the text using the OpenAI API
response = openai.Completion.create(engine="text-davinci-002", prompt=(f"summarize: {text}"),temperature=0.5,stop=None,log_level='info',log_probs=0,max_tokens=2000,min_confidence=0.5)
summary = response["choices"][0]["text"]
print(f"Summary for {filename}: {summary}")
#[...]

Finalement, j’ai réécrit le bout de code moi même, sur la base de la documentation.

def ask_chatgpt(prompt: str) -> (str, float):
    """
    Send the 'prompt' to ChatGPT.
    Return the answer (that may span multiple seconds) as well as the completion time.
    """
    start_time = time.time()

    # create variables to collect the stream of events
    response = openai.Completion.create(
        model="text-davinci-003",
        prompt=prompt,
        max_tokens=200,
        temperature=0,
        stream=True,  # this time, we set stream=True
    )

    collected_events = []
    completion_text = ""
    # iterate through the stream of events
    for event in response:
        event_time = time.time() - start_time  # calculate the time delay of the event
        collected_events.append(event)  # save the event response
        event_text = event["choices"][0]["text"]  # extract the text
        completion_text += event_text  # append the text

    return completion_text, event_time

Après ces modifications, et en utilisant comme prompt

prompt = f"Write a short two sentence summary of this text:\n{text}"

…j’ai pu avoir des résumés plutôt convaincants dans les deux langues que j’ai utilisé pour écrire depuis 2011.

Summary for 2015-01-24-faire-une-video-timelapse-dun-ecran-sous-ubuntu.md (3.06s): 

Ce script permet de créer une vidéo timelapse d'un écran sous Ubuntu en récupérant les packages nécessaires, en écrivant un script shell et en assemblant les images avec mencoder.

Summary for 2023-01-07-utiliser-just-plutot-que-make.md (1.80s):  

Utiliser Just plutôt que Make est une solution élégante pour gérer des commandes et des dépendances, avec une syntaxe proche de Make et des avantages supplémentaires.

Summary for 2021-05-21-batch-resizing-images.md (1.82s): 

This text provides a one-liner command to resize all .jpg files in a directory and store the result in the `small` directory, as well as other options such as creating a cheatsheet and using the tldr command.

En résumé:

ChatGPT est très bien pour générer un premier jet d’un bout de code semi compliqué, mixant plusieurs APIs. Top pour démarrer un script d’automatisation, mais c’est vraiment pas du code insurmontable et attendez-vous à pas mal de fine-tuning.
Attention, il peut raconter des conneries et persister dans l’erreur. J’ai mis un moment à comprendre qu’openai_secret_manager n’était pas la librairie idiomatique, mais un truc qui n’existe pas.
Vous avez intérêt à avoir une bonne connaissance métier sinon vous allez laisser passer des trucs.

Au final, j’ai quand même gagné du temps ; le temps de traitement de ChatGPT n’est pas un problème, il suffit de laisser tourner l’export en tâche de fond.

Et comme l’a fait remarqué John, sans une base de départ j’aurais sans doute eu la flemme.

Ci-dessous la version finale du code (pip install python-dotenv openai). On peut imaginer un ETL un peu plus généraliste mais ce sera pour une autre fois.

import openai
import os
import time
from dotenv import load_dotenv
from typing import Optional


def ask_chatgpt(prompt: str) -> (str, float):
    """
    Send the 'prompt' to ChatGPT.
    Return the answer (that may span multiple seconds) as well as the completion time.
    """
    start_time = time.time()

    # create variables to collect the stream of events
    response = openai.Completion.create(
        model="text-davinci-003",
        prompt=prompt,
        max_tokens=200,
        temperature=0,
        stream=True,  # this time, we set stream=True
    )

    collected_events = []
    completion_text = ""
    # iterate through the stream of events
    for event in response:
        event_time = time.time() - start_time  # calculate the time delay of the event
        collected_events.append(event)  # save the event response
        event_text = event["choices"][0]["text"]  # extract the text
        completion_text += event_text  # append the text

    return completion_text, event_time


def generate_summaries(
    input_directory: str, output_directory: str, count: Optional[int] = 4
):
    """
    Take all the files from the input_directory, and write their summary in the
    """
    # Create output directory if it doesn't exist
    if not os.path.exists(output_directory):
        os.makedirs(output_directory)
    # Iterate through all files in the directory
    for filename in os.listdir(input_directory):
        output_filepath = os.path.join(output_directory, filename)
        if count == 0:
            break
        # Generate only for markdown files that have
        # not been handled already
        if filename.endswith(".md") and not os.path.exists(output_filepath):
            # Read blog post
            filepath = os.path.join(input_directory, filename)
            with open(filepath, "r") as f:
                text = f.read()
            # OpenAI cannot write large summaries
            if len(text) > 4097:
                continue

            # Generate a summary of the text using the OpenAI API
            prompt = f"Write a short two sentence summary of this text:\n{text}"
            summary, duration = ask_chatgpt(prompt)
            print(f"Summary for {filename} ({duration:.2f}s): {summary}")

            # Save the summary to a file in the output directory
            with open(output_filepath, "w") as f:
                f.write(summary)

            # Decrease the remaining number of files we want to have to deal with
            if count is not None:
                count -= 1


if __name__ == "__main__":
    load_dotenv()
    # Load the API key from a dotfile
    openai.api_key = os.getenv("OPENAI_API_KEY")

    # The directory containing the markdown files
    input_directory = "_posts"
    # Where we want to store our summaries
    output_directory = "summaries"
    generate_summaries(input_directory, output_directory, None)