Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Prospective demo #22

Open
wants to merge 13 commits into
base: main
Choose a base branch
from
Open

WIP: Prospective demo #22

wants to merge 13 commits into from

Conversation

picaultj
Copy link
Collaborator

@picaultj picaultj commented Feb 3, 2025

TODO

  • bug: (tab 3 analysis): mismatch of topic id between title+summary and topic evolution+analysis
  • Tab2 (model configuration) : parametrize LLM-based analysis (choice of criterion - stored in a config file but not yet used)
  • Harmonize location of storage of config files (feeds, models) for a user
  • Tab3 (analysis): info about weak signals, strong signals: add link to document sources (url) to better explore the results
  • Tab3 (analysis): split LLM information in tables for better display
  • Tab4 (report generation): generate a newsletter based on a choice of selected topics, add possibility to edit/download/send by mail / subscribe
  • bug? choice of reference_timestamp in reference_new_data
  • add a feature to replay the past in addition to live data (useful to regenerate models from historical data)
  • bug random sur process_new_data avec certaines données sur la sauvegarde parquet des df weak, strong ArrowTypeError: ("Expected bytes, got a 'tuple' object", 'Conversion failed for column Documents with type object')
  • Refactor LLM call for detailed analysis: format in json to avoid passing the html template for formatting

@picaultj picaultj force-pushed the prospective_demo branch 4 times, most recently from a421abe to 2ffd01e Compare February 7, 2025 09:38
@picaultj
Copy link
Collaborator Author

    # FIXME/TODO
    #   - self.topic_models shall not be an attribute of BERTrend -- to much memory consumption after a few iterations
    #   - what we did so far :
    #       * create topic models for each period, store them in self.topic_models
    #       * merge the data after preprocessing of each model
    #   - instead modify the functions as follows
    #       * train_topic_models: do a combined operation
    #           no need to store anything else than the last topic model (at least temporarily)
    #           for each period
    #               combine the operations of training the new topic model and merging
    #               optionnally store (as BERTopic serialization, using the function "save_topic_model" the newly created model
    #               merge the new one with previous data
    #           that way: no need to store BERTopic models inside BERTrend instance (memory saving)
    #                       we can serialize the BERTrend objects simply as a .drill and restore it the same way
    #    - in the demo, modify the different states of BERTrend (timestamps) are checked: use the BERTrend objects
    #       (ex. keys of self.doc_groups)
    #       instead of looking into the disk for available topic models

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant