Spaces:

seanpedrickcase
/

topic_modelling

Running

App Files Files Community

seanpedrickcase commited on Jun 27

Commit

55f0ce3

•

1 Parent(s): 04a15c5

Can split passages into sentences. Improved embedding, LLM representation models, improved zero shot capabilities

Browse files

Files changed (13) hide show

.dockerignore +24 -0
.gitignore +2 -0
README.md +2 -2
app.py +51 -26
funcs/anonymiser.py +1 -1
funcs/bertopic_vis_documents.py +13 -5
funcs/clean_funcs.py +12 -4
funcs/embeddings.py +32 -9
funcs/helper_functions.py +161 -72
funcs/representation_model.py +39 -30
funcs/topic_core_funcs.py +316 -134
requirements.txt +4 -3
requirements_gpu.txt +2 -2

.dockerignore ADDED Viewed

	@@ -0,0 +1,24 @@

+*.pyc
+*.ipynb
+*.zip
+*.npz
+*.csv
+*.xlsx
+*.xls
+*.pkl
+*.parquet
+*.png
+*.safetensors
+*.json
+*.html
+*.log
+*.spec
+*.bin
+.ipynb_checkpoints/*
+old_code/*
+model/*
+output_model/*
+data/*
+build_deps/*
+dist/*
+build/*

.gitignore CHANGED Viewed

@@ -1,5 +1,6 @@
 *.pyc
 *.ipynb
 *.npz
 *.csv
 *.xlsx
@@ -12,6 +13,7 @@
 *.html
 *.log
 *.spec
 .ipynb_checkpoints/*
 old_code/*
 model/*

 *.pyc
 *.ipynb
+*.zip
 *.npz
 *.csv
 *.xlsx
 *.html
 *.log
 *.spec
+*.bin
 .ipynb_checkpoints/*
 old_code/*
 model/*

README.md CHANGED Viewed

@@ -14,8 +14,8 @@ license: apache-2.0
 Generate topics from open text in tabular data, based on [BERTopic](https://maartengr.github.io/BERTopic/). Upload a data file (csv, xlsx, or parquet), then specify the open text column that you want to use to generate topics. Click 'Extract topics' after you have selected the minimum similar documents per topic and maximum total topics. Duplicate this space, or clone to your computer to avoid queues here!
-Uses fast TF-IDF-based embeddings by default, which are fast but not very performant in terms of cluster. Change to [Mixedbread large v1](https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1) model embeddings (512 dimensions, 8 bit quantisation) on the options page for topics of much higher quality, but slower processing time. If you have an embeddings .npz file previously made using this model, you can load this in at the same time to skip the first modelling step. If you have a pre-defined list of topics for zero-shot modelling, you can upload this as a csv file under 'I have my own list of topics...'. Further configuration options are available under the 'Options' tab. Topic representation with LLMs currently based on [Phi-3-mini-128k-instruct-GGUF](https://huggingface.co/QuantFactory/Phi-3-mini-128k-instruct-GGUF), which is quite slow on CPU, so use a GPU-enabled computer if possible, building from the requirements_gpu.txt file in the base folder.
 For small datasets, consider breaking up your text into sentences under 'Clean data' -> 'Split open text...' before topic modelling.
-I suggest [Wikipedia mini dataset](https://huggingface.co/datasets/rag-datasets/mini_wikipedia/tree/main/data) for testing the tool here, choose passages.parquet.

 Generate topics from open text in tabular data, based on [BERTopic](https://maartengr.github.io/BERTopic/). Upload a data file (csv, xlsx, or parquet), then specify the open text column that you want to use to generate topics. Click 'Extract topics' after you have selected the minimum similar documents per topic and maximum total topics. Duplicate this space, or clone to your computer to avoid queues here!
+Uses fast TF-IDF-based embeddings by default, which are fast but does not lead to high quality clusering. Change to higher quality [mxbai-embed-large-v1](https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1) model embeddings (512 dimensions) for better results but slower processing time. If you have an embeddings .npz file previously made using this model, you can load this in at the same time to skip the first modelling step. If you have a pre-defined list of topics for zero-shot modelling, you can upload this as a csv file under 'I have my own list of topics...'. Further configuration options are available such as maximum topics allowed, minimum documents per topic etc.. Topic representation with LLMs currently based on [Phi-3-mini-128k-instruct-GGUF](https://huggingface.co/QuantFactory/Phi-3-mini-128k-instruct-GGUF), which is quite slow on CPU, so use a GPU-enabled computer if possible, building from the requirements_gpu.txt file in the base folder.
 For small datasets, consider breaking up your text into sentences under 'Clean data' -> 'Split open text...' before topic modelling.
+I suggest [Wikipedia mini dataset](https://huggingface.co/datasets/rag-datasets/mini_wikipedia/tree/main/data) for testing the tool here, choose the passages.parquet file for download.

app.py CHANGED Viewed

@@ -6,10 +6,14 @@ import gradio as gr
 import pandas as pd
 import numpy as np
-from funcs.topic_core_funcs import pre_clean, extract_topics, reduce_outliers, represent_topics, visualise_topics, save_as_pytorch_model
-from funcs.helper_functions import initial_file_load, custom_regex_load
 from sklearn.feature_extraction.text import CountVectorizer
 # Gradio app
@@ -17,6 +21,7 @@ block = gr.Blocks(theme = gr.themes.Base())
 with block:
     data_state = gr.State(pd.DataFrame())
     embeddings_state = gr.State(np.array([]))
     embeddings_type_state = gr.State("")
@@ -26,18 +31,20 @@ with block:
     docs_state = gr.State()
     data_file_name_no_ext_state = gr.State()
     label_list_state = gr.State(pd.DataFrame())
-    vectoriser_state = gr.State(CountVectorizer(stop_words="english", ngram_range=(1, 2), min_df=0.1, max_df=0.95))
     gr.Markdown(
     """
     # Topic modeller
     Generate topics from open text in tabular data, based on [BERTopic](https://maartengr.github.io/BERTopic/). Upload a data file (csv, xlsx, or parquet), then specify the open text column that you want to use to generate topics. Click 'Extract topics' after you have selected the minimum similar documents per topic and maximum total topics. Duplicate this space, or clone to your computer to avoid queues here!
-    Uses fast TF-IDF-based embeddings by default, which are fast but not very performant in terms of cluster. Change to [Mixedbread large v1](https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1) model embeddings (512 dimensions, 8 bit quantisation) on the options page for topics of much higher quality, but slower processing time. If you have an embeddings .npz file previously made using this model, you can load this in at the same time to skip the first modelling step. If you have a pre-defined list of topics for zero-shot modelling, you can upload this as a csv file under 'I have my own list of topics...'. Further configuration options are available under the 'Options' tab. Topic representation with LLMs currently based on [Phi-3-mini-128k-instruct-GGUF](https://huggingface.co/QuantFactory/Phi-3-mini-128k-instruct-GGUF), which is quite slow on CPU, so use a GPU-enabled computer if possible, building from the requirements_gpu.txt file in the base folder.
     For small datasets, consider breaking up your text into sentences under 'Clean data' -> 'Split open text...' before topic modelling.
-    I suggest [Wikipedia mini dataset](https://huggingface.co/datasets/rag-datasets/mini_wikipedia/tree/main/data) for testing the tool here, choose passages.parquet.
     """)
     with gr.Tab("Load files and find topics"):
@@ -48,23 +55,34 @@ with block:
         with gr.Accordion("Clean data", open = False):
             with gr.Row():
-                clean_text = gr.Dropdown(value = "No", choices=["Yes", "No"], multiselect=False, label="Clean data - remove html, numbers with > 1 digits, emails, postcodes (UK), custom regex.")
-                drop_duplicate_text = gr.Dropdown(value = "No", choices=["Yes", "No"], multiselect=False, label="Remove duplicate text, drop < 50 char strings. May make old embedding files incompatible due to differing lengths.")
-                anonymise_drop = gr.Dropdown(value = "No", choices=["Yes", "No"], multiselect=False, label="Anonymise data on file load. Personal details are redacted - not 100% effective. This is slow!")
-                split_sentence_drop = gr.Dropdown(value = "No", choices=["Yes", "No"], multiselect=False, label="Split open text into sentences. Useful for small datasets.")
             with gr.Row():
-                custom_regex = gr.UploadButton(label="Import custom regex file", file_count="multiple")
-                gr.Markdown("""Import custom regex - csv table with one column of regex patterns with no header. Example pattern: (?i)roosevelt for case insensitive removal of this term.""")
                 custom_regex_text = gr.Textbox(label="Custom regex load status")
             clean_btn = gr.Button("Clean data")
         with gr.Accordion("I have my own list of topics (zero shot topic modelling).", open = False):
             candidate_topics = gr.File(label="Input topics from file (csv). File should have at least one column with a header and topic keywords in cells below. Topics will be taken from the first column of the file. Currently not compatible with low-resource embeddings.")
-            zero_shot_similarity = gr.Slider(minimum = 0.5, maximum = 1, value = 0.65, step = 0.001, label = "Minimum similarity value for document to be assigned to zero-shot topic.")
         with gr.Row():
-            min_docs_slider = gr.Slider(minimum = 2, maximum = 1000, value = 5, step = 1, label = "Minimum number of similar documents needed to make a topic.")
-            max_topics_slider = gr.Slider(minimum = 2, maximum = 500, value = 50, step = 1, label = "Maximum number of topics")
         with gr.Row():
             topics_btn = gr.Button("Extract topics", variant="primary")
@@ -78,12 +96,12 @@ with block:
                 representation_type =  gr.Dropdown(label = "Method for generating new topic labels", value="Default", choices=["Default", "MMR", "KeyBERT", "LLM"])
                 represent_llm_btn = gr.Button("Change topic labels")
             with gr.Row():
-                reduce_outliers_btn = gr.Button("Reduce outliers")
                 save_pytorch_btn = gr.Button("Save model in Pytorch format")
     with gr.Tab("Visualise"):
         with gr.Row():
-            visualisation_type_radio = gr.Radio(label="Visualisation type", choices=["Topic document graph", "Hierarchical view"])
             in_label = gr.Dropdown(choices=["Choose a column"], multiselect = True, label="Select column for labelling documents in output visualisations.")
         sample_slide = gr.Slider(minimum = 0.01, maximum = 1, value = 0.1, step = 0.01, label = "Proportion of data points to show on output visualisations.")
         legend_label = gr.Textbox(label="Custom legend column (optional, any column from the topic details output)", visible=False)
@@ -98,36 +116,43 @@ with block:
     with gr.Tab("Options"):
         with gr.Accordion("Data load and processing options", open = True):
             with gr.Row():
-                seed_number = gr.Number(label="Random seed to use for dimensionality reduction.", minimum=0, step=1, value=42, precision=0)
                 calc_probs = gr.Dropdown(label="Calculate all topic probabilities", value="No", choices=["Yes", "No"])
             with gr.Row():
-                low_resource_mode_opt = gr.Dropdown(label = "Use low resource (TF-IDF) embeddings and processing.", value="Yes", choices=["Yes", "No"])
-                embedding_super_compress = gr.Dropdown(label = "Round embeddings to three dp for smaller files with less accuracy.", value="No", choices=["Yes", "No"])
-            with gr.Row():
                 return_intermediate_files = gr.Dropdown(label = "Return intermediate processing files from file preparation.", value="Yes", choices=["Yes", "No"])
                 save_topic_model = gr.Dropdown(label = "Save topic model to BERTopic format pkl file.", value="No", choices=["Yes", "No"])
     # Load in data. Update column names dropdown when file uploaded
-    in_files.upload(fn=initial_file_load, inputs=[in_files], outputs=[in_colnames, in_label, data_state, output_single_text, topic_model_state, embeddings_state, data_file_name_no_ext_state, label_list_state])
     # Clean data
     custom_regex.upload(fn=custom_regex_load, inputs=[custom_regex], outputs=[custom_regex_text, custom_regex_state])
-    clean_btn.click(fn=pre_clean, inputs=[data_state, in_colnames, data_file_name_no_ext_state, custom_regex_state, clean_text, drop_duplicate_text, anonymise_drop, split_sentence_drop], outputs=[output_single_text, output_file, data_state, data_file_name_no_ext_state], api_name="clean")
     # Extract topics
-    topics_btn.click(fn=extract_topics, inputs=[data_state, in_files, min_docs_slider, in_colnames, max_topics_slider, candidate_topics, data_file_name_no_ext_state, label_list_state, return_intermediate_files, embedding_super_compress, low_resource_mode_opt, save_topic_model, embeddings_state, embeddings_type_state, zero_shot_similarity, seed_number, calc_probs, vectoriser_state], outputs=[output_single_text, output_file, embeddings_state, embeddings_type_state, data_file_name_no_ext_state, topic_model_state, docs_state, vectoriser_state, assigned_topics_state], api_name="topics")
     # Reduce outliers
-    reduce_outliers_btn.click(fn=reduce_outliers, inputs=[topic_model_state, docs_state, embeddings_state, data_file_name_no_ext_state, assigned_topics_state, vectoriser_state, save_topic_model], outputs=[output_single_text, output_file, topic_model_state], api_name="reduce_outliers")
     # Re-represent topic labels
-    represent_llm_btn.click(fn=represent_topics, inputs=[topic_model_state, docs_state, data_file_name_no_ext_state, low_resource_mode_opt, save_topic_model, representation_type, vectoriser_state], outputs=[output_single_text, output_file, topic_model_state], api_name="represent_llm")
     # Save in Pytorch format
     save_pytorch_btn.click(fn=save_as_pytorch_model, inputs=[topic_model_state, data_file_name_no_ext_state], outputs=[output_single_text, output_file], api_name="pytorch_save")
     # Visualise topics
-    plot_btn.click(fn=visualise_topics, inputs=[topic_model_state, data_state, data_file_name_no_ext_state, low_resource_mode_opt, embeddings_state, in_label, in_colnames, legend_label, sample_slide, visualisation_type_radio, seed_number], outputs=[vis_output_single_text, out_plot_file, plot, plot_2], api_name="plot")
 # Launch the Gradio app
 if __name__ == "__main__":

 import pandas as pd
 import numpy as np
+from funcs.topic_core_funcs import pre_clean, optimise_zero_shot, extract_topics, reduce_outliers, represent_topics, visualise_topics, save_as_pytorch_model, change_default_vis_col
+from funcs.helper_functions import initial_file_load, custom_regex_load, ensure_output_folder_exists, output_folder, get_connection_params
 from sklearn.feature_extraction.text import CountVectorizer
+min_word_occurence_slider_default = 0.01
+max_word_occurence_slider_default = 0.95
+ensure_output_folder_exists()
 # Gradio app
 with block:
+    original_data_state  = gr.State(pd.DataFrame())
     data_state = gr.State(pd.DataFrame())
     embeddings_state = gr.State(np.array([]))
     embeddings_type_state = gr.State("")
     docs_state = gr.State()
     data_file_name_no_ext_state = gr.State()
     label_list_state = gr.State(pd.DataFrame())
+    vectoriser_state = gr.State(CountVectorizer(stop_words="english", ngram_range=(1, 2), min_df=min_word_occurence_slider_default, max_df=max_word_occurence_slider_default))
+    session_hash_state = gr.State("")
     gr.Markdown(
     """
     # Topic modeller
     Generate topics from open text in tabular data, based on [BERTopic](https://maartengr.github.io/BERTopic/). Upload a data file (csv, xlsx, or parquet), then specify the open text column that you want to use to generate topics. Click 'Extract topics' after you have selected the minimum similar documents per topic and maximum total topics. Duplicate this space, or clone to your computer to avoid queues here!
+    Uses fast TF-IDF-based embeddings by default, which are fast but does not lead to high quality clusering. Change to higher quality [mxbai-embed-large-v1](https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1) model embeddings (512 dimensions) for better results but slower processing time. If you have an embeddings .npz file previously made using this model, you can load this in at the same time to skip the first modelling step. If you have a pre-defined list of topics for zero-shot modelling, you can upload this as a csv file under 'I have my own list of topics...'. Further configuration options are available such as maximum topics allowed, minimum documents per topic etc.. Topic representation with LLMs currently based on [Phi-3-mini-128k-instruct-GGUF](https://huggingface.co/QuantFactory/Phi-3-mini-128k-instruct-GGUF), which is quite slow on CPU, so use a GPU-enabled computer if possible, building from the requirements_gpu.txt file in the base folder.
     For small datasets, consider breaking up your text into sentences under 'Clean data' -> 'Split open text...' before topic modelling.
+    I suggest [Wikipedia mini dataset](https://huggingface.co/datasets/rag-datasets/mini_wikipedia/tree/main/data) for testing the tool here, choose the passages.parquet file for download.
     """)
     with gr.Tab("Load files and find topics"):
         with gr.Accordion("Clean data", open = False):
             with gr.Row():
+                clean_text = gr.Dropdown(value = "No", choices=["Yes", "No"], multiselect=False, label="Remove html, > 1 digit nums, emails, postcodes (UK).")
+                drop_duplicate_text = gr.Dropdown(value = "No", choices=["Yes", "No"], multiselect=False, label="Remove duplicate text, drop < 50 character strings.")
+                anonymise_drop = gr.Dropdown(value = "No", choices=["Yes", "No"], multiselect=False, label="Anonymise data on file load. Personal details are redacted - not 100% effective and slow!")
+                split_sentence_drop = gr.Dropdown(value = "No", choices=["Yes", "No"], multiselect=False, label="Split text into sentences. Useful for small datasets.")
             with gr.Row():
+                custom_regex = gr.UploadButton(label="Import custom regex removal file", file_count="multiple")
+                gr.Markdown("""Import custom regex - csv table with one column of regex patterns with no header. Strings matching this pattern will be removed. Example pattern: (?i)roosevelt for case insensitive removal of this term.""")
                 custom_regex_text = gr.Textbox(label="Custom regex load status")
             clean_btn = gr.Button("Clean data")
         with gr.Accordion("I have my own list of topics (zero shot topic modelling).", open = False):
             candidate_topics = gr.File(label="Input topics from file (csv). File should have at least one column with a header and topic keywords in cells below. Topics will be taken from the first column of the file. Currently not compatible with low-resource embeddings.")
+            with gr.Row():
+                zero_shot_similarity = gr.Slider(minimum = 0.2, maximum = 1, value = 0.55, step = 0.001, label = "Minimum similarity value for document to be assigned to zero-shot topic. You may need to set this very low to get documents assigned to your topics!", scale=2)
+                zero_shot_optimiser_btn = gr.Button("Optimise settings to keep only zero-shot topics", scale=1)
         with gr.Row():
+            with gr.Accordion("Topic modelling settings - change documents per topic, max topics, frequency of terms", open = False):
+                with gr.Row():
+                    min_docs_slider = gr.Slider(minimum = 2, maximum = 1000, value = 3, step = 1, label = "Minimum number of similar documents needed to make a topic.")
+                    max_topics_slider = gr.Slider(minimum = 2, maximum = 500, value = 100, step = 1, label = "Maximum number of topics")
+                with gr.Row():
+                    min_word_occurence_slider = gr.Slider(minimum = 0.001, maximum = 0.9, value = min_word_occurence_slider_default, step = 0.001, label = "Keep terms that appear in this minimum proportion of documents. Avoids creating topics with very uncommon words.")
+                    max_word_occurence_slider = gr.Slider(minimum = 0.1, maximum = 1.0, value =max_word_occurence_slider_default, step = 0.01, label = "Keep terms that appear in less than this maximum proportion of documents. Avoids very common words in topic names.")
+            quality_mode_drop = gr.Dropdown(label = "Use high-quality transformers-based embeddings (slower)", value="No", choices=["Yes", "No"])
         with gr.Row():
             topics_btn = gr.Button("Extract topics", variant="primary")
                 representation_type =  gr.Dropdown(label = "Method for generating new topic labels", value="Default", choices=["Default", "MMR", "KeyBERT", "LLM"])
                 represent_llm_btn = gr.Button("Change topic labels")
             with gr.Row():
+                reduce_outliers_btn = gr.Button("Reduce outliers (will create new topic labels)")
                 save_pytorch_btn = gr.Button("Save model in Pytorch format")
     with gr.Tab("Visualise"):
         with gr.Row():
+            visualisation_type_radio = gr.Radio(label="Visualisation type", choices=["Topic document graph", "Hierarchical view"], value="Topic document graph")
             in_label = gr.Dropdown(choices=["Choose a column"], multiselect = True, label="Select column for labelling documents in output visualisations.")
         sample_slide = gr.Slider(minimum = 0.01, maximum = 1, value = 0.1, step = 0.01, label = "Proportion of data points to show on output visualisations.")
         legend_label = gr.Textbox(label="Custom legend column (optional, any column from the topic details output)", visible=False)
     with gr.Tab("Options"):
         with gr.Accordion("Data load and processing options", open = True):
             with gr.Row():
+                seed_number = gr.Number(label="Random seed to use in processing", minimum=0, step=1, value=42, precision=0)
                 calc_probs = gr.Dropdown(label="Calculate all topic probabilities", value="No", choices=["Yes", "No"])
             with gr.Row():
+                embedding_super_compress = gr.Dropdown(label = "Round embeddings to three dp: smaller files but lower quality.", value="No", choices=["Yes", "No"])
                 return_intermediate_files = gr.Dropdown(label = "Return intermediate processing files from file preparation.", value="Yes", choices=["Yes", "No"])
                 save_topic_model = gr.Dropdown(label = "Save topic model to BERTopic format pkl file.", value="No", choices=["Yes", "No"])
     # Load in data. Update column names dropdown when file uploaded
+    in_files.upload(fn=initial_file_load, inputs=[in_files], outputs=[in_colnames, in_label, data_state, output_single_text, topic_model_state, embeddings_state, data_file_name_no_ext_state, label_list_state, original_data_state])
+    # When topic modelling column is chosen, change the default visualisation column to the same
+    in_colnames.change(fn=change_default_vis_col, inputs=[in_colnames],outputs=[in_label])
     # Clean data
     custom_regex.upload(fn=custom_regex_load, inputs=[custom_regex], outputs=[custom_regex_text, custom_regex_state])
+    clean_btn.click(fn=pre_clean, inputs=[data_state, in_colnames, data_file_name_no_ext_state, custom_regex_state, clean_text, drop_duplicate_text, anonymise_drop, split_sentence_drop], outputs=[output_single_text, output_file, data_state, data_file_name_no_ext_state, embeddings_state], api_name="clean")
+    # Optimise for keeping only zero-shot topics
+    zero_shot_optimiser_btn.click(fn=optimise_zero_shot, outputs=[quality_mode_drop, min_docs_slider, max_topics_slider, min_word_occurence_slider, max_word_occurence_slider, zero_shot_similarity])
     # Extract topics
+    topics_btn.click(fn=extract_topics, inputs=[data_state, in_files, min_docs_slider, in_colnames, max_topics_slider, candidate_topics, data_file_name_no_ext_state, label_list_state, return_intermediate_files, embedding_super_compress, quality_mode_drop, save_topic_model, embeddings_state, embeddings_type_state, zero_shot_similarity, calc_probs, vectoriser_state, min_word_occurence_slider, max_word_occurence_slider, split_sentence_drop, seed_number], outputs=[output_single_text, output_file, embeddings_state, embeddings_type_state, data_file_name_no_ext_state, topic_model_state, docs_state, vectoriser_state, assigned_topics_state], api_name="topics")
     # Reduce outliers
+    reduce_outliers_btn.click(fn=reduce_outliers, inputs=[topic_model_state, docs_state, embeddings_state, data_file_name_no_ext_state, assigned_topics_state, vectoriser_state, save_topic_model, split_sentence_drop, data_state], outputs=[output_single_text, output_file, topic_model_state], api_name="reduce_outliers")
     # Re-represent topic labels
+    represent_llm_btn.click(fn=represent_topics, inputs=[topic_model_state, docs_state, data_file_name_no_ext_state, quality_mode_drop, save_topic_model, representation_type, vectoriser_state, split_sentence_drop, data_state], outputs=[output_single_text, output_file, topic_model_state], api_name="represent_llm")
     # Save in Pytorch format
     save_pytorch_btn.click(fn=save_as_pytorch_model, inputs=[topic_model_state, data_file_name_no_ext_state], outputs=[output_single_text, output_file], api_name="pytorch_save")
     # Visualise topics
+    plot_btn.click(fn=visualise_topics, inputs=[topic_model_state, data_state, data_file_name_no_ext_state, quality_mode_drop, embeddings_state, in_label, in_colnames, legend_label, sample_slide, visualisation_type_radio, seed_number], outputs=[vis_output_single_text, out_plot_file, plot, plot_2], api_name="plot")
+    # Get session hash from connection parameters
+    block.load(get_connection_params, inputs=None, outputs=[session_hash_state])
 # Launch the Gradio app
 if __name__ == "__main__":

funcs/anonymiser.py CHANGED Viewed

@@ -46,7 +46,7 @@ from presidio_anonymizer.entities import OperatorConfig
 # Function to Split Text and Create DataFrame using SpaCy
 def expand_sentences_spacy(df, colname, nlp=nlp):
     expanded_data = []
-    df = df.reset_index(names='index')
     for index, row in df.iterrows():
         doc = nlp(row[colname])
         for sent in doc.sents:

 # Function to Split Text and Create DataFrame using SpaCy
 def expand_sentences_spacy(df, colname, nlp=nlp):
     expanded_data = []
+    df = df.drop('index', axis = 1, errors="ignore").reset_index(names='index')
     for index, row in df.iterrows():
         doc = nlp(row[colname])
         for sent in doc.sents:

funcs/bertopic_vis_documents.py CHANGED Viewed

@@ -22,7 +22,8 @@ from tqdm import tqdm
 import itertools
 import numpy as np
-# Shamelessly taken and adapted from Bertopic original implementation here (Maarten Grootendorst): https://github.com/MaartenGr/BERTopic/blob/master/bertopic/plotting/_documents.py
 def visualize_documents_custom(topic_model,
                         docs: List[str],
@@ -168,16 +169,23 @@ def visualize_documents_custom(topic_model,
     df["y"] = embeddings_2d[:, 1]
     # Prepare text and names
     if isinstance(custom_labels, str):
         names = [[[str(topic), None]] + topic_model.topic_aspects_[custom_labels][topic] for topic in unique_topics]
         names = ["_".join([label[0] for label in labels[:4]]) for labels in names]
         names = [label if len(label) < 30 else label[:27] + "..." for label in names]
     elif topic_model.custom_labels_ is not None and custom_labels:
-        print("Using custom labels: ", topic_model.custom_labels_)
-        names = [topic_model.custom_labels_[topic + topic_model._outliers] for topic in unique_topics]
     else:
-        print("Not using custom labels")
-        names = [f"{topic} " + ", ".join([word for word, value in topic_model.get_topic(topic)][:3]) for topic in unique_topics]
     #print(names)

 import itertools
 import numpy as np
+# Following adapted from Bertopic original implementation here (Maarten Grootendorst): https://github.com/MaartenGr/BERTopic/blob/master/bertopic/plotting/_documents.py
 def visualize_documents_custom(topic_model,
                         docs: List[str],
     df["y"] = embeddings_2d[:, 1]
     # Prepare text and names
+    trace_name_char_length = 60
     if isinstance(custom_labels, str):
         names = [[[str(topic), None]] + topic_model.topic_aspects_[custom_labels][topic] for topic in unique_topics]
         names = ["_".join([label[0] for label in labels[:4]]) for labels in names]
         names = [label if len(label) < 30 else label[:27] + "..." for label in names]
     elif topic_model.custom_labels_ is not None and custom_labels:
+        #print("Using custom labels: ", topic_model.custom_labels_)
+        #names = [topic_model.custom_labels_[topic + topic_model._outliers] for topic in unique_topics]
+        # Limit label length to 100 chars
+        names = [label[:trace_name_char_length] for label in (topic_model.custom_labels_[topic + topic_model._outliers] for topic in unique_topics)]
     else:
+        #print("Not using custom labels")
+        # Limit label length to 100 chars
+        names = [f"{topic} " + ", ".join([word for word, value in topic_model.get_topic(topic)][:3])[:trace_name_char_length] for topic in unique_topics]
+        #names = [f"{topic} " + ", ".join([word for word, value in topic_model.get_topic(topic)][:3]) for topic in unique_topics]
     #print(names)

funcs/clean_funcs.py CHANGED Viewed

@@ -23,19 +23,27 @@ def initial_clean(texts, custom_regex, progress=gr.Progress()):
     text = text.str.replace_all(email_pattern_regex, ' ')
     text = text.str.replace_all(nums_two_more_regex, ' ')
     text = text.str.replace_all(postcode_pattern_regex, ' ')
     # Allow for custom regex patterns to be removed
     if len(custom_regex) > 0:
         for pattern in custom_regex:
             raw_string_pattern = r'{}'.format(pattern)
             print("Removing regex pattern: ", raw_string_pattern)
-            text = text.str.replace_all(raw_string_pattern, ' ')
-    text = text.str.replace_all(multiple_spaces_regex, ' ')
-    text = text.to_list()
-    return text
 def remove_hyphens(text_text):
     return re.sub(r'(\w+)-(\w+)-?(\w)?', r'\1 \2 \3', text_text)

     text = text.str.replace_all(email_pattern_regex, ' ')
     text = text.str.replace_all(nums_two_more_regex, ' ')
     text = text.str.replace_all(postcode_pattern_regex, ' ')
+    text = text.str.replace_all(multiple_spaces_regex, ' ')
+    text = text.to_list()
+    return text
+def regex_clean(texts, custom_regex, progress=gr.Progress()):
+    texts = pl.Series(texts).str.strip_chars()
     # Allow for custom regex patterns to be removed
     if len(custom_regex) > 0:
         for pattern in custom_regex:
             raw_string_pattern = r'{}'.format(pattern)
             print("Removing regex pattern: ", raw_string_pattern)
+            texts = texts.str.replace_all(raw_string_pattern, ' ')
+    texts = texts.str.replace_all(multiple_spaces_regex, ' ')
+    texts = texts.to_list()
+    return texts
 def remove_hyphens(text_text):
     return re.sub(r'(\w+)-(\w+)-?(\w)?', r'\1 \2 \3', text_text)

funcs/embeddings.py CHANGED Viewed

@@ -1,15 +1,41 @@
 import time
 import numpy as np
-from torch import cuda
-random_seed = 42
 if cuda.is_available():
     torch_device = "gpu"
 else:
     torch_device =  "cpu"
-def make_or_load_embeddings(docs, file_list, embeddings_out, embedding_model, embeddings_super_compress, low_resource_mode_opt):
     # If no embeddings found, make or load in
     if embeddings_out.size == 0:
@@ -32,7 +58,7 @@ def make_or_load_embeddings(docs, file_list, embeddings_out, embedding_model, em
             # Custom model
             # If on CPU, don't resort to embedding models
-            if low_resource_mode_opt == "Yes":
                 print("Creating simplified 'sparse' embeddings based on TfIDF")
                 # Fit the pipeline to the text data
@@ -41,13 +67,10 @@ def make_or_load_embeddings(docs, file_list, embeddings_out, embedding_model, em
                 # Transform text data to embeddings
                 embeddings_out = embedding_model.transform(docs)
-                #embeddings_out = embedding_model.encode(sentences=docs, show_progress_bar = True, batch_size = 32)
-            elif low_resource_mode_opt == "No":
                 print("Creating dense embeddings based on transformers model")
-                #embeddings_out = embedding_model.encode(sentences=docs, max_length=1024, show_progress_bar = True, batch_size = 32) # For Jina # #
-                embeddings_out = embedding_model.encode(sentences=docs, show_progress_bar = True, batch_size = 32, precision="int8") # For large
             toc = time.perf_counter()
             time_out = f"The embedding took {toc - tic:0.1f} seconds"

 import time
 import numpy as np
+from torch import cuda, backends, version
+# Check for torch cuda
+# If you want to disable cuda for testing purposes
+#os.environ['CUDA_VISIBLE_DEVICES'] = '-1'
+print("Is CUDA enabled? ", cuda.is_available())
+print("Is a CUDA device available on this computer?", backends.cudnn.enabled)
 if cuda.is_available():
     torch_device = "gpu"
+    print("Cuda version installed is: ", version.cuda)
+    high_quality_mode = "Yes"
+    #os.system("nvidia-smi")
 else:
     torch_device =  "cpu"
+    high_quality_mode = "No"
+print("Device used is: ", torch_device)
+def make_or_load_embeddings(docs: list, file_list: list, embeddings_out: np.ndarray, embedding_model, embeddings_super_compress: str, high_quality_mode_opt: str) -> np.ndarray:
+    """
+    Create or load embeddings for the given documents.
+    Args:
+        docs (list): List of documents to embed.
+        file_list (list): List of file names to check for existing embeddings.
+        embeddings_out (np.ndarray): Array to store the embeddings.
+        embedding_model: Model used to generate embeddings.
+        embeddings_super_compress (str): Option to super compress embeddings ("Yes" or "No").
+        high_quality_mode_opt (str): Option for high quality mode ("Yes" or "No").
+    Returns:
+        np.ndarray: The generated or loaded embeddings.
+    """
     # If no embeddings found, make or load in
     if embeddings_out.size == 0:
             # Custom model
             # If on CPU, don't resort to embedding models
+            if high_quality_mode_opt == "No":
                 print("Creating simplified 'sparse' embeddings based on TfIDF")
                 # Fit the pipeline to the text data
                 # Transform text data to embeddings
                 embeddings_out = embedding_model.transform(docs)
+            elif high_quality_mode_opt == "Yes":
                 print("Creating dense embeddings based on transformers model")
+                embeddings_out = embedding_model.encode(sentences=docs, show_progress_bar = True, batch_size = 32)#, precision="int8") # For large
             toc = time.perf_counter()
             time_out = f"The embedding took {toc - tic:0.1f} seconds"

funcs/helper_functions.py CHANGED Viewed

@@ -10,33 +10,70 @@ import numpy as np
 from bertopic import BERTopic
 from datetime import datetime
 today = datetime.now().strftime("%d%m%Y")
 today_rev = datetime.now().strftime("%Y%m%d")
-# Log terminal output: https://github.com/gradio-app/gradio/issues/2362
-class Logger:
-    def __init__(self, filename):
-        self.terminal = sys.stdout
-        self.log = open(filename, "w")
-    def write(self, message):
-        self.terminal.write(message)
-        self.log.write(message)
-    def flush(self):
-        self.terminal.flush()
-        self.log.flush()
-    def isatty(self):
-        return False
-#sys.stdout = Logger("output.log")
-# def read_logs():
-#     sys.stdout.flush()
-#     with open("output.log", "r") as f:
-#         return f.read()
 def detect_file_type(filename):
     """Detect the file type based on its extension."""
@@ -130,7 +167,7 @@ def initial_file_load(in_file):
     #The np.array([]) at the end is for clearing the embedding state when a new file is loaded
-    return gr.Dropdown(choices=concat_choices), gr.Dropdown(choices=concat_choices), df, output_text, topic_model, embeddings, data_file_name_no_ext, custom_labels
 def custom_regex_load(in_file):
     '''
@@ -157,8 +194,6 @@ def custom_regex_load(in_file):
     return output_text, custom_regex
 def get_file_path_end(file_path):
     # First, get the basename of the file (e.g., "example.txt" from "/path/to/example.txt")
     basename = os.path.basename(file_path)
@@ -177,15 +212,7 @@ def get_file_path_end_with_ext(file_path):
     return filename_end
-def dummy_function(in_colnames):
-    """
-    A dummy function that exists just so that dropdown updates work correctly.
-    """
-    return None
 # Zip the above to export file
 def zip_folder(folder_path, output_zip_file):
     # Create a ZipFile object in write mode
     with zipfile.ZipFile(output_zip_file, 'w', zipfile.ZIP_DEFLATED) as zipf:
@@ -215,59 +242,121 @@ def delete_files_in_folder(folder_path):
         except Exception as e:
             print(f"Failed to delete {file_path}. Reason: {e}")
-def save_topic_outputs(topic_model, data_file_name_no_ext, output_list, docs, save_topic_model, progress=gr.Progress()):
-        progress(0.7, desc= "Checking data")
-        topic_dets = topic_model.get_topic_info()
-        if topic_dets.shape[0] == 1:
-            topic_det_output_name = "topic_details_" + data_file_name_no_ext + "_" + today_rev + ".csv"
-            topic_dets.to_csv(topic_det_output_name)
-            output_list.append(topic_det_output_name)
-            return output_list, "No topics found, original file returned"
-        progress(0.8, desc= "Saving output")
-        topic_det_output_name = "topic_details_" + data_file_name_no_ext + "_" + today_rev + ".csv"
         topic_dets.to_csv(topic_det_output_name)
         output_list.append(topic_det_output_name)
-        doc_det_output_name = "doc_details_" + data_file_name_no_ext + "_" + today_rev + ".csv"
-        doc_dets = topic_model.get_document_info(docs)[["Document",	"Topic", "Name", "Probability", "Representative_document"]]
-        doc_dets.to_csv(doc_det_output_name)
-        output_list.append(doc_det_output_name)
-        if "CustomName" in topic_dets.columns:
-            topics_text_out_str = str(topic_dets["CustomName"])
-        else:
-            topics_text_out_str = str(topic_dets["Name"])
-        output_text = "Topics: " + topics_text_out_str
-        # Save topic model to file
-        if save_topic_model == "Yes":
-            print("Saving BERTopic model in .pkl format.")
-            folder_path = "output_model/"
-            if not os.path.exists(folder_path):
-                # Create the folder
-                os.makedirs(folder_path)
-            topic_model_save_name_pkl = folder_path + data_file_name_no_ext + "_topics_" + today_rev + ".pkl"# + ".safetensors"
-            topic_model_save_name_zip = topic_model_save_name_pkl + ".zip"
-            # Clear folder before replacing files
-            #delete_files_in_folder(topic_model_save_name_pkl)
-            topic_model.save(topic_model_save_name_pkl, serialization='pickle', save_embedding_model=False, save_ctfidf=False)
-            # Zip file example
-            #zip_folder(topic_model_save_name_pkl, topic_model_save_name_zip)
-            output_list.append(topic_model_save_name_pkl)
-        return output_list, output_text

 from bertopic import BERTopic
 from datetime import datetime
+from typing import List, Tuple
 today = datetime.now().strftime("%d%m%Y")
 today_rev = datetime.now().strftime("%Y%m%d")
+def get_or_create_env_var(var_name:str, default_value:str) -> str:
+    # Get the environment variable if it exists
+    value = os.environ.get(var_name)
+    # If it doesn't exist, set it to the default value
+    if value is None:
+        os.environ[var_name] = default_value
+        value = default_value
+    return value
+# Retrieving or setting output folder
+env_var_name = 'GRADIO_OUTPUT_FOLDER'
+default_value = 'output/'
+output_folder = get_or_create_env_var(env_var_name, default_value)
+print(f'The value of {env_var_name} is {output_folder}')
+def ensure_output_folder_exists():
+    """Checks if the 'output/' folder exists, creates it if not."""
+    folder_name = "output/"
+    if not os.path.exists(folder_name):
+        # Create the folder if it doesn't exist
+        os.makedirs(folder_name)
+        print(f"Created the 'output/' folder.")
+    else:
+        print(f"The 'output/' folder already exists.")
+def get_connection_params(request: gr.Request):
+    '''
+    Get connection parameter values from request object.
+    '''
+    if request:
+        # print("Request headers dictionary:", request.headers)
+        # print("All host elements", request.client)
+        # print("IP address:", request.client.host)
+        # print("Query parameters:", dict(request.query_params))
+        print("Session hash:", request.session_hash)
+        if 'x-cognito-id' in request.headers:
+            out_session_hash = request.headers['x-cognito-id']
+            base_folder = "user-files/"
+            #print("Cognito ID found:", out_session_hash)
+        else:
+            out_session_hash = request.session_hash
+            base_folder = "temp-files/"
+            #print("Cognito ID not found. Using session hash as save folder.")
+        output_folder = base_folder + out_session_hash + "/"
+        #print("S3 output folder is: " + "s3://" + bucket_name + "/" + output_folder)
+        return out_session_hash
+    else:
+        print("No session parameters found.")
+        return ""
 def detect_file_type(filename):
     """Detect the file type based on its extension."""
     #The np.array([]) at the end is for clearing the embedding state when a new file is loaded
+    return gr.Dropdown(choices=concat_choices), gr.Dropdown(choices=concat_choices), df, output_text, topic_model, embeddings, data_file_name_no_ext, custom_labels, df
 def custom_regex_load(in_file):
     '''
     return output_text, custom_regex
 def get_file_path_end(file_path):
     # First, get the basename of the file (e.g., "example.txt" from "/path/to/example.txt")
     basename = os.path.basename(file_path)
     return filename_end
 # Zip the above to export file
 def zip_folder(folder_path, output_zip_file):
     # Create a ZipFile object in write mode
     with zipfile.ZipFile(output_zip_file, 'w', zipfile.ZIP_DEFLATED) as zipf:
         except Exception as e:
             print(f"Failed to delete {file_path}. Reason: {e}")
+def save_topic_outputs(topic_model: BERTopic, data_file_name_no_ext: str, output_list: List[str], docs: List[str], save_topic_model: bool, prepared_docs: pd.DataFrame, split_sentence_drop: str, output_folder: str = output_folder, progress: gr.Progress = gr.Progress()) -> Tuple[List[str], str]:
+    """
+    Save the outputs of a topic model to specified files.
+    Args:
+        topic_model (BERTopic): The topic model object.
+        data_file_name_no_ext (str): The base name of the data file without extension.
+        output_list (List[str]): List to store the output file names.
+        docs (List[str]): List of documents.
+        save_topic_model (bool): Flag to save the topic model.
+        prepared_docs (pd.DataFrame): DataFrame containing prepared documents.
+        split_sentence_drop (str): Option to split sentences ("Yes" or "No").
+        output_folder (str, optional): Folder to save the output files. Defaults to output_folder.
+        progress (gr.Progress, optional): Progress tracker. Defaults to gr.Progress().
+    Returns:
+        Tuple[List[str], str]: A tuple containing the list of output file names and a status message.
+    """
+    progress(0.7, desc= "Checking data")
+    topic_dets = topic_model.get_topic_info()
+    if topic_dets.shape[0] == 1:
+        topic_det_output_name = output_folder + "topic_details_" + data_file_name_no_ext + "_" + today_rev + ".csv"
         topic_dets.to_csv(topic_det_output_name)
         output_list.append(topic_det_output_name)
+        return output_list, "No topics found, original file returned"
+    progress(0.8, desc= "Saving output")
+    topic_det_output_name = output_folder + "topic_details_" + data_file_name_no_ext + "_" + today_rev + ".csv"
+    topic_dets.to_csv(topic_det_output_name)
+    output_list.append(topic_det_output_name)
+    doc_det_output_name = output_folder + "doc_details_" + data_file_name_no_ext + "_" + today_rev + ".csv"
+    ## Check that the following columns exist in the dataframe, keep only the ones that exist
+    columns_to_check = ["Document",	"Topic", "Name", "Probability", "Representative_document"]
+    columns_found = [column for column in columns_to_check if column in topic_model.get_document_info(docs).columns]
+    doc_dets = topic_model.get_document_info(docs)[columns_found]
+    # If you have created a 'sentence split' dataset from the cleaning options, map these sentences back to the original document.
+    try:
+        if split_sentence_drop == "Yes":
+            doc_dets = doc_dets.merge(prepared_docs[['document_index']], how = "left", left_index=True, right_index=True)
+            doc_dets = doc_dets.rename(columns={"document_index": "parent_document_index"}, errors='ignore')
+            # 1. Group by Parent Document Index:
+            grouped = doc_dets.groupby('parent_document_index')
+            # 2. Aggregate Topics and Probabilities:
+            def aggregate_topics(group):
+                original_text = ' '.join(group['Document'])
+                topics = group['Topic'].tolist()
+                if 'Name' in group.columns:
+                    topic_names = group['Name'].tolist()
+                else:
+                    topic_names = None
+                if 'Probability' in group.columns:
+                    probabilities = group['Probability'].tolist()
+                else:
+                    probabilities = None  # Or any other default value you prefer
+                return pd.Series({'Document':original_text, 'Topic numbers': topics, 'Topic names': topic_names, 'Probabilities': probabilities})
+            #result_df = grouped.apply(aggregate_topics).reset_index()
+            doc_det_agg = grouped.apply(lambda x: aggregate_topics(x)).reset_index()
+            # Join back original text
+            #doc_det_agg = doc_det_agg.merge(original_data[[in_colnames_list_first]], how = "left", left_index=True, right_index=True)
+            doc_det_agg_output_name = output_folder + "doc_details_agg_" + data_file_name_no_ext + "_" + today_rev + ".csv"
+            doc_det_agg.to_csv(doc_det_agg_output_name)
+            output_list.append(doc_det_agg_output_name)
+    except Exception as e:
+        print("Creating aggregate document details failed, error:", e)
+    # Save document details to file
+    doc_dets.to_csv(doc_det_output_name)
+    output_list.append(doc_det_output_name)
+    if "CustomName" in topic_dets.columns:
+        topics_text_out_str = str(topic_dets["CustomName"])
+    else:
+        topics_text_out_str = str(topic_dets["Name"])
+    output_text = "Topics: " + topics_text_out_str
+    # Save topic model to file
+    if save_topic_model == "Yes":
+        print("Saving BERTopic model in .pkl format.")
+        #folder_path = output_folder #"output_model/"
+        #if not os.path.exists(folder_path):
+            # Create the folder
+        #    os.makedirs(folder_path)
+        topic_model_save_name_pkl = output_folder + data_file_name_no_ext + "_topics_" + today_rev + ".pkl"# + ".safetensors"
+        topic_model_save_name_zip = topic_model_save_name_pkl + ".zip"
+        # Clear folder before replacing files
+        #delete_files_in_folder(topic_model_save_name_pkl)
+        topic_model.save(topic_model_save_name_pkl, serialization='pickle', save_embedding_model=False, save_ctfidf=False)
+        # Zip file example
+        #zip_folder(topic_model_save_name_pkl, topic_model_save_name_zip)
+        output_list.append(topic_model_save_name_pkl)
+    return output_list, output_text

funcs/representation_model.py CHANGED Viewed

@@ -3,29 +3,26 @@ from bertopic.representation import LlamaCPP
 from llama_cpp import Llama
 from pydantic import BaseModel
 import torch.cuda
-from huggingface_hub import hf_hub_download, snapshot_download
 from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance, BaseRepresentation
-from funcs.prompts import capybara_prompt, capybara_start, open_hermes_prompt, open_hermes_start, stablelm_prompt, stablelm_start, phi3_prompt, phi3_start
-random_seed = 42
 chosen_prompt = phi3_prompt #open_hermes_prompt # stablelm_prompt
 chosen_start_tag =  phi3_start #open_hermes_start # stablelm_start
 # Currently set n_gpu_layers to 0 even with cuda due to persistent bugs in implementation with cuda
-if torch.cuda.is_available():
-    torch_device = "gpu"
     low_resource_mode = "No"
-    n_gpu_layers = 100
-else:
-    torch_device =  "cpu"
     low_resource_mode = "Yes"
     n_gpu_layers = 0
-#low_resource_mode = "No" # Override for testing
 #print("Running on device:", torch_device)
 n_threads = torch.get_num_threads()
 print("CPU n_threads:", n_threads)
@@ -37,7 +34,7 @@ top_p: float = 1
 repeat_penalty: float = 1.1
 last_n_tokens_size: int = 128
 max_tokens: int = 500
-seed: int = 42
 reset: bool = True
 stream: bool = False
 n_threads: int = n_threads
@@ -83,15 +80,25 @@ llm_config = LLamacppInitConfigGpu(last_n_tokens_size=last_n_tokens_size,
     trust_remote_code=trust_remote_code)
 ## Create representation model parameters ##
-# KeyBERT
 keybert = KeyBERTInspired(random_state=random_seed)
-# MMR
 mmr = MaximalMarginalRelevance(diversity=0.5)
 base_rep = BaseRepresentation()
 # Find model file
-def find_model_file(hf_model_name, hf_model_file, search_folder, sub_folder):
     hf_loc = search_folder #os.environ["HF_HOME"]
     hf_sub_loc = search_folder + sub_folder #os.environ["HF_HOME"]
@@ -116,17 +123,27 @@ def find_model_file(hf_model_name, hf_model_file, search_folder, sub_folder):
     return found_file
-def create_representation_model(representation_type, llm_config, hf_model_name, hf_model_file, chosen_start_tag, low_resource_mode):
     if representation_type == "LLM":
         print("Generating LLM representation")
         # Use llama.cpp to load in model
-        #    del os.environ["HF_HOME"]
         # Check for HF_HOME environment variable and supply a default value if it's not found (typical location for huggingface models)
-        # Get HF_HOME environment variable or default to "~/.cache/huggingface/hub"
         base_folder = "model" #"~/.cache/huggingface/hub"
         hf_home_value = os.getenv("HF_HOME", base_folder)
@@ -158,9 +175,10 @@ def create_representation_model(representation_type, llm_config, hf_model_name,
         print("Loading representation model with", llm_config.n_gpu_layers, "layers allocated to GPU.")
         llm = Llama(model_path=found_file, stop=chosen_start_tag, n_gpu_layers=llm_config.n_gpu_layers, n_ctx=llm_config.n_ctx,seed=seed) #**llm_config.model_dump())#  rope_freq_scale=0.5,
         #print(llm.n_gpu_layers)
-        print("Chosen prompt:", chosen_prompt)
         llm_model = LlamaCPP(llm, prompt=chosen_prompt)#, **gen_config.model_dump())
         # All representation models
@@ -180,15 +198,6 @@ def create_representation_model(representation_type, llm_config, hf_model_name,
     else:
         print("Generating default representation type")
         representation_model = {"Default":base_rep}
-    # Deprecated example using CTransformers. This package is not really used anymore
-    #model = AutoModelForCausalLM.from_pretrained('NousResearch/Nous-Capybara-7B-V1.9-GGUF', model_type='mistral', model_file='Capybara-7B-V1.9-Q5_K_M.gguf', hf=True, **vars(llm_config))
-    #tokenizer = AutoTokenizer.from_pretrained("NousResearch/Nous-Capybara-7B-V1.9")
-    #generator = pipeline(task="text-generation", model=model, tokenizer=tokenizer)
-    # Text generation with Llama 2
-    #mistral_capybara = TextGeneration(generator, prompt=capybara_prompt)
-    #mistral_hermes = TextGeneration(generator, prompt=open_hermes_prompt)
     return representation_model

 from llama_cpp import Llama
 from pydantic import BaseModel
 import torch.cuda
+from huggingface_hub import hf_hub_download
 from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance, BaseRepresentation
+from funcs.embeddings import torch_device
+from funcs.prompts import phi3_prompt, phi3_start
 chosen_prompt = phi3_prompt #open_hermes_prompt # stablelm_prompt
 chosen_start_tag =  phi3_start #open_hermes_start # stablelm_start
+random_seed = 42
 # Currently set n_gpu_layers to 0 even with cuda due to persistent bugs in implementation with cuda
+print("torch device for representation functions:", torch_device)
+if torch_device == "gpu":
     low_resource_mode = "No"
+    n_gpu_layers = -1 # i.e. all
+else: #     torch_device =  "cpu"
     low_resource_mode = "Yes"
     n_gpu_layers = 0
 #print("Running on device:", torch_device)
 n_threads = torch.get_num_threads()
 print("CPU n_threads:", n_threads)
 repeat_penalty: float = 1.1
 last_n_tokens_size: int = 128
 max_tokens: int = 500
+seed: int = random_seed
 reset: bool = True
 stream: bool = False
 n_threads: int = n_threads
     trust_remote_code=trust_remote_code)
 ## Create representation model parameters ##
 keybert = KeyBERTInspired(random_state=random_seed)
 mmr = MaximalMarginalRelevance(diversity=0.5)
 base_rep = BaseRepresentation()
 # Find model file
+def find_model_file(hf_model_name: str, hf_model_file: str, search_folder: str, sub_folder: str) -> str:
+    """
+    Finds the specified model file within the given search folder and subfolder.
+    Args:
+        hf_model_name (str): The name of the Hugging Face model.
+        hf_model_file (str): The specific file name of the model to find.
+        search_folder (str): The base folder to start the search.
+        sub_folder (str): The subfolder within the search folder to look into.
+    Returns:
+        str: The path to the found model file, or None if the file is not found.
+    """
     hf_loc = search_folder #os.environ["HF_HOME"]
     hf_sub_loc = search_folder + sub_folder #os.environ["HF_HOME"]
     return found_file
+def create_representation_model(representation_type: str, llm_config: dict, hf_model_name: str, hf_model_file: str, chosen_start_tag: str, low_resource_mode: bool) -> dict:
+    """
+    Creates a representation model based on the specified type and configuration.
+    Args:
+        representation_type (str): The type of representation model to create (e.g., "LLM", "KeyBERT").
+        llm_config (dict): Configuration settings for the LLM model.
+        hf_model_name (str): The name of the Hugging Face model.
+        hf_model_file (str): The specific file name of the model to find.
+        chosen_start_tag (str): The start tag to use for the model.
+        low_resource_mode (bool): Whether to enable low resource mode.
+    Returns:
+        dict: A dictionary containing the created representation model.
+    """
     if representation_type == "LLM":
         print("Generating LLM representation")
         # Use llama.cpp to load in model
         # Check for HF_HOME environment variable and supply a default value if it's not found (typical location for huggingface models)
         base_folder = "model" #"~/.cache/huggingface/hub"
         hf_home_value = os.getenv("HF_HOME", base_folder)
         print("Loading representation model with", llm_config.n_gpu_layers, "layers allocated to GPU.")
+        #llm_config.n_gpu_layers
         llm = Llama(model_path=found_file, stop=chosen_start_tag, n_gpu_layers=llm_config.n_gpu_layers, n_ctx=llm_config.n_ctx,seed=seed) #**llm_config.model_dump())#  rope_freq_scale=0.5,
         #print(llm.n_gpu_layers)
+        #print("Chosen prompt:", chosen_prompt)
         llm_model = LlamaCPP(llm, prompt=chosen_prompt)#, **gen_config.model_dump())
         # All representation models
     else:
         print("Generating default representation type")
         representation_model = {"Default":base_rep}
     return representation_model

funcs/topic_core_funcs.py CHANGED Viewed

@@ -8,12 +8,17 @@ import numpy as np
 import time
 from bertopic import BERTopic
-from funcs.clean_funcs import initial_clean
 from funcs.anonymiser import expand_sentences_spacy
-from funcs.helper_functions import read_file, zip_folder, delete_files_in_folder, save_topic_outputs
-from funcs.embeddings import make_or_load_embeddings
 from funcs.bertopic_vis_documents import visualize_documents_custom, visualize_hierarchical_documents_custom, hierarchical_topics_custom, visualize_hierarchy_custom
 from sentence_transformers import SentenceTransformer
 from sklearn.pipeline import make_pipeline
@@ -22,27 +27,10 @@ from sklearn.feature_extraction.text import TfidfVectorizer
 import funcs.anonymiser as anon
 from umap import UMAP
-from torch import cuda, backends, version
-# Default seed, can be changed in number selection on options page
-random_seed = 42
-# Check for torch cuda
-# If you want to disable cuda for testing purposes
-#os.environ['CUDA_VISIBLE_DEVICES'] = '-1'
-print("Is CUDA enabled? ", cuda.is_available())
-print("Is a CUDA device available on this computer?", backends.cudnn.enabled)
-if cuda.is_available():
-    torch_device = "gpu"
-    print("Cuda version installed is: ", version.cuda)
-    low_resource_mode = "No"
-    #os.system("nvidia-smi")
-else:
-    torch_device =  "cpu"
-    low_resource_mode = "Yes"
-print("Device used is: ", torch_device)
 today = datetime.now().strftime("%d%m%Y")
 today_rev = datetime.now().strftime("%Y%m%d")
@@ -54,7 +42,35 @@ embeddings_name = "mixedbread-ai/mxbai-embed-large-v1" #"BAAI/large-small-en-v1.
 hf_model_name =  "QuantFactory/Phi-3-mini-128k-instruct-GGUF"#'second-state/stablelm-2-zephyr-1.6b-GGUF' #'TheBloke/phi-2-orange-GGUF' #'NousResearch/Nous-Capybara-7B-V1.9-GGUF'
 hf_model_file =   "Phi-3-mini-128k-instruct.Q4_K_M.gguf"#'stablelm-2-zephyr-1_6b-Q5_K_M.gguf' # 'phi-2-orange.Q5_K_M.gguf' #'Capybara-7B-V1.9-Q5_K_M.gguf'
-def pre_clean(data, in_colnames, data_file_name_no_ext, custom_regex, clean_text, drop_duplicate_text, anonymise_drop, sentence_split_drop, progress=gr.Progress(track_tqdm=True)):
     output_text = ""
     output_list = []
@@ -64,7 +80,7 @@ def pre_clean(data, in_colnames, data_file_name_no_ext, custom_regex, clean_text
     if not in_colnames:
         error_message = "Please enter one column name to use for cleaning and finding topics."
         print(error_message)
-        return error_message, None, data_file_name_no_ext, None, None
     all_tic = time.perf_counter()
@@ -77,17 +93,23 @@ def pre_clean(data, in_colnames, data_file_name_no_ext, custom_regex, clean_text
         clean_tic = time.perf_counter()
         print("Starting data clean.")
-        data_file_name_no_ext = data_file_name_no_ext + "_clean"
-        if not custom_regex.empty:
-            data[in_colnames_list_first] = initial_clean(data[in_colnames_list_first], custom_regex.iloc[:, 0].to_list())
-        else:
-            data[in_colnames_list_first] = initial_clean(data[in_colnames_list_first], [])
         clean_toc = time.perf_counter()
         clean_time_out = f"Cleaning the text took {clean_toc - clean_tic:0.1f} seconds."
         print(clean_time_out)
     if drop_duplicate_text == "Yes":
         progress(0.3, desc= "Drop duplicates - remove short texts")
@@ -104,7 +126,8 @@ def pre_clean(data, in_colnames, data_file_name_no_ext, custom_regex, clean_text
     if anonymise_drop == "Yes":
         progress(0.6, desc= "Anonymising data")
-        data_file_name_no_ext = data_file_name_no_ext + "_anon"
         anon_tic = time.perf_counter()
@@ -120,17 +143,19 @@ def pre_clean(data, in_colnames, data_file_name_no_ext, custom_regex, clean_text
     if sentence_split_drop == "Yes":
         progress(0.6, desc= "Splitting text into sentences")
-        data_file_name_no_ext = data_file_name_no_ext + "_split"
         anon_tic = time.perf_counter()
         data = expand_sentences_spacy(data, in_colnames_list_first)
-        data = data[data[in_colnames_list_first].str.len() >= 5] # Keep only rows with at least 5 characters
         anon_toc = time.perf_counter()
         time_out = f"Anonymising text took {anon_toc - anon_tic:0.1f} seconds"
-    out_data_name = data_file_name_no_ext + "_" + today_rev +  ".csv"
     data.to_csv(out_data_name)
     output_list.append(out_data_name)
@@ -140,14 +165,84 @@ def pre_clean(data, in_colnames, data_file_name_no_ext, custom_regex, clean_text
     output_text = "Data clean completed."
-    return output_text, output_list, data, data_file_name_no_ext
-def extract_topics(data, in_files, min_docs_slider, in_colnames, max_topics_slider, candidate_topics, data_file_name_no_ext, custom_labels_df, return_intermediate_files, embeddings_super_compress, low_resource_mode, save_topic_model, embeddings_out, embeddings_type_state, zero_shot_similarity, random_seed, calc_probs, vectoriser_state, progress=gr.Progress(track_tqdm=True)):
     all_tic = time.perf_counter()
     progress(0, desc= "Loading data")
     output_list = []
     file_list = [string.name for string in in_files]
@@ -170,10 +265,9 @@ def extract_topics(data, in_files, min_docs_slider, in_colnames, max_topics_slid
     # Check if embeddings are being loaded in
     progress(0.2, desc= "Loading/creating embeddings")
-    print("Low resource mode: ", low_resource_mode)
-    if low_resource_mode == "No":
-        print("Using high resource embedding model")
         # Define a list of possible local locations to search for the model
         local_embeddings_locations = [
@@ -205,7 +299,7 @@ def extract_topics(data, in_files, min_docs_slider, in_colnames, max_topics_slid
         embeddings_type_state = "large"
         # UMAP model uses Bertopic defaults
-        umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', low_memory=False, random_state=random_seed)
     else:
         print("Choosing low resource TF-IDF model.")
@@ -223,9 +317,9 @@ def extract_topics(data, in_files, min_docs_slider, in_colnames, max_topics_slid
         #umap_model = TruncatedSVD(n_components=5, random_state=random_seed)
         # UMAP model uses Bertopic defaults
-        umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', low_memory=True, random_state=random_seed)
-    embeddings_out = make_or_load_embeddings(docs, file_list, embeddings_out, embedding_model, embeddings_super_compress, low_resource_mode)
     # This is saved as a Gradio state object
     vectoriser_model = vectoriser_state
@@ -250,7 +344,7 @@ def extract_topics(data, in_files, min_docs_slider, in_colnames, max_topics_slid
             if calc_probs == True:
                 topics_probs_out = pd.DataFrame(topic_model.probabilities_)
-                topics_probs_out_name = "topic_full_probs_" + data_file_name_no_ext + "_" + today_rev + ".csv"
                 topics_probs_out.to_csv(topics_probs_out_name)
                 output_list.append(topics_probs_out_name)
@@ -258,20 +352,24 @@ def extract_topics(data, in_files, min_docs_slider, in_colnames, max_topics_slid
             print(error)
             print(fail_error_message)
-            return fail_error_message, output_list, embeddings_out, embeddings_type_state, data_file_name_no_ext, None, docs, vectoriser_model, []
     # Do this if you have pre-defined topics
     else:
-        if low_resource_mode == "Yes":
-            error_message = "Zero shot topic modelling currently not compatible with low-resource embeddings. Please change this option to 'No' on the options tab and retry."
-            print(error_message)
-            return error_message, output_list, embeddings_out, embeddings_type_state, data_file_name_no_ext, None, docs, vectoriser_model, []
         zero_shot_topics = read_file(candidate_topics.name)
         zero_shot_topics_lower = list(zero_shot_topics.iloc[:, 0].str.lower())
         try:
             topic_model = BERTopic( embedding_model=embedding_model, #embedding_model_pipe, # for Jina
@@ -288,7 +386,7 @@ def extract_topics(data, in_files, min_docs_slider, in_colnames, max_topics_slid
             if calc_probs == True:
                 topics_probs_out = pd.DataFrame(topic_model.probabilities_)
-                topics_probs_out_name = "topic_full_probs_" + data_file_name_no_ext + "_" + today_rev + ".csv"
                 topics_probs_out.to_csv(topics_probs_out_name)
                 output_list.append(topics_probs_out_name)
@@ -296,14 +394,14 @@ def extract_topics(data, in_files, min_docs_slider, in_colnames, max_topics_slid
             print("An exception occurred:", error)
             print(fail_error_message)
-            return fail_error_message, output_list, embeddings_out, embeddings_type_state, data_file_name_no_ext, None, docs, vectoriser_model, []
         # For some reason, zero topic modelling exports assigned topics as a np.array instead of a list. Converting it back here.
         if isinstance(assigned_topics, np.ndarray):
             assigned_topics = assigned_topics.tolist()
          # Zero shot modelling is a model merge, which wipes the c_tf_idf part of the resulting model completely. To get hierarchical modelling to work, we need to recreate this part of the model with the CountVectorizer options used to create the initial model. Since with zero shot, we are merging two models that have exactly the same set of documents, the vocubulary should be the same, and so recreating the cf_tf_idf component in this way shouldn't be a problem. Discussion here, and below based on Maarten's suggested code: https://github.com/MaartenGr/BERTopic/issues/1700
         # Get document info
@@ -312,16 +410,12 @@ def extract_topics(data, in_files, min_docs_slider, in_colnames, max_topics_slid
         documents_per_topic = doc_dets.groupby(['Topic'], as_index=False).agg({'Document': ' '.join})
         # Assign CountVectorizer to merged model
         topic_model.vectorizer_model = vectoriser_model
         # Re-calculate c-TF-IDF
         c_tf_idf, _ = topic_model._c_tf_idf(documents_per_topic)
         topic_model.c_tf_idf_ = c_tf_idf
-        ###
     # Check we have topics
     if not assigned_topics:
         return "No topics found.", output_list, embeddings_out, embeddings_type_state, data_file_name_no_ext, topic_model, docs, vectoriser_model,[]
@@ -329,8 +423,14 @@ def extract_topics(data, in_files, min_docs_slider, in_colnames, max_topics_slid
         print("Topic model created.")
     # Tidy up topic label format a bit to have commas and spaces by default
-    new_topic_labels = topic_model.generate_topic_labels(nr_words=3, separator=", ")
-    topic_model.set_topic_labels(new_topic_labels)
     # Replace current topic labels if new ones loaded in
     if not custom_labels_df.empty:
@@ -342,18 +442,18 @@ def extract_topics(data, in_files, min_docs_slider, in_colnames, max_topics_slid
     print("Custom topics: ", topic_model.custom_labels_)
     # Outputs
-    output_list, output_text = save_topic_outputs(topic_model, data_file_name_no_ext, output_list, docs, save_topic_model)
      # If you want to save your embedding files
     if return_intermediate_files == "Yes":
         print("Saving embeddings to file")
-        if low_resource_mode == "Yes":
-            embeddings_file_name = data_file_name_no_ext + '_' + 'tfidf_embeddings.npz'
         else:
             if embeddings_super_compress == "No":
-                embeddings_file_name = data_file_name_no_ext + '_' + 'large_embeddings.npz'
             else:
-                embeddings_file_name = data_file_name_no_ext + '_' + 'large_embeddings_compress.npz'
         np.savez_compressed(embeddings_file_name, embeddings_out)
@@ -365,7 +465,25 @@ def extract_topics(data, in_files, min_docs_slider, in_colnames, max_topics_slid
     return output_text, output_list, embeddings_out, embeddings_type_state, data_file_name_no_ext, topic_model, docs, vectoriser_model, assigned_topics
-def reduce_outliers(topic_model, docs, embeddings_out, data_file_name_no_ext, assigned_topics, vectoriser_model, save_topic_model, progress=gr.Progress(track_tqdm=True)):
     progress(0, desc= "Preparing data")
@@ -373,13 +491,9 @@ def reduce_outliers(topic_model, docs, embeddings_out, data_file_name_no_ext, as
     all_tic = time.perf_counter()
-    # This step not necessary?
-    #assigned_topics, probs = topic_model.fit_transform(docs, embeddings_out)
     if isinstance(assigned_topics, np.ndarray):
         assigned_topics = assigned_topics.tolist()
     # Reduce outliers if required, then update representation
     progress(0.2, desc= "Reducing outliers")
     print("Reducing outliers.")
@@ -397,20 +511,9 @@ def reduce_outliers(topic_model, docs, embeddings_out, data_file_name_no_ext, as
     print("Finished reducing outliers.")
-    #progress(0.7, desc= "Replacing topic names with LLMs if necessary")
-    #topic_dets = topic_model.get_topic_info()
-    # # Replace original labels with LLM labels
-    # if "LLM" in topic_model.get_topic_info().columns:
-    #     llm_labels = [label[0][0].split("\n")[0] for label in topic_model.get_topics(full=True)["LLM"].values()]
-    #     topic_model.set_topic_labels(llm_labels)
-    # else:
-    #     topic_model.set_topic_labels(list(topic_dets["Name"]))
     # Outputs
     progress(0.9, desc= "Saving to file")
-    output_list, output_text = save_topic_outputs(topic_model, data_file_name_no_ext, output_list, docs, save_topic_model)
     all_toc = time.perf_counter()
     time_out = f"All processes took {all_toc - all_tic:0.1f} seconds"
@@ -418,16 +521,35 @@ def reduce_outliers(topic_model, docs, embeddings_out, data_file_name_no_ext, as
     return output_text, output_list, topic_model
-def represent_topics(topic_model, docs, data_file_name_no_ext, low_resource_mode, save_topic_model, representation_type, vectoriser_model, progress=gr.Progress(track_tqdm=True)):
-    from funcs.representation_model import create_representation_model, llm_config, chosen_start_tag
     output_list = []
     all_tic = time.perf_counter()
-    progress(0.1, desc= "Loading model and creating new representation")
-    representation_model = create_representation_model(representation_type, llm_config, hf_model_name, hf_model_file, chosen_start_tag, low_resource_mode)
     progress(0.3, desc= "Updating existing topics")
     topic_model.update_topics(docs, vectorizer_model=vectoriser_model, representation_model=representation_model)
@@ -439,7 +561,7 @@ def represent_topics(topic_model, docs, data_file_name_no_ext, low_resource_mode
         llm_labels = [label[0].split("\n")[0] for label in topic_dets["LLM"]]
         topic_model.set_topic_labels(llm_labels)
-        label_list_file_name = data_file_name_no_ext + '_llm_topic_list_' + today_rev + '.csv'
         llm_labels_df = pd.DataFrame(data={"Label":llm_labels})
         llm_labels_df.to_csv(label_list_file_name, index=None)
@@ -452,7 +574,7 @@ def represent_topics(topic_model, docs, data_file_name_no_ext, low_resource_mode
     # Outputs
     progress(0.8, desc= "Saving outputs")
-    output_list, output_text = save_topic_outputs(topic_model, data_file_name_no_ext, output_list, docs, save_topic_model)
     all_toc = time.perf_counter()
     time_out = f"All processes took {all_toc - all_tic:0.1f} seconds"
@@ -460,11 +582,51 @@ def represent_topics(topic_model, docs, data_file_name_no_ext, low_resource_mode
     return output_text, output_list, topic_model
-def visualise_topics(topic_model, data, data_file_name_no_ext, low_resource_mode,  embeddings_out, in_label, in_colnames, legend_label, sample_prop, visualisation_type_radio, random_seed,  progress=gr.Progress(track_tqdm=True)):
     progress(0, desc= "Preparing data for visualisation")
     output_list = []
     vis_tic = time.perf_counter()
@@ -500,30 +662,37 @@ def visualise_topics(topic_model, data, data_file_name_no_ext, low_resource_mode
             topic_model.set_topic_labels(labels)
     # Pre-reduce embeddings for visualisation purposes
-    if low_resource_mode == "No":
-        reduced_embeddings = UMAP(n_neighbors=15, n_components=2, min_dist=0.0, metric='cosine', random_state=random_seed).fit_transform(embeddings_out)
     else:
         reduced_embeddings = TruncatedSVD(2, random_state=random_seed).fit_transform(embeddings_out)
-    progress(0.5, desc= "Creating visualisation (this can take a while)")
     # Visualise the topics:
-    print("Creating visualisation")
-    # "Topic document graph", "Hierarchical view"
     if visualisation_type_radio == "Topic document graph":
-        topics_vis = visualize_documents_custom(topic_model, docs, hover_labels = label_list, reduced_embeddings=reduced_embeddings, hide_annotations=True, hide_document_hover=False, custom_labels=True, sample = sample_prop, width= 1200, height = 750)
-        topics_vis_name = data_file_name_no_ext + '_' + 'vis_topic_docs_' + today_rev + '.html'
-        topics_vis.write_html(topics_vis_name)
-        output_list.append(topics_vis_name)
-        topics_vis_2 = topic_model.visualize_heatmap(custom_labels=True, width= 1200, height = 1200)
-        topics_vis_2_name = data_file_name_no_ext + '_' + 'vis_heatmap_' + today_rev + '.html'
-        topics_vis_2.write_html(topics_vis_2_name)
-        output_list.append(topics_vis_2_name)
     elif visualisation_type_radio == "Hierarchical view":
@@ -532,7 +701,7 @@ def visualise_topics(topic_model, data, data_file_name_no_ext, low_resource_mode
         # Print topic tree - may get encoding errors, so doing try except
         try:
             tree = topic_model.get_topic_tree(hierarchical_topics, tight_layout = True)
-            tree_name = data_file_name_no_ext + '_' + 'vis_hierarchy_tree_' + today_rev + '.txt'
             with open(tree_name, "w") as file:
                 # Write the string to the file
@@ -540,59 +709,71 @@ def visualise_topics(topic_model, data, data_file_name_no_ext, low_resource_mode
             output_list.append(tree_name)
-        except Exception as error:
-            print("An exception occurred when making topic tree document, skipped:", error)
         # Save new hierarchical topic model to file
-        hierarchical_topics_name = data_file_name_no_ext + '_' + 'vis_hierarchy_topics_dist_' + today_rev + '.csv'
-        hierarchical_topics.to_csv(hierarchical_topics_name, index = None)
-        output_list.append(hierarchical_topics_name)
-        #try:
-        topics_vis, hierarchy_df, hierarchy_topic_names = visualize_hierarchical_documents_custom(topic_model, docs, label_list, hierarchical_topics, hide_annotations=True, reduced_embeddings=reduced_embeddings, sample = sample_prop, hide_document_hover= False, custom_labels=True, width= 1200, height = 750)
-        topics_vis_2 = visualize_hierarchy_custom(topic_model, hierarchical_topics=hierarchical_topics, width= 1200, height = 750)
         # Write hierarchical topics levels to df
-        hierarchy_df_name = data_file_name_no_ext + '_' + 'hierarchy_topics_df_' + today_rev + '.csv'
         hierarchy_df.to_csv(hierarchy_df_name, index = None)
         output_list.append(hierarchy_df_name)
         # Write hierarchical topics names to df
-        hierarchy_topic_names_name = data_file_name_no_ext + '_' + 'hierarchy_topics_names_' + today_rev + '.csv'
         hierarchy_topic_names.to_csv(hierarchy_topic_names_name, index = None)
         output_list.append(hierarchy_topic_names_name)
-        #except:
-        #    error_message = "Visualisation preparation failed. Perhaps you need more topics to create the full hierarchy (more than 10)?"
-        #    return error_message, output_list, None, None
-        topics_vis_name = data_file_name_no_ext + '_' + 'vis_hierarchy_topic_doc_' + today_rev + '.html'
         topics_vis.write_html(topics_vis_name)
         output_list.append(topics_vis_name)
-        topics_vis_2_name = data_file_name_no_ext + '_' + 'vis_hierarchy_' + today_rev + '.html'
         topics_vis_2.write_html(topics_vis_2_name)
         output_list.append(topics_vis_2_name)
     all_toc = time.perf_counter()
-    time_out = f"Creating visualisation took {all_toc - vis_tic:0.1f} seconds"
-    print(time_out)
-    return time_out, output_list, topics_vis, topics_vis_2
-def save_as_pytorch_model(topic_model, data_file_name_no_ext , progress=gr.Progress(track_tqdm=True)):
     if not topic_model:
-        return "No Pytorch model found.", None
     progress(0, desc= "Saving topic model in Pytorch format")
-    output_list = []
-    topic_model_save_name_folder = "output_model/" + data_file_name_no_ext + "_topics_" + today_rev# + ".safetensors"
     topic_model_save_name_zip = topic_model_save_name_folder + ".zip"
     # Clear folder before replacing files
@@ -600,9 +781,10 @@ def save_as_pytorch_model(topic_model, data_file_name_no_ext , progress=gr.Progr
     topic_model.save(topic_model_save_name_folder, serialization='pytorch', save_embedding_model=True, save_ctfidf=False)
-    # Zip file example
     zip_folder(topic_model_save_name_folder, topic_model_save_name_zip)
     output_list.append(topic_model_save_name_zip)
-    return "Model saved in Pytorch format.", output_list

 import time
 from bertopic import BERTopic
+from typing import List, Type, Union
+PandasDataFrame = Type[pd.DataFrame]
+from funcs.clean_funcs import initial_clean, regex_clean
 from funcs.anonymiser import expand_sentences_spacy
+from funcs.helper_functions import read_file, zip_folder, delete_files_in_folder, save_topic_outputs, output_folder
+from funcs.embeddings import make_or_load_embeddings, torch_device
 from funcs.bertopic_vis_documents import visualize_documents_custom, visualize_hierarchical_documents_custom, hierarchical_topics_custom, visualize_hierarchy_custom
+from funcs.representation_model import create_representation_model, llm_config, chosen_start_tag, random_seed
+from sklearn.feature_extraction.text import CountVectorizer
 from sentence_transformers import SentenceTransformer
 from sklearn.pipeline import make_pipeline
 import funcs.anonymiser as anon
 from umap import UMAP
+# Default options can be changed in number selection on options page
+umap_n_neighbours = 15
+umap_min_dist = 0.0
+umap_metric = 'cosine'
 today = datetime.now().strftime("%d%m%Y")
 today_rev = datetime.now().strftime("%Y%m%d")
 hf_model_name =  "QuantFactory/Phi-3-mini-128k-instruct-GGUF"#'second-state/stablelm-2-zephyr-1.6b-GGUF' #'TheBloke/phi-2-orange-GGUF' #'NousResearch/Nous-Capybara-7B-V1.9-GGUF'
 hf_model_file =   "Phi-3-mini-128k-instruct.Q4_K_M.gguf"#'stablelm-2-zephyr-1_6b-Q5_K_M.gguf' # 'phi-2-orange.Q5_K_M.gguf' #'Capybara-7B-V1.9-Q5_K_M.gguf'
+# When topic modelling column is chosen, change the default visualisation column to the same
+def change_default_vis_col(in_colnames:List[str]):
+    '''
+    When topic modelling column is chosen, change the default visualisation column to the same
+    '''
+    if in_colnames:
+        return gr.Dropdown(value=in_colnames[0])
+    else:
+        return gr.Dropdown()
+def pre_clean(data: pd.DataFrame, in_colnames: list, data_file_name_no_ext: str, custom_regex: pd.DataFrame, clean_text: str, drop_duplicate_text: str, anonymise_drop: str, sentence_split_drop: str, embeddings_state: dict, progress: gr.Progress = gr.Progress(track_tqdm=True)) -> tuple:
+    """
+    Pre-processes the input data by cleaning text, removing duplicates, anonymizing data, and splitting sentences based on the provided options.
+    Args:
+        data (pd.DataFrame): The input data to be cleaned.
+        in_colnames (list): List of column names to be used for cleaning and finding topics.
+        data_file_name_no_ext (str): The base name of the data file without extension.
+        custom_regex (pd.DataFrame): Custom regex patterns for initial cleaning.
+        clean_text (str): Option to clean text ("Yes" or "No").
+        drop_duplicate_text (str): Option to drop duplicate text ("Yes" or "No").
+        anonymise_drop (str): Option to anonymize data ("Yes" or "No").
+        sentence_split_drop (str): Option to split text into sentences ("Yes" or "No").
+        embeddings_state (dict): State of the embeddings.
+        progress (gr.Progress, optional): Progress tracker for the cleaning process.
+    Returns:
+        tuple: A tuple containing the error message (if any), cleaned data, updated file name, and embeddings state.
+    """
     output_text = ""
     output_list = []
     if not in_colnames:
         error_message = "Please enter one column name to use for cleaning and finding topics."
         print(error_message)
+        return error_message, None, data_file_name_no_ext, None, None, embeddings_state
     all_tic = time.perf_counter()
         clean_tic = time.perf_counter()
         print("Starting data clean.")
+        data[in_colnames_list_first] = initial_clean(data[in_colnames_list_first], [])
+        if '_clean' not in data_file_name_no_ext:
+            data_file_name_no_ext = data_file_name_no_ext + "_clean"
         clean_toc = time.perf_counter()
         clean_time_out = f"Cleaning the text took {clean_toc - clean_tic:0.1f} seconds."
         print(clean_time_out)
+    # Clean custom regex if exists
+    if not custom_regex.empty:
+        data[in_colnames_list_first] = regex_clean(data[in_colnames_list_first], custom_regex.iloc[:, 0].to_list())
+        if '_clean' not in data_file_name_no_ext:
+            data_file_name_no_ext = data_file_name_no_ext + "_clean"
     if drop_duplicate_text == "Yes":
         progress(0.3, desc= "Drop duplicates - remove short texts")
     if anonymise_drop == "Yes":
         progress(0.6, desc= "Anonymising data")
+        if '_anon' not in data_file_name_no_ext:
+            data_file_name_no_ext = data_file_name_no_ext + "_anon"
         anon_tic = time.perf_counter()
     if sentence_split_drop == "Yes":
         progress(0.6, desc= "Splitting text into sentences")
+        if '_split' not in data_file_name_no_ext:
+            data_file_name_no_ext = data_file_name_no_ext + "_split"
         anon_tic = time.perf_counter()
         data = expand_sentences_spacy(data, in_colnames_list_first)
+        data = data[data[in_colnames_list_first].str.len() >= 25] # Keep only rows with at least 25 characters
+        data.reset_index(inplace=True, drop=True)
         anon_toc = time.perf_counter()
         time_out = f"Anonymising text took {anon_toc - anon_tic:0.1f} seconds"
+    out_data_name = output_folder + data_file_name_no_ext + "_" + today_rev +  ".csv"
     data.to_csv(out_data_name)
     output_list.append(out_data_name)
     output_text = "Data clean completed."
+    # Overwrite existing embeddings as they will likely have changed
+    return output_text, output_list, data, data_file_name_no_ext, np.array([])
+def optimise_zero_shot():
+    """
+    Return options that optimise the topic model to keep only zero-shot topics as the main topics
+    """
+    return gr.Dropdown(value="Yes"), gr.Slider(value=2), gr.Slider(value=2), gr.Slider(value=0.01), gr.Slider(value=0.95), gr.Slider(value=0.55)
+def extract_topics(
+    data: pd.DataFrame,
+    in_files: list,
+    min_docs_slider: int,
+    in_colnames: list,
+    max_topics_slider: int,
+    candidate_topics: list,
+    data_file_name_no_ext: str,
+    custom_labels_df: pd.DataFrame,
+    return_intermediate_files: str,
+    embeddings_super_compress: str,
+    high_quality_mode: str,
+    save_topic_model: str,
+    embeddings_out: np.ndarray,
+    embeddings_type_state: str,
+    zero_shot_similarity: float,
+    calc_probs: str,
+    vectoriser_state: CountVectorizer,
+    min_word_occurence_slider: float,
+    max_word_occurence_slider: float,
+    split_sentence_drop: str,
+    random_seed: int = random_seed,
+    output_folder: str = output_folder,
+    umap_n_neighbours:int = umap_n_neighbours,
+    umap_min_dist:float = umap_min_dist,
+    umap_metric:str = umap_metric,
+    progress: gr.Progress = gr.Progress(track_tqdm=True)
+) -> tuple:
+    """
+    Extract topics from the given data using various parameters and settings.
+    Args:
+        data (pd.DataFrame): The input data.
+        in_files (list): List of input files.
+        min_docs_slider (int): Minimum number of similar documents needed to make a topic.
+        in_colnames (list): List of column names to use for cleaning and finding topics.
+        max_topics_slider (int): Maximum number of topics.
+        candidate_topics (list): List of candidate topics.
+        data_file_name_no_ext (str): Data file name without extension.
+        custom_labels_df (pd.DataFrame): DataFrame containing custom labels.
+        return_intermediate_files (str): Whether to return intermediate files.
+        embeddings_super_compress (str): Whether to round embeddings to three decimal places.
+        high_quality_mode (str): Whether to use high quality (transformers based) embeddings.
+        save_topic_model (str): Whether to save the topic model.
+        embeddings_out (np.ndarray): Output embeddings.
+        embeddings_type_state (str): State of the embeddings type.
+        zero_shot_similarity (float): Zero-shot similarity threshold.
+        random_seed (int): Random seed for reproducibility.
+        calc_probs (str): Whether to calculate all topic probabilities.
+        vectoriser_state (CountVectorizer): Vectorizer state.
+        min_word_occurence_slider (float): Minimum word occurrence slider value.
+        max_word_occurence_slider (float): Maximum word occurrence slider value.
+        split_sentence_drop (str): Whether to split open text into sentences.
+        original_data_state (pd.DataFrame): Original data state.
+        output_folder (str, optional): Output folder. Defaults to output_folder.
+        umap_n_neighbours (int): Nearest neighbours value for UMAP.
+        umap_min_dist (float): Minimum distance for UMAP.
+        umap_metric (str): Metric for UMAP.
+        progress (gr.Progress, optional): Progress tracker. Defaults to gr.Progress(track_tqdm=True).
+    Returns:
+        tuple: A tuple containing output text, output list, data, data file name without extension, and an empty numpy array.
+    """
     all_tic = time.perf_counter()
     progress(0, desc= "Loading data")
+    vectoriser_state = CountVectorizer(stop_words="english", ngram_range=(1, 2), min_df=min_word_occurence_slider, max_df=max_word_occurence_slider)
     output_list = []
     file_list = [string.name for string in in_files]
     # Check if embeddings are being loaded in
     progress(0.2, desc= "Loading/creating embeddings")
+    if high_quality_mode == "Yes":
+        print("Using high quality embedding model")
         # Define a list of possible local locations to search for the model
         local_embeddings_locations = [
         embeddings_type_state = "large"
         # UMAP model uses Bertopic defaults
+        umap_model = UMAP(n_neighbors=umap_n_neighbours, n_components=5, min_dist=umap_min_dist, metric=umap_metric, low_memory=False, random_state=random_seed)
     else:
         print("Choosing low resource TF-IDF model.")
         #umap_model = TruncatedSVD(n_components=5, random_state=random_seed)
         # UMAP model uses Bertopic defaults
+        umap_model = UMAP(n_neighbors=umap_n_neighbours, n_components=5, min_dist=umap_min_dist, metric=umap_metric, low_memory=True, random_state=random_seed)
+    embeddings_out = make_or_load_embeddings(docs, file_list, embeddings_out, embedding_model, embeddings_super_compress, high_quality_mode)
     # This is saved as a Gradio state object
     vectoriser_model = vectoriser_state
             if calc_probs == True:
                 topics_probs_out = pd.DataFrame(topic_model.probabilities_)
+                topics_probs_out_name = output_folder + "topic_full_probs_" + data_file_name_no_ext + "_" + today_rev + ".csv"
                 topics_probs_out.to_csv(topics_probs_out_name)
                 output_list.append(topics_probs_out_name)
             print(error)
             print(fail_error_message)
+            out_fail_error_message = '\n'.join([fail_error_message, str(error)])
+            return out_fail_error_message, output_list, embeddings_out, embeddings_type_state, data_file_name_no_ext, None, docs, vectoriser_model, []
     # Do this if you have pre-defined topics
     else:
+        #if high_quality_mode == "No":
+        #    error_message = "Zero shot topic modelling currently not compatible with low-resource embeddings. Please change this option to 'No' on the options tab and retry."
+        #    print(error_message)
+        #    return error_message, output_list, embeddings_out, embeddings_type_state, data_file_name_no_ext, None, docs, vectoriser_model, []
         zero_shot_topics = read_file(candidate_topics.name)
         zero_shot_topics_lower = list(zero_shot_topics.iloc[:, 0].str.lower())
+        print("Zero shot topics are:", zero_shot_topics_lower)
         try:
             topic_model = BERTopic( embedding_model=embedding_model, #embedding_model_pipe, # for Jina
             if calc_probs == True:
                 topics_probs_out = pd.DataFrame(topic_model.probabilities_)
+                topics_probs_out_name = output_folder + "topic_full_probs_" + data_file_name_no_ext + "_" + today_rev + ".csv"
                 topics_probs_out.to_csv(topics_probs_out_name)
                 output_list.append(topics_probs_out_name)
             print("An exception occurred:", error)
             print(fail_error_message)
+            out_fail_error_message = '\n'.join([fail_error_message, str(error)])
+            return out_fail_error_message, output_list, embeddings_out, embeddings_type_state, data_file_name_no_ext, None, docs, vectoriser_model, []
         # For some reason, zero topic modelling exports assigned topics as a np.array instead of a list. Converting it back here.
         if isinstance(assigned_topics, np.ndarray):
             assigned_topics = assigned_topics.tolist()
          # Zero shot modelling is a model merge, which wipes the c_tf_idf part of the resulting model completely. To get hierarchical modelling to work, we need to recreate this part of the model with the CountVectorizer options used to create the initial model. Since with zero shot, we are merging two models that have exactly the same set of documents, the vocubulary should be the same, and so recreating the cf_tf_idf component in this way shouldn't be a problem. Discussion here, and below based on Maarten's suggested code: https://github.com/MaartenGr/BERTopic/issues/1700
         # Get document info
         documents_per_topic = doc_dets.groupby(['Topic'], as_index=False).agg({'Document': ' '.join})
         # Assign CountVectorizer to merged model
         topic_model.vectorizer_model = vectoriser_model
         # Re-calculate c-TF-IDF
         c_tf_idf, _ = topic_model._c_tf_idf(documents_per_topic)
         topic_model.c_tf_idf_ = c_tf_idf
     # Check we have topics
     if not assigned_topics:
         return "No topics found.", output_list, embeddings_out, embeddings_type_state, data_file_name_no_ext, topic_model, docs, vectoriser_model,[]
         print("Topic model created.")
     # Tidy up topic label format a bit to have commas and spaces by default
+    if not candidate_topics:
+        print("Zero shot topics found, so not renaming")
+        new_topic_labels = topic_model.generate_topic_labels(nr_words=3, separator=", ")
+        topic_model.set_topic_labels(new_topic_labels)
+    if candidate_topics:
+        print("Custom labels:", topic_model.custom_labels_)
+        print("Topic labels:", topic_model.topic_labels_)
+        topic_model.set_topic_labels(topic_model.topic_labels_)
     # Replace current topic labels if new ones loaded in
     if not custom_labels_df.empty:
     print("Custom topics: ", topic_model.custom_labels_)
     # Outputs
+    output_list, output_text = save_topic_outputs(topic_model, data_file_name_no_ext, output_list, docs, save_topic_model, data, split_sentence_drop)
      # If you want to save your embedding files
     if return_intermediate_files == "Yes":
         print("Saving embeddings to file")
+        if high_quality_mode == "Yes":
+            embeddings_file_name = output_folder + data_file_name_no_ext + '_' + 'tfidf_embeddings.npz'
         else:
             if embeddings_super_compress == "No":
+                embeddings_file_name = output_folder + data_file_name_no_ext + '_' + 'large_embeddings.npz'
             else:
+                embeddings_file_name = output_folder + data_file_name_no_ext + '_' + 'large_embeddings_compress.npz'
         np.savez_compressed(embeddings_file_name, embeddings_out)
     return output_text, output_list, embeddings_out, embeddings_type_state, data_file_name_no_ext, topic_model, docs, vectoriser_model, assigned_topics
+def reduce_outliers(topic_model: BERTopic, docs: List[str], embeddings_out: np.ndarray, data_file_name_no_ext: str, assigned_topics: Union[np.ndarray, List[int]], vectoriser_model: CountVectorizer, save_topic_model: str, split_sentence_drop: str, data: PandasDataFrame, progress: gr.Progress = gr.Progress(track_tqdm=True)) -> tuple:
+    """
+    Reduce outliers in the topic model and update the topic representation.
+    Args:
+        topic_model (BERTopic): The BERTopic topic model to be used.
+        docs (List[str]): List of documents.
+        embeddings_out (np.ndarray): Output embeddings.
+        data_file_name_no_ext (str): Data file name without extension.
+        assigned_topics (Union[np.ndarray, List[int]]): Assigned topics.
+        vectoriser_model (CountVectorizer): Vectorizer model.
+        save_topic_model (str): Whether to save the topic model.
+        split_sentence_drop (str): Dropdown result indicating whether sentences have been split.
+        data (PandasDataFrame): The input dataframe
+        progress (gr.Progress, optional): Progress tracker. Defaults to gr.Progress(track_tqdm=True).
+    Returns:
+        tuple: A tuple containing the output text, output list, and the updated topic model.
+    """
     progress(0, desc= "Preparing data")
     all_tic = time.perf_counter()
     if isinstance(assigned_topics, np.ndarray):
         assigned_topics = assigned_topics.tolist()
     # Reduce outliers if required, then update representation
     progress(0.2, desc= "Reducing outliers")
     print("Reducing outliers.")
     print("Finished reducing outliers.")
     # Outputs
     progress(0.9, desc= "Saving to file")
+    output_list, output_text = save_topic_outputs(topic_model, data_file_name_no_ext, output_list, docs, save_topic_model, data, split_sentence_drop)
     all_toc = time.perf_counter()
     time_out = f"All processes took {all_toc - all_tic:0.1f} seconds"
     return output_text, output_list, topic_model
+def represent_topics(topic_model: BERTopic, docs: List[str], data_file_name_no_ext: str, high_quality_mode: str, save_topic_model: str, representation_type: str, vectoriser_model: CountVectorizer, split_sentence_drop: str, data: PandasDataFrame, progress: gr.Progress = gr.Progress(track_tqdm=True)) -> tuple:
+    """
+    Represents topics using the specified representation model and updates the topic labels accordingly.
+    Args:
+        topic_model (BERTopic): The topic model to be used.
+        docs (List[str]): List of documents to be processed.
+        data_file_name_no_ext (str): The base name of the data file without extension.
+        high_quality_mode (str): Whether to use high quality (transformers based) embeddings.
+        save_topic_model (str): Whether to save the topic model.
+        representation_type (str): The type of representation model to be used.
+        vectoriser_model (CountVectorizer): The vectorizer model to be used.
+        split_sentence_drop (str): Dropdown result indicating whether sentences have been split.
+        data (PandasDataFrame): The input dataframe
+        progress (gr.Progress, optional): Progress tracker for the process. Defaults to gr.Progress(track_tqdm=True).
+    Returns:
+        tuple: A tuple containing the output text, output list, and the updated topic model.
+    """
     output_list = []
     all_tic = time.perf_counter()
+    # Load in representation model
+    progress(0.1, desc= "Loading model and creating new topic representation")
+    representation_model = create_representation_model(representation_type, llm_config, hf_model_name, hf_model_file, chosen_start_tag, high_quality_mode)
     progress(0.3, desc= "Updating existing topics")
     topic_model.update_topics(docs, vectorizer_model=vectoriser_model, representation_model=representation_model)
         llm_labels = [label[0].split("\n")[0] for label in topic_dets["LLM"]]
         topic_model.set_topic_labels(llm_labels)
+        label_list_file_name = output_folder + data_file_name_no_ext + '_llm_topic_list_' + today_rev + '.csv'
         llm_labels_df = pd.DataFrame(data={"Label":llm_labels})
         llm_labels_df.to_csv(label_list_file_name, index=None)
     # Outputs
     progress(0.8, desc= "Saving outputs")
+    output_list, output_text = save_topic_outputs(topic_model, data_file_name_no_ext, output_list, docs, save_topic_model, data, split_sentence_drop)
     all_toc = time.perf_counter()
     time_out = f"All processes took {all_toc - all_tic:0.1f} seconds"
     return output_text, output_list, topic_model
+def visualise_topics(
+    topic_model: BERTopic,
+    data: pd.DataFrame,
+    data_file_name_no_ext: str,
+    high_quality_mode: str,
+    embeddings_out: np.ndarray,
+    in_label: List[str],
+    in_colnames: List[str],
+    legend_label: str,
+    sample_prop: float,
+    visualisation_type_radio: str,
+    random_seed: int = random_seed,
+    umap_n_neighbours: int = umap_n_neighbours,
+    umap_min_dist: float = umap_min_dist,
+    umap_metric: str = umap_metric,
+    progress: gr.Progress = gr.Progress(track_tqdm=True)
+) -> tuple:
+    """
+    Visualize topics using the provided topic model and data.
+    Args:
+        topic_model (BERTopic): The topic model to be used for visualization.
+        data (pd.DataFrame): The input data containing the documents.
+        data_file_name_no_ext (str): The base name of the data file without extension.
+        high_quality_mode (str): Whether to use high quality mode for embeddings.
+        embeddings_out (np.ndarray): The output embeddings.
+        in_label (List[str]): List of labels for the input data.
+        in_colnames (List[str]): List of column names in the input data.
+        legend_label (str): The label to be used in the legend.
+        sample_prop (float): The proportion of data to sample for visualization.
+        visualisation_type_radio (str): The type of visualization to be used.
+        random_seed (int, optional): Random seed for reproducibility. Defaults to random_seed.
+        umap_n_neighbours (int, optional): Number of neighbors for UMAP. Defaults to umap_n_neighbours.
+        umap_min_dist (float, optional): Minimum distance for UMAP. Defaults to umap_min_dist.
+        umap_metric (str, optional): Metric for UMAP. Defaults to umap_metric.
+        progress (gr.Progress, optional): Progress tracker for the process. Defaults to gr.Progress(track_tqdm=True).
+    Returns:
+        tuple: A tuple containing the output message, output list, reduced embeddings, and topic model.
+    """
     progress(0, desc= "Preparing data for visualisation")
     output_list = []
+    output_message = []
     vis_tic = time.perf_counter()
             topic_model.set_topic_labels(labels)
     # Pre-reduce embeddings for visualisation purposes
+    if high_quality_mode == "Yes":
+        reduced_embeddings = UMAP(n_neighbors=umap_n_neighbours, n_components=2, min_dist=umap_min_dist, metric=umap_metric, random_state=random_seed).fit_transform(embeddings_out)
     else:
         reduced_embeddings = TruncatedSVD(2, random_state=random_seed).fit_transform(embeddings_out)
+    progress(0.3, desc= "Creating visualisations")
     # Visualise the topics:
+    print("Creating visualisations")
     if visualisation_type_radio == "Topic document graph":
+        try:
+            topics_vis = visualize_documents_custom(topic_model, docs, hover_labels = label_list, reduced_embeddings=reduced_embeddings, hide_annotations=True, hide_document_hover=False, custom_labels=True, sample = sample_prop, width= 1200, height = 750)
+            topics_vis_name = output_folder + data_file_name_no_ext + '_' + 'vis_topic_docs_' + today_rev + '.html'
+            topics_vis.write_html(topics_vis_name)
+            output_list.append(topics_vis_name)
+        except Exception as e:
+            print(e)
+            output_message = str(e)
+            return output_message, output_list, None, None
+        try:
+            topics_vis_2 = topic_model.visualize_heatmap(custom_labels=True, width= 1200, height = 1200)
+            topics_vis_2_name = output_folder + data_file_name_no_ext + '_' + 'vis_heatmap_' + today_rev + '.html'
+            topics_vis_2.write_html(topics_vis_2_name)
+            output_list.append(topics_vis_2_name)
+        except Exception as e:
+            print(e)
+            output_message.append(str(e))
     elif visualisation_type_radio == "Hierarchical view":
         # Print topic tree - may get encoding errors, so doing try except
         try:
             tree = topic_model.get_topic_tree(hierarchical_topics, tight_layout = True)
+            tree_name = output_folder + data_file_name_no_ext + '_' + 'vis_hierarchy_tree_' + today_rev + '.txt'
             with open(tree_name, "w") as file:
                 # Write the string to the file
             output_list.append(tree_name)
+        except Exception as e:
+            new_out_message = "An exception occurred when making topic tree document, skipped:" + str(e)
+            output_message.append(str(new_out_message))
+            print(new_out_message)
         # Save new hierarchical topic model to file
+        try:
+            hierarchical_topics_name = output_folder + data_file_name_no_ext + '_' + 'vis_hierarchy_topics_dist_' + today_rev + '.csv'
+            hierarchical_topics.to_csv(hierarchical_topics_name, index = None)
+            output_list.append(hierarchical_topics_name)
+            topics_vis, hierarchy_df, hierarchy_topic_names = visualize_hierarchical_documents_custom(topic_model, docs, label_list, hierarchical_topics, hide_annotations=True, reduced_embeddings=reduced_embeddings, sample = sample_prop, hide_document_hover= False, custom_labels=True, width= 1200, height = 750)
+            topics_vis_2 = visualize_hierarchy_custom(topic_model, hierarchical_topics=hierarchical_topics, width= 1200, height = 750)
+        except Exception as e:
+            new_out_message = "An exception occurred when making hierarchical topic visualisation:" + str(e) + ". Maybe your model doesn't have enough topics to create a hierarchy?"
+            output_message.append(str(new_out_message))
+            print(new_out_message)
+            return new_out_message, output_list, None, None
         # Write hierarchical topics levels to df
+        hierarchy_df_name = output_folder + data_file_name_no_ext + '_' + 'hierarchy_topics_df_' + today_rev + '.csv'
         hierarchy_df.to_csv(hierarchy_df_name, index = None)
         output_list.append(hierarchy_df_name)
         # Write hierarchical topics names to df
+        hierarchy_topic_names_name = output_folder + data_file_name_no_ext + '_' + 'hierarchy_topics_names_' + today_rev + '.csv'
         hierarchy_topic_names.to_csv(hierarchy_topic_names_name, index = None)
         output_list.append(hierarchy_topic_names_name)
+        topics_vis_name = output_folder + data_file_name_no_ext + '_' + 'vis_hierarchy_topic_doc_' + today_rev + '.html'
         topics_vis.write_html(topics_vis_name)
         output_list.append(topics_vis_name)
+        topics_vis_2_name = output_folder + data_file_name_no_ext + '_' + 'vis_hierarchy_' + today_rev + '.html'
         topics_vis_2.write_html(topics_vis_2_name)
         output_list.append(topics_vis_2_name)
     all_toc = time.perf_counter()
+    output_message.append(f"Creating visualisation took {all_toc - vis_tic:0.1f} seconds")
+    print(output_message)
+    return '\n'.join(output_message), output_list, topics_vis, topics_vis_2
+def save_as_pytorch_model(topic_model: BERTopic, data_file_name_no_ext:str, progress=gr.Progress(track_tqdm=True)):
+    """
+    Reduce outliers in the topic model and update the topic representation.
+    Args:
+        topic_model (BERTopic): The BERTopic topic model to be used.
+        data_file_name_no_ext (str): Document file name.
+    Returns:
+        tuple: A tuple containing the output text and output list.
+    """
+    output_list = []
+    output_message = ""
     if not topic_model:
+        output_message = "No Pytorch model found."
+        return output_message, None
     progress(0, desc= "Saving topic model in Pytorch format")
+    topic_model_save_name_folder = output_folder + data_file_name_no_ext + "_topics_" + today_rev# + ".safetensors"
     topic_model_save_name_zip = topic_model_save_name_folder + ".zip"
     # Clear folder before replacing files
     topic_model.save(topic_model_save_name_folder, serialization='pytorch', save_embedding_model=True, save_ctfidf=False)
+    # Zip file example
     zip_folder(topic_model_save_name_folder, topic_model_save_name_zip)
     output_list.append(topic_model_save_name_zip)
+    output_message = "Model saved in Pytorch format."
+    return output_message, output_list

requirements.txt CHANGED Viewed

@@ -1,8 +1,7 @@
-gradio
 transformers==4.41.2
 accelerate==0.26.1
 torch==2.3.1
-llama-cpp-python==0.2.79
 bertopic==0.16.2
 spacy==3.7.4
 en_core_web_sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1.tar.gz
@@ -13,4 +12,6 @@ presidio_analyzer==2.2.354
 presidio_anonymizer==2.2.354
 scipy==1.11.4
 polars==0.20.6
-numpy==1.26.4

+gradio # Not specified version due to interaction with spacy - reinstall latest version after requirements.txt load
 transformers==4.41.2
 accelerate==0.26.1
 torch==2.3.1
 bertopic==0.16.2
 spacy==3.7.4
 en_core_web_sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1.tar.gz
 presidio_anonymizer==2.2.354
 scipy==1.11.4
 polars==0.20.6
+sentence-transformers==3.0.1
+llama-cpp-python==0.2.79 --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu
+numpy==1.26.4

requirements_gpu.txt CHANGED Viewed

@@ -1,7 +1,6 @@
-gradio
 transformers==4.41.2
 accelerate==0.26.1
-torch==2.3.1
 bertopic==0.16.2
 spacy==3.7.4
 en_core_web_sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1.tar.gz
@@ -15,3 +14,4 @@ polars==0.20.6
 torch --index-url https://download.pytorch.org/whl/cu121
 llama-cpp-python==0.2.77 --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121
 numpy==1.26.4

+gradio # Not specified version due to interaction with spacy - reinstall latest version after requirements.txt load
 transformers==4.41.2
 accelerate==0.26.1
 bertopic==0.16.2
 spacy==3.7.4
 en_core_web_sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1.tar.gz
 torch --index-url https://download.pytorch.org/whl/cu121
 llama-cpp-python==0.2.77 --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121
 numpy==1.26.4
+sentence-transformers==3.0.1