BassemE commited on
Commit
50466ce
Β·
1 Parent(s): 5309dde
.DS_Store ADDED
Binary file (6.15 kB). View file
 
.gitattributes CHANGED
@@ -33,3 +33,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ model/cluster_assignments.json filter=lfs diff=lfs merge=lfs -text
37
+ model/interactive_som_map.html filter=lfs diff=lfs merge=lfs -text
38
+ model/som_model.pkl filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,158 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Self-Organizing Map (SOM) Model for Document Clustering
2
+
3
+ A trained Self-Organizing Map model for clustering and visualizing high-dimensional document embeddings. This model was trained on technical documentation and can be used for document similarity analysis, topic discovery, and semantic clustering.
4
+
5
+ ## πŸ“Š Model Details
6
+
7
+ - **Model Type**: Self-Organizing Map (SOM)
8
+ - **Training Data**: 11,412 records
9
+ - **Embedding Dimension**: 3,072 (OpenAI Large Embedding Model)
10
+ - **Number of Clusters**: 625
11
+ - **Grid Size**: 25x25
12
+ - **Learning Rate**: 0.1
13
+ - **Sigma**: 1.0
14
+
15
+ ## 🎯 Use Cases
16
+
17
+ - **Document Clustering**: Group similar documents based on semantic similarity
18
+ - **Topic Discovery**: Identify common themes and topics in large document collections
19
+ - **Semantic Search**: Find related documents through vector similarity
20
+ - **Data Visualization**: Interactive visualization of document relationships
21
+ - **Knowledge Organization**: Structure and organize large knowledge bases
22
+
23
+ ## πŸ“ Model Files
24
+
25
+ - `som_model.pkl`: Trained SOM model weights and parameters
26
+ - `cluster_assignments.json`: Document-to-cluster assignments for all 11,412 records
27
+ - `cluster_analysis.json`: Detailed analysis of each cluster including keywords and topics
28
+ - `interactive_som_map.html`: Interactive visualization of the SOM grid with cluster information
29
+
30
+ ## πŸš€ Quick Start
31
+
32
+ ### Installation
33
+
34
+ ```bash
35
+ pip install numpy scikit-learn matplotlib plotly
36
+ ```
37
+
38
+ ### Loading and Using the Model
39
+
40
+ ```python
41
+ import pickle
42
+ import json
43
+ import numpy as np
44
+ from sklearn.metrics.pairwise import cosine_similarity
45
+
46
+ # Load the trained SOM model
47
+ with open('som_model.pkl', 'rb') as f:
48
+ som_model = pickle.load(f)
49
+
50
+ # Load cluster assignments
51
+ with open('cluster_assignments.json', 'r') as f:
52
+ cluster_assignments = json.load(f)
53
+
54
+ # Load cluster analysis
55
+ with open('cluster_analysis.json', 'r') as f:
56
+ cluster_analysis = json.load(f)
57
+
58
+ # Example: Get cluster for a new document embedding
59
+ def get_cluster_for_embedding(embedding, som_model):
60
+ """Get the cluster assignment for a new document embedding"""
61
+ # Find the best matching unit (BMU)
62
+ bmu = som_model.winner(embedding)
63
+ return f"{bmu[0]},{bmu[1]}"
64
+
65
+ # Example: Find similar documents
66
+ def find_similar_documents(embedding, cluster_assignments, top_k=5):
67
+ """Find similar documents based on cluster membership"""
68
+ cluster = get_cluster_for_embedding(embedding, som_model)
69
+
70
+ # Get all documents in the same cluster
71
+ cluster_docs = [doc for doc, doc_cluster in cluster_assignments.items()
72
+ if doc_cluster == cluster]
73
+
74
+ return cluster_docs[:top_k]
75
+ ```
76
+
77
+ ### Interactive Visualization
78
+
79
+ Open `interactive_som_map.html` in a web browser to explore the SOM grid interactively. The visualization shows:
80
+
81
+ - Cluster sizes and distributions
82
+ - Top keywords for each cluster
83
+ - Topic analysis
84
+ - Document counts per cluster
85
+
86
+ ## πŸ“ˆ Model Performance
87
+
88
+ Based on the cluster analysis:
89
+
90
+ - **Total Documents**: 11,412
91
+ - **Total Clusters**: 625 (25x25 grid)
92
+ - **Silhouette Score**: -0.0078
93
+ - **Calinski-Harabasz Score**: 13.69
94
+ - **Davies-Bouldin Score**: 2.33
95
+
96
+ ## πŸ” Cluster Analysis
97
+
98
+ The model identifies meaningful clusters with distinct topics. For example, one of the largest clusters (659 documents) focuses on:
99
+
100
+ - **Keywords**: connector, anypoint, mule, studio, connectors
101
+ - **Topics**: Configuration, API integration, MuleSoft platform usage
102
+
103
+ ## πŸ› οΈ Advanced Usage
104
+
105
+ ### Custom Clustering
106
+
107
+ ```python
108
+ # Train a new SOM with different parameters
109
+ from minisom import MiniSom
110
+
111
+ def train_custom_som(embeddings, grid_size=(20, 20), sigma=1.0, learning_rate=0.1):
112
+ som = MiniSom(grid_size[0], grid_size[1], embeddings.shape[1],
113
+ sigma=sigma, learning_rate=learning_rate, random_seed=42)
114
+ som.train_random(embeddings, 100)
115
+ return som
116
+ ```
117
+
118
+ ### Cluster Analysis
119
+
120
+ ```python
121
+ def analyze_cluster(cluster_key, cluster_analysis):
122
+ """Get detailed information about a specific cluster"""
123
+ for cluster in cluster_analysis['top_clusters']:
124
+ if cluster['cluster_key'] == cluster_key:
125
+ return {
126
+ 'size': cluster['size'],
127
+ 'keywords': cluster['keywords'],
128
+ 'topics': cluster['topics']
129
+ }
130
+ return None
131
+ ```
132
+
133
+ ## πŸ“š Dependencies
134
+
135
+ - `numpy`: Numerical computations
136
+ - `scikit-learn`: Machine learning utilities
137
+ - `minisom`: Self-Organizing Map implementation
138
+ - `matplotlib`: Static plotting
139
+ - `plotly`: Interactive visualizations
140
+ - `pandas`: Data manipulation
141
+
142
+ ## 🀝 Contributing
143
+
144
+ This model is part of a larger document processing and clustering pipeline. For questions or contributions, please refer to the main project repository.
145
+
146
+ ## πŸ“„ License
147
+
148
+ This model is provided for research and educational purposes. Please ensure compliance with the original data source licenses when using this model.
149
+
150
+ ## πŸ”— Related Resources
151
+
152
+ - [Self-Organizing Maps Tutorial](https://en.wikipedia.org/wiki/Self-organizing_map)
153
+ - [MiniSom Documentation](https://github.com/JustGlowing/minisom)
154
+ - [OpenAI Embeddings](https://platform.openai.com/docs/guides/embeddings)
155
+
156
+ ---
157
+
158
+ **Note**: This model was trained on technical documentation and may be most effective for similar types of content. For best results, ensure your input documents are in the same domain or consider fine-tuning the model on your specific data.
model/cluster_analysis.json ADDED
The diff for this file is too large to render. See raw diff
 
model/cluster_assignments.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:18ac9901c5764461b5319be87acb55d8717776bc6a562b0f1f5bdea625436fc6
3
+ size 27423872
model/interactive_som_map.html ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9cd81ecc973da9c56e886184a0edecd1795604573c81625249fbd21e86d25f45
3
+ size 6325326
model/som_model.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8e5549124a1c24006f10a59a468bf7513f0027ee2f68320d32c4f547bc67d77a
3
+ size 317672167