Module: wordcloud
NOTE Brand new module! This has lots of random parameters that seem to make a good picture.
The hashtag tends to dominate the graph. I like that because it serves as like a title or anchoring word. But some folks want to see it without the hashtag itself dominating. So there's a config option hashtag_fix
that takes one of 3 values. (Default if omitted is as-is
). In this section, I show the same data set from Kung-Fu Saturday, 7 December 2024 visualized 3 different ways.
Alt Text Generation
As of version 1.2.0, the wordcloud module now automatically generates descriptive alt text for each wordcloud image. This alt text includes:
- The hashtag and date of the analysis
- Total number of unique words in the wordcloud
- Top 10 most frequent words with their counts
- Information about hashtag treatment method (as-is, remove, reduce)
- List of any custom stop words that were used
The alt text is saved to a text file with the same name as the wordcloud image but with a .txt
extension. For example, if the wordcloud is saved as wordcloud/wordcloud-monsterdon-20250409-as-is.png
, the alt text will be saved as wordcloud/wordcloud-monsterdon-20250409-as-is.txt
.
This feature makes the wordclouds more accessible and provides a quick summary of the key words from the visualization.
Custom Stop Words
You can exclude specific words from appearing in your wordcloud by adding a stop_words
parameter to the [wordcloud]
section of your INI file. This is particularly useful for filtering out common words that aren't meaningful to your analysis.
To use this feature:
- Add a
stop_words
parameter to the [wordcloud]
section of your INI file
- Provide a comma-separated list of words to exclude
For example:
[wordcloud]
graph_title = Wordcloud
font = /path/to/font.otf
size_x = 1280
size_y = 960
hashtag_fix = remove
stop_words = movie, film, watching, watch, tonight, scene, scenes, actor, actors
These words will be excluded from the wordcloud in addition to the default stop words and any other configured exclusions. This is especially useful for event-specific hashtags where certain common words might dominate the visualization without adding meaningful information.
as-is
Leave the hashtag alone.

remove
Remove all instances of the hashtag

reduce
Remove most (currently hard-coded at 90%) occurrences of the hashtag. It will still be popular enough to be quite large, but it won't dominate. In this example, "KungFuSat" is near the top right, in a dark purple.

Synopsis
mastoscore --debug=info ini/monsterdon-20241201.ini wordcloud
Creates a file named {journaldir}/wordcloud-{journalfile}.png
.
A Word about Emoji
While it is possible to make a word cloud that includes emoji, it's a bit complicated. See, it really boils down to the font and matplotlib's support for fonts. I think a lot of fancy word processing systems use multiple fonts (one for text, one for rendering symbols like emoji). But matplotlib needs a single font that has everything you want in it. The only one I have found like that is Symbola, which is OK, but the words themselves look pretty terrible. I think the right answer is probably to build emoji support into word_cloud itself to give it some emoji awareness and then use a different font for emojis. For now, I'm just dropping all emojis and punctuation.
Examples

Code Reference
Module to take the data in from analysis and produce graph files.
write_wordcloud(config)
This is the only function, for now. It invokes get_toots_df()
to get the DataFrame. Then it discards basically everything other than the content
column.
I post-process to remove some weird things (there's lots of emoji-like things). I also remove the
hashtag itself, because it's obviously gonna have the highest frequency.
Parameters
- config: A ConfigParser object from the config module
Config Parameters Used
Option |
Description |
graph:journalfile |
Filename that forms the base of the graph's filename. |
graph:journaldir |
Directory where we will write the graph file |
fetch:hashtag |
Hashtag to search for |
wordcloud:font_path |
Path to fonts like Symbola |
wordcloud:hashtag_fix |
What to do with the main hashtag? 'reduce', 'remove', or 'as-is' |
wordcloud:size_x |
Size in pixels for the image. Default 1280 |
wordcloud:size_y |
Size in pixels for the image. Default 960 |
wordcloud:stop_words |
Comma-separated list of words to exclude |
mastoscore:event_year |
Year of the event (YYYY) |
mastoscore:event_month |
Month of the event (MM) |
mastoscore:event_day |
Day of the event (DD) |
Returns
None
Writes the graph to a file named wordcloud/wordcloud-hashtag-YYYYMMDD-hashtag_fix.png
Writes alt text description to wordcloud/wordcloud-hashtag-YYYYMMDD-hashtag_fix.txt
Source code in mastoscore/wordcloud.py
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186 | def write_wordcloud(config):
"""
This is the only function, for now. It invokes [get_toots_df()](module-analyse.md#mastoscore.analyse.get_toots_df)
to get the DataFrame. Then it discards basically everything other than the `content` column.
I post-process to remove some weird things (there's lots of emoji-like things). I also remove the
hashtag itself, because it's obviously gonna have the highest frequency.
# Parameters
- **config**: A ConfigParser object from the [config](module-config.md) module
# Config Parameters Used
| Option | Description |
| ------- | ------- |
| `graph:journalfile` | Filename that forms the base of the graph's filename. |
| `graph:journaldir` | Directory where we will write the graph file |
| `fetch:hashtag` | Hashtag to search for |
| `wordcloud:font_path` | Path to fonts like Symbola |
| `wordcloud:hashtag_fix` | What to do with the main hashtag? 'reduce', 'remove', or 'as-is' |
| `wordcloud:size_x` | Size in pixels for the image. Default 1280 |
| `wordcloud:size_y` | Size in pixels for the image. Default 960 |
| `wordcloud:stop_words` | Comma-separated list of words to exclude |
| `mastoscore:event_year` | Year of the event (YYYY) |
| `mastoscore:event_month` | Month of the event (MM) |
| `mastoscore:event_day` | Day of the event (DD) |
# Returns
None
Writes the graph to a file named wordcloud/wordcloud-hashtag-YYYYMMDD-hashtag_fix.png
Writes alt text description to wordcloud/wordcloud-hashtag-YYYYMMDD-hashtag_fix.txt
"""
journalfile = config.get('graph', 'journalfile')
journaldir = config.get('graph', 'journaldir')
hashtag = config.get('fetch', 'hashtag')
debug = config.getint('wordcloud', 'debug')
size_x = config.getint('wordcloud', 'size_x')
size_y = config.getint('wordcloud', 'size_y')
font = config.get('wordcloud', 'font')
hashtag_fix = config.get('wordcloud', 'hashtag_fix', fallback='as-is')
# Get date components for filename
try:
year = config.get('mastoscore', 'event_year')
month = config.get('mastoscore', 'event_month')
day = config.get('mastoscore', 'event_day')
date_str = f"{year}{month}{day}"
except Exception as e:
logger.error(f"Failed to get date components from config: {e}")
logger.error("Falling back to current date")
date_str = datetime.now().strftime("%Y%m%d")
logger = logging.getLogger(__name__)
logging.basicConfig(format='%(levelname)s\t%(message)s')
logger.setLevel(debug)
# import the stop words list from the WordCloud package. Add a few specifics to it
stop_words = STOPWORDS.copy()
# Add custom stop words from config file if they exist
if config.has_option('wordcloud', 'stop_words'):
custom_stop_words = config.get('wordcloud', 'stop_words')
# Split by commas and strip whitespace from each word
for word in [w.strip().lower() for w in custom_stop_words.split(',')]:
if word: # Only add non-empty words
stop_words.add(word)
logger.debug(f"Added custom stop word: '{word}'")
logger.info(f"Added {len(custom_stop_words.split(','))} custom stop words from config")
df = get_toots_df(config)
worddata = list(df['content'])
# all we care about is the content data, so we delete the whole dataframe. :)
del df
allwords = ' '.join(worddata)
bswords = BeautifulSoup(allwords, features="html.parser")
just_text = bswords.get_text()
just_text = re.sub("http[^ ]+ ", " ", just_text)
just_words = [word for word in re.split(r"[ #,!-]+", just_text) if len(word) >= 3
and not word.startswith('http')
and not word.startswith('@')]
if hashtag_fix == 'remove':
stop_words.add(hashtag.lower())
elif hashtag_fix == 'reduce':
# This is a cheesey way to implement "keep X0%". I just pick a bunch of
# random numbers between 0 and 9 and check if it comes up higher than X.
# I go through the words and the ones that don't match the hashtag are just
# kept. If it's the hashtag, then I roll a die to see if I keep it.
old_len = len(just_words)
just_words = [word for word in just_words if word.lower()
!= hashtag.lower() or random.randint(0, 9) > 8]
new_len = len(just_words)
logger.debug(
f"Removing {hashtag} removed {old_len-new_len} words, leaving {new_len}")
just_words = ' '.join(just_words)
# the regex used to detect words is a combination of normal words, ascii art, and emojis
# 2+ consecutive letters (also include apostrophes), e.x It's
normal_word = r"(?:\w[\w']+)"
font_path = font
wc = WordCloud(font_path=font_path, width=size_x, height=size_y, max_font_size=360,
max_words=200, regexp=normal_word, scale=1.4, prefer_horizontal=0.60,
font_step=1, relative_scaling=0.1, repeat=False, stopwords=stop_words,
margin=4, min_word_length=3).generate(just_words)
# Create graphs directory if it doesn't exist
graphs_dir = "wordcloud"
try:
os.makedirs(graphs_dir, exist_ok=True)
logger.debug(f"Ensured wordcloud directory exists: {graphs_dir}")
except Exception as e:
logger.error(f"Failed to create graphs directory: {e}")
raise
# Create the wordcloud filename with wordcloud-hashtag-YYYYMMDD-hashtag_fix pattern
graph_file_name = os.path.join(graphs_dir, f"wordcloud-{hashtag}-{date_str}-{hashtag_fix}.png")
alt_text_file_name = os.path.join(graphs_dir, f"wordcloud-{hashtag}-{date_str}-{hashtag_fix}.txt")
# Generate alt text description
just_words = [word.lower() for word in just_words.split(' ') if word.lower() not in stop_words]
word_counts = Counter(just_words)
total_unique_words = len(word_counts)
top_words = word_counts.most_common(10)
event_date = datetime.fromisoformat(f"{year}-{month}-{day}")
nice_date = datetime.strftime(event_date, "%A, %e %b %Y")
# Format the alt text
alt_text = f"The word cloud for {nice_date}. Words are larger the more frequently "\
"they appeared in posts.\n"
alt_text += f"There were {total_unique_words} unique words posted, and " \
f"the wordcloud shows the {len(wc.words_.keys())} most frequent.\n"
alt_text += "Top 10 most frequent words were:\n"
for word, count in top_words:
alt_text += f"{word}: {count}, "
# Add information about custom stop words if any
if config.has_option('wordcloud', 'stop_words'):
custom_stop_words = config.get('wordcloud', 'stop_words')
if custom_stop_words.strip():
alt_text += "\nThese words were excluded from the word cloud: \n"
for word in [w.strip() for w in custom_stop_words.split(',')]:
if word:
alt_text += f"{word}, "
alt_text += f"and the hashtag {hashtag}."
plt.style.use('dark_background')
plt.figure(figsize=(13, 9))
plt.axis("off")
plt.gca().set_position([0, 0, 1, 1])
plt.imshow(wc, interpolation='bilinear')
try:
plt.savefig(graph_file_name, format="png")
# Save the alt text to a file
with open(alt_text_file_name, 'w') as alt_file:
alt_file.write(alt_text)
logger.info(f"Saved alt text to {alt_text_file_name}")
except Exception as e:
logger.error(f"Failed to save {graph_file_name}")
logger.error(e)
raise
else:
logger.info(f"Saved {graph_file_name}")
|