Module: wordcloud¶
NOTE Brand new module! This has lots of random parameters that seem to make a good picture.
The hashtag tends to dominate the graph. I like that because it serves as like a title or anchoring word. But some folks want to see it without the hashtag itself dominating. So there's a config option hashtag_fix that takes one of 3 values. (Default if omitted is as-is). In this section, I show the same data set from Kung-Fu Saturday, 7 December 2024 visualized 3 different ways.
Alt Text Generation¶
As of version 1.2.0, the wordcloud module now automatically generates descriptive alt text for each wordcloud image. This alt text includes:
- The hashtag and date of the analysis
- Total number of unique words in the wordcloud
- Top 10 most frequent words with their counts
- Information about hashtag treatment method (as-is, remove, reduce)
- List of any custom stop words that were used
The alt text is saved to a text file with the same name as the wordcloud image but with a .txt extension. For example, if the wordcloud is saved as wordcloud/wordcloud-monsterdon-20250409-as-is.png, the alt text will be saved as wordcloud/wordcloud-monsterdon-20250409-as-is.txt.
This feature makes the wordclouds more accessible and provides a quick summary of the key words from the visualization.
Custom Stop Words¶
You can exclude specific words from appearing in your wordcloud by adding a stop_words parameter to the [wordcloud] section of your INI file. This is particularly useful for filtering out common words that aren't meaningful to your analysis.
To use this feature:
- Add a
stop_wordsparameter to the[wordcloud]section of your INI file - Provide a comma-separated list of words to exclude
For example:
[wordcloud]
graph_title = Wordcloud
font = /path/to/font.otf
size_x = 1280
size_y = 960
hashtag_fix = remove
stop_words = movie, film, watching, watch, tonight, scene, scenes, actor, actors
These words will be excluded from the wordcloud in addition to the default stop words and any other configured exclusions. This is especially useful for event-specific hashtags where certain common words might dominate the visualization without adding meaningful information.
as-is¶
Leave the hashtag alone.

remove¶
Remove all instances of the hashtag

reduce¶
Remove most (currently hard-coded at 90%) occurrences of the hashtag. It will still be popular enough to be quite large, but it won't dominate. In this example, "KungFuSat" is near the top right, in a dark purple.

Synopsis¶
mastoscore --debug=info ini/monsterdon-20241201.ini wordcloud
Creates a file named {journaldir}/wordcloud-{journalfile}.png.
A Word about Emoji¶
While it is possible to make a word cloud that includes emoji, it's a bit complicated. See, it really boils down to the font and matplotlib's support for fonts. I think a lot of fancy word processing systems use multiple fonts (one for text, one for rendering symbols like emoji). But matplotlib needs a single font that has everything you want in it. The only one I have found like that is Symbola, which is OK, but the words themselves look pretty terrible. I think the right answer is probably to build emoji support into word_cloud itself to give it some emoji awareness and then use a different font for emojis. For now, I'm just dropping all emojis and punctuation.
Examples¶

Code Reference¶
Module to take the data in from analysis and produce graph files.
write_wordcloud(config)
¶
This is the only function, for now. It invokes get_toots_df()
to get the DataFrame. Then it discards basically everything other than the content column.
I post-process to remove some weird things (there's lots of emoji-like things). I also remove the
hashtag itself, because it's obviously gonna have the highest frequency.
Parameters¶
- config: A ConfigParser object from the config module
Config Parameters Used¶
| Option | Description |
|---|---|
graph:journalfile |
Filename that forms the base of the graph's filename. |
graph:journaldir |
Directory where we will write the graph file |
fetch:hashtag |
Hashtag to search for |
wordcloud:font_path |
Path to fonts like Symbola |
wordcloud:hashtag_fix |
What to do with the main hashtag? 'reduce', 'remove', or 'as-is' |
wordcloud:size_x |
Size in pixels for the image. Default 1280 |
wordcloud:size_y |
Size in pixels for the image. Default 960 |
wordcloud:stop_words |
Comma-separated list of words to exclude |
mastoscore:event_year |
Year of the event (YYYY) |
mastoscore:event_month |
Month of the event (MM) |
mastoscore:event_day |
Day of the event (DD) |
Returns¶
None
Writes the graph to a file named wordcloud/wordcloud-hashtag-YYYYMMDD-hashtag_fix.png Writes alt text description to wordcloud/wordcloud-hashtag-YYYYMMDD-hashtag_fix.txt
Source code in mastoscore/wordcloud.py
def write_wordcloud(config):
"""
This is the only function, for now. It invokes [get_toots_df()](module-analyse.md#mastoscore.analyse.get_toots_df)
to get the DataFrame. Then it discards basically everything other than the `content` column.
I post-process to remove some weird things (there's lots of emoji-like things). I also remove the
hashtag itself, because it's obviously gonna have the highest frequency.
# Parameters
- **config**: A ConfigParser object from the [config](module-config.md) module
# Config Parameters Used
| Option | Description |
| ------- | ------- |
| `graph:journalfile` | Filename that forms the base of the graph's filename. |
| `graph:journaldir` | Directory where we will write the graph file |
| `fetch:hashtag` | Hashtag to search for |
| `wordcloud:font_path` | Path to fonts like Symbola |
| `wordcloud:hashtag_fix` | What to do with the main hashtag? 'reduce', 'remove', or 'as-is' |
| `wordcloud:size_x` | Size in pixels for the image. Default 1280 |
| `wordcloud:size_y` | Size in pixels for the image. Default 960 |
| `wordcloud:stop_words` | Comma-separated list of words to exclude |
| `mastoscore:event_year` | Year of the event (YYYY) |
| `mastoscore:event_month` | Month of the event (MM) |
| `mastoscore:event_day` | Day of the event (DD) |
# Returns
None
Writes the graph to a file named wordcloud/wordcloud-hashtag-YYYYMMDD-hashtag_fix.png
Writes alt text description to wordcloud/wordcloud-hashtag-YYYYMMDD-hashtag_fix.txt
"""
journalfile = config.get('graph', 'journalfile')
journaldir = config.get('graph', 'journaldir')
hashtag = config.get('fetch', 'hashtag')
debug = config.getint('wordcloud', 'debug')
size_x = config.getint('wordcloud', 'size_x')
size_y = config.getint('wordcloud', 'size_y')
font = config.get('wordcloud', 'font')
hashtag_fix = config.get('wordcloud', 'hashtag_fix', fallback='as-is')
# Get date components for filename
try:
year = config.get('mastoscore', 'event_year')
month = config.get('mastoscore', 'event_month')
day = config.get('mastoscore', 'event_day')
date_str = f"{year}{month}{day}"
except Exception as e:
logger.error(f"Failed to get date components from config: {e}")
logger.error("Falling back to current date")
date_str = datetime.now().strftime("%Y%m%d")
logger = logging.getLogger(__name__)
logging.basicConfig(format='%(levelname)s\t%(message)s')
logger.setLevel(debug)
# import the stop words list from the WordCloud package. Add a few specifics to it
stop_words = STOPWORDS.copy()
# Add custom stop words from config file if they exist
if config.has_option('wordcloud', 'stop_words'):
custom_stop_words = config.get('wordcloud', 'stop_words')
# Split by commas and strip whitespace from each word
for word in [w.strip().lower() for w in custom_stop_words.split(',')]:
if word: # Only add non-empty words
stop_words.add(word)
logger.debug(f"Added custom stop word: '{word}'")
logger.info(f"Added {len(custom_stop_words.split(','))} custom stop words from config")
df = get_toots_df(config)
worddata = list(df['content'])
# all we care about is the content data, so we delete the whole dataframe. :)
del df
allwords = ' '.join(worddata)
bswords = BeautifulSoup(allwords, features="html.parser")
just_text = bswords.get_text()
just_text = re.sub("http[^ ]+ ", " ", just_text)
just_words = [word for word in re.split(r"[ #,!-]+", just_text) if len(word) >= 3
and not word.startswith('http')
and not word.startswith('@')]
if hashtag_fix == 'remove':
stop_words.add(hashtag.lower())
elif hashtag_fix == 'reduce':
# This is a cheesey way to implement "keep X0%". I just pick a bunch of
# random numbers between 0 and 9 and check if it comes up higher than X.
# I go through the words and the ones that don't match the hashtag are just
# kept. If it's the hashtag, then I roll a die to see if I keep it.
old_len = len(just_words)
just_words = [word for word in just_words if word.lower()
!= hashtag.lower() or randint(0, 9) > 8]
new_len = len(just_words)
logger.debug(
f"Removing {hashtag} removed {old_len-new_len} words, leaving {new_len}")
just_words = ' '.join(just_words)
# the regex used to detect words is a combination of normal words, ascii art, and emojis
# 2+ consecutive letters (also include apostrophes), e.x It's
normal_word = r"(?:\w[\w']+)"
font_path = font
avail_cmaps = ['Pastel1', 'Pastel2', 'Paired', 'Accent', 'Dark2',
'Set1', 'Set2', 'Set3', 'tab10', 'tab20', 'tab20b',
'tab20c']
cmap = choice(avail_cmaps)
logger.info(f"Chose colormap: {cmap}")
wc = WordCloud(font_path=font_path, width=size_x, height=size_y, max_font_size=360,
max_words=200, regexp=normal_word, scale=1.4, prefer_horizontal=0.60,
font_step=1, relative_scaling=0.1, repeat=False, stopwords=stop_words,
margin=4, min_word_length=4,colormap=cmap).generate(just_words)
# Graphs go in the journal directory now
graphs_dir = create_journal_directory(config)
# Create the wordcloud filename with wordcloud-hashtag-YYYYMMDD-hashtag_fix pattern
graph_file_name = os.path.join(graphs_dir, f"wordcloud-{hashtag}-{date_str}-{hashtag_fix}.png")
alt_text_file_name = os.path.join(graphs_dir, f"wordcloud-{hashtag}-{date_str}-{hashtag_fix}.txt")
# Generate alt text description
just_words = [word.lower() for word in just_words.split(' ') if word.lower() not in stop_words]
word_counts = Counter(just_words)
total_unique_words = len(word_counts)
top_words = word_counts.most_common(10)
event_date = datetime.fromisoformat(f"{year}-{month}-{day}")
nice_date = datetime.strftime(event_date, "%A, %e %b %Y")
# Format the alt text
alt_text = f"The word cloud for {nice_date}. Words are larger the more frequently "\
"they appeared in posts.\n"
alt_text += f"There were {total_unique_words} unique words posted, and " \
f"the wordcloud shows the {len(wc.words_.keys())} most frequent.\n"
alt_text += "Top 10 most frequent words were:\n"
for word, count in top_words:
alt_text += f"{word}: {count}, "
# Add information about custom stop words if any
if config.has_option('wordcloud', 'stop_words'):
custom_stop_words = config.get('wordcloud', 'stop_words')
if custom_stop_words.strip():
alt_text += "\nThese words were excluded from the word cloud: \n"
for word in [w.strip() for w in custom_stop_words.split(',')]:
if word:
alt_text += f"{word}, "
alt_text += f"and the hashtag {hashtag}."
plt.style.use('dark_background')
plt.figure(figsize=(13, 9))
plt.axis("off")
plt.gca().set_position([0, 0, 1, 1])
plt.imshow(wc, interpolation='bilinear')
try:
plt.savefig(graph_file_name, format="png")
# Save the alt text to a file
with open(alt_text_file_name, 'w') as alt_file:
alt_file.write(alt_text)
logger.info(f"Saved alt text to {alt_text_file_name}")
except Exception as e:
logger.error(f"Failed to save {graph_file_name}")
logger.error(e)
raise
else:
logger.info(f"Saved {graph_file_name}")