Module: wordcloud

NOTE Brand new module! This has lots of random parameters that seem to make a good picture.

The hashtag tends to dominate the graph. I like that because it serves as like a title or anchoring word. But some folks want to see it without the hashtag itself dominating. So there's a config option hashtag_fix that takes one of 3 values. (Default if omitted is as-is). In this section, I show the same data set from Kung-Fu Saturday, 7 December 2024 visualized 3 different ways.

Alt Text Generation

As of version 1.2.0, the wordcloud module now automatically generates descriptive alt text for each wordcloud image. This alt text includes:

The alt text is saved to a text file with the same name as the wordcloud image but with a .txt extension. For example, if the wordcloud is saved as wordcloud/wordcloud-monsterdon-20250409-as-is.png, the alt text will be saved as wordcloud/wordcloud-monsterdon-20250409-as-is.txt.

This feature makes the wordclouds more accessible and provides a quick summary of the key words from the visualization.

Custom Stop Words

You can exclude specific words from appearing in your wordcloud by adding a stop_words parameter to the [wordcloud] section of your INI file. This is particularly useful for filtering out common words that aren't meaningful to your analysis.

To use this feature:

  1. Add a stop_words parameter to the [wordcloud] section of your INI file
  2. Provide a comma-separated list of words to exclude

For example:

[wordcloud]
graph_title  = Wordcloud
font         = /path/to/font.otf
size_x       = 1280
size_y       = 960
hashtag_fix  = remove
stop_words   = movie, film, watching, watch, tonight, scene, scenes, actor, actors

These words will be excluded from the wordcloud in addition to the default stop words and any other configured exclusions. This is especially useful for event-specific hashtags where certain common words might dominate the visualization without adding meaningful information.

as-is

Leave the hashtag alone.

kungfu saturday as-is

remove

Remove all instances of the hashtag kungfu saturday remove

reduce

Remove most (currently hard-coded at 90%) occurrences of the hashtag. It will still be popular enough to be quite large, but it won't dominate. In this example, "KungFuSat" is near the top right, in a dark purple.

kungfu saturday reduce

Synopsis

mastoscore --debug=info ini/monsterdon-20241201.ini wordcloud

Creates a file named {journaldir}/wordcloud-{journalfile}.png.

A Word about Emoji

While it is possible to make a word cloud that includes emoji, it's a bit complicated. See, it really boils down to the font and matplotlib's support for fonts. I think a lot of fancy word processing systems use multiple fonts (one for text, one for rendering symbols like emoji). But matplotlib needs a single font that has everything you want in it. The only one I have found like that is Symbola, which is OK, but the words themselves look pretty terrible. I think the right answer is probably to build emoji support into word_cloud itself to give it some emoji awareness and then use a different font for emojis. For now, I'm just dropping all emojis and punctuation.

Examples

Example Monsterdon

Code Reference

Module to take the data in from analysis and produce graph files.

write_wordcloud(config)

This is the only function, for now. It invokes get_toots_df() to get the DataFrame. Then it discards basically everything other than the content column. I post-process to remove some weird things (there's lots of emoji-like things). I also remove the hashtag itself, because it's obviously gonna have the highest frequency.

Parameters

  • config: A ConfigParser object from the config module

Config Parameters Used

Option Description
graph:journalfile Filename that forms the base of the graph's filename.
graph:journaldir Directory where we will write the graph file
fetch:hashtag Hashtag to search for
wordcloud:font_path Path to fonts like Symbola
wordcloud:hashtag_fix What to do with the main hashtag? 'reduce', 'remove', or 'as-is'
wordcloud:size_x Size in pixels for the image. Default 1280
wordcloud:size_y Size in pixels for the image. Default 960
wordcloud:stop_words Comma-separated list of words to exclude
mastoscore:event_year Year of the event (YYYY)
mastoscore:event_month Month of the event (MM)
mastoscore:event_day Day of the event (DD)

Returns

None

Writes the graph to a file named wordcloud/wordcloud-hashtag-YYYYMMDD-hashtag_fix.png Writes alt text description to wordcloud/wordcloud-hashtag-YYYYMMDD-hashtag_fix.txt

Source code in mastoscore/wordcloud.py
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
def write_wordcloud(config):
    """
    This is the only function, for now. It invokes [get_toots_df()](module-analyse.md#mastoscore.analyse.get_toots_df)
    to get the DataFrame. Then it discards basically everything other than the `content` column.
    I post-process to remove some weird things (there's lots of emoji-like things). I also remove the
    hashtag itself, because it's obviously gonna have the highest frequency.

    # Parameters
    - **config**: A ConfigParser object from the [config](module-config.md) module

    # Config Parameters Used

    | Option | Description |
    | ------- | ------- |
    | `graph:journalfile` | Filename that forms the base of the graph's filename. |
    | `graph:journaldir` | Directory where we will write the graph file |
    | `fetch:hashtag` | Hashtag to search for |
    | `wordcloud:font_path` | Path to fonts like Symbola |
    | `wordcloud:hashtag_fix` | What to do with the main hashtag? 'reduce', 'remove', or 'as-is' |
    | `wordcloud:size_x` | Size in pixels for the image. Default 1280 |
    | `wordcloud:size_y` | Size in pixels for the image. Default 960 |
    | `wordcloud:stop_words` | Comma-separated list of words to exclude |
    | `mastoscore:event_year` | Year of the event (YYYY) |
    | `mastoscore:event_month` | Month of the event (MM) |
    | `mastoscore:event_day` | Day of the event (DD) |

    # Returns

    None

    Writes the graph to a file named wordcloud/wordcloud-hashtag-YYYYMMDD-hashtag_fix.png
    Writes alt text description to   wordcloud/wordcloud-hashtag-YYYYMMDD-hashtag_fix.txt
    """
    journalfile = config.get('graph', 'journalfile')
    journaldir = config.get('graph', 'journaldir')
    hashtag = config.get('fetch', 'hashtag')
    debug = config.getint('wordcloud', 'debug')
    size_x = config.getint('wordcloud', 'size_x')
    size_y = config.getint('wordcloud', 'size_y')
    font = config.get('wordcloud', 'font')
    hashtag_fix = config.get('wordcloud', 'hashtag_fix', fallback='as-is')

    # Get date components for filename
    try:
        year = config.get('mastoscore', 'event_year')
        month = config.get('mastoscore', 'event_month')
        day = config.get('mastoscore', 'event_day')
        date_str = f"{year}{month}{day}"
    except Exception as e:
        logger.error(f"Failed to get date components from config: {e}")
        logger.error("Falling back to current date")
        date_str = datetime.now().strftime("%Y%m%d")

    logger = logging.getLogger(__name__)
    logging.basicConfig(format='%(levelname)s\t%(message)s')
    logger.setLevel(debug)

    # import the stop words list from the WordCloud package. Add a few specifics to it
    stop_words = STOPWORDS.copy()

    # Add custom stop words from config file if they exist
    if config.has_option('wordcloud', 'stop_words'):
        custom_stop_words = config.get('wordcloud', 'stop_words')
        # Split by commas and strip whitespace from each word
        for word in [w.strip().lower() for w in custom_stop_words.split(',')]:
            if word:  # Only add non-empty words
                stop_words.add(word)
                logger.debug(f"Added custom stop word: '{word}'")
        logger.info(f"Added {len(custom_stop_words.split(','))} custom stop words from config")

    df = get_toots_df(config)
    worddata = list(df['content'])
    # all we care about is the content data, so we delete the whole dataframe. :)
    del df
    allwords = ' '.join(worddata)
    bswords = BeautifulSoup(allwords, features="html.parser")
    just_text = bswords.get_text()
    just_text = re.sub("http[^ ]+ ", " ", just_text)
    just_words = [word for word in re.split(r"[ #,!-]+", just_text) if len(word) >= 3
                  and not word.startswith('http')
                  and not word.startswith('@')]
    if hashtag_fix == 'remove':
        stop_words.add(hashtag.lower())
    elif hashtag_fix == 'reduce':
        # This is a cheesey way to implement "keep X0%". I just pick a bunch of
        # random numbers between 0 and 9 and check if it comes up higher than X.
        # I go through the words and the ones that don't match the hashtag are just
        # kept. If it's the hashtag, then I roll a die to see if I keep it.
        old_len = len(just_words)
        just_words = [word for word in just_words if word.lower()
                      != hashtag.lower() or random.randint(0, 9) > 8]
        new_len = len(just_words)
        logger.debug(
            f"Removing {hashtag} removed {old_len-new_len} words, leaving {new_len}")
    just_words = ' '.join(just_words)
    # the regex used to detect words is a combination of normal words, ascii art, and emojis
    # 2+ consecutive letters (also include apostrophes), e.x It's
    normal_word = r"(?:\w[\w']+)"
    font_path = font
    wc = WordCloud(font_path=font_path, width=size_x, height=size_y, max_font_size=360,
                   max_words=200, regexp=normal_word, scale=1.4, prefer_horizontal=0.60,
                   font_step=1, relative_scaling=0.1, repeat=False, stopwords=stop_words,
                   margin=4, min_word_length=3).generate(just_words)

    # Create graphs directory if it doesn't exist
    graphs_dir = "wordcloud"
    try:
        os.makedirs(graphs_dir, exist_ok=True)
        logger.debug(f"Ensured wordcloud directory exists: {graphs_dir}")
    except Exception as e:
        logger.error(f"Failed to create graphs directory: {e}")
        raise

    # Create the wordcloud filename with wordcloud-hashtag-YYYYMMDD-hashtag_fix pattern
    graph_file_name = os.path.join(graphs_dir, f"wordcloud-{hashtag}-{date_str}-{hashtag_fix}.png")
    alt_text_file_name = os.path.join(graphs_dir, f"wordcloud-{hashtag}-{date_str}-{hashtag_fix}.txt")

    # Generate alt text description
    just_words =  [word.lower() for word in just_words.split(' ') if word.lower() not in stop_words]
    word_counts = Counter(just_words)
    total_unique_words = len(word_counts)
    top_words = word_counts.most_common(10)

    event_date = datetime.fromisoformat(f"{year}-{month}-{day}")
    nice_date = datetime.strftime(event_date, "%A, %e %b %Y")

    # Format the alt text
    alt_text =  f"The word cloud for {nice_date}. Words are larger the more frequently "\
        "they appeared in posts.\n"
    alt_text += f"There were {total_unique_words} unique words posted, and " \
                f"the wordcloud shows the {len(wc.words_.keys())} most frequent.\n"
    alt_text += "Top 10 most frequent words were:\n"
    for word, count in top_words:
        alt_text += f"{word}: {count}, "

    # Add information about custom stop words if any
    if config.has_option('wordcloud', 'stop_words'):
        custom_stop_words = config.get('wordcloud', 'stop_words')
        if custom_stop_words.strip():
            alt_text += "\nThese words were excluded from the word cloud: \n"
            for word in [w.strip() for w in custom_stop_words.split(',')]:
                if word:
                    alt_text += f"{word}, "
    alt_text += f"and the hashtag {hashtag}."
    plt.style.use('dark_background')
    plt.figure(figsize=(13, 9))
    plt.axis("off")
    plt.gca().set_position([0, 0, 1, 1])
    plt.imshow(wc, interpolation='bilinear')
    try:
        plt.savefig(graph_file_name, format="png")

        # Save the alt text to a file
        with open(alt_text_file_name, 'w') as alt_file:
            alt_file.write(alt_text)
            logger.info(f"Saved alt text to {alt_text_file_name}")

    except Exception as e:
        logger.error(f"Failed to save {graph_file_name}")
        logger.error(e)
        raise
    else:
        logger.info(f"Saved {graph_file_name}")