Module: wordcloud¶

NOTE Brand new module! This has lots of random parameters that seem to make a good picture.

The hashtag tends to dominate the graph. I like that because it serves as like a title or anchoring word. But some folks want to see it without the hashtag itself dominating. So there's a config option hashtag_fix that takes one of 3 values. (Default if omitted is as-is). In this section, I show the same data set from Kung-Fu Saturday, 7 December 2024 visualized 3 different ways.

Alt Text Generation¶

As of version 1.2.0, the wordcloud module now automatically generates descriptive alt text for each wordcloud image. This alt text includes:

The hashtag and date of the analysis
Total number of unique words in the wordcloud
Top 10 most frequent words with their counts
Information about hashtag treatment method (as-is, remove, reduce)
List of any custom stop words that were used

The alt text is saved to a text file with the same name as the wordcloud image but with a .txt extension. For example, if the wordcloud is saved as wordcloud/wordcloud-monsterdon-20250409-as-is.png, the alt text will be saved as wordcloud/wordcloud-monsterdon-20250409-as-is.txt.

This feature makes the wordclouds more accessible and provides a quick summary of the key words from the visualization.

Custom Stop Words¶

You can exclude specific words from appearing in your wordcloud by adding a stop_words parameter to the [wordcloud] section of your INI file. This is particularly useful for filtering out common words that aren't meaningful to your analysis.

To use this feature:

Add a stop_words parameter to the [wordcloud] section of your INI file
Provide a comma-separated list of words to exclude

For example:

[wordcloud]
graph_title  = Wordcloud
font         = /path/to/font.otf
size_x       = 1280
size_y       = 960
hashtag_fix  = remove
stop_words   = movie, film, watching, watch, tonight, scene, scenes, actor, actors

These words will be excluded from the wordcloud in addition to the default stop words and any other configured exclusions. This is especially useful for event-specific hashtags where certain common words might dominate the visualization without adding meaningful information.

`as-is`¶

Leave the hashtag alone.

kungfu saturday as-is

`remove`¶

Remove all instances of the hashtag kungfu saturday remove

`reduce`¶

Remove most (currently hard-coded at 90%) occurrences of the hashtag. It will still be popular enough to be quite large, but it won't dominate. In this example, "KungFuSat" is near the top right, in a dark purple.

kungfu saturday reduce

Synopsis¶

mastoscore --debug=info ini/monsterdon-20241201.ini wordcloud

Creates a file named {journaldir}/wordcloud-{journalfile}.png.

A Word about Emoji¶

While it is possible to make a word cloud that includes emoji, it's a bit complicated. See, it really boils down to the font and matplotlib's support for fonts. I think a lot of fancy word processing systems use multiple fonts (one for text, one for rendering symbols like emoji). But matplotlib needs a single font that has everything you want in it. The only one I have found like that is Symbola, which is OK, but the words themselves look pretty terrible. I think the right answer is probably to build emoji support into word_cloud itself to give it some emoji awareness and then use a different font for emojis. For now, I'm just dropping all emojis and punctuation.

Examples¶

Example Monsterdon

Code Reference¶

Module to take the data in from analysis and produce graph files.

`write_wordcloud(config)` ¶

This is the only function, for now. It invokes get_toots_df() to get the DataFrame. Then it discards basically everything other than the content column. I post-process to remove some weird things (there's lots of emoji-like things). I also remove the hashtag itself, because it's obviously gonna have the highest frequency.

Parameters¶

config: A ConfigParser object from the config module

Config Parameters Used¶

Option	Description
`graph:journalfile`	Filename that forms the base of the graph's filename.
`graph:journaldir`	Directory where we will write the graph file
`fetch:hashtag`	Hashtag to search for
`wordcloud:font_path`	Path to fonts like Symbola
`wordcloud:hashtag_fix`	What to do with the main hashtag? 'reduce', 'remove', or 'as-is'
`wordcloud:size_x`	Size in pixels for the image. Default 1280
`wordcloud:size_y`	Size in pixels for the image. Default 960
`wordcloud:stop_words`	Comma-separated list of words to exclude
`mastoscore:event_year`	Year of the event (YYYY)
`mastoscore:event_month`	Month of the event (MM)
`mastoscore:event_day`	Day of the event (DD)

Returns¶

None

Writes the graph to a file named wordcloud/wordcloud-hashtag-YYYYMMDD-hashtag_fix.png Writes alt text description to wordcloud/wordcloud-hashtag-YYYYMMDD-hashtag_fix.txt

Source code in mastoscore/wordcloud.py

def write_wordcloud(config):
    """
    This is the only function, for now. It invokes [get_toots_df()](module-analyse.md#mastoscore.analyse.get_toots_df)
    to get the DataFrame. Then it discards basically everything other than the `content` column.
    I post-process to remove some weird things (there's lots of emoji-like things). I also remove the
    hashtag itself, because it's obviously gonna have the highest frequency.

    # Parameters
    - **config**: A ConfigParser object from the [config](module-config.md) module

    # Config Parameters Used

    | Option | Description |
    | ------- | ------- |
    | `graph:journalfile` | Filename that forms the base of the graph's filename. |
    | `graph:journaldir` | Directory where we will write the graph file |
    | `fetch:hashtag` | Hashtag to search for |
    | `wordcloud:font_path` | Path to fonts like Symbola |
    | `wordcloud:hashtag_fix` | What to do with the main hashtag? 'reduce', 'remove', or 'as-is' |
    | `wordcloud:size_x` | Size in pixels for the image. Default 1280 |
    | `wordcloud:size_y` | Size in pixels for the image. Default 960 |
    | `wordcloud:stop_words` | Comma-separated list of words to exclude |
    | `mastoscore:event_year` | Year of the event (YYYY) |
    | `mastoscore:event_month` | Month of the event (MM) |
    | `mastoscore:event_day` | Day of the event (DD) |

    # Returns

    None

    Writes the graph to a file named wordcloud/wordcloud-hashtag-YYYYMMDD-hashtag_fix.png
    Writes alt text description to   wordcloud/wordcloud-hashtag-YYYYMMDD-hashtag_fix.txt
    """
    journalfile = config.get('graph', 'journalfile')
    journaldir = config.get('graph', 'journaldir')
    hashtag = config.get('fetch', 'hashtag')
    debug = config.getint('wordcloud', 'debug')
    size_x = config.getint('wordcloud', 'size_x')
    size_y = config.getint('wordcloud', 'size_y')
    font = config.get('wordcloud', 'font')
    hashtag_fix = config.get('wordcloud', 'hashtag_fix', fallback='as-is')

    # Get date components for filename
    try:
        year = config.get('mastoscore', 'event_year')
        month = config.get('mastoscore', 'event_month')
        day = config.get('mastoscore', 'event_day')
        date_str = f"{year}{month}{day}"
    except Exception as e:
        logger.error(f"Failed to get date components from config: {e}")
        logger.error("Falling back to current date")
        date_str = datetime.now().strftime("%Y%m%d")

    logger = logging.getLogger(__name__)
    logging.basicConfig(format='%(levelname)s\t%(message)s')
    logger.setLevel(debug)

    # import the stop words list from the WordCloud package. Add a few specifics to it
    stop_words = STOPWORDS.copy()

    # Add custom stop words from config file if they exist
    if config.has_option('wordcloud', 'stop_words'):
        custom_stop_words = config.get('wordcloud', 'stop_words')
        # Split by commas and strip whitespace from each word
        for word in [w.strip().lower() for w in custom_stop_words.split(',')]:
            if word:  # Only add non-empty words
                stop_words.add(word)
                logger.debug(f"Added custom stop word: '{word}'")
        logger.info(f"Added {len(custom_stop_words.split(','))} custom stop words from config")

    df = get_toots_df(config)
    worddata = list(df['content'])
    # all we care about is the content data, so we delete the whole dataframe. :)
    del df
    allwords = ' '.join(worddata)
    bswords = BeautifulSoup(allwords, features="html.parser")
    just_text = bswords.get_text()
    just_text = re.sub("http[^ ]+ ", " ", just_text)
    just_words = [word for word in re.split(r"[ #,!-]+", just_text) if len(word) >= 3
                  and not word.startswith('http')
                  and not word.startswith('@')]
    if hashtag_fix == 'remove':
        stop_words.add(hashtag.lower())
    elif hashtag_fix == 'reduce':
        # This is a cheesey way to implement "keep X0%". I just pick a bunch of
        # random numbers between 0 and 9 and check if it comes up higher than X.
        # I go through the words and the ones that don't match the hashtag are just
        # kept. If it's the hashtag, then I roll a die to see if I keep it.
        old_len = len(just_words)
        just_words = [word for word in just_words if word.lower()
                      != hashtag.lower() or randint(0, 9) > 8]
        new_len = len(just_words)
        logger.debug(
            f"Removing {hashtag} removed {old_len-new_len} words, leaving {new_len}")
    just_words = ' '.join(just_words)
    # the regex used to detect words is a combination of normal words, ascii art, and emojis
    # 2+ consecutive letters (also include apostrophes), e.x It's
    normal_word = r"(?:\w[\w']+)"
    font_path = font
    avail_cmaps =  ['Pastel1', 'Pastel2', 'Paired', 'Accent', 'Dark2',
                    'Set1', 'Set2', 'Set3', 'tab10', 'tab20', 'tab20b',
                    'tab20c']
    cmap = choice(avail_cmaps)
    logger.info(f"Chose colormap: {cmap}")
    wc = WordCloud(font_path=font_path, width=size_x, height=size_y, max_font_size=360,
                   max_words=200, regexp=normal_word, scale=1.4, prefer_horizontal=0.60,
                   font_step=1, relative_scaling=0.1, repeat=False, stopwords=stop_words,
                   margin=4, min_word_length=4,colormap=cmap).generate(just_words)

    # Create graphs directory if it doesn't exist
    graphs_dir = "wordcloud"
    try:
        os.makedirs(graphs_dir, exist_ok=True)
        logger.debug(f"Ensured wordcloud directory exists: {graphs_dir}")
    except Exception as e:
        logger.error(f"Failed to create graphs directory: {e}")
        raise

    # Create the wordcloud filename with wordcloud-hashtag-YYYYMMDD-hashtag_fix pattern
    graph_file_name = os.path.join(graphs_dir, f"wordcloud-{hashtag}-{date_str}-{hashtag_fix}.png")
    alt_text_file_name = os.path.join(graphs_dir, f"wordcloud-{hashtag}-{date_str}-{hashtag_fix}.txt")

    # Generate alt text description
    just_words =  [word.lower() for word in just_words.split(' ') if word.lower() not in stop_words]
    word_counts = Counter(just_words)
    total_unique_words = len(word_counts)
    top_words = word_counts.most_common(10)

    event_date = datetime.fromisoformat(f"{year}-{month}-{day}")
    nice_date = datetime.strftime(event_date, "%A, %e %b %Y")

    # Format the alt text
    alt_text =  f"The word cloud for {nice_date}. Words are larger the more frequently "\
        "they appeared in posts.\n"
    alt_text += f"There were {total_unique_words} unique words posted, and " \
                f"the wordcloud shows the {len(wc.words_.keys())} most frequent.\n"
    alt_text += "Top 10 most frequent words were:\n"
    for word, count in top_words:
        alt_text += f"{word}: {count}, "

    # Add information about custom stop words if any
    if config.has_option('wordcloud', 'stop_words'):
        custom_stop_words = config.get('wordcloud', 'stop_words')
        if custom_stop_words.strip():
            alt_text += "\nThese words were excluded from the word cloud: \n"
            for word in [w.strip() for w in custom_stop_words.split(',')]:
                if word:
                    alt_text += f"{word}, "
    alt_text += f"and the hashtag {hashtag}."
    plt.style.use('dark_background')
    plt.figure(figsize=(13, 9))
    plt.axis("off")
    plt.gca().set_position([0, 0, 1, 1])
    plt.imshow(wc, interpolation='bilinear')
    try:
        plt.savefig(graph_file_name, format="png")

        # Save the alt text to a file
        with open(alt_text_file_name, 'w') as alt_file:
            alt_file.write(alt_text)
            logger.info(f"Saved alt text to {alt_text_file_name}")

    except Exception as e:
        logger.error(f"Failed to save {graph_file_name}")
        logger.error(e)
        raise
    else:
        logger.info(f"Saved {graph_file_name}")

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search