Methodology

This doc describes how we go about fetching toots for a hashtag. A lot of programs will just query their local server, get whatever their local server saw, and stop. This software queries as many servers as it can find, including servers that are not directly connected to the starting server.

The goal is to look at all the toots on Mastodon (or compatible ActivityPub servers) related to a specific hashtag. You might want to brush up on how ActivityPub works if you're unfamiliar. The software uses the Mastodon.py library to call Mastodon API to talk to servers and does not use any other APIs at this time. So even though there are some other services that are kinda compatible with ActivityPub and Mastodon (like Lemmy and Friendica), this will not yet handle them at all.

1. Start somewhere

First we start with whatever server is configured in fetch:api_base_url using fetch:cred_file credentials. That is, we are logging into our home server with our home identity. We call the timelines/tag API to get all the toots on that hashtag that our home server knows about. We use fetch:max to limit how many we might fetch, and we use fetch:start_date and fetch:end_date to limit ourselves to a certain date range.

2. Identify remote servers that we have discovered

Now we look at all the toots we got from our home server. We parse the uri attributes of the toots and pull out the server names. A toot that is delivered to us from its originating server is called a local toot in the code, and all others are non-local. Now, server A will send us a copy of toots from server B that reflects the state of that toot the last time server A fetched a copy of it. We run through all the toots that we got from server A and note down servers B, C, D, etc.

3. Connect to all remote servers we have discovered

As we discover servers, we keep track of them on a list of servers to do. One by one we contact them and try to use the public timelines/tag API on that server. If it will talk to us, we record its toots. If servers have activated authorized_fetch, then we will not get any toots from them. This is a pretty clear indication that they would not like us to use their data, so we stop there. It would be possible to use RSS or HTML implementations to fetch the content, but we want to respect the wishes of the servers that say "go away."

If a server refuses us, it's added to a failed server list. If a server allows us to pull the hashtag off its public APIs, we fetch all the toots they will give us. Again, bound by fetch:max, fetch:start_date, and fetch:end_date.

Eventually, this converges. We stop hearing about new servers and we try every server we've heard of once.

At this point we have tons of duplicates and tons of columns of data we don't need.

4. Discard unneeded content

For the most part, we just care about the author, the URL to the toot, the content, the date, and the number of boosts, favourites, and replies. Our dataframe starts with dozens of columns that we discard. Different servers send different columns, too. All this discarding is done at fetch time in toots2df()

The fetch module creates a series of JSON files with the pared down set of data. It looks like:

journaldir/
├── YYYY/
│   ├── MM/
│   │   ├── DD/
│   │   │   ├── hashtag-YYYYMMDD-mastodon.example.social.json
│   │   │   ├── hashtag-YYYYMMDD-social.example.net.json
│   │   │   └── ...

5. Deduplicate

We separate the toots into 2 piles: local and non-local. We discard any copies of toots from server B that we got from servers A, C, or D. If we got the authoritative one, we don't need other copies. Then we sort all the copies of non-local toots by URI. And we keep one copy. Generally speaking, we keep the one with the most reported boosts, favourites, and replies.

I routinely see 2000-2400 unique toots for #Monsterdon, but I end up fetching 80,000-90,000 toots from 100-120 servers.So we fetch a ton of duplicates and then discard down to the unique set. All this deduplication is done in get_toots_df()

After we deduplicate, we also have to eliminate self-replies. Many people start a top-level post, make it public, and then all future posts are replies to that. It allows the person's followers 2 ways to mute all the posts. Option 1 is to mute the entire hashtag. Option 2 is just to mute the "thread" of that root post. This means, however, that the toot with the "most replies" might simply be a person who replied to themselves a lot. So we try to eliminate self-replies before determining which toot had the most replies.

6. Find the highest in each category

After that, we sort all the toost by number of "reblogs" (or "boosts"). We remove the top N most boosted toots so that they can't also be the most favourited or most replied. Then we figure out the top N most favourited toots, and we remove them. Finally we figure out the most replied to. All these get returned in a data structure out of the analyse module.

7. Do things with those results

There are currently 3 modules that take that analysis and do something. The post module takes the data and writes 4 toots. The first has general stats, the next 3 are most boosts, most favourites, and most replies.

The graph module creates a PNG file of a graph with its corresponding alt text, but it doesn't post it.

Likewise, the wordcloud module creates a PNG file of a graph with its corresponding alt text, but it doesn't post it.

After those images and text files are created, the postgraphs action can be called to post them, assigning the alt text to the images. The postgraphs module won't correctly post it as a reply to the last toot. That's a bug that needs fixing.