Decide how you want to classify data

<aside> ➡️

A classifier adds classes or labels to data you have gathered. You do not have to create a classifier for Phoenix to work! You can skip this step and go straight to exploring your data. This section explains why you might want to create classifiers, and links to other sections with details on how to apply three different types of classifiers.

</aside>

Sometimes, the information you have about each post or comment you gathered will not be enough to make sense of trends or search for things that are useful to your problem statement. In this case, you might want to add a classification to your data. A classifier adds classes or labels to data you have gathered. In other words, the classifier adds another column to your data table containing information that is useful to you.

Classify authors to understand trends in who posts

If you are interested in understanding trends in who posts, then you will want to classify the accounts (or authors) in your data. For example, you might want to classify accounts into categories relating to their profession, such as “academic,” “media,” or “politician.” Or you may want to organise them into categories relating to their position on an issue, such as “pro-army” or “anti-government”.

Note that you will add these classes from your own knowledge; the system just applies the information you provide it. In other words, the system doesn’t tell you who is “academic” or “media”, but rather you tell the system what accounts are “academic” or “media”, and then that class (or label) is applied to the data you have gathered.

Author classes will be added as a column to every row of the standard table. This means that you will be able to see, for example, how many posts are posted by each class of author, e.g. are there more posts by pro-army or anti-government authors.

<aside> ➡️ Jump to: post author classifier to learn how to classify authors in Phoenix.

</aside>

Classify text to understand trends in what is posted

If you are interested in understanding trends in what is posted, then you will want to classify the text of posts or comments (or both).

Keyword text classifier

One way to classify what is posted is to create classifications based on specific words or phrases that appear in the post or comment text. This kind of classification works well where we have a small list of keywords and phrases that are almost always associated with the class we are interested in. For example, this is useful when you want to classify text that contains hate speech, hashtags, group identifiers, specific events, and places.

A keyword text classifier will almost certainly get false positives and you will miss some posts. There may be some contexts where these issues will be overwhelming and the classification won’t be useful (e.g. some hate terms can have multiple uses, including in common parlance).

Getting to the list of keywords and features for each class will be the most time consuming part of this process, and may require iteration. You can try to apply a “lexicon” that someone else has already built, as a starting point.

<aside> ➡️ Jump to: keyword text classifier to learn how to classify posts and comments using keywords in Phoenix.

</aside>

Complex (large language) models

One key limitation of keyword-based classifiers is that they don’t understand meaning in context. For example, the same word can mean very different things depending on how it’s used—something a simple keyword system might miss. To address this, we can use more advanced tools called language models. These models are trained on vast amounts of text to learn how words, phrases, and ideas typically relate to each other.

Put simply for our use case: language models are much better at classifying complicated things such as broad topics (”politics” or “gender”) or tone (“intimidation” or “negative sentiment” or “toxicity”).

<aside> ➡️ Phoenix makes available a few complex models that you can apply to your data. Jump to: apply a complex model.

</aside>