Sometimes, the information you have about each post or comment you gathered will not be enough to make sense of trends or search for things that are useful to your problem statement. In this case, you might want to add a classification to your data – essentially, another column containing information that is useful to you.
Contents of this page:
If you are interested in understanding trends in who posts, then you will want to classify the accounts in your data. For example, you might want to classify accounts into categories relating to their profession, such as “academic,” “media,” or “politician.” Or you may want to organise them into categories relating to their position on an issue, such as “pro-army” or “anti-government”.
<aside> ➡️ In Phoenix, we call these author classifications. Jump to: classify authors.
</aside>
If you are interested in understanding trends in what is posted, then you will want to classify the text of posts or comments (or both).
One way to classify what is posted is to create classifications based on specific words or phrases that appear in the post or comment text. This kind of classification works well where we have a small list of keywords and phrases that are almost always associated with the class we are interested in. For example, this is useful when you want to classify text that contains hate speech, hashtags, group identifiers, specific events, and places.
A keyword text classifier will almost certainly get false positives and you will miss some posts. There may be some contexts where these issues will be overwhelming and the classification won’t be useful (e.g. some hate terms can have multiple uses, including in common parlance).
Getting to the list of keywords and features for each class will be the most time consuming part of this process, and may require iteration. You can try to apply a “lexicon” that someone else has already built, as a starting point.
<aside> ➡️ In Phoenix, we call these keyword text classifiers. Jump to: classify text features.
</aside>
The challenges of a text classification that relies only on a few keywords or features are, in summary, that the classification does not account for context. We can overcome this challenge to a certain extent by applying a classification that uses a large language model (or LLM). LLMs are trained on large amounts of data to learn how language works. They can then use this knowledge to perform a variety of natural language processing (NLP) tasks, including classifying text.
Put simply for our use case: you can train an LLM to identify when something is being said in a post or comment. LLMs are much better at classifying complicated things such as broad topics (”politics” or “gender”). They can also classify tone, for example identifying “intimidation” or “negative sentiment” or “toxicity”.
<aside> ➡️ Phoenix has a few available models that you can re-use and an option to work with you to build your own model. Jump to: apply a complex model.
</aside>