Classify text features

If you are interested in understanding trends in what is posted, then you will want to classify the text of posts or comments (or both).

One way to classify what is posted is to create classifications based on specific words or phrases that appear in the post or comment text. This kind of classification works well where we have a small list of keywords and phrases that are almost always associated with the class we are interested in. For example, this is useful when you want to classify text that contains hate speech, hashtags, group identifiers, specific events, and places.

On the Phoenix platform, we call this a keyword text classifier. A keyword text classifier will automatically add classes to a post and/or comment when its text contains certain keywords.

<aside> 📌

A keyword text classifier will almost certainly get false positives and you will miss some posts. There may be some contexts where these issues will be overwhelming and the classification won’t be useful (e.g. some hate terms can have multiple uses, including in common parlance). In these cases, you may want to consider applying a complex model.

</aside>

If a keyword text classifier seems appropriate, you can create one in three steps:

You will first need to create the list of classes you want before you can add keywords to each class.
You can add keywords to a class in two ways:
1. If you add them together, separated by spaces, in one text box, then there will be an “AND” operator between them. This means you will only classify text that has all the words in this sequence. For example, if I write “war on terror” then I will classify posts that have both the word “war” and the word “on” and the word “terror”, but not a post that only had the word “war” in them.
2. If you add them separately, using the “+” to add a new text box, then there will be an “OR” operator between them. For example, if I write “war” and “terrorism” in two separate boxes, then I will classify posts that have either the word “war” or the word “terrorism” or both words in them.

<aside> 💡

Getting to the list of keywords and features for each class will be the most time consuming part of this process, and may require iteration. You can try to apply a “lexicon” that someone else has already built, as a starting point.

</aside>

Once you are done adding keywords to the classes, you can run the classifier and it will apply these classes to every post or comment that has been gathered, and to data gathered in the future.