Skip to main content

Predicting sentiment on internet pain forum users

This is an extension of a previous analysis in which I visualized text sentiment associated with different pain treatments in a large online chronic pain forum. I wanted to use some machine learning to determine whether the commenters would get better over time based on their comment frequency on treatment types, times of their comments, etc. Below is a feature list of my data, with a quick description of each:
chronic pain -- how often they commented in the 'chronic pain' forum
lower back pain -- same for 'lower back pain' forum
neck pain -- same for 'neck pain' forum
upper back pain -- same for 'upper back pain' forum
therapy -- how often they mentioned some sort of physical therapy, exercise, etc. in the first half of their comments
over the counter -- comment frequency of counter drugs (acetaminophen, naproxen, ibuprofen, etc)
opiod -- comment frequency of opiod drugs
muscle relaxant -- comment frequency of muscle relaxant drugs
benzodiazepine -- comment frequency of benzodiazepine drugs
hypnotics -- comment frequency of hypnotic drugs
anticonvulsants -- comment frequency of anticonvulsant drugs
antidepressants -- comment frequency of anti-depressants
steroids -- comment frequency of steroids
invasives -- comment frequency of invasive treatment (surgery)
numComments -- total number of comments in the first half of their comment history
numDiscusions -- total number of different discussion threads they commented in the first half of their history
1am-4am -- comment frequency between 1:00am to 4:59am (same for other time windows) ...
Sunday -- comment frequency on Sunday (same for other days) ...
Jan-Mar -- comment frequency between January 1st - March 31st (same for other month windows) ...
initSent -- averge sentiment score of the first half of their comment history
For a commenter to be included in the analysis, he/she must have had at least 4 comments separated by 3 days, and they must have made mention of at least 1 of the treatment categories listed above. Additionally, anyone with an anonymous handle was removed because I had trouble determining which commenter was unique. Finally, the average sentiment score of the first half of their comments had to be below -0.3 because I felt this may be a decent surrogate for a pain measurement. In other words, I was interested mostly in people who started out in pain and wanted to see if they would get better (as reflected by the sentiment in their comments). The features were calculated on the first half of their comments, and the task was to predict whether their average sentiment improved later in time, in the second half of comments.
Improvement ('improvement') was positive if their initSent was below -0.3 (so their comments are fairly negative), and the average of the last half of their comments increased by at least 0.3 (so the average of the last half of their comments was greater than 0). This is an arbitrary choice, but the reasoning is consistent with clinical measures of improvment, which is often determined by an average of 20 to 30% improvement in whatever is being measured.
Below is a printout of all the feature labels, and the outcome label, 'improvement':

%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
data = pd.read_csv("/Users/alexbaria/Desktop/Insight/Forum/ForumMLSheet_improvement.csv",
                  index_col=False, header=0)
#name of headers
print(list(data.columns))
features = data.ix[:,'chronic pain':'initSentiment']
X = features.values
y = data['improvement']
['chronic pain', 'lower back pain', 'neck pain', 'upper back pain', 'therapy', 'over the counter', 'opioid', 'muscle relaxant', 'benzodiazepine', 'hypnotics', 'anticonvulsants', 'antiedepressants', 'steroids', 'invasives', 'numComments', 'numDiscussions', '1am - 4am', '5am - 8am', '9am - 12pm', '1pm - 4pm', '5pm - 8pm', '9pm - 12am', 'Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Jan-Mar', 'Apr-Jun', 'Jul-Sep', 'Oct-Dec', 'initSentiment', 'improvement']

Here's a quick look at what the data looks like:



data[0:9]
Out[2]:
chronic painlower back painneck painupper back paintherapyover the counteropioidmuscle relaxantbenzodiazepinehypnotics...WednesdayThursdayFridaySaturdayJan-MarApr-JunJul-SepOct-DecinitSentimentimprovement
00010001010...11000110-0.9024001
10100000000...11103100-0.6863251
20010410000...01000002-0.9554001
30100222222...11003000-0.8131331
41110001100...12101040-0.5447601
50010111111...00010200-0.9896001
60100444444...00000020-0.8649501
70100000000...01000020-0.7348501
81100400000...12000330-0.5460331
9 rows × 35 columns

So after whittling down all my data of over 100k comments, it has been reduced to commenters that made at least 4 comments, at least 3 days apart, totaling 440 commenters with at least 4 comments. That's something I'd like to work on later, to keep more data, because this is a serious degradation of the data set. Nonetheless, here's what it boils down to, in terms of positive and negative samples:



print(data.improvement.value_counts())
1    255
0    185
Name: improvement, dtype: int64

I also wrote a quick function for balancing positive and negative samples, just in case I want to check how it influences the results, as there are about 1.5 times more positive than negative samples in the data set:



def balanced_subsample(x,y):
    y = np.asarray(y)
    uy = np.unique(y)
    numElems = []
    # find minimum number of classes
    for yi in uy:
        idx = np.where(y == yi)
        idx = idx[0]
        numElems.append(len(idx))
    # determine what is the number of elements to return for each condition    
    min_elems = np.min(numElems)
    xbal = []
    ybal = []
    for yi in uy:
        # find all entries of this class
        idx = np.where(y == yi)
        idx = idx[0]
        # make a new array of only this class and shuffle it
        this_x = x[idx][:]
        this_y = y[idx]
        np.random.shuffle(this_x)
        # take only the first min_elems from this array
        this_x = this_x[0:min_elems][:]
        this_y = this_y[0:min_elems]
        # append to balanced data sets
        xbal.append(this_x)
        ybal.append(this_y)
    ybal = np.asarray(np.concatenate(ybal))
    xbal = np.asarray(np.concatenate(xbal))
    
    return xbal,ybal

Here comes the machine-learning part. I'm still trying to familiarize myself with all of Python's great ML tools, so here I'm starting with something simple, logistic regression. Below I've randomly split the data set into one that will be used to train the model, and on that can be used to test it. I've also normalized the feature set:



from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_curve
from sklearn.metrics import classification_report
#split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=0)
#normalize the data
minmax_scaler = preprocessing.MinMaxScaler()
X_train_scaled = minmax_scaler.fit_transform(X_train)
X_test_scaled = minmax_scaler.fit_transform(X_test)

And here's the result. The percent of samples that were correctly classified in the training set was 65.4%, and the % correct in the test set (based on the model generated by the training set) was ~61.8% (not great, but not absolutely terrible for determining pain outcomes!):



logreg = LogisticRegression()
logreg.fit(X_train_scaled, y_train)
print('accuracy of training set is: ' + str(logreg.score(X_train_scaled, y_train)))
print('accuracy of test set is: ' + str(logreg.score(X_test_scaled, y_test)))
accuracy of training set is: 0.654545454545
accuracy of test set is: 0.618181818182

Let's see if this slightly above 50% accuracy has to do with the unbalanced data set. Remember there were about 1.5x more positive than negative samples. Here I've removed more data in the test set by making sure that the total number of positive and negative samples are equal by applying the 'balanced_subsample' function written above:



x0bal,y0bal = balanced_subsample(X_test_scaled,y_test)
print('accuracy is: ' + str(logreg.score(x0bal,y0bal)))
print('total number of samples in the test set after balancing: ' + str(y0bal.shape[0]))
print('total number of positive samples in this balanced set: ' + str(sum(y0bal)))
accuracy is: 0.630434782609
total number of samples in the test set after balancing: 92
total number of positive samples in this balanced set: 46


You can see that this reduced the data set to 92 total samples, 46 positive and 46 negative. The accuracy of the score doesn't change that much, so I'm going to keep working with the unbalanced set. A more suitable accuracy measure for unbalanced data sets is the f1-score, which is the harmonic mean of precision and recall (precision is a ratio of true positives to predicted positives, and recall is the ratio of true positives to actual positives). Below you can see the the average precision, recall, and f1-score are all around 0.6, which again isn't terrible but also not great:


pred_train_logreg = logreg.predict(X_train_scaled)
print('logistic regression, train: ')
print(classification_report(y_train,pred_train_logreg,
                           target_names=["no change","improved"]))
pred_test_logreg = logreg.predict(X_test_scaled)
print('logistic regression, test: ')
print(classification_report(y_test,pred_test_logreg,
                          target_names=["no change","improved"]))
logistic regression, train: 
             precision    recall  f1-score   support

  no change       0.63      0.44      0.52       139
   improved       0.67      0.81      0.73       191

avg / total       0.65      0.65      0.64       330

logistic regression, test: 
             precision    recall  f1-score   support

  no change       0.55      0.46      0.50        46
   improved       0.65      0.73      0.69        64

avg / total       0.61      0.62      0.61       110


Here are the theta weights of the model, printed as a bar chart, where the greater the magnitude of the bar indicates that feature contributed more to determining the outcome. Negative weights contribute more to no change (and / or decrease in future comment sentiment), while positive values contribute more to increased comment sentiment.


y1 = np.transpose(logreg.coef_)
x = np.arange(len(y1))
plt.figure(num=None, figsize=(10, 8), dpi=80, facecolor='w', edgecolor='k')
ylab = data.columns[0:-1]
plt.barh(x,y1)
plt.yticks(x+0.5,ylab)
plt.ylabel("feature")
plt.xlabel("theta")
Out[9]:
<matplotlib.text.Text at 0x11665b438>

While the predictive power of the model is not that great, the feature weights make some sense in the realm of chronic pain research. Here, the greatest predictor for future decreased sentiment for a user is the initial sentiment ('initSentiment'). In the same vein, in clinical pain, one of the more reliable predictors of improvement is the initial reported pain (someone who reports higher pain early on is less likely to recover from their pain completely). Here, someone who has initial negative sentiment will continue to have negative sentiment in the later comments.
Additionally, here, frequency of opioid mentions ('opioid') is the strongest treatment indication that future sentiment will either not change, or will get worse. In general, those who are mentioning opioids are expressing negative sentiments which will not improve in future posts. Opioid use for chronic pain is currently one of the most pressing issues in the field due to patients suffering from it's addictive properties. That negativity seems to be expressed in this forum very clearly.
Some other interesting (albeit weaker) results (to me) are that people are less likely to have higher future sentiment scores the more frequently they post during the week, rather than the weekend (are these people who may be out of work due to their pain?). Also, the number of discussions they engage in (and not so much the total number of comments) is a relatively strong predictor that their sentiment will not improve (so perhaps they have many co-morbidities with their pain?). And finally, it appears that people who posted in the lower back pain forum more frequently had a better chance of increased future sentiment, relative to the other conditions (chronic, upper back, and neck pain).
Let's see if regularization can increase the performance of the model. I like to think of regularization as putting a sort of 'low-pass filter' around the decision boundary in the feature space, to reduce overfiting. My conceptualization of this may not be entirely accurate, but it is, in a sense, a way to reduce noise in the model, so I'm going to stick with this conceptuatlization for right now.
Anyway, overfitting is usually evident if you get really great model performance on the training set, and not-so-great performance on the test set. In my case, neither data set had stellar performance, so I don't think the model was overfit. But just to see what happens, let's go ahead and apply regularization. In Python, this can be performed with 'ridge classification'. The alpha parameter (ranging from 0 to 1) adjusts regularization, with higher values effectively increasing the width of the filter window. Here I got moderate improvement of the model by setting it to 0.5:


#try Ridge regression (Regularization)
from sklearn.linear_model import RidgeClassifier
ridge = RidgeClassifier(alpha=0.5,normalize=True).fit(X_train_scaled,y_train)
pred_ridge_train = ridge.predict(X_train_scaled)
pred_ridge_test = ridge.predict(X_test_scaled)
# ridge regression accuracy on test data
print("ridge regression, train: ")
print(classification_report(y_train, pred_ridge_train,
                          target_names=["no change","improved"]))
# ridge regression accuracy on test data
print("ridge regression, test: ")
print(classification_report(y_test, pred_ridge_test,
                          target_names=["no change","improved"]))
ridge regression, train: 
             precision    recall  f1-score   support

  no change       0.69      0.40      0.50       139
   improved       0.66      0.87      0.75       191

avg / total       0.67      0.67      0.65       330

ridge regression, test: 
             precision    recall  f1-score   support

  no change       0.57      0.46      0.51        46
   improved       0.66      0.75      0.70        64

avg / total       0.62      0.63      0.62       110


You can see that the f1-score increased by 0.01 for both training and test sets, so not a great improvement. Here are the theta weights of the ridge model:


y1 = np.transpose(ridge.coef_)
x = np.arange(len(y1))
plt.figure(num=None, figsize=(10, 8), dpi=80, facecolor='w', edgecolor='k')
ylab = data.columns[0:-1]
plt.barh(x,y1)
plt.yticks(x+0.5,ylab)
plt.ylabel("feature")
plt.xlabel("theta")
Out[11]:
<matplotlib.text.Text at 0x119101cc0>

As expected, this looks very similar to the logistic regression model, although now the number of discussions (numDiscussions) edged out initial sentiment (initSentiment) to become the number one predictor for no improvement in future sentiment. Additionally, frequency of Friday / Saturday posting (as opposed to Sunday - Thursday) edged out steroids and posting frequency in the lower back pain forum, to become the greatest predictor of improved sentiment. I don't want to over-interpret the results, but even though the classification is not overwhelmingly impressive, the model itself makes some sense.
I'd like to make improvements in the near future by taking into account that commenters may have misspelled some treatment words, which I hope would increase my sample size. I also removed anonymous commenters -- there were thousands of anonymous comments. Finding a way to identify the unique ones would also increase my sample size. And of course, I'd like to engineer a few more features out of the data set if possible -- in some cases I might be able to glean sex of the commenter by their username, although this may be presumptuous. I'd also like to work determining the model cost as a function of number of samples to determine whether I need to extract more features or add more data overall, as suggested in Andrew Ng's machine learning course on Coursera.

Comments