I currently work in a lab which studies the interaction between chronic pain and brain function. So I figured a semi-productive way to merge my current work with the aims of this blog would be to practice extracting clinical-pain-related data from the internet. Specifically, because I regularly talk to patients who have tried multiple treatments for their chronic pain, I wanted to get an idea of what the internet thought of these treatments.
I used Beautifulsoup in Python to scrape user comments from the first chronic pain forum listed on Google (www.spine-health.com). I scraped over 100k comments from forums related to chronic pain, lower back pain, upper back pain, and neck/cervical pain.
Interested in seeing which treatments were most associated with the most positive (or negative comments), I assessed the tone of forum comments using Sentiment Analysis in Python. Some examples of comments with their associated sentiment scores are listed below (scores range from -1 to 1, where -1 is most negative):
----------------------------------
sentiment = -0.88
"I''ve been seeing a chiropractor for 3 months and though I''ve seen a good improvement, I now have lost my job and I''m looking for new insurance to help with the cost of recovery. The doctor said I have degenerative disorder in c3-c7. 3 months ago my right arm went numb and I couldn''t stand the pain. Now three months later I have recovered a lot but thinking I should''ve had surgery before now. My question today is do I keep doing what the chiropractor says or do I see a surgeon to speed things up? The doctor said that I may wind up paralyzed if I have surgery. Can that happen? I''m only 45 years old and I''m looking to get back to work. I still have difficulty doing nothing but even if do housework (I.e vacuuming, dishes, cleaning) I find myself sore for a couple days. Looking for advice on what to do as the depression of not working and having to think of a career change at this point in my life is getting worse."
sentiment = 0.37
"I had fusion with a plate on 3 discs and then the ADR on c3-4, if I didn''t do it, I''m sure I''d be paralyzed.....recovery is not bad for me, uncomfortable foe a couple of months....the main thing is that LIFE has changed, I have to always take it easy and deal with things as they come. I''m 41 and the implant was called a P-Zero. hope that helps.Mark"
sentiment = 0.98
"Lala, you are so right. This year has been incredibly hard for me emotionally, and I really did get wrapped up into a pity party. But, really, I have lost so little.No, I don't have a well working body, but I get around. I have a husband who loves me and two wonderful children, and a roof over my head and food on the table. And if that's all I have right now (it isn't, but even if) it's a lot more than many.You know, last year during the summer, my dad got viral encephalitis. It reminded me of a friend of my mom's who got bacterial meningitis. The difference? My dad spent 3 months in the hospital and went home normal. My mom's friend spent 3 months in the hospital and went home with an IQ of 80- she started out a physicist and went home borderline mentally retarded. It was so devastating.I'm kind of babbling, but these are the kinds of things I think about- the sudden things that can hit you and change your life. For us, we fell off the horse and, voila our lives changed. But we still have our thoughts, our families. We can still walk. It's just pain, a change in our career path, etc. It didn't steal our life or our mind.Someone is interested in buying my horse, so I'm extra sappy today. I know this is the right thing still, but I'll miss that big goofball."
---------------------------------------
Additionally, as many forum users made more than 1 comment, I was interested to know whether the tone of their comments changed over time. To calculate this, I simply measured the slope of the linear fit to their comments over time.
A few more notes before I get into the results:
1) I calculated sentiment scores from both raw-text comments, and comments that had been cleaned using the NLTK in Python (removal of stop-words, tokenization, and lemmatization). Scores were very similar before and after cleaning. I'm showing scores from the cleaned data below.
2) I removed comments from any commenter that had more than 50 comments, assuming these were admin comments. The threshold of 50 was semi-abritrary, as a quick perusal of commenters indicated it wasn't unlikely for a user to have many dozens of comments, but most had less than 10.
3) I only analyzed comments which contained words of specific treatments. A total of 104 treatments were searched (see below), so clearly not an exhaustive list, but a decent start.
An example of the sentiment scores in the general chronic pain section of the forum is shown below. Black data points are the sentiment scores of the comment containing the treatment word, with the median score of the treatment shown as the large red dot:
To make this more view-able, I grouped treatment words into 10 overall groups (therapy, over-the-counter, opiods, muscle relaxants, benzodiazepines, hypnotics, anti-convulsants, anti-depressants, steroids, and invasives) These are shown in the upper-left panels below. The right panels show the sentiment scores for single users (lines) over time, and the bottom-left panels show the linear slope of these sentiment scores over time:
Here are the comments (N = 15239 comments) from the general "Chronic pain" forum:
The "lower back pain" forum (N = 15262 comments):
The "upper back pain" forum (N = 3795 comments)
And the "neck pain" forum (N = 14497 comments):
The most obvious result is that the median of sentiment scores are overwhelmingly negative. This is perhaps expected from people who are suffering from chronic pain and are seeking help. While overall negative, median scores were highest in general for therapy (physical therapy, yoga, exercise, etc.), invasive (surgery), and benzodiazepines (tranquilizers, typically prescribed short-term for anxiety-related disorders). Without having a more complete patient history, it's difficult to put these scores in context, however. For example, people who find yoga to be a suitable treatment likely did not have an initial pain intensity that may have prompted more abrasive and risky treatments.
Additionally, sentiment scores tended to decrease with time overall, indicating people were perhaps becoming worse on all the treatments. Of course, 1) It's difficult to know whether a single user was commenting on the same experience over time. It's possible that they had multiple issues and commented on each separately on different days. 2) A linear fit may not have been appropriate for all forum users, and certainly this estimate of rate of treatment-related improvement could itself be improved.
Below is the code I used to scrape the data and save to a csv file. Have a look and try modifying it for websites of your interest!
I used Beautifulsoup in Python to scrape user comments from the first chronic pain forum listed on Google (www.spine-health.com). I scraped over 100k comments from forums related to chronic pain, lower back pain, upper back pain, and neck/cervical pain.
Interested in seeing which treatments were most associated with the most positive (or negative comments), I assessed the tone of forum comments using Sentiment Analysis in Python. Some examples of comments with their associated sentiment scores are listed below (scores range from -1 to 1, where -1 is most negative):
----------------------------------
sentiment = -0.88
"I''ve been seeing a chiropractor for 3 months and though I''ve seen a good improvement, I now have lost my job and I''m looking for new insurance to help with the cost of recovery. The doctor said I have degenerative disorder in c3-c7. 3 months ago my right arm went numb and I couldn''t stand the pain. Now three months later I have recovered a lot but thinking I should''ve had surgery before now. My question today is do I keep doing what the chiropractor says or do I see a surgeon to speed things up? The doctor said that I may wind up paralyzed if I have surgery. Can that happen? I''m only 45 years old and I''m looking to get back to work. I still have difficulty doing nothing but even if do housework (I.e vacuuming, dishes, cleaning) I find myself sore for a couple days. Looking for advice on what to do as the depression of not working and having to think of a career change at this point in my life is getting worse."
sentiment = 0.37
"I had fusion with a plate on 3 discs and then the ADR on c3-4, if I didn''t do it, I''m sure I''d be paralyzed.....recovery is not bad for me, uncomfortable foe a couple of months....the main thing is that LIFE has changed, I have to always take it easy and deal with things as they come. I''m 41 and the implant was called a P-Zero. hope that helps.Mark"
sentiment = 0.98
"Lala, you are so right. This year has been incredibly hard for me emotionally, and I really did get wrapped up into a pity party. But, really, I have lost so little.No, I don't have a well working body, but I get around. I have a husband who loves me and two wonderful children, and a roof over my head and food on the table. And if that's all I have right now (it isn't, but even if) it's a lot more than many.You know, last year during the summer, my dad got viral encephalitis. It reminded me of a friend of my mom's who got bacterial meningitis. The difference? My dad spent 3 months in the hospital and went home normal. My mom's friend spent 3 months in the hospital and went home with an IQ of 80- she started out a physicist and went home borderline mentally retarded. It was so devastating.I'm kind of babbling, but these are the kinds of things I think about- the sudden things that can hit you and change your life. For us, we fell off the horse and, voila our lives changed. But we still have our thoughts, our families. We can still walk. It's just pain, a change in our career path, etc. It didn't steal our life or our mind.Someone is interested in buying my horse, so I'm extra sappy today. I know this is the right thing still, but I'll miss that big goofball."
---------------------------------------
Additionally, as many forum users made more than 1 comment, I was interested to know whether the tone of their comments changed over time. To calculate this, I simply measured the slope of the linear fit to their comments over time.
A few more notes before I get into the results:
1) I calculated sentiment scores from both raw-text comments, and comments that had been cleaned using the NLTK in Python (removal of stop-words, tokenization, and lemmatization). Scores were very similar before and after cleaning. I'm showing scores from the cleaned data below.
2) I removed comments from any commenter that had more than 50 comments, assuming these were admin comments. The threshold of 50 was semi-abritrary, as a quick perusal of commenters indicated it wasn't unlikely for a user to have many dozens of comments, but most had less than 10.
3) I only analyzed comments which contained words of specific treatments. A total of 104 treatments were searched (see below), so clearly not an exhaustive list, but a decent start.
An example of the sentiment scores in the general chronic pain section of the forum is shown below. Black data points are the sentiment scores of the comment containing the treatment word, with the median score of the treatment shown as the large red dot:
To make this more view-able, I grouped treatment words into 10 overall groups (therapy, over-the-counter, opiods, muscle relaxants, benzodiazepines, hypnotics, anti-convulsants, anti-depressants, steroids, and invasives) These are shown in the upper-left panels below. The right panels show the sentiment scores for single users (lines) over time, and the bottom-left panels show the linear slope of these sentiment scores over time:
Here are the comments (N = 15239 comments) from the general "Chronic pain" forum:
The "lower back pain" forum (N = 15262 comments):
The "upper back pain" forum (N = 3795 comments)
And the "neck pain" forum (N = 14497 comments):
The most obvious result is that the median of sentiment scores are overwhelmingly negative. This is perhaps expected from people who are suffering from chronic pain and are seeking help. While overall negative, median scores were highest in general for therapy (physical therapy, yoga, exercise, etc.), invasive (surgery), and benzodiazepines (tranquilizers, typically prescribed short-term for anxiety-related disorders). Without having a more complete patient history, it's difficult to put these scores in context, however. For example, people who find yoga to be a suitable treatment likely did not have an initial pain intensity that may have prompted more abrasive and risky treatments.
Additionally, sentiment scores tended to decrease with time overall, indicating people were perhaps becoming worse on all the treatments. Of course, 1) It's difficult to know whether a single user was commenting on the same experience over time. It's possible that they had multiple issues and commented on each separately on different days. 2) A linear fit may not have been appropriate for all forum users, and certainly this estimate of rate of treatment-related improvement could itself be improved.
Below is the code I used to scrape the data and save to a csv file. Have a look and try modifying it for websites of your interest!
In [ ]:
# Import BeautifulSoup, Sentiment Analyzer, and related tools
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import csv
import requests
import time
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()
# Initialize lists to hold the following information:
Authors = [] # Commenter names
Times = [] # Comment dates and times
Messages = [] # Comment
Discussions = [] # Forum discussion name (different discussions for different topics)
Scores = [] # Polarity scores from SentimentIntensityAnalyzer
""" At the time of writing this code, there were 158 pages of comments on the forum"""
endpage = 159
""" Loop through each page, then through each discussion, then through each comment
and scrape the text"""
BASE_URL = "http://www.spine-health.com"
for page in range(1,endpage): # page loop
# this is the page link
if page==1:
go_here_url = BASE_URL + "/forum/categories/chronic-pain"
else:
go_here_url = BASE_URL + "/forum/categories/chronic-pain/p" + str(page)
# extract HTML code for the page link
response = requests.get(go_here_url)
soup = BeautifulSoup(response.text, "html.parser")
for discussion_link in soup.select("td > div > a"): # discussion loop
# this is the link to the discussion topic
link = urljoin(BASE_URL, discussion_link['href'])
""" Many sections of the code tagged with td>div>a are not links to the discussion
section -- we're only interested in those with 'discussion' included in the link"""
if 'discussion' in link:
""" I like to have some output on the screen to know that the code is still doing
something. In this case, I'm just printing out the link for the discussion topic
for my own comfort """
print(link)
""" Here I found that if I didn't allow some sleep time between text extraction,
I missed many comments. I played around with different lengths of sleep, but 0.35
worked quickly enough for my purposes, without missing any comments"""
time.sleep(0.35)
# extract HTML code for the discussion topic
discussion_response = requests.get(str(link))
discussion_soup = BeautifulSoup(discussion_response.text, "html.parser")
#extract time of each comment
for whenyo in discussion_soup.select(".Permalink > time"):
# append times to the Times list
Times.append([str(whenyo.get("datetime"))])
#append the discussion link to the Discussions list
Discussions.append([str(link)])
#extract author of each comment
for author in discussion_soup.select(".Author > .Username"):
#strip author information from HTML and append to Authors list
Authors.append([str(author.text.strip())])
#extract the comment itself and get the sentiment score
for message in discussion_soup.select(".Message"):
#strip message from HTML code
comment = str(message.text.strip())
comment.encode('utf-8').strip()
#append to the Messages list
Messages.append([comment])
#measure the sentiment of the comment and append to the Scores list
ss = sid.polarity_scores(comment)
Scores.append(ss["compound"])
# now we write data to csv file called 'spinehealth_cp'
with open('output.csv', 'w') as csvfile:
writer = csv.writer(csvfile)
# add a title row to the top of each column
writer.writerow(['Discussions', 'Times', 'Authors', 'Messages', 'Scores'])
# loop through each list simultaneously, and append the elements, columnwise, to the csv file
for discussion, time, author, message, score in zip(Discussions, Times, Authors, Messages, Scores):
writer.writerow([discussion, time, author, message, score])
# close file
csvfile.close()
Comments
Post a Comment