The most important things about a player are his basketball skills and his physical tools. But ignoring those areas, there still are some important factors that influence a player's success: work ethic, character, goals, etc. It's not clear if those things can be reliably predicted using just publicly available information accessed in a mostly automated manner, but it's worth a try. So that's the goal: come up with metrics for a player's character that can be generated through automated analysis of public data.
The challenge is, the data is messy. My current job involves working with clinical trial data, which comes from highly trained professionals in standardized formats - and still I complain that it's too messy. For this project, I certainly didn't anticipate to find any grand conclusions, and I didn't. But it was still fun and could lead to some interesting insights down the road.
For the list of players to analyze, I used DraftExpress's list of top 100 prospects accessed on September 20, 2014.
I had a lot of ideas for different types of analyses that could be run. The main problem for all of them is the data: how do we get it and how do we make a computer understand it? Due to time constraints, I decided to focus on data from Twitter because the data comes in a structured format and there is an easy API that allows access to a relatively large amount of data.
First, with the help of a Python script to provide me with some hints, I collected the Twitter usernames of the top 100 prospects. Almost all of the very top prospects had public Twitter accounts (Emmanuel Mudiay is the exception, as he seems to have recently deleted his). In total, 78/100 prospects had public Twitter accounts that I could find. Twitter's API makes it difficult to retrieve more than ~3000 Tweets per user, so I decided to just download the ~3000 most recent Tweets for the 78 prospects on Twitter. Most users had around 3000 tweets but there are some who only lightly use Twitter. In total, I constructed a database of 164,546 Tweets.
Karl Towns (@KATis32) retweeted this in 2012:
Piscataway looks awful . I feel like I'm in a movie . It's sad and cool at the same time
I guess that wasn't a good sign for my alma mater in the recruiting game. But his other Tweets made me realize that he's actually from Piscataway, and his other Piscataway-related Tweets are positive. Yet another reminder that "RT" doesn't mean "I agree that...", which is one of many factors complicating analysis of Twitter data.
Top basketball prospects form a pretty highly connected community. They very frequently Tweet to each other. There is probably some network analysis to be done here, but my eyes have glazed over at far too many protein-protein interaction networks for me to want to do that in my spare time. Instead, here is a quick summary of some common words, hashtags, and Twitter usernames Tweeted by the top 100 prospects. And no, I didn't know what all of those hashtags meant without looking them up :)
Sentiment analysis. It's a thing. I've heard about it. That was the extent of my knowledge going into this project, and I decided to change that.
Sentiment analysis is determining how positive or negative statements are, and then typically results are combined to provide some insight (e.g. what percentage of people feel positively about Pepsi?). It can be done manually or automatically. Obviously "manual" sounds old and boring, so I decided to try automatic sentiment analysis. My goal was to find if certain players can be deemed more "positive" than others based solely on the contents of their Tweets.
To do automated sentiment analysis, I used TextBlob, a very user-friendly Python library. Basically I just pump in the data and it gives me two scores for each tweet:
I ran every Tweet through there and averaged across players, in the hope that some signal would be revealed when the data was aggregated.
I guess it's not surprising that averaging together the sentiments of a ton of tweets tends to create one big cluster in the middle. The biggest outlier is Guillem Vives who Tweets in Spanish, so that is probably just noise (same for Ilimane Diop). Dorian Finney-Smith is a bit more positive and subjective than normal. Cliff Alexander is a bit more negative and objective than normal. The rest seems to be a blob.
But is there value in that blob, if we zoom in? I'm not so sure. Looking at the most negative Tweets finds things like, "LaMarcus Aldridge is too nasty smh!" But in general, the classifier seems to get more right than wrong. So maybe this could be used as a metric of... something... but I would need to do some more research into studies on the value of this type of analysis. What I did read was mostly people saying that automated analysis is not accurate enough to be particularly useful.
A few years back, after a couple rather unfortunate incidents, Grant Hill and Jared Dudley starred in a commercial attempting to decrease the usage of anti-gay slurs as slang. What does our Tweet database say about that?
I was surprised to see that only 18 out of 164,546 Tweets contained the word "fag" or "faggot". Is that because the culture has changed and it's no longer seen as acceptable to use those words? Or is it that even mid-level 17 year old NBA prospects know better than to say the wrong words in public? You need more than an SQL query to answer that question.
If automated sentiment analysis isn't good enough, what are the alternatives? Manual annotation is one. In additon to being more accurate, a critical advantage of manual sentiment analysis is that you can actually expand it to things besides just sentiment. You can categorize text into arbitrary categories. This more broad analysis is called "content analysis".
Content analysis has been applied to athletes' Tweets previously , although the results don't seem particularly insightful. Tweets were divided into categories like "Tweets about their own team", "interactions with fans", "marketing/promotion", etc. Through this, some information can be gained about the priorities, motivations, and interests of athletes.
I decided to do something similar. After some thought, I narrowed my analysis down to four categories:
The problem with manual content analysis is that it's a ton of work to manually categorize 164,546 Tweets. Amazon Mechanical Turk could make short work of it, but I don't really want to spend money on this project. So I took a three-pronged approach to the annotation problem:
Let's look at an example. Going to my web interface, you might see something like this:
That Tweet you would categorize as "Basketball".
But there is trouble. "23054 completed, 141492 remaining"?? Even though I sat through an entire long boring wedding categorizing Tweets on my phone? That is not very good. With more time/money/effort, more Tweets could be categorized. But for now, 23054 is actually a lot of data.
One big problem with the categorization: a significant number were done by method #1 above, automatic bulk categorization. That could bias the results. The manually annotated ones were all sampled randomly so you could argue that it is an unbiased sample, but I didn't track which ones were manually and automatically annotated. Oh well. All I have to work with now is the pooled, biased categorizations.
There are many other problems too. Different players use Twitter differnetly. Tweet frequencies are different, so for some 3000 Tweets might be a month and for others it might be years. Some players use foreign languages. And if anything too controversial is said, Twitter accounts can be deleted (I wasn't even looking for this, but I couldn't help but notice that Jahlil Okafor, Emmanuel Mudiay, Dakari Johnson, and Chris Walker all deleted their accounts at some point - probably others did too).
But let's ignore all those drawbacks for now and look at the results of the content analysis. Unsurprisingly, basketball was a popular topic for Tweeting amongst the top 100 prospects.
However, players exhibited heterogeneity in their propensities to Tweet about different subjects. After filtering out the players with less than 100 categorized Tweets, I was left with 52 players with decent sample sizes, an average of 426 categorized Tweets/player. The players with the highest percentages of Tweets falling into each category are shown in the following tables and scatter plots.
Just from glancing at the tables, it seems that there are some big names near the top for Basketball %. Is there a correlation between prospect rank and Tweet content, maybe due to self-censoring or self-promotion? Actually, no. The highest correlation coefficient between prospect ranks and one of these percentages is 0.26 between rank and Other %, which is not very large. The correlation for rank and Basketball % is slightly negative.
Speaking of correlations, Andrew and Aaron Harrison have almost identical Tweet category percentages. Which I guess should be expected.
Especially in light of Fun Tidbit #2, I can't help but wonder if the NBA's Michael Sam could be found through techniques like this. I imagine his Girls % would be quite low. I'm not sure how that makes me feel.
It is difficult to derive much insight from the type of analysis presented here because there is no validation. To build a predictive model, you need some type of "gold standard" for comparison, so you can confirm that your model actually works. For instance, in my current job I am trying to automatically detect errors in clinical trial data, so my gold standard is datasets before and after they were manually checked for errors by domain experts. If I can reproduce those results, then I have something. Rather than just gathering some data and jumping in with some generic analysis like I presented here, the real challenge is to define how we can be confident that any of these techniques are actually working and providing value. After that, I could imagine myriad sophisticated analytical techniques that could be applied. But a solid grounding in reality is the most important factor, and that is likely one of the largest challenges you face in data analysis.
One idea along those lines would be to look at historical data. That might not be available, of course. Twitter is pretty new. But at a minimum, you could start gathering data now on current prospects, which could then serve as historical data in the future. Maybe something like "percentage of Tweets about basketball" correlates well with how much a player improves after they enter the NBA. The only way to know is to do the grunt work of collecting, annotating, and (finally) analyzing that data.
I did almost all my analysis in Python because it's an incredibly elegant language and it makes me happy. The libraries I used are TextBlob (sentiment analysis), Beautiful Soup (screen scraping), Requests (downloading crap), Tweepy (easy Twitter API access), Flask (content analysis web UI), and sqlite3 (plenty fast for data storage and manipulation at this scale).
For this document, I used D3 to make the charts. I'm pretty new to D3, but somehow I managed to scrape something together.