Jeremy's Draft Prospect Analysis

jdscheff@gmail.com
???-???-????
dumbmatter.com

by Jeremy Scheff

October 12, 2014

The Big Picture

The most important things about a player are his basketball skills and his physical tools. But ignoring those areas, there still are some important factors that influence a player's success: work ethic, character, goals, etc. It's not clear if those things can be reliably predicted using just publicly available information accessed in a mostly automated manner, but it's worth a try. So that's the goal: come up with metrics for a player's character that can be generated through automated analysis of public data.

The challenge is, the data is messy. My current job involves working with clinical trial data, which comes from highly trained professionals in standardized formats - and still I complain that it's too messy. For this project, I certainly didn't anticipate to find any grand conclusions, and I didn't. But it was still fun and could lead to some interesting insights down the road.

Data

For the list of players to analyze, I used DraftExpress's list of top 100 prospects accessed on September 20, 2014.

I had a lot of ideas for different types of analyses that could be run. The main problem for all of them is the data: how do we get it and how do we make a computer understand it? Due to time constraints, I decided to focus on data from Twitter because the data comes in a structured format and there is an easy API that allows access to a relatively large amount of data.

First, with the help of a Python script to provide me with some hints, I collected the Twitter usernames of the top 100 prospects. Almost all of the very top prospects had public Twitter accounts (Emmanuel Mudiay is the exception, as he seems to have recently deleted his). In total, 78/100 prospects had public Twitter accounts that I could find. Twitter's API makes it difficult to retrieve more than ~3000 Tweets per user, so I decided to just download the ~3000 most recent Tweets for the 78 prospects on Twitter. Most users had around 3000 tweets but there are some who only lightly use Twitter. In total, I constructed a database of 164,546 Tweets.

#NameTwitter
1Jahlil Okafor@BigJah15
2Emmanuel MudiayNone
3Karl Towns@KATis32
4Kelly Oubre@K_Ctmd22
5Cliff Alexander@HumbleKid_2
6Mario Hezonja@MarioHezonja
7Kristaps Porzingis@kporzee
8Justise Winslow@Chief_Justise
9Stanley Johnson@StanMan_5
10Willie Cauley-Stein@THEwillieCS15
11Sam Dekker@samdek1
12Montrezl Harrell@MONSTATREZZ
13Chris Walker@kingsky23
14Tyus Jones@Tyusjones06
15Rondae Hollis-Jefferson@RondaeHJ23
16Marc GarciaNone
17Myles Turner@Original_Turner
18Caris LeVert@CarisLeVert
19Ilimane Diop@eliimane
20Bobby Portis@BPortistime
21Dakari Johnson@SafariDakari44
22Egemen Guven@egemengven1
23Jarell Martin@JarellMartin22
24Frank Kaminsky@FSKPart3
25Rashad Vaughn@ShowtimeMr1
26Amida Brimah@amidabrimah
27Wayne Selden@WayneSeldenJr
28Andrew Harrison@DrewRoc5
29Norman PowellNone
30Marcus Lee@SuperKingMe
31Brice Johnson@bjohnson_23
32R.J. Hunter@RJH_22
33Aaron Harrison@AaronICE2
34Jabari BirdNone
35Delon WrightNone
36Mouhammadou Jaiteh@mamjaiteh14
37Theo Pinson@tpinsonn
38Kevon Looney@Loon_Rebel5
39Devin Booker@DevinBook
40Justin Jackson@JJacks_44
41Trey Lyles@TreyMambaLyles
42Devin RobinsonNone
43Alex Poythress@AlexTheGreat22
44Branden Dawson@219MadeMe
45Marcus Paige@marcuspaige5
46Domantas Sabonis@Dsabonis11
47Damian JonesNone
48Shawn LongNone
49E.C. MatthewsNone
50Jordan Mickey@Jmickey_02
51Nigel Williams-Goss@nigelpah
52Moses Kingsley@KingMoses_
53Michael Qualls@Mr_WALKONAIR
54Terry Rozier@GodsGift_3
55Sindarius Thornwell@Sin_City_803
56Marcus Foster@MF2_KSU
57Juwan Staten@JuwanStaten3
58Andzejs Pasecniks@AnzejsP
59Timothe LuwanuNone
60Isaiah Taylor@Zay_Ctmd11
61Austin Nichols@a_nichols33
62Terran PettewayNone
63Guillermo HernangomezNone
64Nikola Milutinov@NMilutinov
65A.J. HammonsNone
66Kenan Sipahi@KenanSpahi
67Michael Frazier@mfrazier20
68Boris Dallo@BorisDalloidol
69Joel JamesNone
70Ron Baker@RDB_sh31ox
71Buddy Hield@buddyhield
72Georges Niang@GeorgesNiang20
73Yogi Ferrell@YogiFerrell
74Kennedy Meeks@Ksmoove03
75Nedim BuzaNone
76Rysheed Jordan@RysheedJordan
77Moussa DiagneNone
78Kaleb Tarczewski@Tarczewski35
79Aleksandar VezenkovNone
80Cedi Osman@cedi_osman
81Briante Weber@VCU_Bandit2
82Fred VanVleet@CooFredVanVleet
83Rasmus Larsen@rasmusglarbjerg
84Guillem Vives@Vives16
85LeBryan Nash@LotBuckets02
86Brandon Ashley@_Bash21
87Perry Ellis@PElliz
88Jerian Grant@ThatGrant22
89Joseph Young@JoeyBuckets3
90Cameron Ridley@cam_ctmd55
91Alan Williams@alantwilliams
92Adin VrabacNone
93Winston Shepard@WinnShepard_35
94Mike Tobey@miketobey10
95Josh ScottNone
96Dejan Todorovic@ToDeki94
97Dorian Finney-Smith@doedoe_10
98Zak Irvin@zirvin21
99Karlo ZganecNone
100Emircan KosutNone

Top 100 Prospects

Players without Twitter accounts not shown.

Fun Tidbit #1

Karl Towns (@KATis32) retweeted this in 2012:

Piscataway looks awful . I feel like I'm in a movie . It's sad and cool at the same time

I guess that wasn't a good sign for my alma mater in the recruiting game. But his other Tweets made me realize that he's actually from Piscataway, and his other Piscataway-related Tweets are positive. Yet another reminder that "RT" doesn't mean "I agree that...", which is one of many factors complicating analysis of Twitter data.

Common Words, Hashtags, and Usernames

Top basketball prospects form a pretty highly connected community. They very frequently Tweet to each other. There is probably some network analysis to be done here, but my eyes have glazed over at far too many protein-protein interaction networks for me to want to do that in my spare time. Instead, here is a quick summary of some common words, hashtags, and Twitter usernames Tweeted by the top 100 prospects. And no, I didn't know what all of those hashtags meant without looking them up :)

Most Common Words
  • thanks
  • tonight
  • birthday
  • people
  • congrats
  • really
  • better
  • morning
  • school
  • follow

Well that is boring...

Most Common Hashtags
  • #blessed
  • #tbt
  • #1
  • #wps
  • #riseandgrind
  • #beardown
  • #oomf
  • #unc
  • #respect
  • #chap

No #whitegirlwednesday? Marshall Henderson would have made this more interesting. I hope he's doing okay these days.

Most Common Usernames
  • @tyusjones06
  • @bigjah22
  • @showtimemr
  • @aaronice2
  • @k_ctmd22
  • @drewroc5
  • @cwalkertime23
  • @tpinsonn
  • @dslowmotion22
  • @a_nichols33

The most common username not in the top 100 prospects list belongs to Gary Harris, who was recently drafted.

Automated Sentiment Analysis

Sentiment analysis. It's a thing. I've heard about it. That was the extent of my knowledge going into this project, and I decided to change that.

Sentiment analysis is determining how positive or negative statements are, and then typically results are combined to provide some insight (e.g. what percentage of people feel positively about Pepsi?). It can be done manually or automatically. Obviously "manual" sounds old and boring, so I decided to try automatic sentiment analysis. My goal was to find if certain players can be deemed more "positive" than others based solely on the contents of their Tweets.

To do automated sentiment analysis, I used TextBlob, a very user-friendly Python library. Basically I just pump in the data and it gives me two scores for each tweet:

  1. Polarity: -1 to 1, negative to positive
  2. Subjectivity: 0 to 1, objective to subjective

I ran every Tweet through there and averaged across players, in the hope that some signal would be revealed when the data was aggregated.

Hover over the points to see player names.

I guess it's not surprising that averaging together the sentiments of a ton of tweets tends to create one big cluster in the middle. The biggest outlier is Guillem Vives who Tweets in Spanish, so that is probably just noise (same for Ilimane Diop). Dorian Finney-Smith is a bit more positive and subjective than normal. Cliff Alexander is a bit more negative and objective than normal. The rest seems to be a blob.

But is there value in that blob, if we zoom in? I'm not so sure. Looking at the most negative Tweets finds things like, "LaMarcus Aldridge is too nasty smh!" But in general, the classifier seems to get more right than wrong. So maybe this could be used as a metric of... something... but I would need to do some more research into studies on the value of this type of analysis. What I did read was mostly people saying that automated analysis is not accurate enough to be particularly useful.

Fun Tidbit #2

Did this actually work?

Did this actually work?

A few years back, after a couple rather unfortunate incidents, Grant Hill and Jared Dudley starred in a commercial attempting to decrease the usage of anti-gay slurs as slang. What does our Tweet database say about that?

I was surprised to see that only 18 out of 164,546 Tweets contained the word "fag" or "faggot". Is that because the culture has changed and it's no longer seen as acceptable to use those words? Or is it that even mid-level 17 year old NBA prospects know better than to say the wrong words in public? You need more than an SQL query to answer that question.

Manual Content Analysis

If automated sentiment analysis isn't good enough, what are the alternatives? Manual annotation is one. In additon to being more accurate, a critical advantage of manual sentiment analysis is that you can actually expand it to things besides just sentiment. You can categorize text into arbitrary categories. This more broad analysis is called "content analysis".

Content analysis has been applied to athletes' Tweets previously [1], although the results don't seem particularly insightful. Tweets were divided into categories like "Tweets about their own team", "interactions with fans", "marketing/promotion", etc. Through this, some information can be gained about the priorities, motivations, and interests of athletes.

I decided to do something similar. After some thought, I narrowed my analysis down to four categories:

The problem with manual content analysis is that it's a ton of work to manually categorize 164,546 Tweets. Amazon Mechanical Turk could make short work of it, but I don't really want to spend money on this project. So I took a three-pronged approach to the annotation problem:

  1. I bulk-categorized several thousand Tweets if they had easy-to-identify categories (such as Tweets with a hashtag about the McDonald's All American game).
  2. I built a web interface I could use to easily categorize Tweets.
  3. Since I am semi-famous in a very small basketball-related part of the Internet, I recruited some of my acolytes to help categorize Tweets.

Let's look at an example. Going to my web interface, you might see something like this:

RT @23the_King: Rewatching the Howard Pulley vs. BABC game, @Tyusjones06 is too smooth! #FavPG

That Tweet you would categorize as "Basketball".

But there is trouble. "23054 completed, 141492 remaining"?? Even though I sat through an entire long boring wedding categorizing Tweets on my phone? That is not very good. With more time/money/effort, more Tweets could be categorized. But for now, 23054 is actually a lot of data.

One big problem with the categorization: a significant number were done by method #1 above, automatic bulk categorization. That could bias the results. The manually annotated ones were all sampled randomly so you could argue that it is an unbiased sample, but I didn't track which ones were manually and automatically annotated. Oh well. All I have to work with now is the pooled, biased categorizations.

There are many other problems too. Different players use Twitter differnetly. Tweet frequencies are different, so for some 3000 Tweets might be a month and for others it might be years. Some players use foreign languages. And if anything too controversial is said, Twitter accounts can be deleted (I wasn't even looking for this, but I couldn't help but notice that Jahlil Okafor, Emmanuel Mudiay, Dakari Johnson, and Chris Walker all deleted their accounts at some point - probably others did too).

But let's ignore all those drawbacks for now and look at the results of the content analysis. Unsurprisingly, basketball was a popular topic for Tweeting amongst the top 100 prospects.

CategoryCount
Basketball12563
Girls2416
Other7630
N/A445

There was a 5:1 ratio of Tweets about basketball to Tweets about girls.

However, players exhibited heterogeneity in their propensities to Tweet about different subjects. After filtering out the players with less than 100 categorized Tweets, I was left with 52 players with decent sample sizes, an average of 426 categorized Tweets/player. The players with the highest percentages of Tweets falling into each category are shown in the following tables and scatter plots.

NameBasketball %
Justin Jackson83.1%
Theo Pinson80.5%
Stanley Johnson77.0%
Dakari Johnson76.7%
Karl Towns74.1%
NameGirls %
Montrezl Harrell52.0%
Briante Weber50.2%
Buddy Hield37.3%
Terry Rozier34.8%
Marcus Lee30.8%
NameOther %
Austin Nichols59.3%
Amida Brimah57.1%
Kennedy Meeks54.4%
Zak Irvin52.4%
Brice Johnson52.3%
NameBasketball %
Briante Weber21.7%
Montrezl Harrell26.5%
Terry Rozier27.6%
Rondae Hollis-Jefferson30.2%
Amida Brimah34.3%
NameGirls %
Justin Jackson0.3%
Stanley Johnson0.6%
Dakari Johnson0.6%
Theo Pinson1.1%
Tyus Jones1.3%
NameOther %
Karl Towns13.8%
Justin Jackson15.7%
Theo Pinson18.1%
Montrezl Harrell21.2%
Aaron Harrison21.9%

Hover over the points to see player names.

Just from glancing at the tables, it seems that there are some big names near the top for Basketball %. Is there a correlation between prospect rank and Tweet content, maybe due to self-censoring or self-promotion? Actually, no. The highest correlation coefficient between prospect ranks and one of these percentages is 0.26 between rank and Other %, which is not very large. The correlation for rank and Basketball % is slightly negative.

Speaking of correlations, Andrew and Aaron Harrison have almost identical Tweet category percentages. Which I guess should be expected.

Especially in light of Fun Tidbit #2, I can't help but wonder if the NBA's Michael Sam could be found through techniques like this. I imagine his Girls % would be quite low. I'm not sure how that makes me feel.

If I Had More Time...

It is difficult to derive much insight from the type of analysis presented here because there is no validation. To build a predictive model, you need some type of "gold standard" for comparison, so you can confirm that your model actually works. For instance, in my current job I am trying to automatically detect errors in clinical trial data, so my gold standard is datasets before and after they were manually checked for errors by domain experts. If I can reproduce those results, then I have something. Rather than just gathering some data and jumping in with some generic analysis like I presented here, the real challenge is to define how we can be confident that any of these techniques are actually working and providing value. After that, I could imagine myriad sophisticated analytical techniques that could be applied. But a solid grounding in reality is the most important factor, and that is likely one of the largest challenges you face in data analysis.

One idea along those lines would be to look at historical data. That might not be available, of course. Twitter is pretty new. But at a minimum, you could start gathering data now on current prospects, which could then serve as historical data in the future. Maybe something like "percentage of Tweets about basketball" correlates well with how much a player improves after they enter the NBA. The only way to know is to do the grunt work of collecting, annotating, and (finally) analyzing that data.

Conclusions

Basketball

  • The vast majority of prospects have a rich history of tweets publicly available
  • Tweet content varies a lot across players and might deserve some more attention

Technical

  • Automated sentiment analysis might not be too useful
  • Manual content analysis is better, but more time consuming

Overall

  • I kind of feel like a stalker

Boring Programming Details

I did almost all my analysis in Python because it's an incredibly elegant language and it makes me happy. The libraries I used are TextBlob (sentiment analysis), Beautiful Soup (screen scraping), Requests (downloading crap), Tweepy (easy Twitter API access), Flask (content analysis web UI), and sqlite3 (plenty fast for data storage and manipulation at this scale).

For this document, I used D3 to make the charts. I'm pretty new to D3, but somehow I managed to scrape something together.