Jeremy's Draft Prospect Analysis

jdscheff@gmail.com
???-???-????
dumbmatter.com

by Jeremy Scheff

October 12, 2014

The Big Picture

The most important things about a player are his basketball skills and his physical tools. But ignoring those areas, there still are some important factors that influence a player's success: work ethic, character, goals, etc. It's not clear if those things can be reliably predicted using just publicly available information accessed in a mostly automated manner, but it's worth a try. So that's the goal: come up with metrics for a player's character that can be generated through automated analysis of public data.

The challenge is, the data is messy. My current job involves working with clinical trial data, which comes from highly trained professionals in standardized formats - and still I complain that it's too messy. For this project, I certainly didn't anticipate to find any grand conclusions, and I didn't. But it was still fun and could lead to some interesting insights down the road.

Data

For the list of players to analyze, I used DraftExpress's list of top 100 prospects accessed on September 20, 2014.

I had a lot of ideas for different types of analyses that could be run. The main problem for all of them is the data: how do we get it and how do we make a computer understand it? Due to time constraints, I decided to focus on data from Twitter because the data comes in a structured format and there is an easy API that allows access to a relatively large amount of data.

First, with the help of a Python script to provide me with some hints, I collected the Twitter usernames of the top 100 prospects. Almost all of the very top prospects had public Twitter accounts (Emmanuel Mudiay is the exception, as he seems to have recently deleted his). In total, 78/100 prospects had public Twitter accounts that I could find. Twitter's API makes it difficult to retrieve more than ~3000 Tweets per user, so I decided to just download the ~3000 most recent Tweets for the 78 prospects on Twitter. Most users had around 3000 tweets but there are some who only lightly use Twitter. In total, I constructed a database of 164,546 Tweets.

#	Name	Twitter
1	Jahlil Okafor	@BigJah15
2	Emmanuel Mudiay	None
3	Karl Towns	@KATis32
4	Kelly Oubre	@K_Ctmd22
5	Cliff Alexander	@HumbleKid_2
6	Mario Hezonja	@MarioHezonja
7	Kristaps Porzingis	@kporzee
8	Justise Winslow	@Chief_Justise
9	Stanley Johnson	@StanMan_5
10	Willie Cauley-Stein	@THEwillieCS15
11	Sam Dekker	@samdek1
12	Montrezl Harrell	@MONSTATREZZ
13	Chris Walker	@kingsky23
14	Tyus Jones	@Tyusjones06
15	Rondae Hollis-Jefferson	@RondaeHJ23
16	Marc Garcia	None
17	Myles Turner	@Original_Turner
18	Caris LeVert	@CarisLeVert
19	Ilimane Diop	@eliimane
20	Bobby Portis	@BPortistime
21	Dakari Johnson	@SafariDakari44
22	Egemen Guven	@egemengven1
23	Jarell Martin	@JarellMartin22
24	Frank Kaminsky	@FSKPart3
25	Rashad Vaughn	@ShowtimeMr1
26	Amida Brimah	@amidabrimah
27	Wayne Selden	@WayneSeldenJr
28	Andrew Harrison	@DrewRoc5
29	Norman Powell	None
30	Marcus Lee	@SuperKingMe
31	Brice Johnson	@bjohnson_23
32	R.J. Hunter	@RJH_22
33	Aaron Harrison	@AaronICE2
34	Jabari Bird	None
35	Delon Wright	None
36	Mouhammadou Jaiteh	@mamjaiteh14
37	Theo Pinson	@tpinsonn
38	Kevon Looney	@Loon_Rebel5
39	Devin Booker	@DevinBook
40	Justin Jackson	@JJacks_44
41	Trey Lyles	@TreyMambaLyles
42	Devin Robinson	None
43	Alex Poythress	@AlexTheGreat22
44	Branden Dawson	@219MadeMe
45	Marcus Paige	@marcuspaige5
46	Domantas Sabonis	@Dsabonis11
47	Damian Jones	None
48	Shawn Long	None
49	E.C. Matthews	None
50	Jordan Mickey	@Jmickey_02
51	Nigel Williams-Goss	@nigelpah
52	Moses Kingsley	@KingMoses_
53	Michael Qualls	@Mr_WALKONAIR
54	Terry Rozier	@GodsGift_3
55	Sindarius Thornwell	@Sin_City_803
56	Marcus Foster	@MF2_KSU
57	Juwan Staten	@JuwanStaten3
58	Andzejs Pasecniks	@AnzejsP
59	Timothe Luwanu	None
60	Isaiah Taylor	@Zay_Ctmd11
61	Austin Nichols	@a_nichols33
62	Terran Petteway	None
63	Guillermo Hernangomez	None
64	Nikola Milutinov	@NMilutinov
65	A.J. Hammons	None
66	Kenan Sipahi	@KenanSpahi
67	Michael Frazier	@mfrazier20
68	Boris Dallo	@BorisDalloidol
69	Joel James	None
70	Ron Baker	@RDB_sh31ox
71	Buddy Hield	@buddyhield
72	Georges Niang	@GeorgesNiang20
73	Yogi Ferrell	@YogiFerrell
74	Kennedy Meeks	@Ksmoove03
75	Nedim Buza	None
76	Rysheed Jordan	@RysheedJordan
77	Moussa Diagne	None
78	Kaleb Tarczewski	@Tarczewski35
79	Aleksandar Vezenkov	None
80	Cedi Osman	@cedi_osman
81	Briante Weber	@VCU_Bandit2
82	Fred VanVleet	@CooFredVanVleet
83	Rasmus Larsen	@rasmusglarbjerg
84	Guillem Vives	@Vives16
85	LeBryan Nash	@LotBuckets02
86	Brandon Ashley	@_Bash21
87	Perry Ellis	@PElliz
88	Jerian Grant	@ThatGrant22
89	Joseph Young	@JoeyBuckets3
90	Cameron Ridley	@cam_ctmd55
91	Alan Williams	@alantwilliams
92	Adin Vrabac	None
93	Winston Shepard	@WinnShepard_35
94	Mike Tobey	@miketobey10
95	Josh Scott	None
96	Dejan Todorovic	@ToDeki94
97	Dorian Finney-Smith	@doedoe_10
98	Zak Irvin	@zirvin21
99	Karlo Zganec	None
100	Emircan Kosut	None

Top 100 Prospects

Players without Twitter accounts not shown.

Fun Tidbit #1

Karl Towns (@KATis32) retweeted this in 2012:

Piscataway looks awful . I feel like I'm in a movie . It's sad and cool at the same time

I guess that wasn't a good sign for my alma mater in the recruiting game. But his other Tweets made me realize that he's actually from Piscataway, and his other Piscataway-related Tweets are positive. Yet another reminder that "RT" doesn't mean "I agree that...", which is one of many factors complicating analysis of Twitter data.

Common Words, Hashtags, and Usernames

Top basketball prospects form a pretty highly connected community. They very frequently Tweet to each other. There is probably some network analysis to be done here, but my eyes have glazed over at far too many protein-protein interaction networks for me to want to do that in my spare time. Instead, here is a quick summary of some common words, hashtags, and Twitter usernames Tweeted by the top 100 prospects. And no, I didn't know what all of those hashtags meant without looking them up :)

Most Common Words

thanks
tonight
birthday
people
congrats
really
better
morning
school
follow

Well that is boring...

Most Common Hashtags

#blessed
#tbt
#1
#wps
#riseandgrind
#beardown
#oomf
#unc
#respect
#chap

No #whitegirlwednesday? Marshall Henderson would have made this more interesting. I hope he's doing okay these days.

Most Common Usernames

@tyusjones06
@bigjah22
@showtimemr
@aaronice2
@k_ctmd22
@drewroc5
@cwalkertime23
@tpinsonn
@dslowmotion22
@a_nichols33

The most common username not in the top 100 prospects list belongs to Gary Harris, who was recently drafted.

Automated Sentiment Analysis

Sentiment analysis. It's a thing. I've heard about it. That was the extent of my knowledge going into this project, and I decided to change that.

Sentiment analysis is determining how positive or negative statements are, and then typically results are combined to provide some insight (e.g. what percentage of people feel positively about Pepsi?). It can be done manually or automatically. Obviously "manual" sounds old and boring, so I decided to try automatic sentiment analysis. My goal was to find if certain players can be deemed more "positive" than others based solely on the contents of their Tweets.

To do automated sentiment analysis, I used TextBlob, a very user-friendly Python library. Basically I just pump in the data and it gives me two scores for each tweet:

Polarity: -1 to 1, negative to positive
Subjectivity: 0 to 1, objective to subjective

I ran every Tweet through there and averaged across players, in the hope that some signal would be revealed when the data was aggregated.

Hover over the points to see player names.

I guess it's not surprising that averaging together the sentiments of a ton of tweets tends to create one big cluster in the middle. The biggest outlier is Guillem Vives who Tweets in Spanish, so that is probably just noise (same for Ilimane Diop). Dorian Finney-Smith is a bit more positive and subjective than normal. Cliff Alexander is a bit more negative and objective than normal. The rest seems to be a blob.

But is there value in that blob, if we zoom in? I'm not so sure. Looking at the most negative Tweets finds things like, "LaMarcus Aldridge is too nasty smh!" But in general, the classifier seems to get more right than wrong. So maybe this could be used as a metric of... something... but I would need to do some more research into studies on the value of this type of analysis. What I did read was mostly people saying that automated analysis is not accurate enough to be particularly useful.

Fun Tidbit #2

Did this actually work?

A few years back, after a couple rather unfortunate incidents, Grant Hill and Jared Dudley starred in a commercial attempting to decrease the usage of anti-gay slurs as slang. What does our Tweet database say about that?

I was surprised to see that only 18 out of 164,546 Tweets contained the word "fag" or "faggot". Is that because the culture has changed and it's no longer seen as acceptable to use those words? Or is it that even mid-level 17 year old NBA prospects know better than to say the wrong words in public? You need more than an SQL query to answer that question.

Manual Content Analysis

If automated sentiment analysis isn't good enough, what are the alternatives? Manual annotation is one. In additon to being more accurate, a critical advantage of manual sentiment analysis is that you can actually expand it to things besides just sentiment. You can categorize text into arbitrary categories. This more broad analysis is called "content analysis".

Content analysis has been applied to athletes' Tweets previously [1], although the results don't seem particularly insightful. Tweets were divided into categories like "Tweets about their own team", "interactions with fans", "marketing/promotion", etc. Through this, some information can be gained about the priorities, motivations, and interests of athletes.

I decided to do something similar. After some thought, I narrowed my analysis down to four categories:

Basketball: Anything remotely basketball-related
Girls: Tweets to or about women
Other: Everything else
N/A: For completely unintelligible Tweets, such as foreign language Tweets

The problem with manual content analysis is that it's a ton of work to manually categorize 164,546 Tweets. Amazon Mechanical Turk could make short work of it, but I don't really want to spend money on this project. So I took a three-pronged approach to the annotation problem:

I bulk-categorized several thousand Tweets if they had easy-to-identify categories (such as Tweets with a hashtag about the McDonald's All American game).
I built a web interface I could use to easily categorize Tweets.
Since I am semi-famous in a very small basketball-related part of the Internet, I recruited some of my acolytes to help categorize Tweets.

Let's look at an example. Going to my web interface, you might see something like this:

RT @23the_King: Rewatching the Howard Pulley vs. BABC game, @Tyusjones06 is too smooth! #FavPG

That Tweet you would categorize as "Basketball".

But there is trouble. "23054 completed, 141492 remaining"?? Even though I sat through an entire long boring wedding categorizing Tweets on my phone? That is not very good. With more time/money/effort, more Tweets could be categorized. But for now, 23054 is actually a lot of data.

One big problem with the categorization: a significant number were done by method #1 above, automatic bulk categorization. That could bias the results. The manually annotated ones were all sampled randomly so you could argue that it is an unbiased sample, but I didn't track which ones were manually and automatically annotated. Oh well. All I have to work with now is the pooled, biased categorizations.

There are many other problems too. Different players use Twitter differnetly. Tweet frequencies are different, so for some 3000 Tweets might be a month and for others it might be years. Some players use foreign languages. And if anything too controversial is said, Twitter accounts can be deleted (I wasn't even looking for this, but I couldn't help but notice that Jahlil Okafor, Emmanuel Mudiay, Dakari Johnson, and Chris Walker all deleted their accounts at some point - probably others did too).

But let's ignore all those drawbacks for now and look at the results of the content analysis. Unsurprisingly, basketball was a popular topic for Tweeting amongst the top 100 prospects.

Category	Count
Basketball	12563
Girls	2416
Other	7630
N/A	445

There was a 5:1 ratio of Tweets about basketball to Tweets about girls.

However, players exhibited heterogeneity in their propensities to Tweet about different subjects. After filtering out the players with less than 100 categorized Tweets, I was left with 52 players with decent sample sizes, an average of 426 categorized Tweets/player. The players with the highest percentages of Tweets falling into each category are shown in the following tables and scatter plots.

Name	Basketball %
Justin Jackson	83.1%
Theo Pinson	80.5%
Stanley Johnson	77.0%
Dakari Johnson	76.7%
Karl Towns	74.1%

Name	Girls %
Montrezl Harrell	52.0%
Briante Weber	50.2%
Buddy Hield	37.3%
Terry Rozier	34.8%
Marcus Lee	30.8%

Name	Other %
Austin Nichols	59.3%
Amida Brimah	57.1%
Kennedy Meeks	54.4%
Zak Irvin	52.4%
Brice Johnson	52.3%

Name	Basketball %
Briante Weber	21.7%
Montrezl Harrell	26.5%
Terry Rozier	27.6%
Rondae Hollis-Jefferson	30.2%
Amida Brimah	34.3%

Name	Girls %
Justin Jackson	0.3%
Stanley Johnson	0.6%
Dakari Johnson	0.6%
Theo Pinson	1.1%
Tyus Jones	1.3%

Name	Other %
Karl Towns	13.8%
Justin Jackson	15.7%
Theo Pinson	18.1%
Montrezl Harrell	21.2%
Aaron Harrison	21.9%

Hover over the points to see player names.

Just from glancing at the tables, it seems that there are some big names near the top for Basketball %. Is there a correlation between prospect rank and Tweet content, maybe due to self-censoring or self-promotion? Actually, no. The highest correlation coefficient between prospect ranks and one of these percentages is 0.26 between rank and Other %, which is not very large. The correlation for rank and Basketball % is slightly negative.

Speaking of correlations, Andrew and Aaron Harrison have almost identical Tweet category percentages. Which I guess should be expected.

Especially in light of Fun Tidbit #2, I can't help but wonder if the NBA's Michael Sam could be found through techniques like this. I imagine his Girls % would be quite low. I'm not sure how that makes me feel.

If I Had More Time...

It is difficult to derive much insight from the type of analysis presented here because there is no validation. To build a predictive model, you need some type of "gold standard" for comparison, so you can confirm that your model actually works. For instance, in my current job I am trying to automatically detect errors in clinical trial data, so my gold standard is datasets before and after they were manually checked for errors by domain experts. If I can reproduce those results, then I have something. Rather than just gathering some data and jumping in with some generic analysis like I presented here, the real challenge is to define how we can be confident that any of these techniques are actually working and providing value. After that, I could imagine myriad sophisticated analytical techniques that could be applied. But a solid grounding in reality is the most important factor, and that is likely one of the largest challenges you face in data analysis.

One idea along those lines would be to look at historical data. That might not be available, of course. Twitter is pretty new. But at a minimum, you could start gathering data now on current prospects, which could then serve as historical data in the future. Maybe something like "percentage of Tweets about basketball" correlates well with how much a player improves after they enter the NBA. The only way to know is to do the grunt work of collecting, annotating, and (finally) analyzing that data.

Conclusions

Basketball

The vast majority of prospects have a rich history of tweets publicly available
Tweet content varies a lot across players and might deserve some more attention

Technical

Automated sentiment analysis might not be too useful
Manual content analysis is better, but more time consuming

Overall

I kind of feel like a stalker

Boring Programming Details

I did almost all my analysis in Python because it's an incredibly elegant language and it makes me happy. The libraries I used are TextBlob (sentiment analysis), Beautiful Soup (screen scraping), Requests (downloading crap), Tweepy (easy Twitter API access), Flask (content analysis web UI), and sqlite3 (plenty fast for data storage and manipulation at this scale).

For this document, I used D3 to make the charts. I'm pretty new to D3, but somehow I managed to scrape something together.