Generating fake Woj and Shams tweets with AI

Introduction

When it comes to NBA news, Adrian Wojnarowski and Shams Charania dominate the field. Often, they get the first break on many of the top stories in the league. They report rumors, trades, player signings, and everything in between.

Woj and Shams tweets give us a concise history gives a history of recent moves in the NBA. Furthermore, many of their tweets follow the same format. There are only so many ways to report a signing (“player A is signing with team B on a X years/$Y deal”). So, looking at the Twitter history of Woj and Shams, we can model fake NBA news.

Natural language generation

To model Woj and Shams tweets, we’ll take a huge data set of their historical tweets. We’ll use the corpus, or set of words (tweets in this case) to train that creates fake tweets on a given topic. We call this process natural language generation (NLG).

NLG is much more common than we may realize. Any form of machine that takes data and outputs natural language uses NLG. This includes everything from chatbots to phone text suggestions. Chatbots transform your answers into data, then determine the most fitting response. Your phone looks at the previous words you used and your writing tendencies. Combining this information, your phone tries to predict the most probable next word.

One transparent case of NLG is Reddit’s Subreddit Simulator (r/SubredditSimulator). The subreddit consists of bots, where each bot represents a subreddit. By looking at text from a subreddit, each bot generates titles that match text on that subreddit. So, an NBA Subreddit Simulator will create fake posts from r/NBA. On r/SubredditSimulator, the bots use Markov chains to do this.

Markov chains present the most intuitive form of NLG. To construct a Markov chain, we split our corpus into pairs of words that follow each other. So, if our corpus included “he went to”, we would split this up into two samples. The first sample would have “he” as the current word, and “went” as the next word. The second sample would have “went” as the current word, and “to” as the next word. Over a large dataset, this lets us create dependencies between words. By looking at what percent of the time the word “went” follows the word “he”, we can create a probabilistic text. For example, “he she” is not a common phrase. Our corpus likely contains no sample of “he she”, so we would never generate text with that phrase. But, “he went” might be a common phrase, meaning that we could generate text with “he went”. The model learns all these word dependencies. Then, we give the model a starting word. The model predicts the next word by drawing from the probability distribution of the word dependencies. Then, the model uses this predicted word to predict the next word, and so on (this is where the chain comes in).

More recently, users created a new Subreddit Simulator (r/SubredditSimulatorGPT2) that uses AI. Instead of using Markov chains, the bots here use OpenAI’s GPT-2 models. OpenAI’s model is among the most popular public NLG models. It’s trained on a huge corpus of internet text to predict the next word. The results are shockingly accurate (in both OpenAI’s tests and on the subreddit). To read more about GPT-2, click here. Our approach to modeling Woj and Shams is closer to this than to the Markov chains.

This AI model has several advantages over a Markov chain. Consider the following example: “he went to the store and” is a reasonable phrase. Assume our Markov chain says the most probable word after “and” is “he”. Also assume that the chain has the previous phrase as the most probable sequence of words. In this case, our model creates a loop of the same phrase. It starts with “he went to the store and.” Because “he” follows and, we now have “he went to the store and he.” But now, we’re looking at the probable phrase of going to the store. So, our Markov chain might predict the following: “he went to the store and he went to the store and he went to the store…”

Machine learning-based NLG models fix this issue. By considering the previous string of words, they create more reasonable sentences. Instead of considering the last word, they consider all words leading up to the one we’re predicting. (Note that in our model, some loops will still occur. We’ll see this happen often with the word “sources.” “Sources” starts or ends most Woj and Shams tweets, so it will sometimes create loops.) For example, suppose we have two phrases:

  1. The players that played yesterday
  2. The player that played yesterday

A reasonable next word for (1) is “are”, as “players” is plural. For (2), we might expect “is”, given the singular “player.” A simple Markov chain predicts the same word for (1) and (2), thereby messing up the count of the words. Meanwhile, an ML model looks back at the start of the sentence to see that the verb must match the noun’s count.

Now that we have a baseline understanding of NLG, we can dive into our specific methods.

Methods

Note that we created two separate models: one for Woj tweets and one for Shams tweets. Both models have identical structure and methods. Only the data is different, meaning that the results turn out to be different.

First, we collected the past ~3200 tweets from both Woj and Shams. We collected the data on 2/6/2020 at 4:00 PM EST. This number of Woj tweets takes us to 4/26/2018. Meanwhile, this number of Shams tweets brings us back to 8/8/2015. (So Woj wins this battle as the more active tweeter.)

After collecting the data, we performed some basic cleaning. This includes removing any words containing links and special characters. Most importantly, this includes making all words lowercase. For example, “Sources” and “sources” are the same, but count as two different words if we don’t match their case.

We then converted each tweet of cleaned data into a sequence of numbers. However, each tweet isn’t just one sample in the data set. Because we’re predicting the next word, we can split tweets up by word until we have a full tweet.

This sounds complicated in theory. However, in practice it’s simple. Suppose we have the following tweet: “Kawhi Leonard is out.” Suppose for simplicity that we assign the words to numbers sequentially. So, Kawhi = 1, Leonard = 2, is = 3, and out = 4. From this 4-word tweet, we generate 3 samples. They are:

  1. Kawhi -> Leonard. So, 1 -> 2. (the words on the left of “->” are the given words, and the word on the right is the next word that we’re trying to predict)
  2. Kawhi Leonard -> is. [1, 2] -> 3.
  3. Kawhi Leonard is -> out. [1, 2, 3] -> 4.

We do this for each tweet. We try to predict word n + 1 following a sequence of n words (this is called an n-gram model).

In technical terms, we our model uses a single-layer LSTM with word embeddings. An LSTM, or long short-term memory network, is a neural network processes sequences of data. So, it lets us predict the next word given the entire sequence of previous words in the tweet.

Word embeddings allow us to transform our words into smaller dimensions. Without word embeddings, our data consists of huge vectors. This is because the way to represent each word is with a one-hot vector. In this vector, we have 0s everywhere except for a singular 1. This 1 appears in the location of the word’s index. (So, if Kawhi = 132, Kawhi’s word vector would have 0s everywhere and a 1 at location 132). We have to do this so that our predictions assign a probability to any word in our vocabulary. The problem with this process is that we have a lot of words, so our vectors are huge. Word embeddings transform our vectors into lower-dimensional vectors. They maintain similarity between words in the transformation. So, the word embedding for “player” will be similar to the word embedding for “players.” This makes our model much faster without sacrificing accuracy.

For some more technical stuff, we used one LSTM layer with 256 units and a word embedding layer to transform our vectors into 16 dimensions. We trained the models using categorical cross-entropy loss with 100 epochs. Also, given that not every tweet is the same length, we padded our sequences (meaning we fill a bunch of 0s before the actual data so that all our tweets have the same length).

Results

With the trained models, we created fake Woj and Shams tweets. We give the models a set of starting words. The models convert the words to the format we described before, then predict the next word. Then, they add the predicted word to the set of “given words” to predict the next word. They continue doing this for however long we want.

Note that this creates nonsensical tweets as we predict more words for two reasons. First, most tweets are not that long. A long sequence of words is unfamiliar to the model, meaning it won’t perform well. Second, several tweets have some type of circularity to them. Most tweets from Woj and Shams either start or end with “sources”. So, if we predict some string of words that ends with “sources”, it practically ends the sentence. But, because “sources” both starts and ends tweets, the model starts a new tweet. This new tweet is sensible, but completely unrelated to the previous one.

For this reason, even though we generate long sequences of words, we cut them off at natural points where the tweet should end.

Also note that several of these tweets involve real news, and in some cases, real tweets. Suppose, for example, that we start a tweet with “Steph Curry.” All Curry-related news in the past several months relates to his broken hand. So, given “Steph Curry”, the model fills in the rest of a real tweet saying he broke his hand.

We created tweets for players and teams. First, we gave the models seed text (i.e. the initial text we feed the models) of some popular players. Our seed text for player tweets differed between the two models. For Woj, we used “player_name.” For Shams, we used “Sources: player_name.” Team tweets use the same format; Woj uses only “team_name”, while Shams uses “Sources: team_name.” This format gave us the best respective results.

Player tweets

The following is a list of some fake Woj tweets for a set of top players. Note that dollar signs, decimal points, colons, commas, and periods are not present in the true predicted words. However, in the most obvious cases, I added them for readability. Also note that in the true predictions, all words are lowercase (discussed earlier). We could fix this in the code by making sentence starters and names uppercase. Instead of that, we’ll just write things in the right case.

  • Giannis Antetokounmpo Bryant is finalizing a deal to send guard Matthew Dellavedova to Cleveland league sources tell ESPN
  • LeBron James has agreed to 4 year $154M deal with Lakers Klutch Sports says
  • James Harden release its national TV schedule for the first week of the regular season, Christmas, and MLK day on Wednesday at 2 PM
  • Kawhi Leonard is expected to introduce brand as player in deal source tells ESPN
  • Kevin Durant is planning to sign with the Brooklyn Nets league sources tell ESPN
  • Stephen Curry is focused on UNC’s Nassir Little with no. 25 league source tells ESPN
  • Ben Simmons has been searching for an impact point guard and UNC’s Coby White is available on the board at no. 7 league sources tell ESPN
  • Kemba Walker plans to be in Boston on Sunday to finalize a formal agreement with the Celtics league sources tell ESPN
  • Anthony Davis is waiving his $4M trade kicker league sources tell ESPN
  • Russell Westbrook is trading for Markelle Fultz league source tells ESPN
  • Damian Lillard Bryant is finalizing a deal to send guard Matthew Dellavedova to Cleveland league sources tell ESPN
  • Kyrie Irving is likely to make his return to the Nets lineup on Sunday against Atlanta league sources tell ESPN
  • Luka Doncic Bryant is finalizing a deal to send guard Matthew Dellavedova to Cleveland league sources tell ESPN
  • Zion Williamson has surgery today to repair torn meniscus in right knee and is expected to miss six to eight weeks Pels say
  • Trae Young guard JR Smith clears waivers the Los Angeles Lakers are an unlikely destination league sources tell ESPN

The following are some Shams tweets for a set of top players.

  • Sources: Giannis Antetokounmpo is signing a four year $142M maximum contract with the Miami Heat on a 4 year deal sources tell RealGM sources tell The Vertical
  • Sources: LeBron James has cleared waivers setting up the 10 time All Star to sign free agent deal with the Houston Rockets source Tells Yahoo
  • Sources: James Harden has agreed to a contract buyout with the Hawks clearing way for him to sign with the 76ers as a free agent
  • Sources: Kawhi Leonard is out tonight for the Golden State Warriors
  • Sources: Kevin Durant calf has agreed to a one year $2.4M Deal to return to the Philadelphia 76ers league sources tell The Vertical
  • Sources: Stephen Curry has suffered a broken left hand Warriors say. Brutal loss for Golden State Warriors
  • Sources: Ben Simmons projected top pick in 2016 NBA draft is nearing agreement on a significant multiyear deal with Nike
  • Sources: Kemba Walker plans to sign one year $786K veteran’s minimum deal with the Toronto Raptors
  • Sources: Anthony Davis and Utah star Donovan Mitchell’s left pinky toe returned negative
  • Sources: Russell Westbrook create sharpshooter Davis has been diagnosed with dislocation
  • Sources: Damian Lillard compensation to New York in Kristaps Porzingis deal: two future first round picks league sources tell The Vertical
  • Sources: Kyrie Irving has agreed to a four year $142M deal with Brooklyn
  • Sources: Luka Doncic has agreed to a three year $42M deal to return to the Clippers league sources tell The Vertical
  • Sources: Zion Williamson has left selected to the no. 58 pick in the NBA draft
  • Sources: Trae Young who will work out for the Lakers on Tuesday has a workout scheduled with the Golden State Warriors

Analysis of player results

The Woj tweets are much worse than the Shams tweets. Some of them are quite odd. Also, we see that Giannis, Doncic, and Lillard all have the exact same tweets. This is a result of the embeddings layer. Woj’s tweets only go back to 2018, meaning that there’s likely not much coverage of these players. So, the embeddings transform them into words that are similar to each other. Apparently, these players were also linked to Kobe, so they all have the same link. We also see some of Woj’s famous tweets from the 2018 draft. Woj promised not to reveal a team’s pick before it’s made. So, he pulled out his thesaurus to tell fans the picks. This included some tweets like “Boston is tantalized by Robert Williams” and “The Lakers are unlikely to resist Mo Wagner.” You can read more about these tweets here.

In contrast to Woj’s odd tweets, almost all the Shams tweets seem reasonable. Some of them are in fact real or nearly identical to real tweets (Curry’s broken hand as discussed earlier and Simmons signing with Nike). Almost all the others are realistic tweets; they make sense and describe an actual NBA occurrence. Only Anthony Davis and Russell Westbrook have particularly odd tweets.

One theory for this difference may be that Shams has a more focused set of tweets. Perhaps he only tweets big signings and trades, which explains why 3200 Shams tweets take us back to 2015, while the same number of Woj tweets take us to 2018. (Shams is also a newer reporter which explains it, but in general Shams tweets less than Woj.)

Team tweets

The following is a set of interesting team tweets from the Woj model:

  • The Brooklyn Nets are discussing several trade possibilities with the 29th pick league sources tell ESPN
  • The Charlotte Hornets are hiring former Suns coach Igor Kokoskov as an assistant coach league source tells ESPN
  • The Chicago Bulls coach Nate McMillan is finalizing a contract extension league sources tell ESPN
  • The Denver Nuggets extension with coach Michael Malone takes him through the 2022 2023 season league source tells ESPN
  • The Golden State Warriors F Jaylen Brown has agreed to a four year $115M million contract extension agent Jason Glushon tells ESPN
  • The Philadelphia 76ers have agreed to trade Carmelo Anthony and cash to the Chicago Bulls league sources tell ESPN
  • The San Antonio Spurs have agreed to trade guard Anthony Davis league sources tell ESPN

The following is a set of interesting team tweets from the Shams model:

  • Sources: the Boston Celtics have waived guard Brandon Jennings league sources tell Yahoo Sports
  • Sources: the Denver Nuggets have selected Michael Porter Jr. with the no. 14 pick in the NBA draft
  • Sources: the Los Angeles Clippers have extended a qualifying offer to center Ivica Zubac making him a restricted free agent
  • Sources: the Los Angeles Lakers and forward Luol Deng are finalizing a contract buyout as part of waive and stretch provision
  • Sources: the Milwaukee Bucks have selected Donte Divincenzo with the no. 17 pick in the NBA draft league sources tell The Vertical
  • Sources: the New York Knicks and guard Trey Burke of G League affiliate Westchester have agreed on a deal for the remainder of the season league sources tell the Vertical
  • Sources: the Washington Wizards have claimed former Lakers forward Thomas Bryant off waivers league sources tell The Vertical

Analysis of team results

Almost all the Shams team-related tweets are real. Several factors may contribute to this. For example, it’s possible that Shams starts his tweets by discussing the player involved instead of the team. A small sample size for tweets that start with team names would make real tweets appear.

Full interactive results

The lists above include some selections of results we found interesting. However, the models predict tweets for any starting phrase. To create fake tweets from any seed text of any length, go the link below:

http://dribbleanalytics.shinyapps.io/woj-shams-twitter-ai

You can input any seed text and desired length and the models will generate fake Woj and Shams tweets. Take note of the predicted text when there is no seed text. When you give the models some completely unfamiliar seed text, it will likely return that seed text + the default text it produces when it has no input.

Conclusion

With a simple LSTM, we can generate fake Woj and Shams tweets. Though several tweets are nonsensical, no model is perfect. One way to improve performance might be to create a character-based LSTM. Instead of predicting the next word, a character-based LSTM predicts the next character. This gives us a huge sequence of data to work with and lets us generate more nuanced texts. But, these models are very computationally heavy; I was unable to train them on my computer.

One other way to improve model performance may be to limit our subset of tweets we use to train the models. Several Shams and Woj tweets involve player names but discuss minor things. So, for example, if we only cared about signings and trades, we would only consider tweets discussing those topics. However, this might lead to overfitting, and there would be no objective way to do this subsetting. As such, our current structure seems to be the best fit for the problem.

Share this post

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.