BLOG

TED Talk Generator

In this article, I will walk you through the procedure to generate your own TED Talk. We will first create a dataset of TED Talks transcripts and then train a character-level LSTM which we will use to generate our own. For a primer on character-level language models, LSTMs and the inspiration for this article, please check out Andrej Karpathy's amazing blog-post, The Unreasonable Effectiveness of Recurrent Neural Networks

(Prerequisites: Python 2.7, PyTorch, CUDA 9.0, Ubuntu, pandas) 

The data and scripts are in the following GitHub repository: 
https://github.com/samlanka/TED-Talks-Generator

Download all TED Talk Transcripts

All TED Talks to date are regularly cataloged in this file
I downloaded the file and renamed it as talks.csv. The first column of the file contains the public urls for each talk.
I appended '/transcript.json?language=en' to every url in the first column using pandas to store the url of the json object of the transcript and created a new csv file of these urls - talks_url.csv
The code is in the file json_url.py

Using wget, I parsed through all the links in talks_url.csv to scrape the TED website and download the json object transcripts to a folder TED_json/ (This was a lot faster than using BeautifulSoup, which quickly ran into rate-limits.) The bash command is available in script_get.sh  

Next, from the json objects, I extracted the transcript text. I renamed the first file in TED_json/ from '...language=en' to '...language=en.0' to match the format of the remaining files in the folder and to make it part of the loop sequence in the next script . The actual transcript text is stored in a nested dictionary key 'text' and using pandas, I parsed each json file, extracted the transcript text and stored them as text files in a new folder TED_transcripts/ to build the final dataset. The code is available in json2txt.py

Summary:

* json.py: talks.csv -> talks_url.csv
* script_get.sh: talks_url.csv -> TED_json/
* json2txt.py: TED_json/ -> TED_transcripts/

Train a character-level RNN 

The collected transcripts text files in TED_transcripts/ serve as the training dataset. At the time I downloaded the transcripts, it was a whopping 24.8 MB of 2557 text files (this will increase as more talks are updated to the catalog file).
I collated all the text files into one giant string array and followed that with a 0.98-0.2 train-validation split. I used PyTorch to define a 2 layer LSTM with 128 hidden units (in hindsight, I probably could have used more hidden units).
My batch size was 64 and each training sample was a sequence of 128 characters encoded in one-hot representation. The corresponding target was a sequence of 128 characters containing the next letters in the sequence.
During training and validation, I generated the batches using random sampling for sequences. This is equivalent to shuffling the training and validation sets before each epoch.
The final loss function was cross-entropy loss and the network parameters were optimized using Adam. 
The whole network was trained over many hours until I got tried at the 150th epoch and the adventure concluded with a final validation loss of 1.26.
For curiosity - each epoch consisted of  ~2900 iterations.

The compressed data folder is TED_transcripts.zip and the training code is available in trainTED.py

Deliver a TED Talk

I saved the final model in winner.pth
The IPython file, speech.ipynb contains the code to generate the talk. I set a speech length of 1000 characters. The input prompt to the network is "The next big invention". The network generates the rest of the speech letter-by-letter.

Using the temperature parameter, we can modulate the output. Setting a high temperature leads to more randomness in the generated text - which contributes to greater variance in the words, at times to a nonsensical extent. Here is a sample snippet generated with a temperature of 0.75: 

The next big inventionmoney and controlled seff that used child here. One of our to the outside, right, olper does we eat this was the capabilitical closing scriptions that engineered with these point in the paint — the I asked me all the same of the other passage. I bel no ling years in the world snow. Make you some people that I was returned alternative in the lady building station is white to be favors get sort of enough to something that is content. And these clinic to do some of the thing from the one house to paper — you know, the land by design electricity tested to Mirmany and a started the same of the same day in the mind happinarchine industry of those difference, the was it t became our patted here. So it’s not a big airing, way there. I’ll learn a chemical isolate to keep the Story. And I had never standing out the moon fs that in language, and then made this country of the people to the forces and forms that you’re explained that it ratfer out that will put it higher pandemic if that it’s a

Yeah, a lot of the words are imaginary (happinarchine, capabilitical) and there exists almost no cohesive train of thought.

Let's compare this with a low temperature of 0.2. Here, the network will generate more probable text:
 

The next big inventionof the start to the fact that the story of the most important that the story of the other one way to be able to see if you don’t have the other second of the world of the particular people who have to do that in the way that we can see that the second problem of the point of the way that we can see the same thing that we can see that the story of the way that the part of the best thing that we have to do it in the planet to the other second that the start the left in the world that is the thing that we can do this and the car some of the reasons that we can see it is a lot of the car in the most important that the most important that the problems that we have to do that. And the way that the same thing that we have the problems of the world is that the thing that we can see that the rest of the part of the most interesting that we can do it and the same of the most interesting that we can see the only people who say that the answer is the same time that we have to do is a few of the sa

Here we see that the network just repeats itself, no imagined words but it still doesn't make sense. And the sentences are almost 5 lines long! 

Now with a moderate temperature of 0.55, we get something like this:

The next big inventionof the plastic of the result, and there are a short times. And the digital responsibility the same time. (Laughter) One of the world may be the long time that the crisis, the first month that since that I started simple interesting sticker to powerful that we have to solve in this thing. It was all happening, that we’re seeing a lot more part of our studio is what the first time to hearing the life and it personal process of the carbon business that we don’t know it’s the beginning really an artishment in the planet. The international construction is something that the part of our lower and get this small change of the connecting in the world’s going to go to you. And I was a different than around the economic thing can do is so what was the game stone problems of my second political hour in this area of a fact that is the health care the context is going to the tendency, a lot of the world when I was a little bit of day and it’s a progress in the communication. And the collective star

So a speech about plastic, the planet and...health care? It even experiments with humor in the second line. We see a greater variation in sentences, subjects and fewer mistakes in language.

Of course, this model is not perfect and maybe with some hyperparameter tuning and longer training time, it will generate a more engaging speech. I'm excited to see what other language modelling advances lie in the future. Thank you for reading!

Thank you to TED Conferences for spreading great ideas.