Save and Load your RNN model ⋆ Code A Star Learn Machine Learning

Spread the love

In this blog, we tasted different kinds of machine learning projects so far. Our projects included prediction on stock price, image recognizer on hand writing, NLP on comment classification and others. There was one thing in common — we used long time to train a model. It is okay to use several hours for model training in research projects. But what if we want to use it as a production service? It doesn’t make sense to tell our clients, “please wait a few more hours, we are training the model”. Don’t worry, we have a solution. Do you remember how do we handle word embedding with 600 billion tokens? Yes, we didn’t train it, we use the pre-trained model. So this is our answer, we train a model, save and load it then use it in production.

Train and Save RNN model

Before we use a pre-trained model, we need to train a mode. Let’s use the toxic comment classification project that we did last time as our material. Therefore, we use Recurrent Neural Network (RNN) and word embedding to find out toxic comments. For technical details om RNN and word embedding, please read our posts: NLP and Python Part 1 and Part 2. In this post, we focus on how to save and load the RNN model.

In addition to start saving our model, remember to:

enable GPU
get fastText vector file
get GloVe word file

When we are ready, we can load training dataset and pre-process the text.

import pickle
from nltk.tokenize.treebank import TreebankWordTokenizer
from keras.models import Model, model_from_json
from keras.layers import Input, Dense, Embedding, SpatialDropout1D, add, concatenate,  Dropout
from keras.layers import CuDNNLSTM, Bidirectional, GlobalMaxPooling1D, GlobalAveragePooling1D
from keras.preprocessing import text, sequence
from keras.callbacks import LearningRateScheduler

train_df = pd.read_csv('../input/jigsaw-unintended-bias-in-toxicity-classification/train.csv')
train_df['comment_text'] = train_df['comment_text'].apply(lambda x:preprocess(x))

After that, we fit a tokenizer with training text and save it into a pickle. Pickle is a Python model to store a Python object into a byte stream. So we can store the tokenizer to a file, i.e. saved_tokenizer.pickle from the code below.

x_train = train_df[TEXT_COLUMN].astype(str)
tokenizer = text.Tokenizer(filters="")
tokenizer.fit_on_texts(list(x_train))

with open('saved_tokenizer.pickle', 'wb') as handle:
    pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)

Firstly, we start to run the word embedding process. Please note that we only embed the training text, as in save and load case, we never know what the testing text is.

x_train = tokenizer.texts_to_sequences(x_train)
x_train = sequence.pad_sequences(x_train, maxlen=MAX_LEN)

embedding_matrix = np.concatenate(
    [build_matrix(tokenizer.word_index, f) for f in EMBEDDING_FILES], axis=-1)

Secondly, we build the model in RNN (you can find the complete source at the end of this post) and save the model’ structure to a JSON file.

model = build_model(embedding_matrix, y_aux_train.shape[-1])
model_json = model.to_json()
with open("saved_model.json", "w") as json_file:
    json_file.write(model_json)

Thirdly, what time is it? It’s training time! After around 2 hours training time, we save our model weights to a HDF5 file.

for global_epoch in range(EPOCHS):
  model.fit(
            x_train,
            [y_train, y_aux_train],
            batch_size=BATCH_SIZE,
            verbose=2,
            sample_weight=[sample_weights.values, np.ones_like(sample_weights)],
            callbacks=[
                LearningRateScheduler(lambda _: 1e-3 * (0.4 ** global_epoch))
            ]
        )
model.save_weights("saved_model.h5")

Actually we can save both model’ structure and weights into a single file, but it is more flexible if we separate them into 2 files.

Load and Run RNN model

Since we have saved our RNN model, it is the time to load the pre-trained model. But before that, let’s get the test dataset and the tokenizer we saved previously.

KERAS_PATH = "keras-93487"     #fill the path for your saved model/tokenizer
TEXT_COLUMN = 'comment_text'
MAX_LEN = 220

test_df = pd.read_csv('../input/jigsaw-unintended-bias-in-toxicity-classification/test.csv')
with open(f"../input/{KERAS_PATH}/saved_tokenizer.pickle", 'rb') as handle:
    tokenizer = pickle.load(handle)

Moreover, we do the same pre-processing and tokenizing using the saved tokenizer.

test_df[TEXT_COLUMN] = test_df[TEXT_COLUMN].apply(lambda x:preprocess(x))
x_test = test_df[TEXT_COLUMN].astype(str)
x_test = tokenizer.texts_to_sequences(x_test)
x_test = sequence.pad_sequences(x_test, maxlen=MAX_LEN)

Once we have prepared our testing data, we can use the saved model to predict the outcome.

json_file = open(f"../input/{KERAS_PATH}/saved_model.json", 'r')
model_json = json_file.read()
json_file.close()
model = model_from_json(model_json)
model.load_weights(f"../input/{KERAS_PATH}/saved_model.h5")
model.compile(loss='binary_crossentropy', optimizer='adam')
prediction = model.predict(x_test, batch_size=2048)[0].flatten()

Let’s make a dataframe to store our outcome.

validation = pd.DataFrame.from_dict({
    'id': test_df.id,
    'comment_text': test_df.comment_text,
    'prediction': prediction
})

def print_sample(df, column, value, sample_size):
    if (value > 0): 
        df = df[df[column] > value].sample(sample_size)
    else:
        df = df[df[column] < abs(value)].sample(sample_size)
    for index, row in df.iterrows():
       print(f"{row['id']} | {row['prediction']:6.4f} | {row['comment_text']} \n")

Then, let's take a look on those non-toxic comments first.

print_sample(validation,'prediction', -0.5, 5)

7024785 | 0.0049 | Agreed regarding government . But the BofC just controls overnight rates - short term . Market forces dictate the steepness of the yield curve . And the yield on the GoC 5 year has spiked 15 % in the past 2 days . That rate affects mortgage rates . Inflation is well above the stated 1 . 7 % , and historically the 5 Year yield has been above the rate of inflation . If the market dictates , I ca n't see the B of C containing mid to long term rates . 

7037094 | 0.0384 | Trump s nominee for Secretary of Health wants to get rid of Obamacare and privatize Medicare . My wife , who is in very poor health , may well lose her health insurance . Thank you , all who voted for Trump , especially utilitas . 

7018832 | 0.1073 | Standard and normal language for a politician engaged in Dog Whistle Politics . Many - though certainly not all - of his supporters seem to need to hear this sort of thing , it s why he talked in his inauguration speech about gangs and drugs and carnage and old factories dotting the landscape like tombstones . He s a real piece of work , unfortunately . 

7050923 | 0.0124 | Do they have English - immersion schools in Quebec . ? From what I have heard about the language laws in Quebec , English seems a very poor cousin . In the rest of Canada French immersion is very big and sought after . As one who taught Spanish in a high school in B . C . , I found French immersion students were my best students with the head start they had in dealing with the concept of other languages . 

7051534 | 0.2933 | You talking about the most destructive Mayors in the history of Hawaii .

I can say it is what we have expected. so what about the toxic comments?

print_sample(validation,'prediction', 0.5, 5)

7051483 | 0.5751 | `` My grandmother was a typical white person . `` They cling to their religion and their guns `` He worshipped for years with racist friend and America hater Jeremiah Wright . He mocked special Olympics on Letterman . `` The cops acted stupidly `` `` ISIS is JV `` [ that alone recruited more terrorists for ISIS than Trump s comments ] 

7013218 | 0.5867 | Best of luck to you Emers ! You are doing god s work . One day , I firmly believe , the manifold beneficial properties of cannabis will be so revealed that we will only wonder at our stupidity in not fully embracing it sooner . Got me off years of pain medication that had increasingly deleterious effects . 

7092639 | 0.7496 | Stop making dumb comments . Please . 

7053226 | 0.8995 | What a bunch of idiots . You are being paid by CNN to say that ! 

7092559 | 0.5864 | This much we know : Trump is a bully who uses money as a club with which to beat his creditors into submission ; they ca n't afford to litigate their grievances , and he can litigate per omnia saecula saeculorum .  And we know he s lying about his money : He ’ s lying about how much money he has . He ’ s lying about where it comes from . He ’ s lying about where is goes - - especially about his miserly contributions to charity . He ’ s lying about , and covering up , his foreign entanglements - - whom he owes and who owes him .

In conclusion, we can see comments with negative words are scored 0.5 or above. Once again, the moral standard of machine learning is just a bit high to me :]] .

After that, by submitting the prediction to Kaggle, we got 0.934x accuracy, which is an improvement from our previous submission :]] .

Benefits of using Save and Load Model

You may ask, what do we get by using save and load model?

Time and Portability.

We save the long training time, so we only need to handle input testing. From our case above, we only use 70 seconds for handling 97.3k records.

Besides, since we have saved our model in files, we can use the model in other machine. It gives more room for us to think about our strategy in resource planning, knowledge sharing and regional support.

What have we learnt in Save and Load model?

How to save a Keras RNN model
How to load a Keras RNN model
Benefits of using save and load model

(Complete sources can be found at https://www.kaggle.com/codeastar/save-keras-rnn-model for model saving and https://www.kaggle.com/codeastar/load-keras-rnn-model for model loading)

Train and Save RNN model

Load and Run RNN model

Benefits of using Save and Load Model

What have we learnt in Save and Load model?

Related