4C. Title Generation - Developing the Model - Beam Search

This post looks at developing our initial models to include state of the art features to improve results.

To recap:

We have two models: the Ludwig model and the Chollet/Brownlee model.
Performance so far has been fairly poor but the Ludwig model appears to give better results out of the box.
Each model had slightly different characteristics - the Ludwig model produced better formed output but seemed to simply memorise and repeat titles, the Chollet/Brownlee model had a lower loss and appeared to memorise less but produced more nonsensical outputs.

In this post we will look at implementing a beam search decoder as an improvement to our previous greedy decoder.

Beam Search Decoding

This video from the Udacity Deep Learning Course provides a good explanation of beam search.

Basically, when you generate a sequence on a token-by-token basis, each token has an associated probability, where the decoder outputs a vector indicative of probabilities for all possible tokens (in our model at least).

A naive way to select the tokens for the sequence is just to select the largest probability at each step in the sequence. Many decoder structures that are described on the Internet do this. However, this can lead to problems: if you have several tokens with similar probabilities, you might select one token that actually leads to a less likely sequence overall.

To remedy this you might want to keep track of all possible output sequences. However, the number of possible sequences grows exponentially. (Even with a small vocabulary of 2500 tokens you can end up the same number of sequences as there are atoms in the universe!) Beam search makes this manageable by keeping track of only k top sequences.

Machine Learning Mastery has an article here that describes how to implement a beam search in Python. However, in the comments this is criticised for being too simplistic. On review, I think this criticism is warranted: the function does not look at different, branching trees of tokens; instead, the probabilities are "given" in a pre-generated array.

Here is another example of a beam search for keras and here is an example slide showing the branching trees.

To implement, the beam search algorithm, we need a function that performs the following steps:

Initialise probabilities to 1 for start token;
Predict the list of probabilities for the start token;
Update scores as score * -log(list of probs);
Select the k top scores;
Build k sequences for each of the k top probabilities = start token + prediction for 0 to k-1; and
Iterate by predicting the next token for each sequence and updating the scores and sequences.

We can also amend our methods for printing examples to print the top k sequences and their probabilities.

With reference to our models, we can implement beam search as a custom _predict_from_seq() method.

We need a data structure to store, for our k top sequences:

scores;
token sequences.

Our input will be:

claim text (as for our previous decoder); and
a value for k, the number of sequences to track.

We want to output:

the k sequences with top scores, where sequences stop when the stop token is predicted or when the maximum number of output tokens is reached.

Ludwig Model - Custom Decoding Function

Let's start by working on our custom decoding function. This will have the form of _predict_from_seq() as previously developed.

We can first try this out and debug as a separate method that takes a pre-existing model object as self. Once we get something working we can then easily create a custom class that inherits from a previous object definition that has the custom beam search decoding method.

from ludwig_model import LudwigModel

/home/ben/anaconda3/envs/tf_gpu_source/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Using TensorFlow backend.

# Load our data as before
import pickle
import os

PIK = "claim_and_title.data"

if not os.path.isfile(PIK):
    # Download file
    !wget https://benhoyle.github.io/notebooks/title_generation/claim_and_title.data

with open(PIK, "rb") as f:
    print("Loading data")
    data = pickle.load(f)
    print("{0} samples loaded".format(len(data)))

print("\n\nAdding start and stop tokens to output")
data = [(c, "startseq {0} stopseq".format(t)) for c, t in data]

print("\n\nAn example title:", data[0][1])
print("----")
print("An example claim:", data[0][0])

Loading data
30000 samples loaded


Adding start and stop tokens to output


An example title: startseq System and method for session restoration at geo-redundant gateways stopseq
----
An example claim: 
1. A method for managing a backup service gateway (SGW) associated with a primary SGW, the method comprising:
periodically receiving from the primary SGW at least a portion of corresponding UE session state information, the received portion of session state information being sufficient to enable the backup SGW to indicate to an inquiring management entity that UEs having an active session supported by the primary SGW are in a live state; and
in response to a failure of the primary SGW, the backup SGW assuming management of IP addresses and paths associated with said primary SGW and transmitting a Downlink Data Notification (DDN) toward a Mobility Management Entity (MME) for each of said UEs having an active session supported by the failed primary SGW to detach from the network and reattach to the network, wherein each DDN causes the MME to send a detach request with a reattach request code to the respective UE.

machine = LudwigModel(
    encoder_texts=[d[0] for d in data],
    decoder_texts=[d[1] for d in data],
    encoder_seq_length=300,
    decoder_seq_length=22,
    num_encoder_tokens=2500,
    num_decoder_tokens=2500,
    latent_dim=128,
    weights_file="LW_model.hdf5",
    training_set_size=250
)

Fitting tokenizers
Our input data has shape (30000, 300) and our output data has shape (30000, 22)
Generating training and test data
Building model
Loading GloVe 100d embeddings from file
Found 400000 word vectors.
Building embedding matrix
Compiling model
Loaded weights

# Imports 
import math
import random
import numpy as np


def sample(preds, temperature=1.0):
    """ Helper function to sample an index from a probability array. """
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return probas

def _predict_from_seq(self, input_data, k, temp=1.0):
    """ Predict output text from input text. """
    input_seq = input_data.reshape(1, self.encoder_seq_length)
    ndt = self.num_decoder_tokens  # Just shorten the variable name 
    output_sequences = []

    ans_partial = np.zeros((1, self.decoder_seq_length), dtype=np.int16)
    ans_partial[0, -1] = self.output_dictionary["startseq"]

    # Initialise data storage - char sequence, state sequence, scores
    sequences = [[ans_partial, 0.0]]

    for _ in range(self.decoder_seq_length):
        # Create an empty array to hold the scores in each pass
        new_scores = np.zeros(shape=(len(sequences)*ndt))
        temp_states = []
        #print([s[0] for s in sequences])
        for i, seq in enumerate(sequences):
            #print(i, seq[0])
            # Get most recent score
            prev_score = seq[1]
            # Get current token sequence
            ans_partial = seq[0]
            # Predict token probabilities using previous state and token for sequence
            yhat = self.infdec.predict([input_seq, ans_partial])
            # Unpack yhat array of probabilities
            #yhat = sample(yhat[0, 0, :])
            yhat = yhat[0,:]
            # print(yhat)
            # As we are taking the log do we not sum the scores?
            new_scores[i*ndt:(i+1)*ndt] = prev_score+-np.log(yhat) 
            # print(new_scores)
            # Actually I only need to look at the top k for each sequence, then further distill these down to the top k
            # across all sequences - maybe one for optimising later

        # Outside of loop we want to pick the k highest scores
        # Each sequence has a set of scores equal in size to num_decoder_tokens- i.e. k*n_d_t scores in total
        # We want to select the k highest scores across all scores in total - but then we need to know
        # We just modulo k on the index to find the indice and int(index/num_decoder_tokens) to find k

        # Then update the sequences to reflect these k highest scores

        # select top k scores -as we are minimising these are at bottom of list
        top_k_indices = np.argsort(new_scores)[:k].tolist()
        #print(new_scores[top_k_indices])
        #print(top_k_indices)
        new_sequences = []
        seqs_to_delete = []
        for index in top_k_indices:
            seq_select = int(index/ndt)
            #print("Selected seq: ", seq_select)
            #print("Length of sequences: ", len(sequences))
            new_token = index % ndt # this is the token index
            # This is equivalent to argmax - but should we actually be sampling?
            # Update the partial answer
            new_ans_partial = np.zeros((1, self.decoder_seq_length), dtype=np.int16)
            new_ans_partial[0, 0:-1] = sequences[seq_select][0][0, 1:]
            # This then adds the newly decoded word onto the end of ans_partial
            new_ans_partial[0, -1] = new_token
            #print(new_scores[index])
            entry = (new_ans_partial, new_scores[index])
            # If predicted token is end token
            if new_token == self.output_dictionary["stopseq"]:
                # Add data for output
                output_sequences.append(entry)
                # Reduce k by 1
                k -= 1
            else:
                # Add to list of new sequences to use
                new_sequences.append(entry)
        sequences = new_sequences
        if k == 0:
            break

    # Sort list in reverse "score" order
    output_sequences.sort(key=lambda x: x[1])
    return output_sequences

def print_examples(self, number, k=5):
    num_test_titles = len(self.input_test_data)
    indices = random.sample(range(0, num_test_titles), number)
    for i in indices:
        input_sample = self.input_test_data[i]
        output_sample = self.output_test_data[i]
        seqs = _predict_from_seq(self, input_sample, k)
        output_sample_text = self._seq2text(output_sample)
        claim_text = self._seq2text(input_sample, output=False)
        print("-----------------")
        print("Sample of claim text: {}".format(claim_text[0:200]))
        print("-----------------")
        print("\nActual title is: {} \n---".format(output_sample_text))
        for seq, score in seqs:
            o_string = (
                "Predicted title is: {0} with score {1}"
            )

            print(o_string.format(self._seq2text(seq[0].tolist()), score))

# Let's test without Beam Search decoding
machine.example_output(5)

------------------------------------------
Sample of claim text: 1 a computer implemented system for providing users with an interface comprising temporal information the system comprising a computer system the computer system comprising at least one processor and

Predicted title is: method and system for providing a user interface  
Actual title is: providing temporal information to users  
---
Sample of claim text: blocks being disposed against the hot key patterns the first electrode blocks being disposed against the key group pattern the second sensing block being disposed against the cursor control pattern th

Predicted title is: integrated circuit with a power supply  
Actual title is: integrated input apparatus  
---
Sample of claim text: 1 a mobile product inventory management device comprising a display a camera a processor and a recognition module executable on the processor and that the processor to capture image data via the camer

Predicted title is: method and system for providing a product  
Actual title is: virtual management systems and methods  
---
Sample of claim text: 1 a method comprising receiving a two dimensional 2d image frame receiving three dimensional 3d image data corresponding to the 2d image frame using the 3d image data corresponding to the 2d image fra

Predicted title is: method and apparatus for three dimensional image  
Actual title is: using a combination of 2d and 3d image data to determine hand features information  
---
Sample of claim text: 1 a non transitory computer readable medium having computer executable instructions for performing a method comprising collecting performance data for different domains in a system by activating a plu

Predicted title is: method and system for managing data  
Actual title is: system and method to monitor performance of different domains associated with a computer or network  
---

self = machine
print_examples(self, 5)

-----------------
Sample of claim text: 1 a display control apparatus for controlling a display content of a display section of a peripheral that is connected to the display control apparatus the display section being included by the periph
-----------------

Actual title is: display control apparatus communicating with a peripheral to present operational information to users  
---
Predicted title is: display apparatus  with score 4.325193405151367
Predicted title is: information processing apparatus information processing method and recording medium  with score 8.561894416809082
Predicted title is: method and apparatus for portable electronic device  with score 8.68794059753418
Predicted title is: information processing apparatus information processing method and electronic device  with score 10.179014205932617
Predicted title is: information processing apparatus information processing method and computer readable medium  with score 10.28657341003418
-----------------
Sample of claim text: and memory configured for connection with a plurality of components of the aircraft to retrieve data from one or more modules of the aircraft components a configuration file in said portable computer 
-----------------

Actual title is: configuration management apparatus and related methods  
---
Predicted title is: system and method for data processing  with score 8.64958381652832
Predicted title is: system and method for managing data  with score 9.201067924499512
Predicted title is: system and method for data processing in a virtual environment  with score 12.002704620361328
Predicted title is: system and method for data processing in a virtual machine  with score 13.135804176330566
Predicted title is: system and method for data processing in a web page  with score 13.158220291137695
-----------------
Sample of claim text: 1 a computer implemented method for processing a selection associated with a media content received in an electronic broadcast transmission the media content presented by a receiving device and the me
-----------------

Actual title is: broadcast response method and system  
---
Predicted title is: system and method for data  with score 6.181354522705078
Predicted title is: system and method for data processing  with score 7.070134162902832
Predicted title is: system and method for data processing in an electronic device  with score 11.689477920532227
Predicted title is: system and method for data processing in an electronic system  with score 11.725565910339355
Predicted title is: system and method for data processing in a digital media system  with score 12.572397232055664
-----------------
Sample of claim text: 1 a method of detecting human objects in a video comprising determining pixels of a video image are foreground pixels the group of foreground pixels a foreground set of one or more foreground for each
-----------------

Actual title is: 3d human pose and shape modeling  
---
Predicted title is: method and apparatus for creating images  with score 7.900638103485107
Predicted title is: method and apparatus for creating and displaying images  with score 11.06941032409668
Predicted title is: method and apparatus for creating images of images  with score 11.170570373535156
Predicted title is: method and apparatus for creating and displaying video  with score 12.05427360534668
Predicted title is: method and apparatus for creating and displaying video images  with score 12.832809448242188
-----------------
Sample of claim text: 1 in a computing system environment a method of of files stored on one or more computing devices each file having a plurality of symbols representing an underlying data stream of original bits of data
-----------------

Actual title is: grouping and volumes of files  
---
Predicted title is: system and method for creating data  with score 9.60139274597168
Predicted title is: system and method for creating a data stream  with score 11.702018737792969
Predicted title is: systems and methods for creating a data stream  with score 12.231966018676758
Predicted title is: system and method for creating data in a data stream  with score 14.472331047058105
Predicted title is: system and method for creating data from a data stream  with score 14.49394416809082

print_examples(self, 5, k=10)

-----------------
Sample of claim text: 1 a computer implemented method of reducing risk in a payment based transaction wherein payment is made from an account holder to a using a payment bank system operated by a payment bank the method co
-----------------

Actual title is: reducing risk in a payment based transaction based upon at least one user supplied risk parameter including a payment limit  
---
Predicted title is: payment system  with score 5.297518253326416
Predicted title is: payment payment system  with score 5.336144924163818
Predicted title is: system and method for payment transactions  with score 5.893258571624756
Predicted title is: systems and methods for payment transactions  with score 6.355759620666504
Predicted title is: method and system for payment transactions  with score 6.844301223754883
Predicted title is: systems and methods for payment payment transactions  with score 7.168615818023682
Predicted title is: system and method for payment payment transactions  with score 7.247190952301025
Predicted title is: system and method for payment processing  with score 7.810699939727783
Predicted title is: system and method for payment payment  with score 7.991192817687988
Predicted title is: methods and systems for payment payment transactions  with score 8.089160919189453
-----------------
Sample of claim text: 1 a method performed by a system that supports services between a plurality of and a plurality of in a communication network the method comprising delivering a seller interface via the communication n
-----------------

Actual title is: system supporting creation  
---
Predicted title is: method and system for providing a product  with score 13.070158004760742
Predicted title is: methods and systems for providing a product  with score 13.112994194030762
Predicted title is: system and method for providing a product  with score 13.368507385253906
Predicted title is: method and apparatus for providing a product  with score 13.528003692626953
Predicted title is: methods and systems for providing a product item  with score 15.519207954406738
Predicted title is: method and system for providing a product item  with score 15.596986770629883
Predicted title is: method and apparatus for providing real time messaging  with score 15.767982482910156
Predicted title is: method and system for providing real time messaging  with score 15.802291870117188
Predicted title is: method and system for providing real time party based on the internet  with score 24.388505935668945
Predicted title is: method and system for providing real time party based on a plurality of resources  with score 27.679676055908203
-----------------
Sample of claim text: 1 a method of providing consistent user experience in the method comprising receiving an indication of a target level of user experience for an application computing a target application performance c
-----------------

Actual title is: cache management in application  
---
Predicted title is: system and method for managing application  with score 11.204858779907227
Predicted title is: system and method for managing application development  with score 12.491059303283691
Predicted title is: system and method for providing a document  with score 13.075180053710938
Predicted title is: system and method for providing a secure document  with score 13.515348434448242
Predicted title is: system and method for generating a secure document  with score 14.135204315185547
Predicted title is: system and method for providing a user interface  with score 14.223642349243164
Predicted title is: system and method for providing a user space  with score 14.36276912689209
Predicted title is: system and method for providing a cache based on user preferences  with score 19.699052810668945
Predicted title is: system and method for providing a cache based on user space  with score 20.409120559692383
Predicted title is: system and method for providing a cache based on user preferences of a user  with score 28.28386688232422
-----------------
Sample of claim text: or more programmable processors a plurality of data objects associated with data stored in a database at a persistent storage the providing for execution of a service called by an application storing 
-----------------

Actual title is: for objects  
---
Predicted title is: system and method for data collection  with score 9.759042739868164
Predicted title is: system and method for data processing  with score 9.763725280761719
Predicted title is: system and method for managing data  with score 10.533380508422852
Predicted title is: system and method for creating a database  with score 11.950915336608887
Predicted title is: system and method for data processing in a database  with score 13.33000373840332
Predicted title is: system and method for data processing in a cluster  with score 13.828896522521973
Predicted title is: system and method for data collection in a database  with score 13.97393798828125
Predicted title is: system and method for data processing in a database system  with score 14.517136573791504
Predicted title is: system and method for data processing in a cloud  with score 14.586236000061035
Predicted title is: system and method for data processing in a cloud computing environment  with score 14.967475891113281
-----------------
Sample of claim text: 1 a method for enabling a product offering received by a client from a vendor the method comprising in a computer system at the client providing to the vendor a plurality of strings computing an index
-----------------

Actual title is: secure product over channels with bandwidth  
---
Predicted title is: method and system for providing financial transactions  with score 10.173721313476562
Predicted title is: method and system for providing a product  with score 10.914628982543945
Predicted title is: system and method for providing a product  with score 10.918341636657715
Predicted title is: methods and systems for providing a product  with score 11.22200870513916
Predicted title is: system and method for providing a secure product  with score 12.802101135253906
Predicted title is: method and system for providing a secure product  with score 12.827587127685547
Predicted title is: system and method for providing a product item  with score 13.25611686706543
Predicted title is: method and system for providing a product item  with score 13.26268196105957
Predicted title is: method and system for providing a product development  with score 13.407301902770996
Predicted title is: system and method for providing a product development  with score 13.513633728027344

Comments

Using the beam search decoder has two advantages: - it improves the quality of our predicted titles: the highest scoring titles in these examples appear to reflect fairly accurately the most relevant titles; and - it allows us to visualise how our models are working by outputing the k highest scoring entries.

The downside is that decoding time is increased. But this doesn't matter that much for our toy project.

It seems worthwhile folding beam search into our models. One option is to maybe create a mix-in class to add the beam search functionality.