ABOUT ME

-

Today
-
Yesterday
-
Total
-
  • #20. Dinosaur Island - Character-Level Language Modeling
    연구실 2019. 10. 24. 18:18

    - RNN을 이용해 text data를 저장하는 방법

    - 각 단계에서 prediction을 샘플링하고 다음 RNN-cell unit에 전달하며 data를 합성하는 방법

    - character-level text generation RNN을 만드는 방법

    - gradient를 clipping하는게 왜 중요한지

     

    * Problem Statement

    (1) Dataset and Preprocessing

    - 데이터셋을 읽어 unique한 character들의 list를 만든다.

    - 예시에서는 27개(a-z + \n)개의 character를 사용한다.

     

    (2) Overview of the model

    - Initialize parameters

    - Run the optimization loop

        - Forward propagation to compute the loss function

        - Backward propagation to compute the gradients with respect to the loss function

        - Clip the gradients to avoid exploding gradients

        - Using the gradients, update your parameter with the gradient descent update rule

    - Return the learned parameters

    - 각 time-step에서 RNN은 이전 character에 대해 다음 character를 예측한다.

    - dataset X = (x<1>, x<2>, ..., x<Tx>)는 training set의 character 리스트이며, Y = (y<1>, y<2>, ..., y<Tx>)는 각 time-step t에서의 character list이다. y<t> = x<t+1>이다.

     

     

    * Building blocks of the model

    - Gradient clipping: to avoid exploding gradients

    - Sampling: a technique used to generate characters

     

    (1) Clipping the gradients in the optimization loop

    - gradient가 너무 큰 값을 가져 exploding하지 않도록 gradient clipping을 수행해준다.

    - gradient를 clip하는 방법에는 여러가지가 있는데, 여기서는 simple element-wise clipping을 사용한다.

        - 각 gradient vector의 각 element가 [-N, N]의 범위 안에 있도록 만들어주는 기법이다.

        - maxValue를 설정하고, 그 값을 초과하면 maxValue로 값을 조정한다. 반대도 마찬가지이다.

    ### GRADED FUNCTION: clip
    
    def clip(gradients, maxValue):
        '''
        Clips the gradients' values between minimum and maximum.
        
        Arguments:
        gradients -- a dictionary containing the gradients "dWaa", "dWax", "dWya", "db", "dby"
        maxValue -- everything above this number is set to this number, and everything less than -maxValue is set to -maxValue
        
        Returns: 
        gradients -- a dictionary with the clipped gradients.
        '''
        
        dWaa, dWax, dWya, db, dby = gradients['dWaa'], gradients['dWax'], gradients['dWya'], gradients['db'], gradients['dby']
       
        # clip to mitigate exploding gradients, loop over [dWax, dWaa, dWya, db, dby]. (≈2 lines)
        for gradient in [dWax, dWaa, dWya, db, dby]:
            np.clip(gradient, -maxValue, maxValue, out = gradient)
        
        gradients = {"dWaa": dWaa, "dWax": dWax, "dWya": dWya, "db": db, "dby": dby}
        
        return gradients

     

    (2) Sampling

    # GRADED FUNCTION: sample
    
    def sample(parameters, char_to_ix, seed):
        """
        Sample a sequence of characters according to a sequence of probability distributions output of the RNN
    
        Arguments:
        parameters -- python dictionary containing the parameters Waa, Wax, Wya, by, and b. 
        char_to_ix -- python dictionary mapping each character to an index.
        seed -- used for grading purposes. Do not worry about it.
    
        Returns:
        indices -- a list of length n containing the indices of the sampled characters.
        """
        
        # Retrieve parameters and relevant shapes from "parameters" dictionary
        Waa, Wax, Wya, by, b = parameters['Waa'], parameters['Wax'], parameters['Wya'], parameters['by'], parameters['b']
        vocab_size = by.shape[0]
        n_a = Waa.shape[1]
        
        # Step 1: Create the one-hot vector x for the first character (initializing the sequence generation). (≈1 line)
        x = np.zeros((vocab_size, 1))
        # Step 1': Initialize a_prev as zeros (≈1 line)
        a_prev = np.zeros((n_a, 1))
        
        # Create an empty list of indices, this is the list which will contain the list of indices of the characters to generate (≈1 line)
        indices = []
        
        # Idx is a flag to detect a newline character, we initialize it to -1
        idx = -1 
        
        # Loop over time-steps t. At each time-step, sample a character from a probability distribution and append 
        # its index to "indices". We'll stop if we reach 50 characters (which should be very unlikely with a well 
        # trained model), which helps debugging and prevents entering an infinite loop. 
        counter = 0
        newline_character = char_to_ix['\n']
        
        while (idx != newline_character and counter != 50):
            
            # Step 2: Forward propagate x using the equations (1), (2) and (3)
            a = np.tanh(np.dot(Wax, x) + np.dot(Waa, a_prev) + b)
            z = np.dot(Wya, a) + by
            y = softmax(z)
            
            # for grading purposes
            np.random.seed(counter+seed) 
            
            # Step 3: Sample the index of a character within the vocabulary from the probability distribution y
            idx = np.random.choice(list(range(vocab_size)), p = y.ravel())
    
            # Append the index to "indices"
            indices.append(idx)
            
            # Step 4: Overwrite the input character as the one corresponding to the sampled index.
            x = np.zeros((vocab_size, 1))
            x[idx] = 1
            
            # Update "a_prev" to be "a"
            a_prev = a
            
            # for grading purposes
            seed += 1
            counter +=1
            
    
        if (counter == 50):
            indices.append(char_to_ix['\n'])
        
        return indices

     

    * Building the language model

    (1) Gradient Descent

    - 이 예시에서는 stochastic gradient descent(with clipping gradients) 함수를 사용할 것.

    - The steps of a common optimization loop for an RNN:

        1. Forward propagate through the RNN to compute the loss

        2. Backward propagate through time to compute the gradients of the loss with respect to the parameters

        3. Clip the gradients if necessary

        4. Update your parameters using gradient descent

     

    # GRADED FUNCTION: optimize
    
    def optimize(X, Y, a_prev, parameters, learning_rate = 0.01):
        """
        Execute one step of the optimization to train the model.
        
        Arguments:
        X -- list of integers, where each integer is a number that maps to a character in the vocabulary.
        Y -- list of integers, exactly the same as X but shifted one index to the left.
        a_prev -- previous hidden state.
        parameters -- python dictionary containing:
                            Wax -- Weight matrix multiplying the input, numpy array of shape (n_a, n_x)
                            Waa -- Weight matrix multiplying the hidden state, numpy array of shape (n_a, n_a)
                            Wya -- Weight matrix relating the hidden-state to the output, numpy array of shape (n_y, n_a)
                            b --  Bias, numpy array of shape (n_a, 1)
                            by -- Bias relating the hidden-state to the output, numpy array of shape (n_y, 1)
        learning_rate -- learning rate for the model.
        
        Returns:
        loss -- value of the loss function (cross-entropy)
        gradients -- python dictionary containing:
                            dWax -- Gradients of input-to-hidden weights, of shape (n_a, n_x)
                            dWaa -- Gradients of hidden-to-hidden weights, of shape (n_a, n_a)
                            dWya -- Gradients of hidden-to-output weights, of shape (n_y, n_a)
                            db -- Gradients of bias vector, of shape (n_a, 1)
                            dby -- Gradients of output bias vector, of shape (n_y, 1)
        a[len(X)-1] -- the last hidden state, of shape (n_a, 1)
        """
        
        ### START CODE HERE ###
        
        # Forward propagate through time (≈1 line)
        loss, cache = rnn_forward(X, Y, a_prev, parameters)
        
        # Backpropagate through time (≈1 line)
        gradients, a = rnn_backward(X, Y, parameters, cache)
        
        # Clip your gradients between -5 (min) and 5 (max) (≈1 line)
        gradients = clip(gradients, maxValue = 5)
        
        # Update parameters (≈1 line)
        parameters = update_parameters(parameters, gradients, learning_rate)
        
        ### END CODE HERE ###
        
        return loss, gradients, a[len(X)-1]
    def update_parameters(parameters, gradients, lr):
    
        parameters['Wax'] += -lr * gradients['dWax']
        parameters['Waa'] += -lr * gradients['dWaa']
        parameters['Wya'] += -lr * gradients['dWya']
        parameters['b']  += -lr * gradients['db']
        parameters['by']  += -lr * gradients['dby']
        return parameters
    
    def rnn_forward(X, Y, a0, parameters, vocab_size = 27):
        
        # Initialize x, a and y_hat as empty dictionaries
        x, a, y_hat = {}, {}, {}
        
        a[-1] = np.copy(a0)
        
        # initialize your loss to 0
        loss = 0
        
        for t in range(len(X)):
            
            # Set x[t] to be the one-hot vector representation of the t'th character in X.
            # if X[t] == None, we just have x[t]=0. This is used to set the input for the first timestep to the zero vector. 
            x[t] = np.zeros((vocab_size,1)) 
            if (X[t] != None):
                x[t][X[t]] = 1
            
            # Run one step forward of the RNN
            a[t], y_hat[t] = rnn_step_forward(parameters, a[t-1], x[t])
            
            # Update the loss by substracting the cross-entropy term of this time-step from it.
            loss -= np.log(y_hat[t][Y[t],0])
            
        cache = (y_hat, a, x)
            
        return loss, cache
    
    def rnn_backward(X, Y, parameters, cache):
        # Initialize gradients as an empty dictionary
        gradients = {}
        
        # Retrieve from cache and parameters
        (y_hat, a, x) = cache
        Waa, Wax, Wya, by, b = parameters['Waa'], parameters['Wax'], parameters['Wya'], parameters['by'], parameters['b']
        
        # each one should be initialized to zeros of the same dimension as its corresponding parameter
        gradients['dWax'], gradients['dWaa'], gradients['dWya'] = np.zeros_like(Wax), np.zeros_like(Waa), np.zeros_like(Wya)
        gradients['db'], gradients['dby'] = np.zeros_like(b), np.zeros_like(by)
        gradients['da_next'] = np.zeros_like(a[0])
        
        ### START CODE HERE ###
        # Backpropagate through time
        for t in reversed(range(len(X))):
            dy = np.copy(y_hat[t])
            dy[Y[t]] -= 1
            gradients = rnn_step_backward(dy, gradients, parameters, x[t], a[t], a[t-1])
        ### END CODE HERE ###
        
        return gradients, a
    

     

    (2) Training the model

    - Stochastic gradient descent 100 steps마다 랜덤하게 10개의 샘플을 만들어낸다. 

    # GRADED FUNCTION: model
    
    def model(data, ix_to_char, char_to_ix, num_iterations = 35000, n_a = 50, dino_names = 7, vocab_size = 27):
        """
        Trains the model and generates dinosaur names. 
        
        Arguments:
        data -- text corpus
        ix_to_char -- dictionary that maps the index to a character
        char_to_ix -- dictionary that maps a character to an index
        num_iterations -- number of iterations to train the model for
        n_a -- number of units of the RNN cell
        dino_names -- number of dinosaur names you want to sample at each iteration. 
        vocab_size -- number of unique characters found in the text, size of the vocabulary
        
        Returns:
        parameters -- learned parameters
        """
        
        # Retrieve n_x and n_y from vocab_size
        n_x, n_y = vocab_size, vocab_size
        
        # Initialize parameters
        parameters = initialize_parameters(n_a, n_x, n_y)
        
        # Initialize loss (this is required because we want to smooth our loss, don't worry about it)
        loss = get_initial_loss(vocab_size, dino_names)
        
        # Build list of all dinosaur names (training examples).
        with open("dinos.txt") as f:
            examples = f.readlines()
        examples = [x.lower().strip() for x in examples]
        
        # Shuffle list of all dinosaur names
        np.random.seed(0)
        np.random.shuffle(examples)
        
        # Initialize the hidden state of your LSTM
        a_prev = np.zeros((n_a, 1))
        
        # Optimization loop
        for j in range(num_iterations):
            
            ### START CODE HERE ###
            
            # Use the hint above to define one training example (X,Y) (≈ 2 lines)
            index = j % len(examples)
            X = [None] + [char_to_ix[ch] for ch in examples[index]] 
            Y = Y = X[1:] + [char_to_ix["\n"]]
            
            # Perform one optimization step: Forward-prop -> Backward-prop -> Clip -> Update parameters
            # Choose a learning rate of 0.01
            curr_loss, gradients, a_prev = optimize(X, Y, a_prev, parameters, learning_rate=0.01)
            
            ### END CODE HERE ###
            
            # Use a latency trick to keep the loss smooth. It happens here to accelerate the training.
            loss = smooth(loss, curr_loss)
    
            # Every 2000 Iteration, generate "n" characters thanks to sample() to check if the model is learning properly
            if j % 2000 == 0:
                
                print('Iteration: %d, Loss: %f' % (j, loss) + '\n')
                
                # The number of dinosaur names to print
                seed = 0
                for name in range(dino_names):
                    
                    # Sample indices and print them
                    sampled_indices = sample(parameters, char_to_ix, seed)
                    print_sample(sampled_indices, ix_to_char)
                    
                    seed += 1  # To get the same result for grading purposed, increment the seed by one. 
          
                print('\n')
            
        return parameters

     

    댓글

©hyunbul