VISUALIZING AND UNDERSTANDING RECURRENT NEURAL NETWORKS Presented By:

VISUALIZING AND UNDERSTANDING RECURRENT NEURAL NETWORKS Presented By:

VISUALIZING AND UNDERSTANDING RECURRENT NEURAL NETWORKS Presented By: Collin Watts Wrritten By: Andrej Karpathy, Justin Johnson, Li Fei-fei PLAN OF ATTACK What were going to cover: Overview

Some Definitions Expiremental Analysis Lots of Results The Implications of the Results Case Studies Meta-Analysis SO, WHAT WOULD YOU SAY YOU DO HERE... This paper set out to analyze both the most

efficient implementation of an RANN (well get there) as well as identify what mechanisms are used internally that achieve their results. Chose 3 different variants of RANNs: Basic RANNs LSTM RANNs GRU RANNs Did character level language analysis as their test problem, as it is apparently strongly

representative of other analysies. DEFINITIONS RECURRENT NEURAL NETWORK Subset of Artificial Neural Networks Still use feedforward and backpropogation Allows nodes to form cycles, creating the potentiality for storage of information within the network

Used in applications such as handwriting analysis, video analysis, translation, and other interpretation of various human tasks DEFINITIONS RECURRENT NEURAL NETWORK Subset of Artificial Neural Networks Still use feedforward and backpropogation Allows nodes to form cycles, creating the potentiality for storage of information within

the network Used in applications such as handwriting analysis, video analysis, translation, and other interpretation of various human tasks Difficult to train DEFINITIONS RECURRENT NEURAL NETWORK (Cont.) Uses a 2 dimensional node setup, with time as one axis and depth of the nodes as

another Nodes are referrd to as hLt, with l = 0 being the input nodes, and l = L being the output nodes. Intermediate vectors are calculated as a function of both the previous time step and the previous layer. This results in the following recurrence: MORE DEFINITIONS!

LONG SHORT-TERM MEMORY VARIANT Variant of the RANN designed to mitigate problems with backpropogation within a RANN. Adds a memory vector to each node. Every time step, an LSTM can choose to read, write to, or reset the memory vector, following a series of gating mechanisms. Has the effect of preserving gradients across memory cells for long periods. i, f, o, and g are the gates for whether the memory

cell is updated, reset, or read, respectively, while g allows for additive additions to the memory cell. HALF A DEFINITION... GATED RECURRENT UNIT Not well elaborated on in the paper... Given explanation is that The GRU has the interpretation of computing a candidate hidden vector and then

smoothly interpolating towards it, as gated by z. My interpretation: rather than having explicit access & control gates, this follows a more analog approach. EXPIREMENTAL ANALYSIS (SCIENCE!) As previously stated, the researchers used character-level language modelling as a basis of

comparison. Trained each network to predict the following character in a sequence. Used Softmax classifier at each time step. Generated a vector of all possible next characters and fed those to the current network to get that many hidde vectors in the last layer of the network. These outputs represented log probabilities of each character being the next character in the sequence.

EXPIREMENTAL ANALYSIS (SCIENCE!) Rejected the use of two other datasets (Penn treeback dataset and Hutter Prize 100MB of Wikipedia dataset) on the basis of them containing both standard English language and markup. Stated intention for rejecting was to use a controlled setting for all types of neural networks, rather than compete for best results on these data sets. Decided on Leo Tolstoys War and Peace, consisting

of 3,258,246 characters and the source code of the Linux Kernel (randomized across files and then concatenated into a single 6,206,996 character file). EXPIREMENTAL ANALYSIS (SCIENCE!) War and Peace, was split into 80/10/10 for training/validation/testing. Linux Kernel, was split into 90/5/5 for training/validation/testing.

Tested the following properties for each of the 3 RANNS: Number of Layers (1, 2 , or 3) Number of Parameters (64, 128, 256, 512 cell counts) RESULTS (AND THE WINNER IS...) Test set cross entropy loss:

RESULTS (AND THE WINNER IS...) RESULTS (AND THE WINNER IS...) IMPLICATIONS OF RESULTS (BUT WHY...) The researchers paid attention to several characteristics

beyond just the results of their findings. One of their stated goals was to arrive at why these emergent properties exist. Interpretable, long-range LSTM cells Have been theorized to exist, but never proven. They proved them. Truncated back-propagation (used for performance gains as well as combatting overfitting) limits understanding dependencies more than X characters away, where X is the depth of the

backpropogation. These LSTM cells have been able to overcome that challenge while retaining performance and fitting characteristics. VISUALIZATIONS OF RESULTS (BUT WHY...) Text color is a visualization of tanh(c) where 1 is red and +1 is blue.

VISUALIZATIONS OF RESULTS (BUT WHY...) VISUALIZATIONS OF RESULTS (BUT WHY...) VISUALIZATIONS OF RESULTS (BUT WHY...) IMPLICATIONS OF RESULTS

(BUT WHY...) Also paid attention to gate activations (remember the gates are what cause interactions with the memory node) in LSTMs. Defined the ideas of left saturated and right saturated Left saturated: If the gates activate less than 0.1 (10% of the time). Right saturated: If the gates activate more than 0.9 (90% of the time) Of particular note:

Right saturated forget gate cells (cells remembering values) No left saturated forget gate cells (no cells being purely feed forward) Found that activations in the first layer are diffuse (this is unexplainable by the researchers, but found to be very strange) VISUALIZATIONS OF RESULTS (BUT WHY... LSTMS) VISUALIZATIONS OF RESULTS

(BUT WHY...GRUS) ERROR ANALYSIS OF RESULTS Compared against two standard n-gram models for analysis of LSTMs effectiveness. An error was defined to be if the probability of the next character being the character that was actually there was less than 0.5. Found that while the models shared many of

the same errors, there were distinct segments that each one failed differently on. ERROR ANALYSIS OF RESULTS Linux Kernel War and peace

ERROR ANALYSIS OF RESULTS Found that LSTM has significant advantages over standard n-gram models when computing the probability of special characters. In the Linux Kernel model, brackets and whitespce are predicted significantly better than in the n-gram model, because of its ability to keep track of relationships between open and closing brackets.

Similarly, in War and Peace, LSTM was able to more correctly predict carriage returns, due to the relationship being outside of the n-gram models effective range of relationship prediction. CASE STUDY { LOOK, BRACES! } When it specifically compes to closing brackets (}) in the Linux kernel, the researchers were able to analyze the performance of the LSTM versus the n-gram models.

Found that LSTM did better than n-gram for distances of up to 60 characters. After that, the performance gains levelled off. META-ANALYSIS (THE GOOD) The researchers were able to very effectively capture and elucidate their point via their visualizations and

implications. They seem to have proven several until now only theorized ideas on how RANNs work in data analysis. META-ANALYSIS (THE BAD) I would have appreciated a more in depth explanation of why they rejected the standard ANN competitive datasets. It would seem to follow that those would be a

more true measure of the capabilities, which is why they are chosen in the first place. There wasnt a lot of explanation as to why their parameters were chosen for each RANN, or what their parameters for evaluation were. (What is test set crossentropy loss?) Data was split differently across each of the texts, so that the total count for validation and tests was the same. I dont see what this offers. If anything, you would want the count of training to be the same.

META-ANALYSIS (THE UGLY) This paper does not ease the reader into understanding the ideas involved. Required reading several additional papers to get the implications of things they assumed the reader knew. Some ideas were not clearly explained even after researching the related works.

FINAL SLIDE Questions? Comments? Concerns? Corrections?

Recently Viewed Presentations

  • Clinical Handover Improvement - hqsc.govt.nz

    Clinical Handover Improvement - hqsc.govt.nz

    Clinical Handover Improvement ... Results and measures Pilot group trialled the TrendCare programme "Allocate Medical Handover" 'SBAR' communication tool provided the framework for the documentation of patient information within Trendcare handover KEY OUTCOMES: Introduced and trialled an ...
  • RAILTRACK Presentation to Liability Underwriters Group Conference 4

    RAILTRACK Presentation to Liability Underwriters Group Conference 4

    Times New Roman Gill Sans Wingdings Arial newRTohp Microsoft Excel Chart Document Presentation to Liability Underwriters Group Summary Britain's New Railway Industry The heart of the railway Railtrack Owns Railtrack's Objectives Risk Management Process Corporate Governance (Continuous Process) Railtrack Has...
  • Partial Differential Equations

    Partial Differential Equations

    Part 8 Partial Differential Equations Table PT8.1 Figure PT8.4 Finite Difference: Elliptic Equations Chapter 29 Solution Technique Elliptic equations in engineering are typically used to characterize steady-state, boundary value problems. For numerical solution of elliptic PDEs, the PDE is transformed...
  • French and Indian War

    French and Indian War

    French and Indian WarSeven Years' War. Britain to prevent the French from gaining control In North America. Canada. Western Pennsylvania. ... Currency Act-1764. to regulate the issue and legal tender status of paper money in the colonial economy.
  • PPT examples

    PPT examples

    Title: PPT examples Author: mfurlong Last modified by: lweller Created Date: 10/28/2002 5:41:29 PM Document presentation format: On-screen Show (4:3)
  • Partial Molar Quantities, Activities, Mixing Properties

    Partial Molar Quantities, Activities, Mixing Properties

    Partial Molar Quantities, Activities, Mixing Properties Composition (X) is a critical variable, as well at temperature (T) and pressure (P) Variation of a thermodynamic parameter with number of moles of one component, all other compositional variables, T, P held constant
  • UTC Slides - Federal Aviation Administration

    UTC Slides - Federal Aviation Administration

    Smoke Generator A. Low velocity undulating plume vs. high velocity turbulent plume . Higher ambient T (& low density) not effective in plume slow down & lateral mixing. Buoyancy effects of the (hot) plume can change smoke transport .
  • SYPHILIS - isinbasgan

    SYPHILIS - isinbasgan

    Syphilis is distributed worldwide and is particularly problematic in . low-income countries, where it is a leading cause of genital ulcer disease. Worldwide, the rates of primary and secondary syphilis decreased dramatically with the introduction of penicillin treatment after the...