Simple Recurrent Networks: Teaching AI Language Without a Grammar Book

The challenge of representing time has long been a central problem in artificial intelligence and cognitive science. Many important human behaviors, like understanding language or planning actions, unfold sequentially. In a landmark 1990 paper, “Finding Structure in Time” Jeffrey L. Elman introduced a powerful solution that changed the field:

simple recurrent networks. This work demonstrated how a neural network could develop a rich understanding of structure by simply processing sequences one step at a time.

Before Elman’s work, a common approach was to represent time spatially. This meant converting a sequence into one large, static snapshot and presenting it to a network all at once. However, Elman pointed out that this method has serious flaws. It imposes rigid limits on the length of sequences and struggles with patterns that shift in time. As he wrote, “such an approach does not easily distinguish relative temporal position from absolute temporal position”. Two identical patterns occurring at different moments would appear as completely distinct to the network.

A New Architecture with Memory

Elman proposed a different way forward. Instead of representing time explicitly, he suggested representing it “implicitly by its effects on processing”. The solution was to give the network a form of memory.

The architecture of these simple recurrent networks, now often called “Elman networks,” is elegant. In addition to input, hidden, and output layers, the network includes a set of “context units.” After each time step, the activation pattern of the hidden layer is copied to these context units. These context units then feed back into the hidden layer at the next time step, along with the new external input.

This simple recurrent loop means the network’s internal state at any moment is a function of both the current input and its own previous state. This gives the network a dynamic memory, allowing it to process information in the context of what came before. Elman explains that “the internal representations which develop thus reflect task demands in the context of prior internal states”.

From Simple Patterns to Complex Grammar

To test the power of simple recurrent networks, Elman designed a series of simulations where the network’s task was always the same: predict the next item in a sequence.

The experiments started with simple patterns, like a temporal version of the classic XOR problem, and moved to more complex letter sequences. In one task, the network was fed a stream of “letters” (represented as vectors) that formed “syllables” according to specific rules, such as the consonant ‘d’ always being followed by two ‘i’ vowels. The network successfully learned these rules and could accurately predict the vowels following a given consonant. It could even learn that after a sequence of vowels, a consonant was likely to appear next, even if it could not know which one.

The most compelling simulation involved discovering the structure of language. The network was trained on a continuous stream of thousands of words generated from simple sentence templates like “man eats cookie” or “dragon chases woman”. Crucially, the input vectors for each word were arbitrary and gave no clue about their meaning or grammatical category. The network’s only task was to predict the next word.

Discovering Nouns and Verbs from Scratch

After training, an analysis of the network’s internal hidden unit patterns revealed something remarkable. The network had spontaneously organized the words based on their grammatical and semantic properties.

Without any explicit instruction, the network learned to group words into abstract categories. A hierarchical analysis showed a clean split between nouns and verbs. Within the noun category, further divisions emerged between animate and inanimate nouns. The animate nouns were even subdivided into humans and animals, while inanimate nouns were grouped into categories like “breakable” and “edible”.

This demonstrated that the complex categorical structure of language is implicitly present in the sequential statistics of word order, and that simple recurrent networks are powerful enough to discover it. The network learned that words like “man” and “woman” behave similarly because they tend to appear in similar contexts and precede a similar range of verbs.

The Importance of Context

This approach also provides an elegant solution to the “type-token” distinction, a classic problem in representation. A “type” is an abstract category (the concept of a ‘boy’), while a “token” is a specific instance (‘boy’ in the sentence “the boy runs”).

Because the network’s internal state is always influenced by prior context, its representation for ‘boy’ was never exactly the same twice. The specific pattern of activation for ‘boy’ was subtly different depending on the sentence it appeared in. Yet, all these different “tokens” of ‘boy’ were more similar to each other than to any token of ‘girl’ or ‘cat’, thus forming a coherent “type”. This allows the representations to be both context-dependent and generalizable. As Elman notes, this method provides an “alternative account of how such distinctions can be made without indexing or binding”.

The legacy of Elman’s work is profound. Simple recurrent networks laid the foundation for more advanced architectures like LSTMs and GRUs that are at the heart of modern natural language processing and today’s large language models. The paper showed that with the right architecture, a learning system does not need to be given abstract rules. Instead, it can discover them on its own, finding deep structure hidden in the flow of time.