# The Substitution Cipher

Perhaps the oldest and one of the simplest method of encrypting a message is to use the substitution cipher. What this cipher does is, as its name suggests, to simply substitute each character in the message with the character it is mapped to. In this article, we will examine substitution ciphers specifically meant for encrypting English texts.

Firstly, we will create our very own substitution cipher: for each alphabet, we will take it’s position (‘A’ = 0, ‘B’ = 1, …), add $4$ to it (a randomly choosen number), and lookup the alphabet whose position is the result. Of course, if the result goes beyond the range, it will “wrap around” and continue from the start. For example, ‘A’ will map to ‘E’, ‘G’ will map to ‘K’ and ‘Y’ will map to ‘C’. This will define our encoding.

Mathematically, our encoding will simply be an addition in the ring of integers modulo 26 (because the alphabet has 26 letters), i.e.

$E(x) = x + 4 \mod 26$

Our decoding using this cipher would then be the reverse of this operation: a subtraction in the ring of integers modulo 26, i.e.

$D(x) = x - 4 \mod 26$

# Our Substituition Cipher in action

Now, for some action! We will use one of Aesop’s Fables – “The Ant and the Grasshopper” as our example. The fable goes like this:

In a field one summer’s day a Grasshopper was hopping about, chirping and singing to its heart’s content. An Ant passed by, bearing along with great toil an ear of corn he was taking to the nest.

Why not come and chat with me, said the Grasshopper, instead of toiling and moiling in that way?

I am helping to lay up food for the winter, said the Ant, and recommend you to do the same.

Why bother about winter? said the Grasshopper; we have got plenty of food at present.But the Ant went on its way and continued its toil.

When the winter came the Grasshopper found itself dying of hunger, while it saw the ants distributing, every day, corn and grain from the stores they had collected in the summer.

Then the Grasshopper knew : It is best to prepare for the days of necessity

To keep this article simple, I have choose to illustrate substitution ciphers for the letters of the English alphabet only. As such, we would have to omit everything else (spaces, fullstops, commas, etc.) other than the letters. After retaining only the letters, we have the following plaintext:

INAFIELDONESUMMERSDAYAGRASSHOPPERWASHOPPINGABOUTCHIRPINGANDSINGINGTOITSHEARTSCONTENTANANTPASSEDBYBEARINGALONGWITHGREATTOILANEAROFCORNHEWASTAKINGTOTHENESTWHYNOTCOMEANDCHATWITHMESAIDTHEGRASSHOPPERINSTEADOFTOILINGANDMOILINGINTHATWAYIAMHELPINGTOLAYUPFOODFORTHEWINTERSAIDTHEANTANDRECOMMENDYOUTODOTHESAMEWHYBOTHERABOUTWINTERSAIDTHEGRASSHOPPERWEHAVEGOTPLENTYOFFOODATPRESENTBUTTHEANTWENTONITSWAYANDCONTINUEDITSTOILWHENTHEWINTERCAMETHEGRASSHOPPERFOUNDITSELFDYINGOFHUNGERWHILEITSAWTHEANTSDISTRIBUTINGEVERYDAYCORNANDGRAINFROMTHESTORESTHEYHADCOLLECTEDINTHESUMMERTHENTHEGRASSHOPPERKNEWITISBESTTOPREPAREFORTHEDAYSOFNECESSITY

Now, we encrypt every letter in our plaintext using our encoding function:

$E(x) = x + 4 \mod 26$

With this, we get $E(\text{'I'}) = \text{'M'}$, $E(\text{'N'}) = \text{'R'}$, $E(\text{'A'}) = \text{'E'}$… and if we put together everything, we will obtain our ciphertext:

MREJMIPHSRIWYQQIVWHECEKVEWWLSTTIVAEWLSTTMRKEFSYXGLMVTMRKERHWMRKMRKXSMXWLIEVXWGSRXIRXERERXTEWWIHFCFIEVMRKEPSRKAMXLKVIEXXSMPERIEVSJGSVRLIAEWXEOMRKXSXLIRIWXALCRSXGSQIERHGLEXAMXLQIWEMHXLIKVEWWLSTTIVMRWXIEHSJXSMPMRKERHQSMPMRKMRXLEXAECMEQLIPTMRKXSPECYTJSSHJSVXLIAMRXIVWEMHXLIERXERHVIGSQQIRHCSYXSHSXLIWEQIALCFSXLIVEFSYXAMRXIVWEMHXLIKVEWWLSTTIVAILEZIKSXTPIRXCSJJSSHEXTVIWIRXFYXXLIERXAIRXSRMXWAECERHGSRXMRYIHMXWXSMPALIRXLIAMRXIVGEQIXLIKVEWWLSTTIVJSYRHMXWIPJHCMRKSJLYRKIVALMPIMXWEAXLIERXWHMWXVMFYXMRKIZIVCHECGSVRERHKVEMRJVSQXLIWXSVIWXLICLEHGSPPIGXIHMRXLIWYQQIVXLIRXLIKVEWWLSTTIVORIAMXMWFIWXXSTVITEVIJSVXLIHECWSJRIGIWWMXC

Similarly, we can decode this cipher text back into our plain text using our decoding function,

$D(x) = x - 4 \mod 26$

Till here, we see how it is possible to use a simple substitution cipher to encrypt and to decrypt a message. But what if we want to decrypt a message encrypted using such a simple substitution cipher, despite having no knowledge of how the cipher is defined? Is it possible?

# Frequency Analysis

Imagine having obtained an encrypted message:

MREJMIPHSRIWYQQIVWHECEKVEWWLSTTIVAEWLSTTMRKEFSYXGLMVTMRKERHWMRKMRKXSMXWLIEVXWGSRXIRXERERXTEWWIHFCFIEVMRKEPSRKAMXLKVIEXXSMPERIEVSJGSVRLIAEWXEOMRKXSXLIRIWXALCRSXGSQIERHGLEXAMXLQIWEMHXLIKVEWWLSTTIVMRWXIEHSJXSMPMRKERHQSMPMRKMRXLEXAECMEQLIPTMRKXSPECYTJSSHJSVXLIAMRXIVWEMHXLIERXERHVIGSQQIRHCSYXSHSXLIWEQIALCFSXLIVEFSYXAMRXIVWEMHXLIKVEWWLSTTIVAILEZIKSXTPIRXCSJJSSHEXTVIWIRXFYXXLIERXAIRXSRMXWAECERHGSRXMRYIHMXWXSMPALIRXLIAMRXIVGEQIXLIKVEWWLSTTIVJSYRHMXWIPJHCMRKSJLYRKIVALMPIMXWEAXLIERXWHMWXVMFYXMRKIZIVCHECGSVRERHKVEMRJVSQXLIWXSVIWXLICLEHGSPPIGXIHMRXLIWYQQIVXLIRXLIKVEWWLSTTIVORIAMXMWFIWXXSTVITEVIJSVXLIHECWSJRIGIWWMXC

Now, we assume a few things: we have no prior knowledge of the (unencrypted) message sent. We know that the message sent was written in English. We also know that it was encrypted using a simple substitution cipher. Is it then possible to decrypt this message? The answer is : probably yes, using frequency analysis.

Frequency analysis is the analysis of the frequency of each letter appearing in a piece of text. Below shows a frequency analysis on our cipher text, and separately on the letters in a typical English text. The data on the letter frequencies in a typical English text was taken from Robert Lewand’s Cryptological Mathematics.

TYPICAL ENGLISH TEXT    CIPHER TEXT
A : 0.08167              A : 0.02786
B : 0.01492              B : 0.00000
C : 0.02782              C : 0.02459
D : 0.04253              D : 0.00000
E : 0.12702              E : 0.08360
F : 0.02280              F : 0.01311
G : 0.02015              G : 0.01967
H : 0.06094              H : 0.04096
I : 0.06966              I : 0.10983
J : 0.01530              J : 0.02131
K : 0.07720              K : 0.03442
L : 0.04025              L : 0.06229
M : 0.02406              M : 0.07049
N : 0.06749              N : 0.00000
O : 0.07507              O : 0.00327
P : 0.01929              P : 0.02131
Q : 0.00950              Q : 0.02131
R : 0.05987              R : 0.08524
S : 0.06327              S : 0.07540
T : 0.09056              T : 0.03278
U : 0.02758              U : 0.00000
V : 0.00978              V : 0.05737
W : 0.02360              W : 0.06557
X : 0.00150              X : 0.10819
Y : 0.01974              Y : 0.01803
Z : 0.00074              Z : 0.00327

After scrutinizing the numbers, we notice something interesting : if we “shift” the entire column of the frequency of our cipher text up, we get the following table.

TYPICAL ENGLISH TEXT    CIPHER TEXT (shifted up by 4)
A : 0.08167              E : 0.08360
B : 0.01492              F : 0.01311
C : 0.02782              G : 0.01967
D : 0.04253              H : 0.04096
E : 0.12702              I : 0.10983
F : 0.02280              J : 0.02131
G : 0.02015              K : 0.03442
H : 0.06094              L : 0.06229
I : 0.06966              M : 0.07049
J : 0.01530              N : 0.00000
K : 0.07720              O : 0.00327
L : 0.04025              P : 0.02131
M : 0.02406              Q : 0.02131
N : 0.06749              R : 0.08524
O : 0.07507              S : 0.07540
P : 0.01929              T : 0.03278
Q : 0.00950              U : 0.00000
R : 0.05987              V : 0.05737
S : 0.06327              W : 0.06557
T : 0.09056              X : 0.10819
U : 0.02758              Y : 0.01803
V : 0.00978              Z : 0.00327
W : 0.02360              A : 0.02786
X : 0.00150              B : 0.00000
Y : 0.01974              C : 0.02459
Z : 0.00074              D : 0.00000

The relative frequencies for each column seems to match. Hey, pretty neat!

If a simple substitution cipher defines a one-to-one correspondence between the pre-encrypted letters and post-encrypted letters, then the frequency of the post-encrypted letters in the cipher text and pre-encrypted letters in the plain text must be equal. Could it be? Have we found the cipher’s mapping used for the encrypted text? We guess this mapping based on the table ($D(\text{'A'}) = \text{'W'}, D(\text{'B'}) = 'X', D(\text{'C'}) = \text{'Y'}, ...$) to attempt to map the encrypted message back into its plain text, and we get…

INAFIELDONESUMMERSDAYAGRASSHOPPERWASHOPPINGABOUTCHIRPINGANDSINGINGTOITSHEARTSCONTENTANANTPASSEDBYBEARINGALONGWITHGREATTOILANEAROFCORNHEWASTAKINGTOTHENESTWHYNOTCOMEANDCHATWITHMESAIDTHEGRASSHOPPERINSTEADOFTOILINGANDMOILINGINTHATWAYIAMHELPINGTOLAYUPFOODFORTHEWINTERSAIDTHEANTANDRECOMMENDYOUTODOTHESAMEWHYBOTHERABOUTWINTERSAIDTHEGRASSHOPPERWEHAVEGOTPLENTYOFFOODATPRESENTBUTTHEANTWENTONITSWAYANDCONTINUEDITSTOILWHENTHEWINTERCAMETHEGRASSHOPPERFOUNDITSELFDYINGOFHUNGERWHILEITSAWTHEANTSDISTRIBUTINGEVERYDAYCORNANDGRAINFROMTHESTORESTHEYHADCOLLECTEDINTHESUMMERTHENTHEGRASSHOPPERKNEWITISBESTTOPREPAREFORTHEDAYSOFNECESSITY

… which is more or less comprehensible (in English) if you go through it slowly, and not some gibberish message. Cool! It seems that we have successfully decrypted/cracked the encrypted message received. This entire process of attempting to decrypt a cipher text is what is know as a frequency analysis attack, because we reversed a cipher text by comparing the frequency analysis of the cipher text with the frequency analysis of a typical English text.

# Conclusion

In this article we described a possible way to crack encrypted messages by making use of frequency analysis. It is effective if the encryption used is constant, i.e. it always produces the same output for the same input. This was the case of the substitution cipher we used to illustrate a frequency analysis attack in our article.

Another implicit assumption we made was that the sample size (length of encrypted message) has to be large enough for a frequency analysis to be accurate. If the encrypted message is too short (e.g. < 100 words), then it would be difficult for an analysis attack to have any significant implications (unless of course, the possible number of inputs are small as well).

It should be noted that most modern encryption are well-defended against frequency analysis attacks. They are non-constant : they do not always produce the same input for a given input. In fact, most of them are a lot more complex and has more parameters as well. As such, a frequency analysis attack might be effective only on naive and poorly-protected encryption, but would generally fail for almost all well-protected encryption algorithms today, such as AES.