Project 3: A Magical Code

You run the spy agency for a country at war. You need to send messages behind enemy lines, and have several secret agents who are willing to take the risk to deliver those messages. Those agents could be stopped and searched at any time, making anything that looks like a code suspicious. For that reason, and to avoid being tracked, agents do not carry electronic devices.

The agency has come up with a cool idea: encode a message in a deck of 52 regular playing cards. To avoid suspicion, these decks would be carried by agents trained as magicians. Agents carry many decks, only one of which contains the message. The message is presumably encoded in the sequence of cards.

In the event that an agent is searched and asked about the cards, agents are trained to respond as follows: ``Officer, I'm a magician and these decks of cards are the tools of my trade. They cannot contain secret information. Look, I'll take a card from the top of the deck and insert it randomly back into the deck. You can even choose where to insert it. I'll do that several more times for you on each of the decks. Surely that convinces you that any hypothetical message in the cards is now scrambled, and that therefore these cards are innocuous. By the way, I think there's a rabbit in your hat...'' It is possible that an agent is stopped more than once on their journey, in which case the procedure is applied each time in the same way.

What is not entirely obvious here is that even a fair number of top-to-random insertions don't really scramble the sequence enough to prevent information transfer. An analysis of top-to-random shuffling shows that even after many card insertions, the entropy remaining in the deck is high. You can use that entropy to encode your message.

Your job is to design a scheme for encoding short messages into card sequences. A message is a sequence of English words or character strings that may not be words (e.g., names, numbers) separated by spaces and other regular punctuation. We'll discuss in class a subset of compact messages that avoids the use of stop words, capitalization and punctuation to reduce the number of bits needed to get the message across. You will need to write functions encode() and decode() to convert between messages and card sequences and back. We'll let s() be a randomized function that moves the top card in the deck to a random position. The encode() and decode() functions are decided before the agent goes into the field, so they cannot depend on the actual messages or the true number of top-to-random insertions. The true number of top-to-random insertions encountered during transit is unknown to both the encoder and decoder of the message.

The basic requirements are:

For any small message m, m=decode(encode(m)).
For any small message m and sufficiently small n, m=decode(s(encode(m))).
For any random card permutation p, Pr(decode(p)=NULL)1. In other words, your decode function is smart enough to notice that a deck of cards does not contain a message, and says so. Because there is a tiny chance that a random deck matches an encoded message, we can't say with certainty that the decode() function always returns NULL on random decks.

The more advanced goals are:

Given a message size, determine a scheme for which m=decode(s(encode(m))) for as large an n as possible, for messages m having that message size.
Given an a priori bound on n, determine a scheme for which m=decode(s(encode(m))) for as long a message m as possible.
If during decoding you recover a partial message, then output that partial message rather than NULL. The simulator will allow you to annotate your return value as a partial message.

We'll provide a simulator for you to interact with. You supply the simulator with the encode() and decode() functions, and the simulator will test them using messages from a user-specified file. The randomness induced by the top-to-random insertions will be determined using a random number seed, so that runs are repeatable, and so that different groups can be run with the same deck permutations. Details of the simulator will be provided later. At the end of the project we'll run tournaments to test the capabilities of your schemes under various conditions.

We'll also talk in class about how to measure the success of your schemes. For example, given a list of messages, the simulator can run a large number of trials for each message with various values of n, to see how often your program succeeds for each n. The simulator can also interleave these requests with random deck permutations to see how often your decode function correctly identifies the deck as not containing a message. We might also bound n by some value, and the simulator can try longer and longer messages to see how far your scheme can go. What is a good way to score a partially decoded message? Another dimension is the domain of messages. Simple messages are just sequences of one or more English words, e.g., ``activate device''. More complex messages might include dates, times, latitudes, longitudes, names, numbers, etc. Knowing the message domain might allow you choose a suitable encoding.

Some things to think about:

What is a good measure of message size? Even without limiting to compact messages as described above, what might you do to minimize the number of bits needed to represent a given message?
There is some interesting mathematics underlying this problem, as described in the paper by Stark and the references therein. While understanding the math is not a pre-requisite for this project, those of you who are mathematically inclined might be able to gain some insight into the structure of the problem by reading these papers.

Ken Ross 2022-09-12