Project 3: Mutation

DNA sequences can mutate in response to certain environmental exposures. For example, UV-light is a common mutagen of skin cells, and it has a very particular pattern of mutations. UV mutation (unlike most other mutagens) often gives rise to a double-substitution of TT for CC within a DNA sequence. A variety of mutational signatures have been observed in human cells. Given a menu of mutagens and the spectrum of mutations that each generates, it is sometimes possible to trace the origins of mutations in cancer, e.g., to attribute a skin cancer to UV exposure.

We'll assume an idealized organism containing a circular genome of 1000 bases, where each base is either a, c, g, or t. (We'll use lower case, because it's easier to see the difference between c and g.)

A mutagen acts on a genome by choosing a target location at random, and then scanning the genome from that point on for the first instance of a pattern. Scanning always goes from left to right, and the genome wraps at the right end. The pattern is a local template that is evaluated against a segment of up to 10 contiguous bases. For example, the pattern could be ac;actg;g which would match any 3-base sequence that starts with a or c, and ends with g.

Each mutagen has an action that can mutate up to 10 bases simultaneously, starting at the leftmost base in the match. Deletions or insertions are not allowed. For example, the action might be "a11" which says that base at offset 0 is changed to a, and the base at offset 2 is changed to be identical to the original base at offset 1. The intervening base at offset 1 is unchanged. The combination of the pattern ac;actg;g and action a11 would lead to the following possible changes with one application of the rule:

acgggggg to accggggg; acgactgg to acgaattg; gagtcgtg has no matches

A mutagen is a collection of pattern-action pairs, and the complexity of the mutagen is the number of such pairs. Your goal is to learn the pattern-action pairs as quickly as possible, with or without information about the complexity of the mutagen.

To learn about mutagens, you will write a program that (a) designs experiments and (b) infers the mutational signatures from the results of those experiments. The sequence of interactions with the simulator is as follows:

After some (large) number of failed guesses, the simulator will time-out. A that point, it will compute the similarity of the actual mutagen and the last guess as follows:

For complex mutagens that are hard to guess, this Jaccard score will at least give some way to measure/rank how close you got.

To satisfy the simulator your guess has to be exact. If the rule is ``PATTERN=ac, ACTION=g'' then the guess ``PATTERN=a, ACTION=g'' is implied by the rule, but is not an exact match. When a mutagen is defined by more than one rule, you need to guess the set of rules exactly right. Some sets of rules are degenerate, in that they are equivalent to a set of rules of smaller cardinality, or to rules with more specific patterns:

The mutagen representation given to the simulator (both the simulated mutagen itself and the guessed rules) should be non-degenerate. (Is there an algorithm for finding an equivalent nondegenerate set for any set of input rules? Is there always a unique nondegenerate representation? What about for specific subclasses, e.g., rule sets of cardinality 1, rule sets with provably disjoint action patterns, etc.?)

One mutation in a sequence may enable a subsequent mutation that would not have been legal on the original sequence. (Can you think of an example?) You won't see the intermediate because only the final mutated sequence is output at the end of the experiment. Also, some non-trivial rules may lead to non-observable effects on some sequences (e.g., what would be such a sequence for { PATTERN=cg, ACTION=c0 } ?).

Prior to the first class for this project, groups will be asked to come up with some interesting mutagens to use for in-class simulations and for testing your players.

For the tournaments we'll run situations with varying m and different kinds of mutagens, with various levels of complexity. For the tournament, the instructor and TAs will choose some mutagens you haven't seen before, as well as some discussed in class. Groups will also each submit one mutagen (unseen by other groups) for the tournament. The goal for submitted tournament mutagens is that they be easy enough to be solved by some groups, but difficult enough that they are not solved by all groups.

Some initial things to think about:

Ken Ross 2019-10-21