Project 3: Mutation

DNA sequences can mutate in response to certain environmental exposures. For example, UV-light is a common mutagen of skin cells, and it has a very particular pattern of mutations. UV mutation (unlike most other mutagens) often gives rise to a double-substitution of TT for CC within a DNA sequence. A variety of mutational signatures have been observed in human cells. Given a menu of mutagens and the spectrum of mutations that each generates, it is sometimes possible to trace the origins of mutations in cancer, e.g., to attribute a skin cancer to UV exposure.

We'll assume an idealized organism containing a circular genome of 1000 bases, where each base is either a, c, g, or t. (We'll use lower case, because it's easier to see the difference between c and g.)

A mutagen acts on a genome by choosing a target location at random, and then scanning the genome from that point on for the first instance of a pattern. Scanning always goes from left to right, and the genome wraps at the right end. The pattern is a local template that is evaluated against a segment of up to 10 contiguous bases. For example, the pattern could be ac;actg;g which would match any 3-base sequence that starts with a or c, and ends with g.

Each mutagen has an action that can mutate up to 10 bases simultaneously, starting at the leftmost base in the match. Deletions or insertions are not allowed. For example, the action might be "a11" which says that base at offset 0 is changed to a, and the base at offset 2 is changed to be identical to the original base at offset 1. The intervening base at offset 1 is unchanged. The combination of the pattern ac;actg;g and action a11 would lead to the following possible changes with one application of the rule:

acgggggg to accggggg; acgactgg to acgaattg; gagtcgtg has no matches

A mutagen is a collection of pattern-action pairs, and the complexity of the mutagen is the number of such pairs. Your goal is to learn the pattern-action pairs as quickly as possible, with or without information about the complexity of the mutagen.

To learn about mutagens, you will write a program that (a) designs experiments and (b) infers the mutational signatures from the results of those experiments. The sequence of interactions with the simulator is as follows:

You design an experimental organism, i.e., a DNA sequence of 1000 bases.
The simulator executes the mutational effect of the (unknown) mutagen on your sequence. The number of (attempted) mutations is per experiment is m, where m>0 is a parameter. If a mutagen has multiple rules, the simulator will choose one of the rules at random for each mutation, with equal probability. The simulator returns just the final result of the sequence of mutations. Each of the m mutations occurs at an independently generated random site in the genome, and not in a sequential fashion along the genome. If a rule generates an action that does not change the base sequence at the given position in the genome, then this rule does not count towards the m mutations. You know m at the beginning.
Based on the results of the simulation so far, your player makes a guess as to the most likely rule-set it thinks generated the observed mutations.
If the rule set guess is correct (and complete) the simulator stops and your score is the number of experiments you needed, smaller being better.
If the guess is incorrect, the simulator will repeat the process by asking for a new experimental organism and looping once more.

After some (large) number of failed guesses, the simulator will time-out. A that point, it will compute the similarity of the actual mutagen and the last guess as follows:

A random organism of size 1,000,000 is generated, and the positions of all possible base changes according to the mutagen are computed, together with the base generated for that position, without actually mutating the sequence (so sequential mutations are not generated, just single mutations). Call this set of positions/bases A.
The positions/bases of all possible base changes according to the final guess are computed. Call this set B.
The Jaccard similarity of A and B represents how close the guess was to the actual mutagen.

For complex mutagens that are hard to guess, this Jaccard score will at least give some way to measure/rank how close you got.

To satisfy the simulator your guess has to be exact. If the rule is ``PATTERN=ac, ACTION=g'' then the guess ``PATTERN=a, ACTION=g'' is implied by the rule, but is not an exact match. When a mutagen is defined by more than one rule, you need to guess the set of rules exactly right. Some sets of rules are degenerate, in that they are equivalent to a set of rules of smaller cardinality, or to rules with more specific patterns:

{ PATTERN=a, ACTION=c; PATTERN=g, ACTION=c } is equivalent to { PATTERN=ag, ACTION=c }. (Remember, ``ag'' in a pattern means disjunction, not concatenation.)
{ PATTERN=c, ACTION=c } is equivalent to the empty set.
{ PATTERN=[anything], ACTION=0 } is equivalent to the empty set. (The leftmost position is at offset 0, not 1.)
{ PATTERN=cg, ACTION=c } is equivalent to { PATTERN=g, ACTION=c }.
{ PATTERN=cg, ACTION=c0 } is not equivalent to { PATTERN=g, ACTION=c0 }. (ca $\rightarrow$ cc)
{ PATTERN=g;ac, ACTION=c1 } is not equivalent to { PATTERN=g, ACTION=c }. (gt $\rightarrow$ ct)

The mutagen representation given to the simulator (both the simulated mutagen itself and the guessed rules) should be non-degenerate. (Is there an algorithm for finding an equivalent nondegenerate set for any set of input rules? Is there always a unique nondegenerate representation? What about for specific subclasses, e.g., rule sets of cardinality 1, rule sets with provably disjoint action patterns, etc.?)

One mutation in a sequence may enable a subsequent mutation that would not have been legal on the original sequence. (Can you think of an example?) You won't see the intermediate because only the final mutated sequence is output at the end of the experiment. Also, some non-trivial rules may lead to non-observable effects on some sequences (e.g., what would be such a sequence for { PATTERN=cg, ACTION=c0 } ?).

Prior to the first class for this project, groups will be asked to come up with some interesting mutagens to use for in-class simulations and for testing your players.

For the tournaments we'll run situations with varying m and different kinds of mutagens, with various levels of complexity. For the tournament, the instructor and TAs will choose some mutagens you haven't seen before, as well as some discussed in class. Groups will also each submit one mutagen (unseen by other groups) for the tournament. The goal for submitted tournament mutagens is that they be easy enough to be solved by some groups, but difficult enough that they are not solved by all groups.

Some initial things to think about:

How many measurements are needed before you can be confident about a mutagen? What features of a rule (or rule set) make many measurements necessary to identify it/them correctly?
If your initial guesses fail, how would you design the next experiment to focus on specific sequences in a way that leverages the earlier generated data?

Ken Ross 2019-10-21