New Heuristics and exact algorithms for the planted DNA (l, d)-Motif finding problem

Julieta Q. Nabos

Abstract

This study explored some approaches that can improve algorithms for finding (l,d)-motifs in DNA sequences. Two new exact algorithms, EMS-PG and EMS-GT that employ pattern-based approach and instance-based approach, respectively, were developed. Number mapping for l-mer and XOR-based Hamming distance computation employing bitwise operation were introduced to enhance more the efficiency of EMS-GT. Moreover, heuristics called GS-SANS and GS-SANS-RP were also developed to improve searching from combinatorial solutions. The new heuristic and algorithms were tested on both synthetic data sets and real data sets. The run times of the new algorithms and accuracies of the new heuristics were compared to some state of the art algorithms. The pattern growth approach of the EMS-PG was able to prune the large solution space, while a combination of selection by hash-based elimination and selection by probing techniques provided EMS-GT the means to surpass the efficiency of some state-of-the art exact algorithms. On the other hand, GS-SANS-RP outperformed some existing Random Projection enhanced algorithms in some performance metrics in (l,d)-motif instances tested. Finally, it was observed that the proposed exact algorithms were able to determine motifs involving short l-mers. However, it takes a significant amount of time in finding long l-mers. Thus, heuristics may be used to search for commonly occurring long patterns. However, there is no guarantee that the best scoring pattern (motif) is always returned.