optimization - Fast(er) algorithm for the Length of the Longest Common Subsequence (LCS) -
problem: need length of lcs between 2 strings. size of strings @ 100 characters. alphabet usual dna one, 4 characters "acgt". dynamic approach not quick enough.
my problem dealing lot's , lot's of pairs (of rank of hundreds of million far can see). believe have decreased calling of lcs_length function minimum possible other way make program run faster have more efficient lcs_length function.
i have started off implementing in usual dynamic programming approach. gives correct answer , implemented in properly.
#define arraylengthmacro(a) strlen(a) + 1 #define max_string 101 static int maxlength(int lengtha, int lengthb); /* * 2 strings compared following dynamic computing * lcs table algorithm. since require length of lcs * can rather easily. */ int lcs_length(char *a, char *b) { int lengtha = arraylengthmacro(a),lengthb = arraylengthmacro(b), lcs = 0, i, j, maxlength, board[max_string][max_string]; maxlength = maxlength(lengtha, lengthb); //printf("%d %d\n", lengtha, lengthb); (i = 0; < maxlength - 1; i++) { board[i][0] = 0; board[0][i] = 0; } (i = 1; < lengtha; i++) { (j = 1; j < lengthb; j++) { /* if match found allocate number in (i-1, j-1) incremented * 1 (i, j) position */ if (a[i - 1] == b[j - 1]) { board[i][j] = board[i-1][j-1] + 1; if(lcs < board[i][j]) { lcs++; } } else { if (board[i-1][j] > board[i][j-1]) { board[i][j] = board[i-1][j]; } else { board[i][j] = board[i][j-1]; } } } } return lcs; }
that should o(mn) (hopefully).
then in search speed found post longest common subsequence gave o(nd) paper myers. tried relates lcs shortest edit script (ses). relation give d = m + n - 2l, d length of ses, m , n lengths of 2 strings , l lcs length. isn't correct in implementation. give implementation (which think correct):
#include <stdio.h> #include <stdlib.h> #include <string.h> #define arraylengthmacro(a) strlen(a) int lcs_length (char *a, char *b); int minlength (int a, int b); int max (int a, int b); int snake(int k, int max, char *a, char *b, int lengtha, int lengthb); int main(void) { int l; char a[] = "tomato", b[] = "potato"; //must give lcs = 4 l = lcs_length(a, b); printf("lcs: %d\n", l ); char c[] = "gtcgttcggaatgccgttgctctgtaaa", d[] = "accggtcgagtgcgcggaagccggccgaa"; // must give lcs = 20 l = lcs_length(c, d); printf("lcs: %d\n", l ); char e[] = "human", f[] = "chimpanzee"; // must give lcs = 4 l = lcs_length(e, f); printf("lcs: %d\n", l ); char g[] = "howareyou", h[] = "whoareyou"; // lcs =8 l = lcs_length(g, h); printf("lcs: %d\n", l ); char i[] = "ttctttcggtaacgcctactttatgaagaggttacattgcaatcgggtaaattaaccaacaagtaatggtagttcctagtagagaaaccctcccgctcac", j[] = "gcacgcgcctgttgctacgctctgtccatcctttgtgtgccgggtactcagaccggtaactcgagttgctatcgcggttatcaggatcatttacggactc"; // 61 l = lcs_length(i, j); printf("lcs: %d\n", l ); return 0; } int lcs_length(char *a, char *b) { int d, lengtha = arraylengthmacro(a), lengthb = arraylengthmacro(b), max, *v_, *v, i, k, x, y; max = lengtha + lengthb; v_ = malloc(sizeof(int) * (max+1)); if(v_ == null) { fprintf(stderr, "failed allocate memory lcs"); exit(1); } v = v_ + lengtha; v[1] = 0; (d = 0; d < max; d++) { (k = -d; k <= d; k = k + 2) { if ((k == -d && v[k-1] < v[k+1]) || (k != d && v[k-1] < v[k+1])) { x = v[k+1]; } else { x = v[k-1] + 1; } y = x - k; while ((x < lengthb) && (y < lengtha) && (a[x+1] == b[y+1])) { x++; y++; } v[k] = x; if ((x >= lengthb) && (y >= lengtha)) { return (lengtha + lengthb - d)/2; } } } return (lengtha + lengthb - d)/2; }
there examples in main. eg. "tomato" , "potato" (don't comment), have lcs length of 4. implementation finds 5 sice ses (d in code) comes out 2 instead of 4 i'd expect (delete "t", insert "p", delete "m", insert "t"). inclined think maybe o(nd) algorithm counts substitutions well, not sure this.
any approach implementable (i don't have immense programming skills), welcome. (if know how take advantage of small alphabet example).
edit: think useful on top of else, use lcs function between 2 strings @ time. not string s1, compared few million others. might s200 s1000, s0 s10000, , 250 s100000. not able track used strings either. require lcs length not approximation, since implementing approximation algorithm , don't want add error.
edit: ran callgrind. input of 10000 strings seem call lcs function 50,000,000 times, different pairs of strings. (10000 strings lowest want run , max 1 million (if feasible)).
there several ways make computation faster:
- instead of plain dynamic programming, can use a* search (by using heuristic partial alignment (i,j) must have |i-j| deletions or insertions in it).
- if you're comparing 1 sequence whole bunch of other ones, can save time computing dynamic programming table (or a* search state) part leading prefix , re-use part of computation. if stick dynamic programming table, sort library of strings lexicographic order , throw away part changes (e.g. if have 'banana' , want compare 'panama' , 'panamericanism', can reuse part of table covers 'panam').
- if of strings quite similar, can save time looking common prefix , excluding common prefix computation (e.g. lcs of "panama" , "panamericanism" common prefix "panam" plus lcs of "a" , "ericanism")
- if strings quite dissimilar, can use symbol counts lower bound on number of edits (e.g., "aaab" "abbb" need @ least 2 edits because there 3 in 1 , 1 in other). can used in heuristic a* search.
edit: comparing-to-the-same-set-of-stings case, 1 person suggested bk-tree data structure in efficient way of calculating likeness scores of strings when sample size large?
Comments
Post a Comment