# Integer Linear Programming Approach to Median and Center Strings for a Probability Distribution on a Set of Strings

### Morihiro Hayashida, Hitoshi Koyano

#### Abstract

We address problems of finding median and center strings for a probability distribution on a set of strings under Levenshtein distance, which are known to be NP-hard in a special case. There are many applications in various research fields, for instance, to find functional motifs in protein amino acid sequences, and to recognize shapes and characters in image processing. In this paper, we propose novel integer linear programming-based methods for finding median and center strings for a probability distribution on a set of strings under Levenshtein distance. Furthermore, we restrict several variables to a region near the diagonal in the formulation, and propose novel integer linear programming-based methods also for finding approximate median and center strings for a probability distribution on a set of strings. For evaluation of our proposed methods, we perform several computational experiments, and show that the restricted formulation reduced the execution time.

#### References

- Abreu, J. and Rico-Juan, J. (2014). A new iterative algorithm for computing a quality approximate median of strings based on edit operations. Pattern Recognition Letters, 36:74-80.
- Bunke, H., Jiang, X., Abegglen, K., and Kandel, A. (2002). On the weighted mean of a pair of strings. Pattern Analysis and Applications, 5:23-30.
- Casacuberta, F. and de Antoni, M. (1997). A greedy algorithm for computing approximate median strings. pages 193-198.
- Chen, S., Tung, S., Fang, C., Cherng, S., and Jain, A. (1998). Extended attributed string matching for shape recognition. Computer Vision and Image Understanding, 70:36-50.
- de la Higuera, C. and Casacuberta, F. (2000). Topology of strings: Median string is NP-complete. Theoretical Computer Science, 230:39-48.
- Dinu, L. and Ionescu, R. (2012). An efficient rank based based approach for closest string and closest substring. PLoS ONE, 7(6):e37576.
- Gramm, J. (2003). Fixed-parameter algorithms for the consensus analysis of genomic data. PhD thesis, Universität T übingen.
- Gramm, J., Niedermeier, R., and Rossmanith, P. (2003). Fixed-parameter algorithms for closest string and related problems. Algorithmica, 37:25-42.
- Gusfield, D. (1997). Algorithms on Strings, Trees and Sequences. Cambridge University Press.
- Hamming, R. (1950). Error detecting and error correcting codes. The Bell System Technical Journal, 29(2):147- 160.
- Hufsky, F., Kuchenbecker, L., Jahn, K., Stoye, J., and Böcker, S. (2011). Swiftly computing center strings. BMC Bioinformatics, 12:106.
- Jiang, X., Abegglen, K., Bunke, H., and Csirik, J. (2003). Dynamic computation of generalised median strings. Pattern Analysis and Applications, 6:185-193.
- Kohonen, T. (1985). Median strings. Pattern Recognition Letters, 3:309-313.
- Koyano, H. and Kishino, H. (2010). Quantifying biodiversity and asymptotics for a sequence of random strings. Physical Review E, 81(6):061912.
- Kruskal, J. (1983). An overview of sequence comparison: Time warps, string edits, and macromolecules. SIAM Reviews, 25(2):201-237.
- Levenshtein, V. (1965). Binary codes capable of correcting deletions, insertions and reversals. Doklady Adademii Nauk SSSR, 163(4):845-848.
- Lopresti, D. and Zhou, J. (1997). Using consensus sequence voting to correct OCR errors. Computer Vision and Image Understanding, 67(1):39-47.
- Martínez-Hinarejos, C., Juan, A., and Casacuberta, F. (2003). Median strings for k-nearest neighbour classification. Pattern Recognition Letters, 24:173-181.
- Nicolas, F. and Rivals, E. (2003). Complexities of the centre and median string problems. Lecture Notes in Computer Science, 2676:315-327.
- Nicolas, F. and Rivals, E. (2005). Hardness results for the center and median string problems under the weighted and unweighted edit distances. Journal of Discrete Algorithms, 3:390-415.
- Olivares-Rodríguez, C. and Oncina, J. (2008). A Stochastic Approach to Median String Computation, pages 431- 440. Springer, Berlin.
- Sim, J. S. and Park, K. (2003). The consensus string problem for a metric is NP-complete. Journal of Discrete Algorithms, 1:111-117.
- Wagner, R. and Fischer, M. (1974). The string-to-string correction problem. Journal of the ACM, 21(1):168- 173.
- Winkler, W. (1990). String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage. pages 354-359.

#### Paper Citation

#### in Harvard Style

Hayashida M. and Koyano H. (2016). **Integer Linear Programming Approach to Median and Center Strings for a Probability Distribution on a Set of Strings** . In *Proceedings of the 9th International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 3: BIOINFORMATICS, (BIOSTEC 2016)* ISBN 978-989-758-170-0, pages 35-41. DOI: 10.5220/0005666400350041

#### in Bibtex Style

@conference{bioinformatics16,

author={Morihiro Hayashida and Hitoshi Koyano},

title={Integer Linear Programming Approach to Median and Center Strings for a Probability Distribution on a Set of Strings},

booktitle={Proceedings of the 9th International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 3: BIOINFORMATICS, (BIOSTEC 2016)},

year={2016},

pages={35-41},

publisher={SciTePress},

organization={INSTICC},

doi={10.5220/0005666400350041},

isbn={978-989-758-170-0},

}

#### in EndNote Style

TY - CONF

JO - Proceedings of the 9th International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 3: BIOINFORMATICS, (BIOSTEC 2016)

TI - Integer Linear Programming Approach to Median and Center Strings for a Probability Distribution on a Set of Strings

SN - 978-989-758-170-0

AU - Hayashida M.

AU - Koyano H.

PY - 2016

SP - 35

EP - 41

DO - 10.5220/0005666400350041