Algorithms in Bioinformatics

by Martin Frith, Christian NÝrgaard Storm Pedersen

Algorithms in Bioinformatics This book constitutes the refereed proceedings of the 16th International Workshop on Algorithms in Bioinformatics WABI 2016 held in Aarhus Denmark The 25 full papers together with 2 invited talks presented were carefully reviewed and selected from 54 submissions The selected papers cover a wide range of topics from networks tophylogenetic studies sequence and genome analysis comparative genomics and mass spectrometry data analysis

Publisher : Springer International Publishing

Author : Martin Frith, Christian NÝrgaard Storm Pedersen (eds.)

ISBN : 9783319436807

Year : 2016

Language: en

File Size : 13.47 MB

Category : Computers Technology

LNBI 9838

Martin Frith
Christian N√łrgaard Storm Pedersen (Eds.)

Algorithms
in Bioinformatics
16th International Workshop, WABI 2016
Aarhus, Denmark, August 22‚Äď24, 2016
Proceedings

123

Lecture Notes in Bioinformatics

9838

Subseries of Lecture Notes in Computer Science

LNBI Series Editors
Sorin Istrail
Brown University, Providence, RI, USA
Pavel Pevzner
University of California, San Diego, CA, USA
Michael Waterman
University of Southern California, Los Angeles, CA, USA

LNBI Editorial Board
S√łren Brunak
Technical University of Denmark, Kongens Lyngby, Denmark
Mikhail S. Gelfand
IITP, Research and Training Center on Bioinformatics, Moscow, Russia
Thomas Lengauer
Max Planck Institute for Informatics, Saarbr√ľcken, Germany
Satoru Miyano
University of Tokyo, Tokyo, Japan
Eugene Myers
Max Planck Institute of Molecular Cell Biology and Genetics, Dresden,
Germany
Marie-France Sagot
Université Lyon 1, Villeurbanne, France
David Sankoff
University of Ottawa, Ottawa, Canada
Ron Shamir
Tel Aviv University, Ramat Aviv, Tel Aviv, Israel
Terry Speed
Walter and Eliza Hall Institute of Medical Research, Melbourne, VIC, Australia
Martin Vingron
Max Planck Institute for Molecular Genetics, Berlin, Germany
W. Eric Wong
University of Texas at Dallas, Richardson, TX, USA

More information about this series at http://www.springer.com/series/5381

Martin Frith
Christian N√łrgaard Storm Pedersen (Eds.)

Algorithms
in Bioinformatics
16th International Workshop, WABI 2016
Aarhus, Denmark, August 22‚Äď24, 2016
Proceedings

123

Editors
Martin Frith
AIST and University of Tokyo
Tokyo
Japan

ISSN 0302-9743
Lecture Notes in Bioinformatics
ISBN 978-3-319-43680-7
DOI 10.1007/978-3-319-43681-4

Christian N√łrgaard Storm Pedersen
Aarhus University
Aarhus
Denmark

ISSN 1611-3349

(electronic)

ISBN 978-3-319-43681-4

(eBook)

Library of Congress Control Number: 2016945963
LNCS Sublibrary: SL8 ‚Äď Bioinformatics
© Springer International Publishing Switzerland 2016
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the
material is concerned, speciÔ¨Ācally the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microÔ¨Ālms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now
known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a speciÔ¨Āc statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book are
believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors
give a warranty, express or implied, with respect to the material contained herein or for any errors or
omissions that may have been made.
Printed on acid-free paper
This Springer imprint is published by Springer Nature
The registered company is Springer International Publishing AG Switzerland

Preface

This proceedings volume contains papers presented at the 16th Workshop on Algorithms in Bioinformatics (WABI 2016) that was held at Aarhus University, Aarhus,
Denmark, August 22‚Äď24, 2016. WABI 2016 was one of eight conferences that were
organized as part of ALGO 2016. The Workshop on Algorithms in Bioinformatics was
established in 2001, and is an annual conference on all aspects of algorithmic work in
bioinformatics, computational biology, and systems biology. The emphasis is on discrete algorithms and machine-learning methods that address important problems in
molecular biology, that are founded on sound models, that are computationally efÔ¨Ācient, and that have been implemented and tested in simulations and on real datasets.
The goal is to present recent research results, including signiÔ¨Ācant work-in-progress,
and to identify and explore directions of future research. WABI 2016 was sponsored by
the European Association for Theoretical Computer Science (EATCS).
In 2016, a total of 56 manuscripts were submitted to WABI from which 27 were
selected for presentation at the conference. Among them, 25 are included in this
proceedings volume as full papers presenting novel results not previously published in
journals, and two are included as short abstracts of papers that are in the process of
being published simultaneously in journals. The 27 papers were selected based on
thorough reviewing, usually involving three independent reviewers per submitted
paper, followed by discussions in the WABI Program Committee. The selected papers
cover a wide range of topics from networks, to phylogenetic studies, sequence and
genome analysis, comparative genomics, and mass spectrometry data analysis.
Extended versions of selected papers will be published in a thematic series in the
journal Algorithms for Molecular Biology (AMB), published by BioMed Central.
We thank all the authors of submitted papers and the members of the WABI
Program Committee and their reviewers for their efforts that made this conference
possible, and the WABI Steering Committee for their help and advice. We also thank
all the conference participants and speakers. In particular, we are indebted to the
keynote speaker of the conference, Kiyoshi Asai, for his presentation. Finally, we thank
Gerth St√łlting Brodal and the local ALGO Organizing Committee for their hard work.
June 2016

Martin Frith
Christian N. S. Pedersen

Organization

Program Chairs
Martin Frith
Christian N.S. Pedersen

AIST and University of Tokyo, Japan
Aarhus University, Denmark

Program Committee
Tatsuya Akutsu
Timothy L. Bailey
Jan Baumbach
Anne Bergeron
Paola Bonizzoni
Alessandra Carbone
Rita Casadio
Nadia El-Mabrouk
Anna Gambin
Raffaele Giancarlo
Michiaki Hamada
Thomas Hamelryck
Fereydoun Hormozdiari
Katharina Huber
Carl Kingsford
Hisanori Kiryu
Gregory Kucherov
Timo Lassmann
Ming Li
Zsuzsanna Liptak
Stefano Lonardi
Gerton Lunter
Thomas Mailund
Paul Medvedev
Daniel Merkle
István Miklós
Bernard Moret
Burkhard Morgenstern
Vincent Moulton
Veli Mäkinen
Luay Nakhleh
William Noble

Kyoto University, Japan
University of Queensland, Australia
University of Southern Denmark, Denmark
Université du Québec à Montréal, Canada
Università di Milano-Bicocca, Italy
Université Pierre et Marie Curie, France
UNIBO, Italy
University of Montreal, Canada
Warsaw University, Poland
Università di Palermo, Italy
Waseda University, Japan
University of Copenhagen, Denmark
University of Washington, USA
University of East Anglia, UK
Carnegie Mellon University, USA
University of Tokyo, Japan
CNRS/LIGM, France
Telethon Kids, Australia
University of Waterloo, Canada
University of Verona, Italy
University of California at Riverside, USA
University of Oxford, UK
Aarhus University, Denmark
Pennsylvania State University, USA
University of Southern Denmark, Denmark
Rényi Institute, Hungary
EPFL, Switzerland
University of Göttingen, Germany
University of East Anglia, UK
University of Helsinki, Finland
Rice University, USA
University of Washington, USA

VIII

Organization

Nadia Pisanti
Mihai Pop
Teresa Przytycka
Sven Rahmann
Marie-France Sagot
Kengo Sato
Michael Schatz
Russell Schwartz
Kana Shimizu
Anish Man Singh
Shrestha
Peter F. Stadler
Jens Stoye
Krister Swenson
Hélène Touzet
Lusheng Wang
Siu Ming Yiu
Louxin Zhang
Michal Ziv-Ukelson

Università di Pisa, Italy, and Inria, France
University of Maryland, USA
NIH, USA
University of Duisburg-Essen, Germany
Inria, France
Keio University, Japan
Cold Spring Harbor Laboratory, USA
Carnegie Mellon University, USA
Waseda University, Japan
University of Tokyo, Japan
University of Leipzig, Germany
Bielefeld University, Germany
CNRS, Université de Montpellier, France
CNRS, University of Lille and Inria, France
City University of Hong Kong, China
University of Hong Kong, SAR China
National University of Singapore, Singapore
Ben-Gurion University of the Negev, Israel

WABI Steering Committee
Bernard Moret
Vincent Moulton
Jens Stoye
Tandy Warnow

EPFL, Switzerland
University of East Anglia, UK
Bielefeld University, Germany
University of Illinois at Urbana-Champaign, USA

ALGO Organizing Committee
Gerth St√łlting Brodal (Chair)
Marianne Dammand Iversen

Trine Ji Holmgaard
Katrine √ėsterlund Rasmussen

Additional Reviewers
Nicolas Alcaraz
Hind Alhakami
Eloi Araujo
Matthias Bernt
Karel Brinda
Laurent Bulteau
Victoria Cepeda
Daniel Cooke
Phuong Dao
Gianluca Della Vedova
Alex Di Genova

Pietro Di Lena
Daniel Doerr
Norbert Dojer
Mikhail Dubov
Oliver Eulenstein
Pedro Feijao
Jay Ghurye
Krzysztof Gogolewski
Roberto Grossi
Laurent Gueguen
Marc Hellmuth

Organization

Donna Henderson
Farhad Hormozdiari
Jeff Howbert
Alex Hu
Laurent Jacob
Mateusz Krzysztof ŇĀńÖcki
Manuel Lafond
Cong Ma
Guillaume Marcais
Damon May
Blazej Miasojedow
Ibrahim Numanagic
Rachid Ounit
Solon Pissis

Raffaella Rizzi
Abbas Roayaei Ardakany
Giovanna Rosone
Yutaka Saito
Guillaume Scholz
Marcel Schulz
Celine Scornavacca
Mingfu Shao
Grzegorz SkoraczyŇĄski
Bianca Stöcker
Peng Sun
Hao Wang
Lusheng Wang
Martin Weigt

IX

Abstracts

Mass Graphs and Their Applications
in Top-Down Proteomics
Qiang Kou1, Si Wu2, Nikola Tolińá3, Yunlong Liu4,5,
Ljiljana PaŇ°a-Tolińá3 and Xiaowen Liu1,5(&)
1

Department of BioHealth Informatics,
Indiana University-Purdue University Indianapolis, Indianapolis, Indiana
2
Department of Chemistry and Biochemistry,
University of Oklahoma, Norman, Oklahoma
3
Biological Science Division, PaciÔ¨Āc Northwest National Laboratory,
Richland, USA
4
Department of Medical and Molecular Genetics,
Indiana University School of Medicine, Indianapolis, Indiana
5
Center for Computational Biology and Bioinformatics,
Indiana University School of Medicine, Indianapolis, Indiana
[email protected]
Abstract. Although proteomics has rapidly developed in the past decade,
researchers are still in the early stage of exploring the world of complex proteoforms, which are protein products with various primary structure alterations
resulting from gene mutations, alternative splicing, post-translational modiÔ¨Ācations, and other biological processes. Proteoform identiÔ¨Ācation is essential to
mapping proteoforms to their biological functions as well as discovering novel
proteoforms and new protein functions. Top-down mass spectrometry is the
method of choice for identifying complex proteoforms because it provides a
‚Äúbird's eye view‚ÄĚ of intact proteoforms. Fragment ion series in top-down tandem
mass spectra provide essential information for identifying primary sequence
alterations in proteoforms. Extended proteoform databases and spectral alignment are the two main approaches for proteoform identiÔ¨Ācation. However, due
to the combinatorial explosion of various alterations on a protein and the limitations of available spectral alignment algorithms, the proteoform identiÔ¨Ācation
problem has still not been fully solved.

We propose a new data structure, called the mass graph, for efÔ¨Ācient representation of
proteoforms of a protein with variable post-translational modiÔ¨Ācations and/or terminal
truncations. The proteoform identiÔ¨Ācation problem is transformed to the mass graph
alignment problem, and dynamic programming algorithms are proposed for a restricted
version of the problem. The proposed algorithms were tested on two top-down tandem
mass spectrometry data sets. Experimental results showed that the proposed algorithms
were efÔ¨Ācient in identifying proteoforms with variable post-translational modiÔ¨Ācations
and outperformed MS-Align-E in running time and sensitivity for identifying complex
proteoforms, especially those with terminal truncations.
Acknowledgement. The research was supported by the National Institute of General Medical
Sciences, National Institutes of Health (NIH) through Grant R01GM118470.

Safely Filling Gaps with Partial Solutions
Common to all Solutions

Leena Salmela and Alexandru I. Tomescu
Department of Computer Science, Helsinki Institute for Information
Technology HIIT, University of Helsinki, Helsinki, 00014, Finland
{leena.salmela,tomescu}@cs.helsinki.Ô¨Ā
Abstract. Gap Ô¨Ālling has emerged as a natural sub-problem of many de novo
genome assembly projects (e.g., Ô¨Ālling gaps in scaffolds). Several methods have
addressed it, but only few have focused on strategies for dealing with multiple
gap Ô¨Ālling solutions and for guaranteeing reliable results. Such strategies include
reporting only unique solutions, or exhaustively enumerating all Ô¨Ālling solutions
and heuristically creating their consensus.

The gap Ô¨Ālling problem is usually formulated as Ô¨Ānding an s-t path in the assembly
graph, whose length matches the gap length estimate. In this paper we address it with
the ‚Äúsafe and complete‚ÄĚ framework proposed in [Tomescu and Medvedev, RECOMB
2016] for the contig assembly problem. In terms of gap Ô¨Ālling, a safe solution is a path
of the assembly graph that is a sub-path of all possible s-t paths whose length matches
the gap length estimate.
We give an efÔ¨Ācient safe algorithm for the gap Ô¨Ālling problem, running in time O
(dm), where d is the gap length estimate and m is the number of edges of the assembly
graph. To transform the safe paths into a single Ô¨Ālling sequence usable in downstream
analysis, we Ô¨Āll the gap with an arbitrary Ô¨Ālling path, in which we mark the safe
subsequences. Experiments on the GAGE bacterial datasets show that our method
retrieves over 90 % more safe and correct bases as compared to previous methods
differentiating between ambiguous and unambiguous positions, with a precision similar
to the one of previous methods.
We implemented this method as version 2.0 of our gap Ô¨Āller of scaffolds, Gap2Seq,
available at www.cs.helsinki.Ô¨Ā/u/lmsalmel/Gap2Seq/.

Fig. 1. A de Bruijn graph (k = 31) built on S.aureus data. We represent unary paths by
numbers indicating their length. The estimated gap length is d = 3774, and there are
9216 different s-t paths of length d. The safe sub-paths (in black) have length 3337 and
the precision of our method on these sub-paths is 99.9 %. Notice that most of the
bubbles of this graph are caused by SNPs.

Contents

Optimal Computation of Avoided Words . . . . . . . . . . . . . . . . . . . . . . . . . .
Yannis Almirantis, Panagiotis Charalampopoulos, Jia Gao,
Costas S. Iliopoulos, Manal Mohamed, Solon P. Pissis,
and Dimitris Polychronopoulos
A Biclique Approach to Reference Anchored Gene Blocks
and Its Applications to Pathogenicity Islands . . . . . . . . . . . . . . . . . . . . . . .
Arnon Benshahar, Vered Chalifa-Caspi, Danny Hermelin,
and Michal Ziv-Ukelson
An Efficient Branch and Cut Algorithm to Find Frequently Mutated
Subnetworks in Cancer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Anna Bomersbach, Marco Chiarandini, and Fabio Vandin
Isometric Gene Tree Reconciliation Revisited . . . . . . . . . . . . . . . . . . . . . . .
BroŇąa Brejov√°, Askar Gafurov, Dana Pardubsk√°, Michal Sabo,
and Tom√°Ň° VinaŇô

1

14

27
40

Further Improvement in Approximating the Maximum Duo-Preservation
String Mapping Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Brian Brubach

52

SpecTrees: An Efficient Without a Priori Data Structure for MS/MS Spectra
Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Matthieu David, Guillaume Fertin, and Dominique Tessier

65

Predicting Core Columns of Protein Multiple Sequence Alignments
for Improved Parameter Advising . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Dan DeBlasio and John Kececioglu

77

Fast Compatibility Testing for Phylogenies with Nested Taxa . . . . . . . . . . . .
Yun Deng and David Fern√°ndez-Baca

90

The Gene Family-Free Median of Three . . . . . . . . . . . . . . . . . . . . . . . . . .
Daniel Doerr, Pedro Feij√£o, Metin Balaban, and Cedric Chauve

102

Correction of Weighted Orthology and Paralogy Relations - Complexity
and Algorithmic Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Riccardo Dondi, Nadia El-Mabrouk, and Manuel Lafond

121

XVI

Contents

Copy-Number Evolution Problems: Complexity and Algorithms . . . . . . . . . .
Mohammed El-Kebir, Benjamin J. Raphael, Ron Shamir, Roded Sharan,
Simone Zaccaria, Meirav Zehavi, and Ron Zeira

137

Gerbil: A Fast and Memory-Efficient k-mer Counter with GPU-Support . . . .
Marius Erbert, Steffen Rechner, and Matthias M√ľller-Hannemann

150

Genome Rearrangements on Both Gene Order and Intergenic Regions. . . . . .
Guillaume Fertin, Géraldine Jean, and Eric Tannier

162

Better Identification of Repeats in Metagenomic Scaffolding . . . . . . . . . . . .
Jay Ghurye and Mihai Pop

174

A Better Scoring Model for De Novo Peptide Sequencing:
The Symmetric Difference Between Explained and Measured Masses . . . . . .
Ludovic Gillet, Simon Rösch, Thomas Tschager, and Peter Widmayer

185

StreAM-Tg : Algorithms for Analyzing Coarse Grained RNA Dynamics
Based on Markov Models of Connectivity-Graphs. . . . . . . . . . . . . . . . . . . .
Sven Jager, Benjamin Schiller, Thorsten Strufe, and Kay Hamacher

197

Solving Generalized Maximum-Weight Connected Subgraph Problem
for Network Enrichment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Alexander A. Loboda, Maxim N. Artyomov, and Alexey A. Sergushichev

210

A Natural Encoding of Genetic Variation in a Burrows-Wheeler Transform
to Enable Mapping and Genome Inference . . . . . . . . . . . . . . . . . . . . . . . . .
Sorina Maciuca, Carlos del Ojo Elias, Gil McVean, and Zamin Iqbal

222

Inferring Population Genetic Parameters: Particle Filtering, HMM, Ripley’s
K-Function or Runs of Homozygosity? . . . . . . . . . . . . . . . . . . . . . . . . . . .
Svend V. Nielsen, Simon Simonsen, and Asger Hobolth

234

A Graph Extension of the Positional Burrows-Wheeler Transform and Its
Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Adam M. Novak, Erik Garrison, and Benedict Paten

246

Compact Universal k-mer Hitting Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Yaron Orenstein, David Pellow, Guillaume Marçais, Ron Shamir,
and Carl Kingsford

257

A New Approximation Algorithm for Unsigned Translocation Sorting . . . . . .
Lianrong Pu, Daming Zhu, and Haitao Jiang

269

Independent Component Analysis to Remove Batch Effects from Merged
Microarray Datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Emilie Renard, Samuel Branders, and P.-A. Absil

281

Contents

A Linear Time Approximation Algorithm for the DCJ Distance
for Genomes with Bounded Number of Duplicates . . . . . . . . . . . . . . . . . . .
Diego P. Rubert, Pedro Feijão, Marília D.V. Braga, Jens Stoye,
and F√°bio V. Martinez

XVII

293

A Hybrid Parameter Estimation Algorithm for Beta Mixtures
and Applications to Methylation State Classification . . . . . . . . . . . . . . . . . .
Christopher Schröder and Sven Rahmann

307

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

321

© 2018-2019 uberlabel.com. All rights reserved