00
173
346
518
691
864
1037
1210
1382
1555
2048 1024 512
seconds
Shared Input File Size (Mb)
Classic List Scheduling
(NFS)
MWS DA (Local Disk)
MWF DA (Local Disk +
RamDisk)
Figure 7: Makespan of 50 Workflows on 128 cores and dif-
ferent shared input files sizes.
The experiment provides us with some ground
where we can conclude that our algorithm is better in
the case of workflows that share data files in different
levels of the memory hierarchy. Without using local
storage our gain is about 11% for 50 Workflows run-
ning at the same time in a 8 core cluster and almost
20% for 128 cores as we can see in figure 6.
5 CONCLUSIONS AND OPEN
LINES
We have studied the state of the art of schedulers for
multiworkflows and their taxonomies, and then focus
our work in the field of data-aware policies for clus-
ters. We concentrate our efforts in studying disk I/O
cluster bottlenecks. We characterize bioinformatics
applications where some of them using same data files
as input. Techniques like shared input files are desir-
able to prevent multiple file reads and to improve the
performance of the system I/O.
We have considered a list of options for data re-
placement polices in ramdisk or local disk. To further
increase efficiency of the policies, we should consider
a better prediction technique of how many nodes, pro-
cessors and cores.
Looking forward, this scheduler is ready to be in-
tegrated it to a real scientific workflow manager like
Galaxy (Goecks et al., 2010) which is a web-based
workflow manager widely used in the bioinformatics
community.
ACKNOWLEDGEMENT
This work has been supported by project number
TIN2014-53234-C2-1-R of Spanish Ministerio de
Ciencia y Tecnolog
´
ıa (MICINN). This work is co-
founded by the EGI-Engage project (Horizon 2020)
under Grant number 654142.
REFERENCES
Afrati, F., Papadimitriou, C. H., and Papageorgiou, G.
(1988). Scheduling dags to minimize time and com-
munication. In VLSI Algorithms and Architectures,
pages 134–138. Springer.
Ananthanarayanan, G., Ghodsi, A., Wang, A., Borthakur,
D., Kandula, S., Shenker, S., and Stoica, I. (2012).
Pacman: coordinated memory caching for parallel
jobs. In Proceedings of the 9th USENIX conference
on Networked Systems Design and Implementation,
pages 20–20. USENIX Association.
Barbosa, J. and Monteiro, A. P. (2008). A list scheduling
algorithm for scheduling multi-user jobs on clusters.
In High Performance Computing for Computational
Science-VECPAR 2008, pages 123–136. Springer.
Bittencourt, L. F. and Madeira, E. R. (2010). Towards the
scheduling of multiple workflows on computational
grids. Journal of grid computing, 8(3):419–441.
Bolze, R., Desprez, F., and Insard, B. Evaluation of on-
line multi-workflow heuristics based on list schedul-
ing methods. Technical report, Gwendia ANR-06-
MDCA-009.
Cerezo, N., Montagnat, J., and Blay-Fornarino, M. (2013).
Computer-assisted scientific workflow design. Jour-
nal of grid computing, 11(3):585–612.
Costa, L. B., Yang, H., Vairavanathan, E., Barros, A., Ma-
heshwari, K., Fedak, G., Katz, D., Wilde, M., Ri-
peanu, M., and Al-Kiswany, S. (2015). The case for
workflow-aware storage: An opportunity study. Jour-
nal of Grid Computing, 13(1):95–113.
Goecks, J., Nekrutenko, A., Taylor, J., et al. (2010). Galaxy:
a comprehensive approach for supporting accessible,
reproducible, and transparent computational research
in the life sciences. Genome Biol, 11(8):R86.
Gu, Y. and Wu, Q. (2010). Optimizing distributed com-
puting workflows in heterogeneous network environ-
ments. In Distributed Computing and Networking,
pages 142–154. Springer.
H
¨
onig, U. and Schiffmann, W. (2006). A meta-algorithm
for scheduling multiple dags in homogeneous sys-
tem environments. In Proceedings of the eighteenth
IASTED International Conference on Parallel and
Distributed Computing and Systems (PDCS’06).
Ilavarasan, E. and Thambidurai, P. (2007). Low complexity
performance effective task scheduling algorithm for
heterogeneous computing environments. Journal of
Computer sciences, 3(2):94–103.
Kwok, Y.-K. and Ahmad, I. (1999). Static schedul-
ing algorithms for allocating directed task graphs to
multiprocessors. ACM Computing Surveys (CSUR),
31(4):406–471.
Mandal, A., Kennedy, K., Koelbel, C., Marin, G., Mellor-
Crummey, J., Liu, B., and Johnsson, L. (2005).
Scheduling strategies for mapping application work-
flows onto the grid. In High Performance Distributed
Computing, 2005. HPDC-14. Proceedings. 14th IEEE
International Symposium on, pages 125–134. IEEE.
Meswani, M. R., Laurenzano, M. A., Carrington, L., and
Snavely, A. (2010). Modeling and predicting disk i/o
A Data-Aware MultiWorkflow Cluster Scheduler
101