Improving Pipelining Tools for Pre-processing Data
- María Novo-Lourés 1
- Yeray Lage 1
- Reyes Pavón 1
- Rosalía Laza 1
- David Ruano-Ordás 1
- José Ramón Méndez 1
-
1
Universidade de Vigo
info
ISSN: 1989-1660
Year of publication: 2022
Volume: 7
Issue: 4
Pages: 214-224
Type: Article
More publications in: IJIMAI
Abstract
The last several years have seen the emergence of data mining and its transformation into a powerful tool that adds value to business and research. Data mining makes it possible to explore and find unseen connections between variables and facts observed in different domains, helping us to better understand reality. The programming methods and frameworks used to analyse data have evolved over time. Currently, the use of pipelining schemes is the most reliable way of analysing data and due to this, several important companies are currently offering this kind of services. Moreover, several frameworks compatible with different programming languages are available for the development of computational pipelines and many research studies have addressed the optimization of data processing speed. However, as this study shows, the presence of early error detection techniques and developer support mechanisms is very limited in these frameworks. In this context, this study introduces different improvements, such as the design of different types of constraints for the early detection of errors, the creation of functions to facilitate debugging of concrete tasks included in a pipeline, the invalidation of erroneous instances and/or the introduction of the burst-processing scheme. Adding these functionalities, we developed Big Data Pipelining for Java (BDP4J, https://github.com/sing-group/bdp4j), a fully functional new pipelining framework that shows the potential of these features
Bibliographic References
- I. M. Dunham, “Big Data: A Revolution That Will Transform How We Live, Work, and Think”,The AAG Review of Books,vol. 3, no. 1,pp. 19– 21,Jan. 2015.
- Q. Qi, F. Tao, “Digital Twin and Big Data Towards Smart Manufacturing and Industry 4.0: 360 Degree Comparison”,IEEE Access,vol. 6,pp. 3585– 3593,2018.
- V. Kalavri, V. Vlassov, “MapReduce: Limitations, Optimizations and Open Issues,” in 2013 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications, 2013, pp. 1031–1038.
- D. Miner, A. Shook, Mapreduce Design Patterns Building Effective Algorithms and Analytics for Hadoop and Other Systems. Oreilly & Associates Inc, 2012.
- Apache Software Foundation, “Apache Hadoop.” 2018.
- Amazon, “Amazon Elastic MapReduce.” 2019.
- Disco Project, “DisCo MapReduce.” 2014.
- S. Papadimitriou, J. Sun, “DisCo: Distributed Co-Clustering with MapReduce: A Case Study towards Petabyte-Scale End-to-End Mining,” in 2008 Eighth IEEE International Conference on Data Mining, 2008, pp. 512–521.
- Apache Software Foundation, “Apache Spark - Unified Analytics Engine for Big Data.” 2018.
- J. Zeng, B. Plale, “Data Pipeline in MapReduce,” in 2013 IEEE 9th International Conference on e-Science, 2013, pp. 164–171.
- P. O’Donovan, K. Leahy, K. Bruton, D. T. J. O’Sullivan, “An Industrial Big Data Pipeline for Data-Driven Analytics Maintenance Applications in Large-Scale Smart Manufacturing Facilities”,Journal of Big Data,vol. 2, no. 1,p. 25,Dec. 2015.
- P. Di Tommaso, “Awesome Pipeline: A Curated List of Awesome Pipeline Toolkits.” 2018.
- Amazon, “AWS Data Pipeline.” 2019.
- Snaplogic, “SnapLogic Intelligent Integration Platform,” 2019. [Online]. Available: https://www.snaplogic.com/products/intelligent-integrationplatform. [Accessed: 21-Jun-2020].
- Alooma, “Alooma Enterprise Data Pipeline.” 2019.
- S. G. Ahmad, C. S. Liew, M. M. Rafique, E. U. Munir, “Optimization of Data-Intensive Workflows in Stream-Based Data Processing Models”,The Journal of Supercomputing,vol. 73, no. 9,pp. 3901–3923,Sep. 2017.
- G. Kougka, A. Gounaris, A. Simitsis, “The Many Faces of Data-Centric Workflow Optimization: A Survey”,International Journal of Data Science and Analytics,vol. 6, no. 2,pp. 81–107,Sep. 2018.
- J. Leipzig, “A Review of Bioinformatic Pipeline Frameworks”,Briefings in Bioinformatics,p. bbw020,Mar. 2016.
- P. A. Ewels et al., “The Nf-Core Framework for Community-Curated Bioinformatics Pipelines”,Nature Biotechnology,vol. 38, no. 3,pp. 276– 278,Mar. 2020.
- M. Bourgey et al., “GenPipes: An Open-Source Framework for Distributed and Scalable Genomic Analyses”,GigaScience,vol. 8, no. 6,Jun. 2019.
- D. Swersky, “Top 43 Programming Languages: When and How to Use Them,” 2018. [Online]. Available: https://raygun.com/blog/programminglanguages/. [Accessed: 21-Jun-2020].
- E. Frank, M. A. Hall, I. H. Witte, The WEKA Workbench. Online Appendix for “Data Mining: Practical Machine Learning Tools and Techniques,” Fourth Edi. Morgan Kaufmann Publishers Inc., 2016.
- A. Moro, R. Navigli, “Babelfy.” 2014.
- Y. Lage, J. R. Méndez, M. Novo-Lourés, “Big Data Pre-Processing For Java (BDP4J).” 2018.
- F. Lordan et al., “ServiceSs: An Interoperable Programming Framework for the Cloud”,Journal of Grid Computing,vol. 12, no. 1,pp. 67–91,Mar. 2014.
- R. M. Badia et al., “COMP Superscalar: An Interoperable Programming Framework”,SoftwareX,vol. 3–4,pp. 32–36,Dec. 2015.
- T. Burdett, N. Kurbatova, D. Hastings, Emma Faulconbridge, Adam Mapleson, R. Davey, “Conan2 Lightweight Workflow Manager.” 2019.
- J. Bingham, S. Davis, N. Deflaux, “Dockerflow: A Workflow Runner That Uses Dataflow to Run a Series of Tasks in Docker with the Pipelines API,” 2017. [Online]. Available: https://github.com/googlegenomics/ dockerflow.
- Google Inc, “Cloud Dataflow Documentation,” 2019. [Online]. Available: https://cloud.google.com/dataflow/docs/?hl=es-419. [Accessed: 21-Jun2020].
- Netflix, “Suro: Netflix Distributed Data Pipeline.” 2012.
- J. M. Wozniak, M. Wilde, I. T. Foster, “Language Features for Scalable Distributed-Memory Dataflow Computing,” in Fourth Workshop on DataFlow Execution Models for Extreme Scale Computing, 2014, pp. 50–53.
- J. M. Wozniak, M. Wilde, I. T. Foster, “Swift Tutorial for Running on Localhost,” 2014. [Online]. Available: http://swift-lang.org/tutorials/ localhost/tutorial.html. [Accessed: 21-Jun-2019].
- M. Hategan et al., “Swift-Lang, Swift-K,” 2019. [Online]. Available: https:// github.com/swift-lang/swift-k. [Accessed: 21-Jun-2019].
- H. López-Fernández, O. Graña-Castro, A. Nogueira-Rodríguez, M. Reboiro-Jato, D. Glez-Peña, “Compi: A Framework for Portable and Reproducible Pipelines”,PeerJ Computer Science,vol. 7,p. e593,Jun. 2021.
- Broad Institute, “Cromwell: Workflow Management System Geared towards Scientific Workflows.” 2019.
- A. Malloy et al., “Drake.” 2015.
- S. Fong, Y. Zhuang, J. Li, R. Khoury, “Sentiment Analysis of Online News Using MALLET,” in 2013 International Symposium on Computational and Business Intelligence, 2013, pp. 301–304.
- A. K. McCallum, “MALLET: A Machine Learning for Language Toolkit.” 2002.
- Apache Software Foundation, “Apache Spark: ML Pipelines,” 2018. [Online]. Available: https://spark.apache.org/docs/latest/ml-pipeline. html. [Accessed: 21-Jun-2020].
- A. Liu, Apache Spark Machine Learning Blueprints, First. Birmingham, UK: PACKT Publishing Ltd., 2016.
- D. S. F. Long, D. Mohindra, R.C. Seacord, D.F. Sutherland, “Svoboda, Java Coding Guidelines: 75 Recommendations for Reliable and Secure Programs”,Addison-Wesley,2013.
- Google LLC, “AutoService: A Collection of Source Code Generatos for Java.” 2013.
- L. Breiman, “Random Forests”,Machine Learning,vol. 45, no. 1,pp. 5–32,2001.
- M. R. G. Alder, D. Benson, “Jgraph/Jgraphx.” 2014.
- E. P. S. J.M. Gómez Hidalgo, “SMS Spam Corpus v.0.1,” 2011.
- A. Pérez, P. Larrañaga, I. Inza, “Bayesian Classifiers Based on Kernel Density Estimation: Flexible Classifiers”, International Journal of Approximate Reasoning, vol. 50, no. 2,pp. 341–362,Feb. 2009.
- M. Novo-Lourés, Y. Lage, R. Pavón, R. Laza, D. Ruano-Ordás, J. R. Mendez, “Benchmarking Code for Pipeline-Based Frameworks.” 2021.