Commit bb0b6302 authored by Omran Saleh's avatar Omran Saleh
Browse files

submitted version

parent 9ba88b4d
To provide a higher abstraction level, we have developed a Web application called \textbf{AutoStudio} as shown in Fig. \ref{fig:auto}. It is a user friendly and easy to use browser-based application to run cross-platform using HTML5, Draw2D touch\footnote{\url{http://www.draw2d.org/draw2d/index.html}}, and node.js\footnote{\url{http://www.nodejs.org}}. AutoStudio has several functionalities:
To provide a higher abstraction level, we have developed a Web application called \AutoStudio~\cite{DBIS:RaneSaleh2015} as shown in Fig. \ref{fig:auto}. It is a user friendly and easy to use browser-based application to run cross-platform using HTML5, Draw2D touch\footnote{\url{http://www.draw2d.org/draw2d/index.html}}, Highcharts\footnote{\url{http://www.highcharts.com}}, and Node.js\footnote{\url{http://www.nodejs.org}}. \AutoStudio has several functionalities:
\begin{itemize}
\item It enables users to leverage the emerging \PipeFlow language graphically via a collection of operators (represented by icons) which could be simply ``dragged and dropped'' onto a drawing canvas. The user can assemble the operators in order to create a dataflow graph in a logical way and visually show how they are related, and from this graph, equivalent \PipeFlow script can be generated. By clicking on the operator icon, a pop-up window appears to let the user specify the parameters of \PipeFlow operators, which are required. Moreover, the user can display the help contents for each operator.
\item Contacting the \PipeFlow system to generate the right program (e.g., Storm, Spark Streaming, or PipeFabric programs) based upon the user's selection of engine from the dashboard page. This makes the user to be not aware of any of stream-processing language syntax and its constructs including \PipeFlow. By this application, the user can trigger the execution of the script through the \PipeFlow system via calling the respective engine. Moreover, it handles real-time stats including execution and performance results sent by the \PipeFlow system when the program is in execution. When the execution is complete, the application can send an email to the user.
\item It provides the options of saving the generated scripts or flow-designs for future reference as well as loading the saved script and executing it whenever required.
\item An evaluation tool for the generated programs where the user is interested in comparing as well as evaluating the performance of stream- processing systems in terms of throughput, latency, and resource consumption such as CPU and memory. The evaluation can be performed online using dynamic figures or offline using static figures.
\item It enables users to easily build the \PipeFlow program graphically via a collection of operators (represented by icons) which could be simply ``dragged and dropped'' onto a drawing canvas. The user can group the operators together in order to create the dataflow graph for that program, and from this graph, an equivalent \PipeFlow script can be generated. It also provides options for saving the generated scripts or flow-designs for future use as well as loading the saved script and executing it whenever needed.
\item It contacts the \PipeFlow system to generate the equivalent program (e.g., for Storm, Spark Streaming, or PipeFabric) based upon the user's selection of the engine from the dashboard page. This makes the user to be not aware of any of stream-processing language syntax and its constructs including \PipeFlow. By this application, the user can trigger the execution of the script through the \PipeFlow system via calling the respective engine. Moreover, it handles real-time stats including execution and performance results sent by the \PipeFlow system when the program is in execution. When the execution is complete, the application can send an email to the user.
\item It can be considered as an evaluation tool for the generated programs where the user is interested in comparing as well as evaluating the performance of various stream-processing systems in terms of throughput, latency, and resource consumption such as CPU and memory. The evaluation can be performed online using dynamic figures or offline using static figures.
\end{itemize}
AutoStudio prominently uses open source softwares and frameworks. The client side, including HTML5, JavaScript, Cascading Style Sheets (CSS), jQuery (and helping libraries), twitter bootstrap, and hogan.js, is used for building the graphical user interface, performing Ajax requests, file uploading and downloading, etc. AutoStudio extensively uses pre-compiled hogan templates where the data returned from the server is simply passed to these templates for quick rendering. In addition, Draw2D touch is used to enable creation of diagram applications in a browser by creating and manipulating operators and connections. Highcharts\footnote{\url{http://www.highcharts.com}} library is utilized to create interactive charts for our application. In the server side, we used a web server suitable for data-intensive real-time applications. Therefore, node.js and its supporting modules such as nodemailer and socket.io were employed.
The client side in \AutoStudio is developed using suitable Web technologies, including HTML5, JavaScript, Cascading Style Sheets (CSS), jQuery (and helping libraries), twitter bootstrap, and hogan.js. These technologies are used for building the graphical user interface, performing Ajax requests, file uploading and downloading, etc. In addition, Draw2D touch is used for creating the graphical view of \PipeFlow programs by creating and manipulating operators and connections. Highcharts\footnote{\url{http://www.highcharts.com}} library is utilized to create interactive charts for our application. In the server side, we used a web server suitable for data-intensive real-time applications. Therefore, Node.js and its supporting modules such as Nodemailer and Socket.IO were employed.
\begin{figure}[hb]
\centering
\includegraphics[width=3.5in, height = 2in]{autostudio.eps}
\includegraphics[width=3.5in, height = 2in]{AutoStudio.eps}
\caption
{AutoStudio: the \PipeFlow web application}
{\AutoStudio: the \PipeFlow web application}
\label{fig:auto}
\end{figure}
During the demo session, we will bring a laptop running our system and demonstrate
its capabilities and supported-services by employing real queries and datasets which can be interactively
explored by demo audience. Since the front-end in our system is a web-based application, the whole demonstration can be carried out via the laptop's browser. We will prepare some sample scripts expressed in \PipeFlow. The users will create these scripts graphically by dragging and dropping the operators and connections as well as specifying the operator's parameters. The audience can switch between different generated programs (i.e., for different engines) without changing the original data flow script and observe the differences between these programs constructs. To show real-time results, the generated programs will be compiled and executed over the respective engine via our system. Moreover, the audience will notice real-time performance measurements for the engines through the existing of static and dynamic figures in the dashboard page.
\ No newline at end of file
explored by demo audience. Since the front-end in our system is a web-based application, the whole demonstration can be carried out via the laptop's browser. We will prepare some sample scripts expressed in \PipeFlow. The users will create these scripts graphically by dragging and dropping operators and connections as well as specifying the operator's parameters. The audience can switch between different generated programs (i.e., for different engines) without changing the original dataflow script and observe the differences between these programs constructs. To show real-time results, the generated programs will be compiled and executed over the respective engines via our system. Moreover, the audience will notice real-time performance measurements for the engines through the existing of static and dynamic figures in the dashboard page.
\ No newline at end of file
A wide range of (near) real-time applications process stream-based data including
financial data analysis, traffic management, telecommunication monitoring, environmental monitoring, the smart grid, weather forecasting, and social media analysis, etc. These applications focus mainly on finding useful information and patterns on-the-fly as well as deriving valuable higher-level information from lower-level ones from continuously incoming data stream to report and monitor the progress of some activities. In the last few years, several systems for processing streams of information, where each offering their own processing solution, have been proposed. It is pioneered by academic systems such as Aurora and Borealis~\cite{Abadi:2003:ANM:950481.950485} and commercial systems like IBM InfoSphere Streams or StreamBase. Recently, some novel distributed stream computing platforms have been developed based on data parallelization approaches, which try to support scalable operation in cluster environments for processing massive data streams. Examples of these platforms are Storm~\cite{storm}, Spark Streaming ~\cite{spark}, and Flink~\cite{flink}. Though, these engines provide abstractions for processing (possibly) infinite streams of data, they lack support for higher-level declarative languages. Some of these engines provide only a programming interface where operators and topologies have to be implemented in a programming language like Java or Scala. Moreover, to build a particular program (i.e., query) in these systems, the users should be expert and should have a deeper knowledge of the syntax and programming constructs of the language, especially, if the system supports multiple languages. Therefore, no time and effort savings can be achieved as the user needs to proceed by writing each programming statements correctly. To make the life much easier, the current trend in data analytics should be the adopting of the "Write once, run anywhere" slogan. This is a slogan first-mentioned by Sun Microsystems to illustrate that the Java code can be developed on any platform and be expected to run on any platform equipped with a java virtual machine (JVM). In general, the development of various stream processing engines raises the question whether we can provide an unified programming model or a standard language where the user can write one stream processing script and he/she expects to execute this script on any stream processing engines. By bringing all these things together, we provide a demonstration of our solution called \PipeFlow. In our \PipeFlow system, we address the following issues:
A wide range of (near) real-time applications, including
financial data analysis, traffic management, telecommunication monitoring, environmental monitoring, smart grids, weather forecasting, social media analysis, etc., process stream-based data. These applications focus mainly on finding useful information and patterns on-the-fly as well as deriving valuable higher-level information from lower-level ones from continuously incoming data stream to report and monitor the progress of some activities. In the last few years, several systems for processing streams of information, where each offers its own processing solution, have been proposed. It is pioneered by academic systems such as Aurora and Borealis~\cite{Abadi:2003:ANM:950481.950485} and commercial systems like IBM InfoSphere Streams or StreamBase. Recently, some novel distributed stream computing platforms have been developed based on data parallelization approaches, which try to support scalable operations in cluster environments for processing massive data streams. Examples of these platforms are Storm~\cite{storm}, Spark Streaming~\cite{spark}, and Flink~\cite{flink}. Though, these engines provide abstractions for processing (possibly) infinite streams of data, they lack support for higher-level declarative languages. Some of these engines provide only programming interfaces where operators and topologies have to be implemented in a programming language like Java or Scala. Moreover, to build a particular program (i.e., query) in these systems, the users should be expert and should have deep knowledge of the syntax and programming constructs of the language, especially, if the system supports multiple languages. Therefore, no time and effort savings can be achieved as the user needs to proceed by writing each programming statements correctly. To make the life much easier, the current trend in data analytics should be the adopting of the "Write once, run anywhere" slogan. This is a slogan first-mentioned by Sun Microsystems to illustrate that Java code can be developed on any platform and be expected to run on any platform equipped with a Java Virtual Machine. In general, the development of various stream-processing platforms raises the question whether we can provide a unified programming model or a standard language where the user can write one stream-processing script and he/she expects to execute this script on any stream-processing engines. By bringing all these things together, we provide a demonstration of our solution called \PipeFlow. In our \PipeFlow system, we address the following issues:
\begin{itemize}
\item Developing a scripting language that provides most of the features of stream processing scripting languages, e.g., Storm and Spark Streaming. Therefore, we have chosen a dataflow language called \PipeFlow. At the beginning, this language was intended to be used in conjunction with a stream processing engine called PipeFabric \cite{DBIS:SalBetSat14year2014,DBIS:SalSat14}. Later, it is extended to be used with other engines. The source script written in \PipeFlow language is parsed and compiled and then a target program (i.e., for Spark Streaming and Storm as well as PipeFabric) is generated based upon user's selection. This target program is equivalent in its functionalities to the original \PipeFlow script. Once the target program is generated, the user can execute this program in the specific engine.
\item Mapping or translating a \PipeFlow script into other programs necessitates the existing of each operator in \PipeFlow to be implemented in the target engine. Since \PipeFlow contains a set of predefined operators, all of these operators have been implemented directly or indirectly in that engine.
\item Providing a flexible architecture for users for extending the system by supporting more engines as well as new operators. These extensions should be integrated in the system smoothly.
\item Developing a scripting language that provides most of the features of other stream-processing scripting languages, e.g., Storm and Spark Streaming. Therefore, we have chosen a dataflow language called \PipeFlow. At the beginning, this language was intended to be used in conjunction with a stream-processing engine called PipeFabric \cite{DBIS:SalBetSat14year2014,DBIS:SalSat14}. Afterwards, it is extended to be used with other engines. The source script written in the \PipeFlow language is parsed and compiled and then a target program (i.e., for Spark Streaming and Storm as well as PipeFabric) is generated based upon user's selection. This target program is equivalent in its functionalities to the original \PipeFlow script. Once the target program is generated, the user can execute this program in the specific platform.
\item Implementing all \PipeFlow operators in each platform directly or indirectly. \PipeFlow contains a set of pre-defined operators, thus, mapping or translating a \PipeFlow script into other programs necessitates the existing of each operator in \PipeFlow to be implemented in the target platform.
\item Providing a flexible architecture for users for extending the system by supporting more platforms as well as new operators. These extensions should be integrated in the system smoothly.
\item Developing a front-end web application to enable users who have little experience in the \PipeFlow language to express the program and its associated processing algorithm and data pipeline graphically.
\end{itemize}
One of the useful applications for this approach is helping the developers to evaluate various stream processing systems. Instead of writing several programs, which should perform the same task, for different systems manually, writing a single script in our approach will help to get the same result faster and more efficiently.
The remainder of the paper is structured as follows: In Sect.~\ref{sec:pipeflow}, we introduce the \PipeFlow language, the system architecture, and an example for the translation between \PipeFlow script and engine programs. Next, in Sect.~\ref{sec:app}, we describe our front-end application and give details about its design and provided functionalities. Finally, the planned demonstration is described in Sect.~\ref{sec:demo}.
One of the useful applications for this approach is helping the developers to evaluate various stream-processing systems. Instead of writing several programs, which should perform the same task, for different systems manually, writing a single script in our approach will help to get the same result faster and more efficiently.
The remainder of the paper is structured as follows: In Sect.~\ref{sec:pipeflow}, we introduce the \PipeFlow language, the system architecture, and an example for translating a \PipeFlow script into different platform programs. Next, in Sect.~\ref{sec:app}, we describe our front-end application and give details about its design and provided functionalities. Finally, the planned demonstration is described in Sect.~\ref{sec:demo}.
......@@ -9,21 +9,27 @@
\usepackage{multirow}
\usepackage{xcolor}
\usepackage{listings}
\usepackage{scrextend}
\lstset{language=Java,
showspaces=false,
showtabs=false,
breaklines=true,
showstringspaces=false,
breakatwhitespace=true,
basicstyle=\ttfamily,
basicstyle=\ttfamily\small,
}
\newtheorem{mydef}{Definition}
\newcommand{\todo}[1]{\textcolor[rgb]{1,0,0}{#1}}
\newcommand{\PipeFlow}{\textsc{PipeFlow}\xspace}
\newcommand{\AutoStudio}{\textsc{AutoStudio}\xspace}
\newcommand{\op}[1]{\textbf{\texttt{#1}}}
\newcommand{\SymbReg}{\textsuperscript{\textregistered}\xspace}
\newcommand{\CC}{C\nolinebreak\hspace{-.05em}\raisebox{.4ex}{\tiny\bf +}\nolinebreak\hspace{-.10em}\raisebox{.4ex}{\tiny\bf +}\xspace}
\clubpenalty=10000
\widowpenalty=10000
\begin{document}
\frenchspacing
\sloppy
......@@ -72,13 +78,13 @@ Omran Saleh\\
\conferenceinfo{DEBS'15,} {June 29 - July 3, 2015, Oslo, Norway.}
\copyrightetc{Copyright 2015 ACM \the\acmcopyr}
\crdata{978-1-4503-3286-6/15/06\ ...\$15.00.\\
http://}
http://dx.doi.org/10.1145/2675743.2776774}
\maketitle
\begin{abstract}
Recently, some distributed stream computing platforms have been
developed for processing massive data streams such as Storm and Spark Streaming. However, these platforms lack support for higher-level declarative languages and provide only a programming interface. Moreover, the users should be well versed of the syntax and programming constructs of each language in these platforms. In this paper, we are going to demonstrate our \PipeFlow system. In \PipeFlow system, the user can write a stream-processing script (i.e., query) using a higher-level dataflow language. This script can be translated to different stream-processing programs that run in the corresponding engines. In this case, the user is only willing to know a single language, thus, he/she can write one stream-processing script and expects to execute this script on different engines.
Recently, some distributed stream computing platforms, such as Storm and Spark Streaming, have been
developed for processing massive data streams. However, these platforms lack support for higher-level declarative languages and provide only programming interfaces. Moreover, the users should be well versed of the syntax and programming constructs of each language in these platforms. In this paper, we are going to demonstrate our \PipeFlow system. In the \PipeFlow system, the user can write a stream-processing script (i.e., query) using a higher-level dataflow language. This script can be translated into different stream-processing programs that run in the corresponding engines. In this case, the user is only willing to know a single language, thus, he/she can write one stream-processing script and expects to execute this script on different engines.
\end{abstract}
......@@ -86,7 +92,7 @@ developed for processing massive data streams such as Storm and Spark Streaming.
\category{H.2.4}{Database Management}{Systems - Query Processing}
\category{H.4}{Information Systems Applications}{Miscellaneous}
\keywords{Data stream processing, Autostudio, Query processing, PipeFabric, Spark Streaming, Storm, Flink, PipeFlow}
\keywords{Data stream processing, AutoStudio, Query processing, PipeFabric, Spark Streaming, Storm, Flink, PipeFlow}
\section{Introduction}
\label{sec:intro}
......@@ -114,7 +120,8 @@ developed for processing massive data streams such as Storm and Spark Streaming.
\section{Demonstration}
\label{sec:demo}
\input{demo}
\section{Acknowledgment}
This work was supported by the Deutsche Forschungsgemeinschaft (DFG) under grant no. GRK 1487.
%
% The following two commands are all you need in the
......
In this section, we provide a description of our \PipeFlow language and the system architecture in addition to providing an example of mapping between a \PipeFlow script to PipeFabric and Storm programs.
In this section, we provide a description of our \PipeFlow language and the system architecture. Additionally, we provide an example for mapping a \PipeFlow script into PipeFabric and Storm programs by our system.
\subsection{\PipeFlow Language}
\PipeFlow language is a dataflow language inspired by Hadoop's Pig Latin \cite{Olston2008}. In general, a \PipeFlow script describes a directed acyclic graph of dataflow operators, which are connected by named pipes. A single statement is given by:
\PipeFlow language is a dataflow language inspired by Hadoop's Pig Latin. In general, a \PipeFlow script describes a directed acyclic graph of dataflow operators, which are connected by named pipes. A single statement is given by:
\begin{alltt}
\$out := op(\$in1, \$in2, \dots) ... clause ...;
\end{alltt}
where \texttt{\$out} denotes a pipe variable referring to the typed output stream of operator \texttt{op} and \texttt{\$in\emph{i}} refers to input streams. By using the output pipe of one operator as input pipe of another operator, a dataflow graph is formed. Dataflow operators can be further parametrized by the following clauses described below.
where \texttt{\$out} denotes a pipe variable referring to the typed output stream of operator \texttt{op} and \texttt{\$in\emph{i}} refers to input streams. By using the output pipe of one operator as an input pipe of another operator, a dataflow graph is formed. Dataflow operators can be further parameterized by the following clauses described below.
\begin{itemize}
\item \texttt{by} clause: The by clause allows to specify a boolean expression which has to be satisfied by each output tuple. This is used, for instance, for filter operators, grouping, or joins. Expressions are formulated in standard C notation as follows:
\item \texttt{by} clause: This clause allows to specify a boolean expression which has to be satisfied by each output tuple. It is used, for instance, for filter operators, grouping, or joins. Expressions are formulated in standard C notation as follows:
\vspace*{-0.5cm}
\begin{alltt}
\begin{center}
......@@ -14,25 +14,25 @@ where \texttt{\$out} denotes a pipe variable referring to the typed output strea
\end{center}
\end{alltt}
\vspace*{-0.5cm}
where \texttt{\$res} is the output pipe to which the filter operator publishes its result stream, \texttt{\$in} is the input pipe from which the operator receives its input tuples and has the \texttt{x} attribute, and \texttt{x > 42} is an predicate to discard incoming tuples accordingly.
\item \texttt{with} clause: The with clause is used to explicitly specify the schema associated with the output pipe of the operator. This is only required for some operators such as \texttt{file\_source}. An example of this clause is:
where \texttt{\$res} is the output pipe to which the filter operator publishes its result stream, \texttt{\$in} is the input pipe from which the operator receives its input tuples and has the \texttt{x} attribute, and \texttt{x > 42} is a predicate to discard incoming tuples accordingly.
\item \texttt{using} clause: This clause allows to pass operator-specific parameters, e.g., the 'filename' parameter for the file reader operator to specify the input file. These parameters are given as a list of key-value pairs with the following syntax:
\vspace*{-0.5cm}
\begin{alltt}
\begin{center}
\$out := file_source() using (filename =
"input.csv") with (x int, y int, z string);
using (param1 = value1, param2 = value2, ...);
\end{center}
\end{alltt}
\vspace*{-0.5cm}
\item \texttt{using} clause: This clause allows to pass operator-specific parameters. These parameters are given as a list of key-value pairs with the following syntax, e.g., the 'filename' parameter for the file reader operator to specify the input file:
\item \texttt{with} clause: This clause is used to explicitly specify the schema associated with the output pipe of the operator. It is only required for some operators such as \texttt{file\_source}. An example of this clause is:
\vspace*{-0.5cm}
\begin{alltt}
\begin{center}
using (param1 = value1, param2 = value2, ...);
\$out := file_source() using (filename =
"input.csv") with (x int, y int, z string);
\end{center}
\end{alltt}
\vspace*{-0.5cm}
\item \texttt{generate} clause: The generate clause specifies how an output tuple of the operator is constructed. For this purpose, a comma-separated list of expressions is given, optionally a new field name can be specified using the keyword \texttt{as}:
\item \texttt{generate} clause: This clause specifies how an output tuple of the operator is constructed. For this purpose, a comma-separated list of expressions is given, optionally a new field name can be specified using the keyword \texttt{as}:
\vspace*{-0.5cm}
\begin{alltt}
\begin{center}
......@@ -40,7 +40,7 @@ generate x, y, (z * 2) as res;
\end{center}
\end{alltt}
\vspace*{-0.5cm}
\item \texttt{on} clause: The on clause is used to specify a list of fields from the input pipe(s) used for grouping or joining. For example, ordering the tuples on \texttt{x} attribute in a descending order can specified as follows:
\item \texttt{on} clause: This clause is used to specify a list of fields from the input pipe(s) used for grouping, joining, or sorting. For example, ordering tuples on the \texttt{x} attribute in a descending order can be specified as follows:
\vspace*{-0.5cm}
\begin{alltt}
\begin{center}
......@@ -51,29 +51,28 @@ $out1 := sort($res) on x using (limit = 10);
where \texttt{limit} parameter limits the number of results returned.
\end{itemize}
\PipeFlow provides a large set of pre-defined operators to efficiently support various query domains and provide a higher degree of expressiveness. A script in \PipeFlow is a collection of these operators. It supports stream-based variant of the relational operators, which are well-known operators in relational database systems (e.g., filter and projection, streaming and relation joins, grouping and aggregations ) and source operators (file and database readers, network socket readers, \dots), etc.
\PipeFlow provides a large set of pre-defined operators to efficiently support various query domains and provide a higher degree of expressiveness. A script in \PipeFlow is a collection of these operators. It supports stream-based variant of the relational operators, which are well-known operators in relational database systems (e.g., filter and projection, streaming and relation joins, grouping and aggregations) and source operators (file and database readers, network socket readers, \dots), etc.
\subsection{\PipeFlow Architecture}
\begin{figure}[hb]
\centering
\includegraphics[width=3.5in, height = 2in]{architecture.eps}
\caption{\PipeFlow architecture}
\caption{The \PipeFlow architecture}
\label{fig:arch}
\end{figure}
In our approach, we have adopted the automated code translation (ACT) technique by taking an input source script written in \PipeFlow language and converting it into another output program. In general, \PipeFlow system is written in Java and depends heavily on ANTLR\footnote{\url{http://www.antlr.org}} and StringTemplate\footnote{\url{http://www.stringtemplate.org}} libraries. The former generates a \PipeFlow language parser that can build and walk parse trees whereas the latter generates code using pre-defined templates. Basically, the following components are used to achieve the translation of \PipeFlow source script to equivalent target programs (PipeFabric, Spark Streaming, or Storm): \emph{(1)} parser \emph{(2)} flow graph \emph{(3)} template file and \emph{(4)} code generator. The latter two components are specific to target programs (i.e., engine) and differ from each other depending on the target code to be generated. Therefore, for every target code a separate template file and code generator are created.
In our approach, we have adopted the automated code translation (ACT) technique by taking an input source script written in the \PipeFlow language and converting it into another output program. In general, the \PipeFlow system is written in Java and depends heavily on ANTLR\footnote{\url{http://www.antlr.org}} and StringTemplate\footnote{\url{http://www.stringtemplate.org}} libraries. The former generates a \PipeFlow language parser that can build and walk parse trees whereas the latter generates code using pre-defined templates. Basically, the following components are used to achieve the translation of \PipeFlow source script into equivalent target programs (PipeFabric, Spark Streaming, or Storm): \emph{(1)} parser \emph{(2)} flow graph \emph{(3)} template file and \emph{(4)} code generator. The latter two components are specific to target programs (i.e., engines) and differ from each other depending on the target code to be generated. Therefore, for every target code a separate template file and code generator are created.
\textbf{Role of Components}: The roles and functionalities of each above-mentioned components are described below and shown in Fig. \ref{fig:arch}.
\begin{description}
\item[Parser:] This component simply does the lexical analysis by parsing the input script written in \PipeFlow language and identifying the dataflow graph instances from this script. Initially, ANTLR generates a parser for the \PipeFlow language automatically based on the \PipeFlow grammar. This parser creates a parse tree which is the data structures representing how the grammar matches the \PipeFlow script. Additionally, ANTLR automatically generates a tree walker interface which can be used to visit the nodes of the parse tree. A new listener interface, which implements the parent interface, is used to visit the nodes in order to construct the corresponding flow node instances during the tree traversal. From these flow node instances, the flow graph object can be created.
\item[Flow graph: ] A logical representation of the input script which comprises nodes (flow nodes) and edges (pipes). A flow node instance represents an operator in the data-flow script. It is the intermediate mean between the parser and the code generator to build up a graph of nodes and generate the target code, respectively. Pipe instance represents the edge between two nodes of the dataflow graph. Therefore, each pipe contains the input node (the node producing tuples which are sent via the pipe) and the output node (the node consuming these tuples from the pipe). This component is mainly used to generate the target code for a specific engine. And Irrespective of target program to be generated this component remains same.
\item[Parser:] This component simply does the lexical analysis by parsing the input script written in the \PipeFlow language and identifying the dataflow graph instances from this script. Initially, ANTLR generates a parser for the \PipeFlow language automatically based on the \PipeFlow grammar. This parser creates a parse tree which is the data structures representing how the grammar matches the \PipeFlow script. Additionally, ANTLR automatically generates a tree walker interface which can be used to visit the nodes of the parse tree. A new listener interface, which implements the parent interface, is used to visit the nodes in order to construct the corresponding flow node instances during the tree traversal. From these flow node instances, the flow graph object can be created.
\item[Code generator:] Once the flow graph is generated from the input \PipeFlow source script, the code generator generates the target code based on this graph. The code generator takes the flow graph generated by the parser and a template file, processes all nodes and pipes iteratively, and creates the equivalent code using the StringTemplate library. As previously mentioned, there are separate code generators for each target engine.
\item[Template file:] It is also defined as a string template group file (stg). We can imagine this file as string with holes which has to be filled by the code generator. Inside this file, the rules with their arguments should be defined to specify how to format the operators code and which part in it has to render. Therefore, some parts will be rendered as it is whereas other parts contain place-holders where they will be replaced with the provided arguments. Template file is different for each specific target code to be generated as the format and syntax of each target program to be generated is different. The template file contains the library files to be included, packages to be imported, or the code blocks for operators, etc.
\item[Flow graph: ] This component represents the logical view of the input script. It consists of nodes (flow nodes) and edges (pipes). The flow node instance represents an operator in the dataflow script whereas the pipe instance represents the edge between two nodes of the dataflow graph. Each pipe contains the input node (the node producing tuples which are sent via the pipe) and the output node (the node consuming these tuples from the pipe). This component is mainly used to generate target code for a specific engine. And irrespective of the target program to be generated this component remains same.
\item[Template file:] This component is also defined as a string template group file (STG). We can imagine this file as a document string with holes which has to be filled by the code generator. Inside this file, the rules with their arguments should be defined to specify how to format operators code and which part in it has to render. Therefore, some parts will be rendered as it is whereas other parts contain place-holders where they will be replaced with the provided arguments. The template file is different for each specific target code to be generated as the format and syntax of each target program to be generated is different. Each file contains the library files to be included, packages to be imported, the code blocks for operators, etc.
\item[Code generator:] This component generates the target code based on the flow graph generated from the input \PipeFlow source script. The code generator takes a template file and the flow graph generated by the parser, processes all nodes and pipes iteratively, and creates equivalent code using the StringTemplate library. As previously mentioned, there is a separate code generator for each target engine.
\end{description}
\subsection{An Example of Translation}
Consider below a simple\footnote{Because of the limitation in the number of pages.} sample script written in \PipeFlow that needs to be translated to PipeFabric and Storm programs by our system. This script reads a stream that contains \texttt{x} and \texttt{y} fields. Later, the \texttt{x} field is filtered and aggregated to find the sum of all \texttt{y} fields for a particular \texttt{x}. Note that the \PipeFlow construct is simpler than other engine constructs.
Consider below a simple\footnote{\label{note1}Because of the limitation in the number of pages.} sample script written in \PipeFlow that needs to be translated into PipeFabric and Storm programs through our system. This script reads a stream that contains \texttt{x} and \texttt{y} fields. After that, the \texttt{x} field is filtered and aggregated to find the sum of all \texttt{y} fields for a particular \texttt{x}.
\begin{alltt}
\$in := file_source() using (filename =
"input.csv") with (x int, y int);
......@@ -81,18 +80,26 @@ Consider below a simple\footnote{Because of the limitation in the number of page
\$2 := aggregate(\$1) on x
generate x, sum(y) as counter;
\end{alltt}
The following represents the most important parts in Storm and PipeFabric codes, respectively. The \PipeFlow script and both codes have the same functionalities but in different structures and syntaxes.
\begin{lstlisting}[caption=Generated Storm code, label="storm"]
public class MyFilter extends BaseFilter {
The following code fragments show portions\footref{note1} of the generated Storm and PipeFabric programs, respectively. The \PipeFlow script and both programs have the same functionalities but in different syntax and programming constructs. By counting the number of lines, we can notice that the \PipeFlow construct is simpler than other platform constructs.
\vspace*{-0.2cm}
\begin{lstlisting}[caption=A portion of the generated Storm program, label="storm"]
.....
public class Op2Filter extends BaseFilter {
public boolean isKeep(TridentTuple tuple) {
return tuple.getInteger(0) > 18;
return tuple.getIntegerByField("x") > 18;
}
}
stream.each(new Fields("x","y"), new MyFilter())
FileSource op1= new FileSource("input.csv", new Fields("x","y"));
TridentTopology topology = new TridentTopology();
Stream stream = topology.newStream("op1", op1);
stream.each(new Fields("x","y"), new Op2Filter());
stream.groupBy(new Fields("x"))
.aggregate(new Fields("y"), new Sum(), new Fields("counter"))
.aggregate(new Fields("y"), new Sum(), new Fields("counter"));
.....
\end{lstlisting}
\begin{lstlisting}[caption=Generated PipeFabric code, label="pipefabric"]
\vspace*{-0.2cm}
\begin{lstlisting}[caption=A portion of the generated PipeFabric program, label="pipefabric"]
.....
typedef pfabric::Tuple<int, int>
TupleType1_;
auto op1_ = new FileSource<TupleType1_>("input.csv");
......@@ -100,6 +107,7 @@ auto op2_ = new Filter<TupleType1_>([&](TupleType1_ tp) { return std::get<0>(*tp
makeLink(op1_, op2_);
auto op3_ = new GroupedAggregation<TupleType1_, TupleType1_>( ... );
makeLink(op2_, op3_);
.....
\end{lstlisting}
......@@ -54,21 +54,32 @@ howpublished = {\url{https://storm.apache.org}},
Author = {Saleh, Omran and Betz, Heiko and Sattler, Kai-Uwe},
Booktitle = {New Trends in Database and Information Systems II},
ISBN = {978-3-319-10517-8},
Pages = {185-197},
Pages = {185--197},
Title = {Partitioning for Scalable Complex Event Processing on Data Streams},
Volume = {312},
Year = {2014},
Publisher = {Springer International Publishing},
Year = {2014}
}
@Incollection{DBIS:SalSat14,
Author = {Saleh, Omran and Sattler, Kai-Uwe},
Booktitle = {OTM 2014},
ISBN = {978-3-662-45562-3},
Pages = {700-717},
Pages = {700--717},
Title = {On Efficient Processing of Linked Stream Data},
Volume = {8841},
Year = {2014},
}
@Inproceedings{DBIS:RaneSaleh2015,
Author = {Rane, Nikhil-Kishor and Saleh, Omran},
Booktitle = {BTW,
16. Fachtagung des GI-Fachbereichs "Datenbanken und Informationssysteme"
(DBIS) Proceedings},
Month = {March},
Pages = {655--658},
Title = {{AutoStudio}: A Generic Web Application for Transforming Dataflow
Programs into Action},
Year = {2015}
}
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment