Commit 77ce9de7 authored by Omran Saleh's avatar Omran Saleh
Browse files

minor changes

parent 85207e97
To provide a more abstraction level, we have developed a web application called \textbf{AutoStudio}. It is a very user friendly and easy to use web application to run cross-platform using HTML5, Draw2D touch\footnote{\url{http://www.draw2d.org/draw2d/index.html}}, and node.js\footnote{\url{http://www.nodejs.org}}. AutoStudio has several functionalities:
\begin{itemize}
\item It enables users to leverage the emerging \PipeFlow language graphically via a collection of operators (represented by icons) which could be simply ``dragged and dropped'' onto a drawing canvas. The user can assemble the operators in order to create a dataflow graph in a logical way and visually show how they are related, and from this graph, equivalent \PipeFlow script can be generated. By clicking on the operator icon, a pop-up window appears to let the user specify the parameters of \PipeFlow operators, which are required. Moreover, the user can display the help contents for each operator.
\item Contacting the \PipeFlow system to generate the right script (e.g., Storm, Spark, or PipeFabric scripts) based upon the user's selection of language from the dashboard page. This makes the user to be not aware of any of stream-processing languages syntax including \PipeFlow and their constructs. By this application, the user can trigger the execution of the script through the \PipeFlow system via calling the respective engine. Moreover, it handles real-time stats including execution and performance results sent by \PipeFlow system when the script is in execution. When the execution is complete, the application can send an email to the user .
\item Contacting the \PipeFlow system to generate the right script (e.g., Storm, Spark, or PipeFabric scripts) based upon the user's selection of language from the dashboard page. This makes the user to be not aware of any of stream-processing languages syntax including \PipeFlow and their constructs. By this application, the user can trigger the execution of the script through the \PipeFlow system via calling the respective engine. Moreover, it handles real-time stats including execution and performance results sent by \PipeFlow system when the script is in execution. When the execution is complete, the application can send an email to the user.
\item It provides the options of saving the generated scripts or flow-designs for future reference, loading the saved script and executing it whenever required.
\item An evaluation tool for the generated scripts where the user is interested in comparing as well as evaluating the performance of stream- processing systems in terms of throughput, latency, and resource consumption such as CPU and memory. The evaluation can be performed online using dynamic figures or offline using static figures.
\end{itemize}
......
A wide range of (near) real-time applications process stream-based data including
financial data analysis, traffic management, telecommunication monitoring, environmental monitoring, the smart grid, weather forecasting, and social media analysis, etc. These applications focus mainly on finding useful information and patterns on-the-fly as well as deriving valuable higher-level information from lower-level ones from continuously incoming data stream to report and monitor the progress of some activities. In the last few years, several systems for processing streams of information, where each offering their own processing solution, have been proposed. It is pioneered by academic systems such as Aurora and Borealis~\cite{Abadi:2003:ANM:950481.950485,Borealis}, STREAM~\cite{stream}, and commercial systems like IBM InfoSphere Streams or StreamBase. Recently, some novel distributed stream computing platforms have been developed based on data parallelization approaches, which try to support scalable operation in cluster environments for processing massive data streams. Examples of these platforms Storm~\cite{storm}, Spark ~\cite{spark} and Flink~\cite{flink}. Though, these engines (SPE) provide abstractions for processing (possibly) infinite streams of data, they lack support for higher-level declarative languages. Some of these engines provide only a programming interface where operators and topologies have to be implemented in a programming language like Java or Scala. Moreover, to build a particular program (i.e., query) in these systems, the users should be expert and should have a deeper knowledge of the syntax and programming constructs of the language, especially, if the system supports multiple languages. Therefore, no time and effort savings can be achieved as the user needs to proceed by writing each programming statements correctly. To make the life much easier, the current trend in data analytics should be the adopting of the "Write once, run anywhere" slogan. This is a slogan first-mentioned by Sun Microsystems to illustrate that the Java code can be developed on any platform and be expected to run on any platform equipped with a Java virtual machine (JVM). In general, the development of various stream processing engines raises the question whether we can provide an unified programming model or a standard language where the user can write one steam-processing script and he/she expects to execute this script on any stream-processing engines. By bringing all these things together, we provide a demonstration of our solution called \PipeFlow. In our \PipeFlow system, we address the following issues:
financial data analysis, traffic management, telecommunication monitoring, environmental monitoring, the smart grid, weather forecasting, and social media analysis, etc. These applications focus mainly on finding useful information and patterns on-the-fly as well as deriving valuable higher-level information from lower-level ones from continuously incoming data stream to report and monitor the progress of some activities. In the last few years, several systems for processing streams of information, where each offering their own processing solution, have been proposed. It is pioneered by academic systems such as Aurora and Borealis~\cite{Abadi:2003:ANM:950481.950485} and commercial systems like IBM InfoSphere Streams or StreamBase. Recently, some novel distributed stream computing platforms have been developed based on data parallelization approaches, which try to support scalable operation in cluster environments for processing massive data streams. Examples of these platforms Storm~\cite{storm}, Spark ~\cite{spark}, and Flink~\cite{flink}. Though, these engines (SPE) provide abstractions for processing (possibly) infinite streams of data, they lack support for higher-level declarative languages. Some of these engines provide only a programming interface where operators and topologies have to be implemented in a programming language like Java or Scala. Moreover, to build a particular program (i.e., query) in these systems, the users should be expert and should have a deeper knowledge of the syntax and programming constructs of the language, especially, if the system supports multiple languages. Therefore, no time and effort savings can be achieved as the user needs to proceed by writing each programming statements correctly. To make the life much easier, the current trend in data analytics should be the adopting of the "Write once, run anywhere" slogan. This is a slogan first-mentioned by Sun Microsystems to illustrate that the Java code can be developed on any platform and be expected to run on any platform equipped with a java virtual machine (JVM). In general, the development of various stream processing engines raises the question whether we can provide an unified programming model or a standard language where the user can write one steam-processing script and he/she expects to execute this script on any stream-processing engines. By bringing all these things together, we provide a demonstration of our solution called \PipeFlow. In our \PipeFlow system, we address the following issues:
\begin{itemize}
\item Developing a scripting language that provides most of the features of stream-processing scripting languages, e.g., Storm and Spark. Therefore, we have chosen a dataflow language called \PipeFlow. At the beginning, this language was intended to be used in conjunction with a stream processing engine called PipeFabric \cite{DBIS:SalBetSat14year2014,DBIS:SalSat14}. Later, it is extended to be used with other engines. The source script written in \PipeFlow language is parsed and compiled and then a target program (i.e., for Spark and Storm as well as PipeFabric) is generated based on the user selection. This target script is equivalent in its functionalities to the original \PipeFlow program. Once the target program is generated, the user can execute this program in the specific engine.
\item Mapping or translating a \PipeFlow script into other scripts necessitates the existing of each operator in \PipeFlow to be implemented in the target engine. Since \PipeFlow contains a set of pref-defined operators, all of these operators have been implemented directly or indirectly in that engine.
......
......@@ -75,7 +75,7 @@ Recently, some distributed stream computing platforms have been developed for pr
\category{H.4}{Information Systems Applications}{Miscellaneous}
\category{I.5}{Pattern Recognition}{Miscellaneous}
\keywords{Data stream processing,Autostudio, Query processing, Spark, Storm, Flink, PipeFlow}
\keywords{Data stream processing, Autostudio, Query processing, Spark, Storm, Flink, PipeFlow}
\section{Introduction}
\label{sec:intro}
......
......@@ -14,7 +14,7 @@ where \texttt{\$out} denotes a pipe variable referring to the typed output strea
\end{center}
\end{alltt}
\vspace*{-0.5cm}
where \texttt{\$res} is the output pipe to which the filter operator publishes its result stream, \texttt{\$in} is the input pipe from which the operator receives its input tuples and has the \texttt{x} attribute , and \texttt{x > 42} is an predicate to discard incoming tuples accordingly.
where \texttt{\$res} is the output pipe to which the filter operator publishes its result stream, \texttt{\$in} is the input pipe from which the operator receives its input tuples and has the \texttt{x} attribute, and \texttt{x > 42} is an predicate to discard incoming tuples accordingly.
\item \texttt{with} clause: The with clause is used to explicitly specify the schema associated with the output pipe of the operator. This is only required for some operators such as \texttt{file\_source}. An example of this clause is:
\vspace*{-0.5cm}
\begin{alltt}
......@@ -50,7 +50,7 @@ $out1 := sort($res) on x using (limit = 10);
where \texttt{limit} parameter limits the number of results returned.
\end{itemize}
\PipeFlow provides a large set of pre-defined operators to efficiently support various query domains and provide a higher degree of expressiveness. A script in \PipeFlow is a collection of these operators. It supports stream-based variant of the relational operators, which are well-known operators in relational database systems (e.g, filter and projection, streaming and relation joins, grouping and aggregations ) and source operators (file and database readers, network socket readers, \dots), etc.
\PipeFlow provides a large set of pre-defined operators to efficiently support various query domains and provide a higher degree of expressiveness. A script in \PipeFlow is a collection of these operators. It supports stream-based variant of the relational operators, which are well-known operators in relational database systems (e.g., filter and projection, streaming and relation joins, grouping and aggregations ) and source operators (file and database readers, network socket readers, \dots), etc.
\subsection{\PipeFlow Architecture}
......@@ -60,11 +60,11 @@ where \texttt{limit} parameter limits the number of results returned.
\caption{\PipeFlow architecture}
\label{fig:arch}
\end{figure}
In our approach, we have adopted the automated code translation (ACT) technique by taking an input source script written in \PipeFlow language and converting it into an output source script in another language. In general, \PipeFlow system is written in Java and depends heavily on ANTLR\footnote{\url{http://www.antlr.org}} and StringTemplate\footnote{\url{http://www.stringtemplate.org}} libraries. The former libraries generates a \PipeFlow language parser that can build and walk parse trees whereas the latter generates code using pre-defined templates. Basically, the following components are used to achieve the translation of \PipeFlow source script to equivalent target scripts (PipeFabric, Spark, or Storm): \emph{(1)} parser \emph{(2)} flow graph \emph{(3)} template file, and \emph{(4)} code generator. The latter two components are specific to target scripts and differ from each other depending on the target code to be generated. And thus for every target language we need to create a separate template file and code generator to generate its code.
In our approach, we have adopted the automated code translation (ACT) technique by taking an input source script written in \PipeFlow language and converting it into an output source script in another language. In general, \PipeFlow system is written in Java and depends heavily on ANTLR\footnote{\url{http://www.antlr.org}} and StringTemplate\footnote{\url{http://www.stringtemplate.org}} libraries. The former libraries generates a \PipeFlow language parser that can build and walk parse trees whereas the latter generates code using pre-defined templates. Basically, the following components are used to achieve the translation of \PipeFlow source script to equivalent target scripts (PipeFabric, Spark, or Storm): \emph{(1)} parser \emph{(2)} flow graph \emph{(3)} template file and \emph{(4)} code generator. The latter two components are specific to target scripts and differ from each other depending on the target code to be generated. And thus for every target language we need to create a separate template file and code generator to generate its code.
\textbf{Role of Components}: The roles and functionalities of each above-mentioned components are described below and shown in Fig. \ref{fig:arch}.
\begin{description}
\item[Parser:] This component simply does the lexical analysis by parsing the input program written in \PipeFlow and identifying the dataflow graph instances from the program. Initially, ANTLR generates a parser for the \PipeFlow language automatically based on the \PipeFlow grammar. This parser creates a parse tree which is the data structures representing how the grammar matches the \PipeFlow script. Additional, ANTLR automatically generates a tree walker interface which can be used to visit the nodes of the parse tree. A new listener interface, which implements the parent interface, is used to visit the nodes in order to construct the corresponding flow node instances during the tree traversal. From flow node instances, the flow graph object can be created.
\item[Parser:] This component simply does the lexical analysis by parsing the input program written in \PipeFlow and identifying the dataflow graph instances from the program. Initially, ANTLR generates a parser for the \PipeFlow language automatically based on the \PipeFlow grammar. This parser creates a parse tree which is the data structures representing how the grammar matches the \PipeFlow script. Additionally, ANTLR automatically generates a tree walker interface which can be used to visit the nodes of the parse tree. A new listener interface, which implements the parent interface, is used to visit the nodes in order to construct the corresponding flow node instances during the tree traversal. From flow node instances, the flow graph object can be created.
\item[Flow graph: ] A logical representation of the input program which comprises nodes (flow nodes) and edges (pipes). A flow node instance represents an operator in the data-flow program. It is the intermediate mean between the parser and the code generator to build up a graph of nodes and generate the target code, respectively. Pipe instance represents the edge between two nodes of the dataflow graph. Therefore, each pipe contains the input node (the node producing tuples which are sent via the pipe) and the output node (the node consuming these tuples from the pipe). This component is mainly used to generate the target code in a specific language. And Irrespective of target language program to be generated this component remains same.
......@@ -72,7 +72,7 @@ In our approach, we have adopted the automated code translation (ACT) technique
\item[Template file:] It is also defined as a string template group file (stg). We can imagine this file as string with holes which has to be filled by the code generator. Inside this file, the rules with its arguments should be defined to specify how to format the operators code and which part in it has to render. Therefore, some parts will be rendered as it is where other parts contain place-holders will be replaced with the provided arguments. Template file is different for each specific target code to be generated as the format and syntax of each target language to be generated is different. The template file contains the library files to be included, packages to be imported, or the code blocks for operators, etc.
\end{description}
\subsection{An Example of Translation}
Consider below a simple\footnote{Because of the limitation in the number of pages.} sample script written in \PipeFlow that needs to be translated to PipeFabric and Storm by our system. This script reads a stream that contains \texttt{x} and \texttt{y} fields . Later, the \texttt{x} field is filtered and aggregated to find the sum of all \texttt{y} fields for a particular \texttt{x}. Note that the \PipeFlow construct is simpler than other engine constructs.
Consider below a simple\footnote{Because of the limitation in the number of pages.} sample script written in \PipeFlow that needs to be translated to PipeFabric and Storm by our system. This script reads a stream that contains \texttt{x} and \texttt{y} fields. Later, the \texttt{x} field is filtered and aggregated to find the sum of all \texttt{y} fields for a particular \texttt{x}. Note that the \PipeFlow construct is simpler than other engine constructs.
\begin{alltt}
\$in := file_source() using (filename = "input.csv") with (x int, y int);
\$1 := filter(\$in) by x > 18;
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment