Commit 3bb2e34b authored by Omran Saleh's avatar Omran Saleh
Browse files

first commit

parents
To provide a more abstraction level, we have developed a web application called \textbf{AutoStudio}. It is a very user friendly and easy to use web application to run cross-platform using HTML5, Draw2D touch\footnote{\url{http://www.draw2d.org/draw2d/index.html}}, and node.js\footnote{\url{http://www.nodejs.org}}. AutoStudio has several functionalities:
\begin{itemize}
\item It enables users to leverage the emerging \PipeFlow language graphically via a collection of operators (represented by icons) which could be ``dragged and dropped'' onto a drawing canvas. The user can assemble the operators in order to create a dataflow graph in a logical way and visually show how they are related, and from this graph, equivalent \PipeFlow script can be generated. By clicking on the operator icon, a pop-up window appears to let the user specify the parameters of \PipeFlow operators, which are required. Moreover, the user can display the help contents for each operator.
\item Contacting the \PipeFlow system to generate the right script (e.g., storm, spark, etc. scripts) based upon the user's selection of language from the dashboard page. This makes the user to be not aware of any of stream-processing languages syntax including \PipeFlow and their constructs.
\item It also has a feature of script execution via calling the respective engine and displaying script execution result instantly and in real-time as well as emailing the user when the execution is complete. The application provides the options of saving the generated scripts or flow-designs for future reference, loading the saved script and executing it whenever
required.
\item An evaluation tool for the generated scripts where the user is interested in comparing as well as evaluating the performance of stream- processing systems in terms of throughput, latency, and resource consumption such as CPU and memory. The evaluation can be performed online using dynamic figures or offline using static figures.
\end{itemize}
AutoStudio prominently uses open source softwares and frameworks. The client side including HTML5, JavaScript, Cascading Style Sheets (CSS), jQuery (and helping libraries), twitter bootstrap, and hogan.js is used for building the graphical user interface, performing Ajax requests, file uploading and downloading, etc. AutoStudio extensively uses pre-compiled hogan templates where the data returned from the server is simply passed to these templates for quick rendering. In addition, Draw2D touch is used to enable creation of diagram applications in a browser by creating and manipulating operators and connections. \\
%The second part is the server side which consists of the node.js web server.
A wide range of (near) real-time applications process stream-based data, for instance,
financial data analysis, traffic management, telecommunication monitoring, environmental monitoring, the smart grid, weather forecasting, and social media analysis, etc. These applications focus mainly on finding useful information and patterns on-the-fly as well as deriving valuable higher-level information from lower-level ones from continuously incoming data stream to report and monitor the progress of some activities. Therefore, stream processing solution has emerged as an appropriate approach to support this kind of applications. In the last few years, several systems for processing streams of information where each offering their own processing solution in (near) real time have been proposed. It is pioneered by academic systems such as Aurora and Borealis~\cite{Abadi:2003:ANM:950481.950485,Borealis}, STREAM~\cite{stream}, and commercial systems like IBM InfoSphere Streams or StreamBase. Recently, some novel distributed stream computing platforms have been developed based on data parallelization approaches, which try to support scalable operation in cluster environments for processing massive data streams. Examples of these platforms are Storm~\cite{storm}, Spark ~\cite{spark} and Flink~\cite{flink}. Though, these engines (SPE) provide abstractions for processing (possibly) infinite streams of data, they provide only a programming interface and lack support for higher-level declarative languages. This means that operators and topologies have to be implemented in a programming language like Java or Scala. Moreover, to build a particular program or query in these systems, the users should be well versed with the syntax and programming constructs of the language, especially, if the system supports multiple languages. Therefore, no rapid development can be achieved as the user needs to proceed by writing each programming statements correctly. To make the life much easier, the current trend in data analytics should be the adopting of the "Write once, run anywhere" slogan. This is a slogan first-mentioned by Sun Microsystems to illustrate that the Java code can be developed on any platform and be expected to run on any platform equipped with a Java virtual machine (JVM). In general, the development of various stream processing engines raises the question whether we can provide an unified programming model or a stand language where the user can write one steam-processing script and he/she expects to execute this script on any stream-processing engines. By bringing all these things together, we provide a demonstration of our solution called \PipeFlow. In our \PipeFlow system, we address the following issues:
\begin{itemize}
\item Developing a scripting language (i.e., a standard one) that provides most of the features of stream-processing scripting languages, mainly, Storm and Spark. Therefore, a dataflow specification language has been proposed that compiles into an equivalent script. The equivalent script can be a specific-script of one of those engines.
\item Mapping or compiling a \PipeFlow script into other scripts necessitates the existing of each operator in \PipeFlow also in the target engine. Since \PipeFlow contains a set of pref-defined operators, all of these operators has already been implemented in that engine directly or indirectly.
\item Providing a flexible architecture for users for extending the system by supporting more engines as well as new operators. The latter operators, where a custom processing can be defined, should be integrated in the system smoothly.
\item Developing a web application as a front-end to enable users who have less experience in \PipeFlow to express the script problem and its associated processing algorithm and data pipeline graphically.
\end{itemize}
The remainder of the paper is structured as follows: In Sect.~\ref{sec:pipeflow}, we introduce the \PipeFlow language, the system architecture, and an example for the mapping between scripts. Next, in Sect.~\ref{sec:app}, we describe our front-end application and give details about its design and provided functionalities. Finally, a description of the planned demonstration is discussed in Sect.~\ref{sec:demo}.
\ No newline at end of file
\documentclass{sig-alternate}
\usepackage[lined,ruled,commentsnumbered,linesnumbered]{algorithm2e}
\usepackage{url}
\usepackage{enumerate}
\usepackage{alltt}
\usepackage{color}
\usepackage{graphicx}
\usepackage{multirow}
\newtheorem{mydef}{Definition}
\newcommand{\todo}[1]{\textcolor[rgb]{1,0,0}{#1}}
\newcommand{\PipeFlow}{\textsc{PipeFlow}\xspace}
\newcommand{\op}[1]{\textbf{\texttt{#1}}}
\newcommand{\SymbReg}{\textsuperscript{\textregistered}\xspace}
\newcommand{\CC}{C\nolinebreak\hspace{-.05em}\raisebox{.4ex}{\tiny\bf +}\nolinebreak\hspace{-.10em}\raisebox{.4ex}{\tiny\bf +}\xspace}
\begin{document}
\frenchspacing
\sloppy
%
% --- Author Metadata here ---
%\conferenceinfo{DEBS}{'14 May 26th-29th 2014, IIT Bombay, Mumbai, India}
%\CopyrightYear{2007} % Allows default copyright year (20XX) to be over-ridden - IF NEED BE.
%\crdata{0-12345-67-8/90/01} % Allows default copyright data (0-89791-88-6/97/05) to be over-ridden - IF NEED BE.
% --- End of Author Metadata ---
\title{The PipeFlow Approach: Write once, Run in Different Stream-processing Engines}
\numberofauthors{2} % in this sample file, there are *total*
% of EIGHT authors. SIX appear on the 'first-page' (for formatting
% reasons) and the remaining two appear in the \additionalauthors section.
%
\author{
% You can go ahead and credit any number of authors here,
% e.g. one 'row of three' or two rows (consisting of one row of three
% and a second row of one, two or three).
%
% The command \alignauthor (no curly braces needed) should
% precede each author name, affiliation/snail-mail address and
% e-mail address. Additionally, tag each line of
% affiliation/address with \affaddr, and tag the
% e-mail address with \email.
%
% 1st. author
\alignauthor
Omran Saleh\\
\affaddr{Department of Computer Science and Automation}\\
\affaddr{Technische Universit{\"a}t Ilmenau, Germany}\\
\email{omran.saleh@tu-ilmenau.de}
% 3rd. author
\alignauthor Kai-Uwe Sattler\\
\affaddr{Department of Computer Science and Automation}\\
\affaddr{Technische Universit{\"a}t Ilmenau, Germany}\\
\email{kus@tu-ilmenau.de}
}
\maketitle
\begin{abstract}
\end{abstract}
% % A category with the (minimum) three required fields
\category{H.2.4}{Database Management}{Systems - Query Processing}
\category{H.4}{Information Systems Applications}{Miscellaneous}
\category{I.5}{Pattern Recognition}{Miscellaneous}
\keywords{Data stream processing,Autostudio, Query processing, Spark, Storm, Flink, PipeFlow}
\section{Introduction}
\label{sec:intro}
\input{intro}
\section{The PipeFlow System}
\label{sec:pipeflow}
\input{pipeflow}
\section{The Front-end Application }
\label{sec:app}
\input{autostudio}
%\section{Solutions Correctness}
%\label{sec:correctness}
%\input{correctness}
%describe the main idea of the queries and the details of the PipeFlow implementation
%\subsection{Query 1: Load Prediction}
%\subsection{Query 2: Outliers}
%\section{Demonstration}
%\label{sec:demo}
%\input{demo}
%
% The following two commands are all you need in the
% initial runs of your .tex file to
% produce the bibliography for the citations in your paper.
\bibliographystyle{abbrv}
\bibliography{sigproc} % sigproc.bib is the name of the Bibliography
\end{document}
In this section we provide a description of our \PipeFlow language and the system architecture as well as providing an example of mapping between a \PipeFlow script to Spark and Storm scripts.
\subsection{\PipeFlow Language}
\PipeFlow language is a dataflow language inspired by Hadoop's Pig Latin \cite{Olston2008}. In general, a \PipeFlow script describes a directed acyclic graph of dataflow operators, which are connected by named pipes. A single statement is given by:
\begin{alltt}
\$out := op(\$in1, \$in2, \dots) ... clause ... clause;
\end{alltt}
where \texttt{\$out} denotes a pipe variable referring to the typed output stream of operator \texttt{op} and \texttt{\$in\emph{i}} refers to input streams. By using the output pipe of one operator as input pipe of another operator, a dataflow graph is formed. Dataflow operators can be further parametrized by the following clauses described below.
\begin{itemize}
\item \texttt{by} clause: The by clause allows to specify a boolean expression which has to be satisfied by each output tuple. This is used for instance for filter operators, grouping, or joins. Expressions are formulated in standard C notation as follows:
\begin{alltt}
$res := filter($in) by x > 42;
\end{alltt}
\item \texttt{with} clause: The with clause is used to explicitly specify the schema associated with the output pipe of the operator. This is only required for some operators such as \texttt{file\_source}. The syntax for the with clause is:
\begin{alltt}
with (fieldname fieldtype, fieldname fieldtype, ...)
\end{alltt}
\item \texttt{using} clause: This clause allows to pass operator-specific parameters. These parameters are given as a list of key-value pairs with the following syntax, e.g., the 'filename' parameter for the file
reader operator to specify the input file:
\begin{alltt}
using (param1 = value1, param2 = value2, ...)
\end{alltt}
\item \texttt{generate} clause: The generate clause specifies how an output tuple of the operator is constructed. For this purpose, a comma-separated list of expressions is given, optionally a new field name can be specified using the keyword as:
\begin{alltt}
generate x, y, (z * 2) as res
\end{alltt}
\item \texttt{on} clause: The on clause is used to specify a list of fields from the input pipe(s) used for grouping or joining.
\end{itemize}
\PipeFlow provides a large set of pre-defined operators to efficiently support various query domains and provide a higher degree of expressiveness. A script in \PipeFlow is a collection of these operators. Our system enables sophisticated applications and queries to be constructed easily by using only a few dataflow operators. It supports stream-based variant of the relational operators, which are well-known operators in relational database systems (e.g, filter and projection, streaming and relation joins, grouping and aggregations ) and source operator (file and database readers, network socket readers, \dots).
\subsection{\PipeFlow Architecture}
\subsection{An Example of Mapping}
This diff is collapsed.
@inproceedings{Olston2008,
author = {Olston, Christopher and Reed, Benjamin and Srivastava, Utkarsh and Kumar, Ravi and Tomkins, Andreq},
booktitle = {SIGMOD},
title = {{Pig latin: a not-so-foreign language for data processing}},
year = {2008}
}
@misc{stream,
author = {{Stanford University}},
title = {{Stanford Stream Data Manager}},
howpublished = {\url{http://infolab.stanford.edu/stream/}},
}
@misc{spark,
author = {Apache Spark},
title = {Spark: Lightning-fast cluster computing},
howpublished = {\url{https://spark.apache.org}},
}
@misc{flink,
author = {{Apache Flink}},
title = {{Fast and reliable large-scale data processing engine}},
howpublished = {\url{https://flink.apache.org}},
}
@misc{storm,
author = {{Apache Storm}},
title = {{Distributed realtime computation system}},
howpublished = {\url{https://storm.apache.org}},
}
@article{Abadi:2003:ANM:950481.950485,
author = {Abadi, D. J. and Carney, D. and \c{C}etintemel, U. and Cherniack, M. and Convey, C. and Lee, S. and Stonebraker, M. and Tatbul, N. and Zdonik, S.},
title = {Aurora: a new model and architecture for data stream management},
journal = {The VLDB Journal},
issue_date = {August 2003},
volume = {12},
number = {2},
month = aug,
year = {2003},
issn = {1066-8888},
pages = {120--139},
numpages = {20},
acmid = {950485},
publisher = {Springer-Verlag New York, Inc.},
address = {Secaucus, NJ, USA},
keywords = {Continuous queries, Data stream management, Database triggers, Quality-of-service, Real-time systems},
}
@INPROCEEDINGS{Borealis,
author = {Daniel J. Abadi and Yanif Ahmad and Magdalena Balazinska and Mitch Cherniack and Jeong-hyon Hwang and Wolfgang Lindner and Anurag S. Maskey and Er Rasin and Esther Ryvkina and Nesime Tatbul and Ying Xing and Stan Zdonik},
title = {The design of the borealis stream processing engine},
booktitle = {CIDR},
year = {2005},
pages = {277--289}
}
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment