1) One of the key ones is low latency for executing SQL queries on top of Hadoop. And part of this has to do with bypassing the MapReduce infrastructure which involves significant overhead, especially when starting and stopping JBMs.
2) Cloudera also claims several magnitudes of improvement in performance compared to executing the same SQL queries using Hive.
3) Another benefit is that if we really wanted to look under the hood at what Cloudera has provided in Impala or if we wanted to tinker with the code, the source code is available for you to access and download.
Impala is a SQL query system for Hadoop from Cloudera. The Cloudera positions Impala as a "real-time" query engine for Hadoop and by "real-time" they imply that
rather than running batch oriented jobs like with MapReduce, we can get much faster query results for a certain types of queries using Impala over an SQL based front-end.
It does not rely on the MapReduce infrastructure of Hadoop, instead Impala implements a completely separate engine for processing queries. So this engine is a specialized distributed query engine that is similar to what you can find in some of the commercial pattern related databases. So in essence it bypasses MapReduce.
1) The SQL syntax that Hive supports is quite restrictive. So for example, we are not
allowed to do sub-queries, which is very very common in the SQL world. There is no windowed aggregates, and also ANSI joins are not allowed. And in the SQL world there are a lot of other joins that the developers are used to which we cannot use with Hive.
2) The other restriction that is quite limiting is the data types that are supported, for example when it comes to Varchar support or Decimal support, Hive lacks quite severely
3) When it comes to client support the JDBC and the ODBC drivers are quite limited and there are concurrency problems when accessing Hive using these client drivers.
1)Impala isn't a GA offering yet.So as a beta offering, it has several limitations in terms of functionality and capability; for example, several of the data sources and file formats aren't yet supported.
2)Also ODBC is currently the only client driver that's available, so if we have JDBC applications we are not able to use them directly yet.
3)Another Impala drawback is that it's only available for use with Cloudera's distribution of Hadoop; that is CDH 4.1.
IBM claims that Big SQL provides robust SQL support for the Hadoop ecosystem:
- it has a scalable architecture;
- it supports SQL and data types available in SQL '92, plus it has some additional capabilities;
- it supports JDBC and ODBC client drivers;
- it has efficient handling of "point queries" (and we'll get to what that means);
- there are a wide variety of data sources and file formats for HDFS and HBase that it supports;
- And it, although is not open source, it does interoperate well with the open source ecosystem within Hadoop.
A Data Wizard is someone who can consistently derive money out of data, e.g. working as an employee, consultant or in an other capacity, by providing value to clients or extracting value for himself, out of data. Even a guy who design statistical models for sport bets, and use his strategies for himself alone, is a data wizard.Rather than knowlege, what makes a data wizard successful is craftsmanship, intuition and vision, to compete with peers who share the same knowledge but lack these other skills.
Impala is a SQL query system for Hadoop from Cloudera. It is currently in beta; and it has been opensource and it's source can be downloaded from Gitap. It supports the same SQL syntax, the ODBC driver and the user interface (which is Beeswax) as Apache Hive.We can use it to query data, whether it is stored in ADFS or Apache H-base. And we can do selects, joins and aggregate functions.
The Big SQL engine analyzes incoming queries.It separates portions to execute at the server versus the portions to be executed by the cluster. It rewrites queries if
necessary for improved performance; determines the appropriate storage handle for data; produces the execution plan and executes and coordinates the query.
IBM architected Big SQL with the goal that existing queries should run with no or few modifications and that queries should be executed as efficiently as the chosen storage mechanisms allow. And rather than build a separate query execution infrastructure they made Big SQL rely much on Hive, so much of the data manipulation language, the data definition language syntax, and the general concepts of Big SQL are similar to Hive. And Big SQL shares catalogues with Hive via the Hive metastore.Hence each can query each other's tables.
Big Data is a culmination of numerous research and development projects at IBM. So IBM has taken the work from these various projects and released it as a technology preview called Big SQL.IBM claims that Big SQL provides robust SQL support for the Hadoop ecosystem:
- it has a scalable architecture
- it supports SQL and data types available in SQL '92, plus it has some additional capabilities
- it supports JDBC and ODBC client drivers
- it has efficient handling of "point queries"
Big SQL is based on a multi-threaded architecture, so it's good for performance and the scalability in a Big SQL environment essentially depends on the Hadoop cluster itself that is its size and scheduling
Hive can be thought of as a data warehouse infrastructure for providing summarization, query and analysis of data that is managed by Hadoop.Hive provides a SQL interface for data that is stored in Hadoop.And, it implicitly converts queries into MapReduce jobs so that the programmer can work at a higher level than he or she would when writing MapReduce jobs in Java. Hive is an integral part of the Hadoop ecosystem that was initially developed at Facebook and is now an active Apache open source project.