Cascading

by robin · Published June 30, 2014 · Updated June 30, 2014

Source: http://www.cascading.org

Cascading is a proven application development platform for building Data applications on Apache Hadoop. Whether solving simple or complex data problems, Cascading balances an optimal level of abstraction with the necessary degrees of freedom through a computation engine, systems integration framework, data processing and scheduling capabilities.

Java API

Cascading is a Java library and does not require installation. Cascading fits directly into a standard development process, and you don’t have to do anything extra except use APIs.

Data Processing API

The data processing APIs define data processing flows. The APIs exposed provide a rich set of capabilities that allow you to think in terms of the data and the business problem such as sort, average, filter, merge etc.

Data Integration API

The data integration API allows you to isolate your integration dependencies from your business logic. You can easily read/write from a variety of external systems to Hadoop, and then write those results to another system.

Scheduler API

Scheduler APIs can schedule work from 3rd party applications. The Process Scheduler coupled with the Riffle life-cycle annotations allows Cascading to schedule unit of work from any third-party application.

Process Planner

Cascading’s physical planner automatically creates MapReduce jobs ready for processing on your cluster.

Taps and Schemes

Taps and Schemes enable read/write capabilities between any source and in any format. Cascading comes with several pre-built taps and schemes and also provides you the flexibility to quickly build your own.

Standard Relational Operations

Many common operations used in relational environments such as regular expression operations, Java expression operations, XML operations and logical filter operations are available in Cascading.

Scriptable Interface

Any Java-compatible scripting language can import and instantiate Cascading classes, create pipe assemblies and flows, and execute those flows. Users can also create their own DSLs to handle common idioms.

Local mode / In-Memory mode

On a single node, Cascading’s local mode can be used to efficiently test code and process local files before being deployed on a cluster. The built-in testability allows debugging before production deployment.

Dynamic Programming Languages
The Cascading community has built dynamic programming languages on top of the Java API for greater productivity. There are several to choose from: Lingual (ANSI SQL), Pattern (PMML), Scalding (Scala), Cascalog (Clojure) and more!

Hadoop Support
Cascading runs on all popular Hadoop distributions and Hadoop-as-a-service providers. We ensure that Cascading can run on-premise or in the cloud to meet your deployment needs.

WHO IS CASCADING FOR?
Enterprise Development
Cascading was designed to fit into any Enterprise development environment. With a clear separation between “data processing” and “data integration”, its clean Java API, and JUnit testing framework, Cascading can easily be tested and deployed at any scale.Data Science
Because Cascading is Java-based, it naturally fits into JVM-based languages like Scala, Clojure, Jruby, Jython, and Groovy. Within many of these languages, the Cascading community has created many scripting and query languages that simplify ad hoc and production-ready analytics as well as machine learning applications.

WHAT ARE TYPICAL USE CASES FOR CASCADING?

Typical uses cases for Cascading range from the complex (i.e. data processing/ETL applications) to the cutting-edge (i.e. geolocation and genomics).

For a list, see USE CASES

For specific examples from leading companies, check out our CASE STUDIES

HOW IS CASCADING DIFFERENT FROM PIG AND HIVE?

When it comes to application development, both Pig and Hive have shortcomings when dealing with complexity and testing, while Cascading applications are built to scale. With Pig, easy problems can be easily solved, but the harder problems become quite complicated. With Hive, the language is not compliant with ANSI SQL standards, which makes its applicability and interoperability with existing SQL systems challenging. Furthermore, it is non-deterministic, which makes its behavior difficult to predict when compared to Cascading’s deterministic planner. Both Pig and Hive will generally require significantly more code when it comes to integration and incorporating business logic. Cascading separates business logic and integration logic and has system integration capabilities already built-in. Also, with Pig and Hive, it is notoriously difficult to build complex data workflows, and equally as hard to troubleshoot your data applications. Cascading allows developers to unit test and execute test-driven deployment best practices at scale.

	Cascading	Pig	Hive
Ad hoc queries	✔	✔	✔
Complex workflows	✔	✔
Create unit tests	✔
Application portability	✔
Pluggable data sources	✔
Extensible into other languages	✔
Support for non-Hadoop platforms	✔
No Installation required	✔

Use Cases

BUILD MISSION-CRITICAL APPLICATIONS WITH CASCADING

Enterprise IT

Extract Transform Load
Log File Analysis
Systems Integration
Operations Analysis

Corporate Apps

HR Analytics
Employee Behavioral Analysis
Customer Support | eCRM
Business Reporting

Telecom

Data processing of Open Data
Geospatial Indexing
Consumer Mobile Apps
Location based services

Marketing / Retail

Mobile, Social, Search Analytics
Funnel analysis
Revenue attribution
Customer experiments
Ad Optimization
Retail recommenders

Consumer / Entertainment

Music Recommendation
Comparison Shopping
Restaurant Rankings
Real Estate
Rental Listings
Travel Search & Forecast

Finance

Fraud and Anomaly Detection
Fraud Experiments
Customer Analytics
Insurance Risk Metric

Health / Biotech

Aggregate metrics for Govt
Person biometrics
Veterinary diagnostics
Next-Gen Genomics
Argonomics
Environmental Maps

URLs
http://www.cascading.org/projects/cascading/
http://www.cascading.org/use-cases/

Tags: bigdata study