Being Productive with HDInsight

This post will be the holding place where I put misc. tools and tips for HDInsight

Build Tools

1. Apache ANT (http://ant.apache.org/manual/install.html)

Extract archive to c:\ant\ then modify the classpath to include Ant:

set ANT_HOME=c:\ant

set PATH=%PATH%;%ANT_HOME%\bin

2. Apache IVY (http://ant.apache.org/ivy/history/latest-milestone/install.html)

  • Copy Ivy.JAR to Ant lib folder

3. Git Client (http://git-scm.com/downloads)

Data Preparation/Research Tools

1. CURL (http://curl.haxx.se/download.html)

2. CYGWIN (http://cygwin.com)

3. Enthought Data Platform (EDP) (http://www.enthought.com/products/epd.php)

4. GNU Parallel (ftp://ftp.gnu.org/gnu/parallel/ )

PIGGYBANK

Community contributed user defined functions for PIG

  • Retrieve source from Git:
    git clone https://github.com/apache/pig.git
    
    ls Pig
    
    git checkout -b branch-0.9 remotes/origin/branch-0
  • Build Pig and then PiggyBank using Ant
  • Pig Script:
    -- myscript.pig
    REGISTER C:\Users\Administrator\pig\contrib\piggybank\java\piggybank.jar;
    
    A = LOAD 'student_data' AS (name: chararray, age: int, gpa: float);
    
    B = FOREACH A GENERATE myudfs.UPPER(name);
    
    DUMP B;
Advertisements

One thought on “Being Productive with HDInsight

  1. Pingback: MMM More Bacon – Pig User-Defined Functions (UDFs) | Bluewater SQL

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s