Tuesday 17 December 2013

Talend integration with Hive on hadoop – Part#1 (Write data into Hive)




Talend (big data edition) integration with Hive on hadoop – Part#1 (Write data into Hive)


 Talend (big data edition) integration with Hive on hadoop – Part#1 (Write data into Hive)

I will write on how to use talend to connect to Hive database running on Hadoop and create a table and insert/load data into this table.

Pre-requisites –
1)    Hadoop+hive installed – I am using Cloudera quick start VM (Oracle virtualbox VM).
2)    Any source db from where you want to source data and push to Hive – I am using dellstore2 sample database in postgressql 9.3 DB.
3)    Talend big data edition – I am using TOS 5.4

Overall objective of job – create external table “customers_ext” in hive and read data from dellstore2db@postgressql and load this data into “customers_ext” table in hive
Follow steps below -
1)    As a first step I create external table in hive using tHiveCreateTable - see settings below
(see important settings and also option to create EXTERNAL table.








2) Write output in file format to HDFS using tHDFSOutput component. Basically to insert data into hive we simply have to first create flat file input and later load data into table using this file by using hive commands. Talend tHiveLoad does this load of data for us.





3) tHiveLoad - to load data into hive table from flat file. see settings for this component as below for loading data



4) Finally when we run job - what happens is that we first create external table in hive and associate it with a file location and structure. Next we read data from postgressql database and write this data using tHDFSOutput component - matching output file format wrt/ delimeter and row separators. once file has been created into HDFS we call tHiveLoad to load data from this file into hive table.

1 comment:

  1. hi,
    Could you pls give me the video link for this document.

    ReplyDelete