Shell Script for HDFS Ingestion Using JDBC

Peers,

I was in process of building a script that connects to salesforce using jdbc and pull the data using spark and process in hive table. During this process I have encountered a problem where and variable assigned with hadoop command that list files in Azure Data lake is not parsing the value and to variable and in return the it's parsing null and hadoop command lists local file infact. Is there a work around where someone has faced similar such situation.

spark@hn0-xyz:~$ hadoop fs -ls adl://ayz.xyz12345.net/hdfs/DataWareHouse/salesforce_jars/DataWareHouse.jar

-rwxrwx---+  1 spark  spark      98011 2018-10-23 12:07 adl://ayz.xyz12345.net/hdfs/DataWareHouse/salesforce_jars/DataWareHouse.jar

DataWareHouse.jar is the result I was looking at it . Below is script that does pickup similar few jars and parse that as a command to spark submit job.

#!/bin/bash

#Identification for Salesforce Class and Object
now () {
 date -d "4 hours" "+Date: %Y-%m-%d Time: %H:%M:%S"
}
echo " -----------------Job Run Time------------------------------"
echo " `now` "
echo " Spark Job For Salesforce Account Object %n Class for Account ${SALESFORCE_ACCOUNT_CLASS} "
echo " -----------------------------------------------------------"

#Check All Maven Build Configs

set MVN_LIB_PATH = "adl://ayz.xyz123.net/hdfs/DataWareHouse/salesforce_jars/DataWareHouse.jar"
mvnClass () {
hadoop fs -ls $MVN_LIB_PATH
}
mvnClass 
if [ echo $? == 0]
  then
    echo "Finding Maven Build Jar is Successful"
elif
    echo " Aborting the Job Process"
fi

#Checking for Driver and Executor Class

set JDBC_LIB_PATH = "adl://ayz.xyz123.net/hdfs/DataWareHouse/salesforce_jars/sforce.jar"
jdbcClass () {
hadoop fs -ls $JDBC_LIB_PATH 
}
jdbcClass
if [ echo $? == 0]
then
    echo " Finding JDBC Driver and Executor Jar is Successfull"
elif
    echo " Aborting the Job Process"
fi

#Compiling Spar Submit for Spark API


set SALESFORCE_ACCOUNT_CLASS = "--class com.yxzar.property.SalesForceAccount"
set ENVIRONMENT = "--master yarn"
set DEPLOY_MODE = "--deploy-mode client"
set EXECUTOR_CLASS = "--conf "spark.executor.extraClassPath=adl://ayz.xyz123.net/hdfs/DataWareHouse/salesforce_jars/sforce.jar""
set DRIVER_CLASS = "--conf "spark.driver.extraClassPat=adl://ayz.xyz123.net/emaardevhdfs/DataWareHouse/salesforce_jars/sforce.jar""
set CONNECTOR_JAR = "--jars adl://ayz.xyz123.net/hdfs/DataWareHouse/salesforce_jars/sforce.jar"
set MVN_JAR = "--verbose adl://ayz.xyz123.net/hdfs/DataWareHouse/salesforce_jars/DataWareHouse.jar"

sparkSubmit (){
 spark-submit ${SALESFORCE_ACCOUNT_CLASS} ${ENVIRONMENT} ${DEPLOY_MODE} ${EXCUTOR_CLASS} ${DRIVER_CLASS} ${MVN_JAR}
}

sparkSubmit 2>~/stdout


From the above code

mvnClass 

jdbcClass

 

Resulting invalid output. Any help would be appreciated :slight_smile:

Welcome to the forum.

I can't talk on hadoop nor jdbc or such, but as you seem to be using bash , I can comment on a few sysntax errors in your script:

  • set MVN_LIB_PATH = "adl://ayz...jar" : that's not bash , make it VAR="Value" , no set , no spaces.
  • if [ echo $? == 0] : no echo needed; make it [ $? == 0 ] (with all spaces!) or use "command substitution" like [ $(echo $?) == 0 ] (less effective).
  • elif needs a condition, and a separate fi . Methinks else would do in this case...

Aside: while functions are a very valuable tool for scripting / programming, I can't see the benefit in above, as they all are single-lined and single call only.

Correct your errors, run the script and report back.

1 Like

Thanks RudiC..

Your suggestion worked. I could able to achieve this.