Downloading hdfs file to local UNIX through UNIX script

Hi All ,
I am very new to unix script.I am aware of unix commands but never put together in unix script level.If any one can suggest me technical guidance in the below scenario that will highly beneficial.

Data have been already migrated from mainframe to Hadoop file system(HDFS).HDFS server is hosted on a unix box.HDFS file is just .txt file.We are currently downloading HDFS to local unix file system by HDFS command and then adding sequence number to the local unix file.We just want to automate this process for any file.

Currently we are using below steps :

  1. Downlading HDFS file to local unix system by below command :
    text Hdfs dfs -copyToLocal (HDFS file path) (Local directory path)
  2. Adding sequnce generated No :
    text awk '{printf "%06d,",NR} 1' File.txt >File_Output.txt

We need to automate the above process through unix script for any user provided input file .If anyone can provide me technical approach/block of code to automate the above scenario that will be real help.Thanks in advance !

---------- Post updated at 08:24 AM ---------- Previous update was at 03:52 AM ----------

Hi All ,
I have written the below script which is working for a single file.I have hardcoded the hdfs path in the script itself.And the script is creating pipe dilimited sequence generated o/p file in the local directory.We just need to modify this script for any user provided input file.Can anyone suggest me how to automate this script for any user provided input file which should not be hardcoded in the script.Any help in this regard will be highly appreciated.Thanks.

#! /bin/bash
#Downloading HDFS file to Local Unix & Reformatting

hdfs dfs -copyToLocal /user/target/file.txt .

awk '{printf "%06d|",NR} 1' file.txt >output.txt

What exactly would they be inputting?

We need to provide directory name & file name.Suppose a file called input.txt is located at hdfs path /user/target directory .We can pass the total filepath like /user/target/input.txt or
<input_directory> <sourcefile> as two different parameter.

case "$#" in
2)
        hdfs dfs -copyToLocal "$1"/"$2" .
        FILE="$2"
        ;;
1)
        hdfs dfs -copyToLocal "$1" .
        OLDIFS="$IFS"
        # Split $1="a/b/filename" into $1="a", $2="b", $3="filename"
        IFS="/"
                set -- $1
        IFS="$OLDIFS"

        # Get rid of "a", "b"
        shift "$(( $# - 1 ))

        FILE="$1"
        ;;
*)
        echo "Usage:  $0 path file"
        echo "Alternate usage:  $0 path/file"
        exit 1
        ;;
esac

awk '{printf "%06d|",NR} 1' "$FILE" >output.txt
1 Like

Hi Corona688 ,

Thanks a lot for your reply.As Im new to unix script level ,I am not clear the block of code you have provided.So please bear with me and requesting you to kindly provide the explanation .
You have written the codeblock for which case
if
case1:/user/target/input.txt as a single parameter
case 2 : <input_directory> <sourcefile> as two different parameter.

If you kindly explain your block of code ,that will be really helpful for me.Thanks !

$# is a special variable meaning "number of arguments". It gets fed into a 'case' statement to run different code for different values.

When there's 2 arguments, it does

        hdfs dfs -copyToLocal "$1"/"$2" .
        FILE="$2"

When there's 1 argument, it does

        hdfs dfs -copyToLocal "$1" .
        OLDIFS="$IFS"
        # Split $1="a/b/filename" into $1="a", $2="b", $3="filename"
        IFS="/"
                set -- $1
        IFS="$OLDIFS"

        # Get rid of "a", "b"
        shift "$(( $# - 1 ))

        FILE="$1"
        ;;

IFS is a special variable used by the shell to control splitting, and set sets the $1 $2 ... arguments to what you tell it to, used together it splits a "/path/to/file" string into "path", "to", "file".

The shift gets rid of the first $# - 1 arguments to leave the last one.

Then it runs awk '{printf "%06d|",NR} 1' "$FILE" >output.txt and exits.

P.S. I might change that awk into

awk '{printf "%06d|",NR} 1 ; END { printf "\n" }' "$FILE" >output.txt

...to add a newline to the end of the file. Text files that don't end in a newline can confuse certain programs.

Hi Corona688 ,

Thanks for your explanation.We need to modify this script .We used the block of code for one file in a directory and corresponding reformatting for one file.Could you kindly suggest me how to automate this for multiple files in a directory .Suppose in target directory (/user/target) 10 files are there ,and all the them are .txt file ,then how to automate this for all 10 files and then it will create 10 reformatted file(sequence no added) for all files in a different name.If you kindly advice me how to automate the above scenario ,it will be really benefecial .Thanks.

#! /bin/bash
#Downloading HDFS file to Local Unix & Reformatting


hdfs dfs -copyToLocal "$1"/"$2" .

FILE="$2"

awk '{printf "%06d,",NR} 1 ; END { printf "\n" }' "$FILE" >output.txt

Sorry for missing this.

How about this:

#! /bin/bash
#Downloading HDFS file to Local Unix & Reformatting

N=0

DIR="$1" ; shift

while [ "$#" -gt 0 ]
do
    hdfs dfs -copyToLocal "$DIR"/"$1" .

    FILE="$1"

    awk '{printf "%06d,",NR} 1 ; END { printf "\n" }' "$FILE" >output${N}.txt
    let N=N+1
    shift
done

Run it like ./script.sh folder file1 file2 file3 file4 ...

Otherwise explain in detail what you really need.

1 Like

Hi Corona688 ,
Thanks a lot for your reply.When I am running the below script(./script.sh /user/target file1 file2 ) for multiple files , although the files are present in the directory ,I am getting error like no such file or directory.Also it would be highly beneficial for me if you kindly explain this script.Thanks !

Hi Corona688 ,
Whenever I am running the script provided by you(./script.sh /user/target file1 file2 ) for multiple files ,I am getting error like below ,although file1,file2 is present in the /user/target directory.

-bash: ./Script.sh: /bin/bash^M: bad interpreter: No such file or directory

Could you kindly look into this.It would be really helpful for me if you kindly help me in this regard.If possible ,cud you pls explain the block of code you provided for multiple files.Thanks !

UNIX and Linux system tools use the <newline> character as the line terminator; not the Windows <carriage-return><newline> character pair. So, you should get this error no matter how many file operands you pass to this script. Remove the <carriage-return> characters from your script and it will probably work as expected. However, in theory, there should be no space before the interpreter name in the first line of the script, so I would change the first line in the script to:

#!/bin/bash

in case your operating system is picky about this issue.

1 Like

Hi Don/Corona688 ,
Whenever I am using the below script for single file ,its working fine for me.

#! /bin/bash
#Downloading HDFS file to Local Unix & Reformatting

hdfs dfs -copyToLocal "$1"/"$2" .

FILE="$2"

awk '{printf "%06d,",NR} 1 ; END { printf "\n" }' "$FILE" >output.txt

But whenever I m running the below script for multiple files ,its throwing error.

#! /bin/bash
#Downloading HDFS file to Local Unix & Reformatting

N=0

DIR="$1" ; shift

while [ "$#" -gt 0 ]
do
    hdfs dfs -copyToLocal "$DIR"/"$1" .

    FILE="$1"

    awk '{printf "%06d,",NR} 1 ; END { printf "\n" }' "$FILE" >output${N}.txt
    let N=N+1
    shift
done

Error is like below :

./Script1.sh: /bin/bash^M: bad interpreter: No such file or directory

It would be really beneficial for me if you kindly help me in this regard.Thanks !

Look at message #12 in this thread. I told you exactly what to do. You didn't do what I told you to do or you wouldn't be getting the error saying that it can't find /bin/bash<carriage-return> (AKA /bin/bash/^M ).

Get rid of all of the <carriage-return> characters in your failing script and get rid of the space in the 1st line in that script. Then your script will at least have a chance to start running.

Hi Don/Corona688 ,
Thanks a lot for your reply.As suggested by you ,after removing the carriage return from the script ,the script is working fine for me.As I am new to unix script level ,the script is not totally understandable to me.If you kindly explain the below block of code ,that will be really helpful for me.

#!/bin/bash
#Downloading HDFS file to Local Unix & Reformatting

N=0

DIR="$1" ; shift

while [ "$#" -gt 0 ]
do
    hdfs dfs -copyToLocal "$DIR"/"$1" .

    FILE="$1"

    awk '{printf "%06d,",NR} 1 ; END { printf "\n" }' "$FILE" >output${N}.txt
    let N=N+1
    shift
done

I am not sure why shift is being used in the above code .And if DIR variable holds the value of "$1" ,then why the FILE variable is again holding the value of "$1".Kindy bear with me and if you please explain the above code block ,that will be really beneficial for my understanding.Thanks !

When you have working code and you don't understand what it does, the man pages on your system are always a good starting point. Since the shift utility is a shell built-in, it might have its own man page ( man shift ) or it might be described on the man page for your shell ( man bash ). I have reformatted your script and added comments. If the man pages and the following comments don't clear it up for you, please ask more detailed questions about the parts you don't understand...

#!/bin/bash
# Usage: scriptname directory file...

#Downloading HDFS file to Local Unix & Reformatting
N=0		# Initialize output file counter.
DIR="$1"	# Save 1st command line operand as source directory for files to
		# be downloaded.
shift		# Discard 1st command line argument and renumber remaining
		# arguments.
while [ "$#" -gt 0 ]	# While there are any file operands left to process...
do
    FILE="$1"		# Set FILE to the name of the next file to download
    hdfs dfs -copyToLocal "$DIR/$FILE" .	# Download the file

    awk '				# Use awk to read downloaded file
	{    printf "%06d,",NR		# For each line read, print a six digit
	}				# leading 0 filled line number
	1				# Followed by the contents of the line
	END {printf "\n"		# Add an empty line (without a line
					# number) to the end of the file
	}' "$FILE" > output${N}.txt	# Read input from the current file and
					# redirect the output to the next
					# numbered output file.
    let N=N+1		# Increment the output file counter
    shift		# Discard the current file argument and renumber
    			# remaining arguments.
done			# End the while loop.

If I was writing this code, I might use a simpler script:

#!/bin/bash
# Usage: scriptname directory file...

#Downloading HDFS file to Local Unix & Reformatting
N=0
DIR="$1"
shift
for FILE in "$@"
do
	hdfs dfs -copyToLocal "$DIR/$FILE" .
	nl -w 6 -n rz -ba -s "" "$FILE" > output$N.txt
	N=$((N + 1))
done

Other than the hdfs utility that your script was using, this script only uses constructs required by the POSIX standards and the Single UNIX Specifications, so it should work with any shell that supports basic POSIX shell requirements (such as bash , ksh , ash , dash , zsh , and several others).

PS I forgot to mention that the above replacement does not add the unnumbered empty line to the end of the output files that the awk script produces. Do you really want/need to add an empty line to the output files?