Extract file name based on the pattern

prajaktaraut · November 16, 2016, 3:29pm

Hello All,
I have multiple files in a hadoop /tmp/cloudera directory.
Filename are as follows

ABC_DATA_BAD5A_RO_F_20161104.CSV
ABC_DATA_BAD6C_VR_F_20161202.CSV
ABC_DATA_BAD7A_TR_F_20162104.CSV
ABC_DATA_BAD2A_BR_F_20161803.CSV
ABC_DATA_BAD3T_KT_F_20160106.CSV

I just need filenames change in the output directory.
i want filename to be as below.

BAD5A_RO
BAD6C_VR
BAD7A_TR
BAD2A_BR
BAD3T_KT

The logic is, the command should look for "DATA_" and pick rest of the filename before "_F"

I am looking out for some grep or egrep command or a code.
Still trying to figure out.
Need few suggestions.

bakunin · November 16, 2016, 7:28pm

You can do that with a simple variable expansion:

filename="ABC_DATA_BAD5A_RO_F_20161104.CSV"
ftmp="${filename##*DATA_}"         # gives "BAD5A_RO_F_20161104.CSV"
result="${ftmp%%_F*}"              # gives "BAD5A_RO"

You can get the same with many other text filters in UNIX: sed , awk , ... All these methods will be far slower than the variable expansion, though, even if this takes a step in between. It is possible to put that all in one step, but it would be ugly and cumbersome to do so, while this remains readable and understandable.

I hope this helps.

bakunin

prajaktaraut · November 16, 2016, 7:42pm

Thanks Bakunin...
The solution you provided is for one file... So if I hv multiple files, it will be a tedious job...
Can it be done with one command/script and then putting those file names into some other file..
I just need the files names...

Don_Cragun · November 16, 2016, 8:49pm

You could try:

for filename in *DATA_*_F*		# for filenames like ABC_DATA_BAD5A_RO_F_20161104.CSV
do	ftmp=${filename##*DATA_}	# gives "BAD5A_RO_F_20161104.CSV"
	result=${ftmp%%_F*}		# gives "BAD5A_RO"
	printf '%s\n' "$result"
done > list.txt

extending bakunin's suggestion to work on all of the files in the current working directory that have the filename pattern you specified and putting the results in a file named list.txt in the same directory.

Of course, all of this assumes that you are using a shell that meets POSIX standard requirements for a shell. In the future when asking questions like this, please tell us what shell and what operating system you're using so we don't have to make so many assumptions.

Aia · November 16, 2016, 9:24pm

In case you want to try something different.
Issue the following command in /tmp/cloudera

ls | perl -MFile::Copy=move -anlF'_' -e 'move $_, "$F[2]_$F[3]" if /_DATA_/ && @F > 3'

prajaktaraut · November 25, 2016, 9:45am

Thanks Don Cragun.
I tested your code and it ran successfully for the files available in the current directory.

The challenge is My files are at hadoop location and i access those file from my bash prompt using the below command.

hdfs dfs -ls /user/cloudera/prod/SMS

i want the code to run for the files available at this hadoop location (hdfs dfs -ls /user/cloudera/prod/SMS).
Iam trying to figure out the solution for this.

Don_Cragun · November 25, 2016, 1:50pm

As a first wild guess, I would try:

cd /user/cloudera/prod/SMS
for filename in *DATA_*_F*
do	ftmp=${filename##*DATA_}	# gives "BAD5A_RO_F_20161104.CSV"
	result=${ftmp%%_F*}		# gives "BAD5A_RO"
	printf '%s\n' "$result"
done > list.txt

and if that doesn't work, and assuming that the command:

hdfs dfs -ls /user/cloudera/prod/SMS

gives you a list of filenames separated by sequences of spaces, tabs, and/or newline characters and that none of your filenames contain any space, tab, or newline characters, I would also try:

for filename in $(hdfs dfs -ls /user/cloudera/prod/SMS/*DATA_*_F*)
do	ftmp=${filename##*DATA_}	# gives "BAD5A_RO_F_20161104.CSV"
	result=${ftmp%%_F*}		# gives "BAD5A_RO"
	printf '%s\n' "$result"
done > list.txt

I have absolutely no experience with hadoop filesystems or utilities, so I have no confidence that either of these will work.