Script to read file and extract data by matching pattern

pradeepmacha · June 2, 2011, 2:10am

Hello,
I have a file ( say file1) which has lines like below.

xxxx:xxxx,yyyy,1234,efgh
zzzz:zzzz,kkkk,pppp,1234,xxxx,uuuu,oooo
dddd:dddd

here the word before ":" ( ie: xxxx) is the file name and the string after : are also file names, but each file name separated by ","
In case of 1st line- xxxx has dependency on files xxxx,yyyy,1234,efgh
In case of second line - zzzz has dependency on kkkk,pppp,1234,xxxx,uuuu,oooo files.

each file can have No dependency ( like line 3 ) or more and the first file name dependent is always itself.

I have another file( say file2) which has only the master file names with Path which looks like

/data/testing2/zzzz
/data/testing1/xxxx
/data/test/dddd
/data/test3/fffff

now i want create another file ( say file3 "output" ) which has all the file names on separate lines ( with Path if possible) of all dependent files for each of the files in file2
so file3 once the script is run should look some thing like this for ( file1 and file2 as input)

/data/testing1/kkkk
/data/test1/pppp
/data/testing1/1234
/data/test1/xxxx
/data/test1/uuuu
/data/testing3/oooo
/data/testing1/yyyy
/data/test/1234
/data/post/efgh

I would be happy if some one could help me with a script to process this.

itkamaraj · June 2, 2011, 2:39am

as per your input files, the output of the below is wrong

/data/testing1/kkkk

it must be

/data/testing2/kkkk as kkkk is depends on the zzzz and zzzz path is /data/testing2/

please clearly mention your input and output format

I am confused

pravin27 · June 2, 2011, 2:41am

This is what are you looking for ?

 awk -F"/" 'NR==FNR{for(i=1;i<NF;i++){a[$NF]=a[$NF] $i FS}} a[$1]{for(i=2;i<=NF;i++){print a[$1] $i}}' file2 FS="[:,]" file1

pradeepmacha · June 2, 2011, 3:08am

@ itkamaraj -- the path of the dependent file can be any where. basically it has to find the dependent file under /data, it can be under /data/testing2/kkkk or /data/testing1/kkkk.
Sorry if Im confusing you more . let me know if you need more clarification.

---------- Post updated at 12:38 PM ---------- Previous update was at 12:25 PM ----------

Pravin Thanks for the quick reply, the script works fine, but the path of the dependent file is same as the master file. But in my case the dependent files can be in any location under /data ( for eg:).How do we go about this.
or i can maintain another file ( say file4) which has file names and path of all files. so after getting the dependent file ( from file1) i can just find this file name in file4 and just pick that line with path as output.
Kindly suggest

pravin27 · June 2, 2011, 4:47am

If you want to find out the files only then what is use of file2. Could the below code helps you ?

perl -nle '@flds=split(",",substr($_,index($_,":")+1));for($i=0;$i<=$#flds;$i++){system("find /data -name $flds[$i]");}' file1

pradeepmacha · June 2, 2011, 5:03am

Thanks pravin - file1 has list of all files and dependencies ( its kinda master file), and file2 has only a subset of files that i require from file1.
for instance file1 has 100 files ( lines) and their dependencies, but i require only 10files and their dependencies (that is mentioned in file2 ) to be found/ listed.

pravin27 · June 2, 2011, 5:34am

Try this,
FindFile.pl

#!/usr/bin/perl

($file1,$file2)=($ARGV[0],$ARGV[1]);
open (FH1,"<","$file1") or die "Fail- $!\n";
open (FH2,"<","$file2") or die "Fail- $!\n";

while (<FH2>) {
chomp;
@flds=split(/\//);
$lookup{$flds[$#flds]}++;
}

while (<FH1>) {
chomp;
if(/^(.+?):/) {
        if ( exists $lookup{$1}) {
                @flds=split(/,/,substr($_,index($_,":")+1));
                for($i=0;$i<=$#flds;$i++) { system("find /data -name $flds[$i]"); }
              }
        }
}

close(FH1);
close(FH2);

Invocation

perl FindFile.pl file1 file2

pradeepmacha · June 2, 2011, 5:46am

Thanks pravin the script works fine.. thank you so much

pradeepmacha · June 3, 2011, 8:01am

Hey pravin.. sorry to get back and open the old thread.. I had one more requirement on the same topic.
i forgot to mention that i need to recursively find the dependent files.
for eg: file1 line 2 - zzzz has oooo as dependent file. now oooo might have in turn dependent files ( its not mentioned in example), so i need these files as well and this goes on till the last level. if the file has no dependent files then it will have its own name ( like uuuu:uuuu )
Another case could be: zzzz has xxxx as dependent file. but the 1st line has solved this dependency
please let me know If I'm confusing you a lot, I'll try to frame my question in a better way.
I would appreciate if anyone else could help me on this as well.

---------- Post updated at 05:31 PM ---------- Previous update was at 05:22 PM ----------

for example
-------------------------
file1

A:A,B,C,D
B:B,E,F
E:E
F:F.G,H
G:G
H:H
I:I
J:J

-------------------------
file2

/data/testing2/A
/data/test/B

------------------------
Now the script should list ( with path of course)

/data/testing2/A
/data/test/B
/data/testg2/C
/data/testing3/D
/data/testing1/E
/data/testing/F
/data/testing/G
/data/test2/H

pravin27 · June 3, 2011, 11:41pm

I hope this will resolve the issue.

#!/usr/bin/perl

sub seen_files {
        $filename=shift;
        @seenfld=split(" ",$seen{$filename});
        for($k=1;$k<=$#seenfld;$k++){
                @res=grep(/$seenfld[$k]/,@old_array);
                unless (scalar @res > 0 ) {
                        system("find /data -name $seenfld[$k]");
                        @old_array=(@old_array,$seenfld[$k]);
                }
        }
}


($file1,$file2)=($ARGV[0],$ARGV[1]);
open (FH1,"<","$file1") or die "Fail- $!\n";
open (FH2,"<","$file2") or die "Fail- $!\n";

while (<FH2>) {
        chomp;
        @flds=split(/\//);
        $lookup{$flds[$#flds]}++;
}

open (FH3,"<","$file1") or die "Fail- $!\n";
while (<FH3>) {
        chomp;
        if(/^(.+?):/) {
                @fld=split(/,/,substr($_,index($_,":")+1));
                $seen{$1}="@fld";
        }
}

while (<FH1>) {
chomp;
if(/^(.+?):/) {
        if ( exists $lookup{$1}) {
                @flds=split(/,/,substr($_,index($_,":")+1));
                for($i=0;$i<=$#flds;$i++) {
                @res=grep(/$flds[$i]/,@old_array);
                unless (scalar @res > 0 ) {
                        system("find /data -name $flds[$i]");
                        @old_array=(@old_array,$flds[$i]);
                        if (exists $seen{$flds[$i]} &&  $i > 0 ) {
                                seen_files($flds[$i]);
                                @seenfld_old=@seenfld;
                                for($p=1;$p<=$#seenfld_old;$p++){seen_files($seenfld_old[$p]);}
                        }
                        }
                   }
           }

        }
}

close(FH1);
close(FH2);
close(FH3);

pradeepmacha · June 6, 2011, 12:40am

Hello Pravin, thanks the script is working perfectly fine. Could you please explain the script to me so that i can add error handling.

pradeepmacha · June 7, 2011, 6:58am

Hello All,
I have a slight modification in my input files and need your help again. The logic of the script remains the same except that the input file1 and file2 are

File1:

Testing~xxxx~null:Testing~xxxx~null,Testing1~abcd~null,test~yyyy~null
Testing1~abcd~Null:Testing1~abcd~null,test~oooo~null,test1~zzzz~null
test1~zzzz:test1~zzzz~null,test~1234~null

File2:

Testing~xxxx~Null
test1~zzzz~null
test~oooo~
/data/Testing1/abcd~null
/data/test/yyyy.xml~null

Here the first word (test*) in file1 and file2 is the directory in which the file exist. eg: file xxxx is under /data/Testing

so i want to modify the script to

search file in specific directory.
remove ~null, ~ which is junk ( some time the "N" is caps ~Null or some time it is just ~) in both files.
file2 is mix and match of directory structure. first 3 lines have only the directory in which the file exist where as the last line has complete path with or without extension file extension. But the file names are unique so we can eliminate file extension.
i understand this is very confusing but any help would be appreciated.

pankaj80 · June 7, 2011, 7:23am

Can anyone help me here..

I am new in learning unix and finding some difficulty in preparing a small shell script. I want script to sort all the files given by user as input (either the exact full name of the file or say the files matching the criteria like files having names ZEB*). Then script will find all those files based on matching criteria and append the extension .sort to the sorted version of the file whilst retaining the original version of the file. Script should not sort files in sub-directories, only the top level directory it is run from.
This piece of code line is working for me but it does sort all the files present in that path/location.

find * -prune -type f |grep -v .sort| while read file
do
  sort "$file" > "$file".sort
done

I want some use friendly shell script which ask user to enter the path location and enter the file name and work only on those files enter by user as input.

getmmg · June 7, 2011, 7:29am

pankaj80:

Can anyone help me here..

I am new in learning unix and finding some difficulty in preparing a small shell script. I want script to sort all the files given by user as input (either the exact full name of the file or say the files matching the criteria like files having names ZEB*). Then script will find all those files based on matching criteria and append the extension .sort to the sorted version of the file whilst retaining the original version of the file. Script should not sort files in sub-directories, only the top level directory it is run from.
This piece of code line is working for me but it does sort all the files present in that path/location.
find * -prune -type f |grep -v .sort| while read file
do
sort "$file" > "$file".sort
done
I want some use friendly shell script which ask user to enter the path location and enter the file name and work only on those files enter by user as input.

If you have new question, please post it in a separate thread.

pankaj80 · June 7, 2011, 7:32am

Can anyone please tell me how to post queries in a new thread.. Either share the link to post new thread or navigation on the forum will work .. Thanks in advance -

getmmg · June 7, 2011, 7:38am

The UNIX and Linux Forums - FAQ: General Forum Usage

pradeepmacha · June 7, 2011, 7:38am

Hello guys some one has posted his query in my post, hence I'm re posting my question.
Kindly help

I have a slight modification in my input files and need your help again. The logic of the script remains the same except that the input file1 and file2 are

File1:

Testing~xxxx~null:Testing~xxxx~null,Testing1~abcd~null,test~yyyy~null
Testing1~abcd~Null:Testing1~abcd~null,test~oooo~null,test1~zzzz~null
test1~zzzz:test1~zzzz~null,test~1234~null

File2:

Testing~xxxx~Null
test1~zzzz~null
test~oooo~
/data/Testing1/abcd~null
/data/test/yyyy.xml~null

Here the first word (test*) in file1 and file2 is the directory in which the file exist. eg: file xxxx is under /data/Testing

so i want to modify the script to

search file in specific directory.
remove ~null, ~ which is junk ( some time the "N" is caps ~Null or some time it is just ~) in both files.
file2 is mix and match of directory structure. first 3 lines have only the directory in which the file exist where as the last line has complete path with or without extension file extension. But the file names are unique so we can eliminate file extension.
i understand this is very confusing but any help would be appreciated.

pravin27 · June 8, 2011, 12:50am

I am really confused with your last requirement, I have made the changes in the script as per my understanding.I hope this will resolve your issue.

#!/usr/bin/perl

sub seen_files {
        $filename=shift;
        @seenfld=split(" ",$seen{$filename});
        for($k=1;$k<=$#seenfld;$k++){
                @res=grep(/$seenfld[$k]/,@old_array);
                unless (scalar @res > 0 ) {
                        #print "FILENAME - $seenfld[$k]\n";
                        system("find /data -name $seenfld[$k]");
                        @old_array=(@old_array,$seenfld[$k]);
                }
        }
}

($file1,$file2)=($ARGV[0],$ARGV[1]);
open (FH1,"<","$file1") or die "Fail- $!\n";
open (FH2,"<","$file2") or die "Fail- $!\n";

while (<FH2>) {
        chomp;
        if(/^\//) {
                @flds=split(/\//);
                $filename=$flds[$#flds];
                if($filename=~/(.+?)[\.~](.+?)/) { $filename=$1;} #print $filename,"\n"; }
                $lookup{$filename}++;
        }
}
open (FH3,"<","$file1") or die "Fail- $!\n";
while (<FH3>) {
        chomp;
        if(/^(.+?):/) {
                @fld=split(/,/,substr($_,index($_,":")+1));
                while (<@fld>) {
                        @fld_new=(@fld_new,((split(/\~/))[1]));
                }
                $seen{$1}="@fld_new";
        }
}
while (<FH1>) {
chomp;
if(/^(.+?):/) {
        $fname=((split(/\~/,$1))[1]);
        if ( exists $lookup{$fname}) {
                @flds=split(/,/,substr($_,index($_,":")+1));
                while (<@flds>) {
                        @flds_new=(@flds_new,((split(/\~/))[1]));
                }

                for($i=0;$i<=$#flds_new;$i++) {
                @res=grep(/$flds_new[$i]/,@old_array);
                unless (scalar @res > 0 ) {
                        #print "FILENAME - $flds_new[$i]\n";
                        system("find /data -name $flds_new[$i]");
                        @old_array=(@old_array,$flds_new[$i]);
                        if (exists $seen{$flds_new[$i]} &&  $i > 0 ) {
                                seen_files($flds_new[$i]);
                                @seenfld_old=@seenfld;
                                for($p=1;$p<=$#seenfld_old;$p++){seen_files($seenfld_old[$p]);}
                        }
                        }
                   }
           }

        }
}
close(FH1);
close(FH2);
close(FH3);

pradeepmacha · June 8, 2011, 3:19am

Hello Pravin,
Thanks it looks good and works file. regarding the last point. let me try and clarify it
file2 can look like
-------
test~yyyy
testing2~zzzz
/data/test/dddd
/data/test3/ffff
---------------
so before reading values it has to convert all the "~" to / and if required add a /at the beginning and should look like path
----------
/test/yyyy
/testing2/zzzz
/data/test/dddd
/data/test3/ffff
-----------------
My input file which are created by other script are in bad shape.. i have to see if i can fix these as well.
but if this can get a solution to this then i'm very close to what i want.

pravin27 · June 9, 2011, 1:31am

#!/usr/bin/perl

sub seen_files {
        $filename=shift;
        @seenfld=split(" ",$seen{$filename});
        for($k=1;$k<=$#seenfld;$k++){
                @res=grep(/$seenfld[$k]/,@old_array);
                unless (scalar @res > 0 ) {
                        #print "FILENAME - $seenfld[$k]\n";
                        system("find /data -name $seenfld[$k]");
                        @old_array=(@old_array,$seenfld[$k]);
                }
        }
}
($file1,$file2)=($ARGV[0],$ARGV[1]);
open (FH1,"<","$file1") or die "Fail- $!\n";
open (FH2,"<","$file2") or die "Fail- $!\n";


open (FW,">","file2_new") or die "Fail- $!\n";
while (<FH2>) {
        chomp;
         s/(\~[Nn]ull|\~$)//g;
        if(!/^\//) {$_="/".$_;s/\~/\//g;}
        print FW $_,"\n";
        if(/^\//) {
                @flds=split(/\//);
                $filename=$flds[$#flds];
                if($filename=~/(.+?)[\.~]/) { $filename=$1;} #print $filename,"\n"; }
                $lookup{$filename}++;
        }
}

print "-------------------------------------------\n\n";
print "file2 newly created with name file2_new in current directory\n";
print "-------------------------------------------\n\n";
open (FH3,"<","$file1") or die "Fail- $!\n";
while (<FH3>) {
        chomp;
        if(/^(.+?):/) {
                @fld=split(/,/,substr($_,index($_,":")+1));
                while (<@fld>) {
                        @fld_new=(@fld_new,((split(/\~/))[1]));
                }
                $seen{$1}="@fld_new";
        }
}
while (<FH1>) {
chomp;
if(/^(.+?):/) {
        $fname=((split(/\~/,$1))[1]);
        if ( exists $lookup{$fname}) {
                @flds=split(/,/,substr($_,index($_,":")+1));
                while (<@flds>) {
                        @flds_new=(@flds_new,((split(/\~/))[1]));
                }

                for($i=0;$i<=$#flds_new;$i++) {
                @res=grep(/$flds_new[$i]/,@old_array);
                unless (scalar @res > 0 ) {
                        #print "FILENAME - $flds_new[$i]\n";
                        system("find /data -name $flds_new[$i]");
                        @old_array=(@old_array,$flds_new[$i]);
                        if (exists $seen{$flds_new[$i]} &&  $i > 0 ) {
                                seen_files($flds_new[$i]);
                                @seenfld_old=@seenfld;
                                for($p=1;$p<=$#seenfld_old;$p++){seen_files($seenfld_old[$p]);}
                        }
                        }
                   }
           }

        }
}
close(FH1);
close(FH2);
close(FH3);
close(FW);