lurak
January 9, 2012, 5:01am
1
Hi all, can anyone help me. I want to compare values within column in one file using awk.
Below are the sample input file(1st file):
8388014643|20735704|20100713140734|20735505|20100713152501|700111|20110226085427|400501|20100329165139|300046|20100329165139|400400|20100329165139|400000|20100329165139|||||||||||||||||||||||||||
8388040001|20735505|20110916121248|20735604|20110916121248|700333|20110916121248|400400|20110916121248|500510|20110916121248|||||||||||||||||||||||||||||||
8388084999|700333|20101018171805|700666|20100329175101|20735604|20100329175101|400501|20100329175101|300001|20100329175101|500510|20100329175101|||||||||||||||||||||||||||||
83875488100|20735505|20111101130130|20735604|20111011140704|700333|20111011140704|400400|20111011140704|500510|20111011140704|700666|20111011140704|||||||||||||||||||||||||||||
83875426992|400400|20111011141108|400000|20111011141108|500510|20111011141108|20735704|20111011141108|400501|20111011141108|||||||||||||||||||||||||||||||
below are the comparison files (2nd file):
20735704|20735505|20735604|700333|700666|
What I want to get is, if the 1st file column 2,4,6,8,10,..42 (even column under 43) have data from 2nd file so it will output all the data from 1st row + multiple on 2nd file data.
Below are the sample output
8388014643|20735704|20100713140734|20735505|20100713152501|700111|20110226085427|400501|20100329165139|300046|20100329165139|400400|20100329165139|400000|20100329165139|||||||||||||||||||||||||||Multiple on 20735704 20735505|
8388040001|20735505|20110916121248|20735604|20110916121248|700333|20110916121248|400400|20110916121248|500510|20110916121248|||||||||||||||||||||||||||||||Multiple on 20735505 20735604 700333|
8388084999|700333|20101018171805|700666|20100329175101|20735604|20100329175101|400501|20100329175101|300001|20100329175101|500510|20100329175101|||||||||||||||||||||||||||||Multiple on 700333 700666 20735604|
83875488100|20735505|20111101130130|20735604|20111011140704|700333|20111011140704|400400|20111011140704|500510|20111011140704|700666|20111011140704|||||||||||||||||||||||||||||Multiple on 20735505 20735604 700333 700666|
83875426992|400400|20111011141108|400000|20111011141108|500510|20111011141108|20735704|20111011141108|400501|20111011141108|||||||||||||||||||||||||||||||No Multiple data|
Can you guys help me? really appreciate the help, and FYI the data that I'm handling is around 20,000,000 rows.
birei
January 9, 2012, 6:58am
2
Hi lurak,
This is my try with awk, but last line gives me one coincidence:
$ cat file1
8388014643|20735704|20100713140734|20735505|20100713152501|700111|20110226085427|400501|20100329165139|300046|20100329165139|400400|20100329165139|400000|20100329165139|||||||||||||||||||||||||||
8388040001|20735505|20110916121248|20735604|20110916121248|700333|20110916121248|400400|20110916121248|500510|20110916121248|||||||||||||||||||||||||||||||
8388084999|700333|20101018171805|700666|20100329175101|20735604|20100329175101|400501|20100329175101|300001|20100329175101|500510|20100329175101|||||||||||||||||||||||||||||
83875488100|20735505|20111101130130|20735604|20111011140704|700333|20111011140704|400400|20111011140704|500510|20111011140704|700666|20111011140704|||||||||||||||||||||||||||||
83875426992|400400|20111011141108|400000|20111011141108|500510|20111011141108|20735704|20111011141108|400501|20111011141108|||||||||||||||||||||||||||||||
$ cat file2
20735704|20735505|20735604|700333|700666|
$ cat script.awk
BEGIN {
FS = "|"
if ( ARGC != 3 ) {
print "Usage: awk -f script.awk <file1> <file2>"
exit 0
}
dummy = ARGV[1]
ARGV[1] = ARGV[2]
ARGV[2] = dummy
}
## Process 'file2' of input parameters.
FNR == NR {
for ( i = 1; i <= NF; i++ ) {
if ( length( $i ) > 0 ) {
comparison[ $i ] = 1
}
}
}
## Process 'file1' of input parameters.
FNR < NR {
for ( i = 2; i <= 42; i += 2 ) {
if ( comparison[ $i ] ) {
multiples = multiples " " $i
}
}
if ( length( multiples ) > 0 ) {
multiples = "Multiple on" multiples "|"
} else {
multiples = "No multiple data|"
}
print $0 multiples
multiples = ""
}
$ awk -f script.awk file1 file2 | cat -n
1 8388014643|20735704|20100713140734|20735505|20100713152501|700111|20110226085427|400501|20100329165139|300046|20100329165139|400400|20100329165139|400000|20100329165139|||||||||||||||||||||||||||Multiple on 20735704 20735505|
2 8388040001|20735505|20110916121248|20735604|20110916121248|700333|20110916121248|400400|20110916121248|500510|20110916121248|||||||||||||||||||||||||||||||Multiple on 20735505 20735604 700333|
3 8388084999|700333|20101018171805|700666|20100329175101|20735604|20100329175101|400501|20100329175101|300001|20100329175101|500510|20100329175101|||||||||||||||||||||||||||||Multiple on 700333 700666 20735604|
4 83875488100|20735505|20111101130130|20735604|20111011140704|700333|20111011140704|400400|20111011140704|500510|20111011140704|700666|20111011140704|||||||||||||||||||||||||||||Multiple on 20735505 20735604 700333 700666|
5 83875426992|400400|20111011141108|400000|20111011141108|500510|20111011141108|20735704|20111011141108|400501|20111011141108|||||||||||||||||||||||||||||||Multiple on 20735704|
Regards,
Birei
i am not good at awk... anyway here is the normal code... since you mentioned 20,00,000 rows... execution time might be more..
file1 has 1st file content file2 has 'multiples'.. file3 is the output file
rm -r file3
while read line
do
counter=0
finalNo=$( echo $line | awk '{ split($0,a,"|"); print length(a) - 1 }' )
line="$line Multiple on: "
for (( i = 2; i <= $finalNo; i=`expr $i + 2` )) ### Outer for loop ###
do
column1=$( echo $line | cut -d'|' -f$i )
if [[ "$column1" != "" ]]; then
noPresent=$( cat file2 | grep -w "$column1" )
if [[ "$noPresent" != "" ]] ; then
line="$line $column1"
counter=`expr $counter + 1`
fi
fi
done
if [ $counter -ne 0 ]; then
line="$line |"
echo $line >> file3
else
line="$line |No Multiple found|"
echo $line >> file3
fi
done < file1
cat file3
8388014643|20735704|20100713140734|20735505|20100713152501|700111|20110226085427|400501|20100329165139|300046|20100329165139|400400|20100329165139|400000|20100329165139||||||||||||||||||||||||||| Multiple on: 20735704 20735505 |
8388040001|20735505|20110916121248|20735604|20110916121248|700333|20110916121248|400400|20110916121248|500510|20110916121248||||||||||||||||||||||||||||||| Multiple on: 20735505 20735604 700333 |
8388084999|700333|20101018171805|700666|20100329175101|20735604|20100329175101|400501|20100329175101|300001|20100329175101|500510|20100329175101||||||||||||||||||||||||||||| Multiple on: 700333 700666 20735604 |
83875488100|20735505|20111101130130|20735604|20111011140704|700333|20111011140704|400400|20111011140704|500510|20111011140704|700666|20111011140704||||||||||||||||||||||||||||| Multiple on: 20735505 20735604 700333 700666 |
83875426992|400400|20111011141108|400000|20111011141108|500510|20111011141108|207357|20111011141108|400501|20111011141108||||||||||||||||||||||||||||||| Multiple on: |No Multiple found|
i have modifed last line in file1 since last line has 20735704 which is a multiple
mirni
January 9, 2012, 7:19am
4
Try this out:
awk -F\| '
NR==FNR{
for(i=1;i<=NF;i++)
a[$i]=$i;
next}
{m="";
for(i=2;i<=NF; i+=2)
if($i && $i in a)
m=m " "$i
app = m ? "Multiple on " m "|" : "No multiple data |" ;
print $0 app
}' file2.txt file1.txt
But I agree with birei, that in your sample output, the last line should find 1 match.
lurak
January 10, 2012, 12:06am
5
Thanks for the big help and already tried the code myself and work like charm. The only problem is, actually the last line is correct. There is no multiple because there is only one data of 20735704 that is why it show no multiple.
As you can see below are the 2nd file.
20735704|20735505|20735604|700333|700666|
if there is only one data from the 2nd file there are no multiple. But if there are, let's say 20735704 and 700333 in one line, there are multiple.
Pardon my bad explanation and bad english. Can you guys help me again?
birei
January 10, 2012, 6:20am
6
So there must exists at least two coincidences to consider a line has multiples, in case of zero or one coincidence it is assumed there are not coincidences. Is like this?
Regards,
Birei
mirni
January 10, 2012, 6:36am
7
Just put a counter in there:
awk -F\| '
NR==FNR{
for(i=1;i<=NF;i++)
a[$i]=$i;
next
}
{m=""; cnt=0;
for(i=2;i<=NF; i+=2)
if($i && $i in a) {
m=m " "$i
cnt++
}
app = cnt > 1 ? "Multiple on " m "|" : "No multiple data |" ;
print $0 app
}' file2.txt file1.txt
Or, you can try to count the spaces in variable 'm' and test for that:
awk -F\| '
NR==FNR{
for(i=1;i<=NF;i++)
a[$i]=$i;
next
}
{m="";
for(i=2;i<=NF; i+=2)
if($i && $i in a)
m=m " "$i
app = gsub(" "," ",m) > 1 ? "Multiple on " m "|" : "No multiple data |" ;
print $0 app
}' file2.txt file1.txt
Makes the code shorter, but a little bit more obscure. gsub returns the number of substitutions made.