concatenate all duplicate line in a file.

Hi All,

i have a zip file like the format

794051400123|COM|24|0|BD|R|99.98

794051413727|COM|11|0|BD|R|28.99

794051415622|COM|23|0|BD|R|28.99

883929004676|COM|0|0|BD|R|28.99
794051400123|MOM|62|0|BD|R|99.98

794051413727|MOM|4|0|BD|R|28.99

794051415622|MOM|80|0|BD|R|28.99

883929004676|MOM|0|0|BD|R|28.99

883929017164|MOM|0|0|BD|R|39.99
794051400123|RNO|73|0|BD|R|99.98

794051413727|RNO|8|0|BD|R|28.99

794051415622|RNO|84|0|BD|R|28.99

883929004676|RNO|0|0|BD|R|28.99
794051400123|SOM|25|0|BD|R|99.98

794051415622|SOM|80|0|BD|R|28.99

883929004676|SOM|0|0|BD|R|28.99

883929017164|SOM|0|0|BD|R|39.99
.................................

i need concate all duplicate line like

794051400123|COM|24|MOM|62|SOM|25|RNO|73
794051413727|COM|11||MOM|4|RNO|8
............
...

the file size is nearly 30 MB.So it takes lot of time.I have to do it with in 15 min.

please help me.

Thanks!
vaskar

What happened to the dollar amount at the end of the line?

Tested and working.

#!/usr/bin/env python

import sys

input = open(sys.argv[1], 'r')

dict = {}

for line in input:

    line = line.rstrip()

    if line.count("|"):
        line = line.split("|")
        if dict.has_key(line[0]):
            dict[line[0]].extend(line[1:3])
        else:
            dict[line[0]] = line[1:3]
            
input.close()

for key, value in dict.items():

    print("%s|%s" % (key, "|".join(value)))

I don't know python script.Is it possible write simple shell script? I have done it but it takes lot of time.

Gift horse and all that ...

Assuming you mean take all values with an identical first field, and paste together fields 2 and 3 from all those lines, something like

sort file |
awk -F '|' '$1 == prev { collect=collect "|" $2 "|" $3; next }
{ if (collect) print collect; prev = $1; collect = $1 "|" $2 "|" $3; next }
END { if (collect) print collect }'

The sort might kill you if it's a very small or old system, but having sorted input makes the awk script very simple.

Thanks!! It is working....:stuck_out_tongue:

Hi,

I am facing a big problem...It gives me wrong result

like

|COM|24|MOM|62|SOM|25|RNO|73
794051413727|COM|11||MOM|4|RNO|8

pls help me.

Thanks!
vaskar

Check this, might help u.

for i in `cat filename | cut -d"|" -f1`
do
if [ `grep "$i" filename | wc -l` -gt "1" ]
then
tmp=`grep "$i" filename`
echo $tmp | sed 's/ '"$i"'//g'
fi
done

manosubsulo: I'm not sure how that's an improvement; certainly it's a lot less efficient.

vaskarbasak: What is the output on the first line supposed to be? Can you provide sample input for which you get this result?

I think some problem in line
'$1 == prev { collect=collect "|" $2 "|" $3; next }
If i have changed the code
$1 == prev { if(collect) collect=collect "|" $2 "|" $3; next }
it works..

.

for line in `sort FILENAME_HERE`
do
str=`echo $line|cut -d"|" -f1`
if [ "$prev" -eq "$str" ]
then
str2=`echo $line | sed 's/'"$str"'//g'`
echo "$str2\c"
else
echo "\n$line\c"
fi
prev=$str
done

vaskarbasak: sounds like you have input lines which have an empty first field, in which case the output (as per your definition) is correct.

Such kind of problem occurs only first line of any files.

Correct

794051400123|COM|24|MOM|62|SOM|25|RNO|73
794051413727|COM|11||MOM|4|RNO|8

Wrong

COM|24|MOM|62|SOM|25|RNO|73
794051413727|COM|11||MOM|4|RNO|8

I have tested the era's solution, it works fine for me :

sort vaskar.dat |
awk -F '|' '
$1 == prev {
   collect=collect "|" $2 "|" $3;
   next
}
{
   if (collect) print collect;
   prev = $1;
   collect = $1 "|" $2 "|" $3;
   next
}
END {
   if (collect) print collect
}
'

Input file (vaskar.dat) :

794051400123|COM|24|0|BD|R|99.98
794051413727|COM|11|0|BD|R|28.99
794051415622|COM|23|0|BD|R|28.99
883929004676|COM|0|0|BD|R|28.99
794051400123|MOM|62|0|BD|R|99.98
794051413727|MOM|4|0|BD|R|28.99
794051415622|MOM|80|0|BD|R|28.99
883929004676|MOM|0|0|BD|R|28.99
883929017164|MOM|0|0|BD|R|39.99
794051400123|RNO|73|0|BD|R|99.98
794051413727|RNO|8|0|BD|R|28.99
794051415622|RNO|84|0|BD|R|28.99
883929004676|RNO|0|0|BD|R|28.99
794051400123|SOM|25|0|BD|R|99.98
794051415622|SOM|80|0|BD|R|28.99
883929004676|SOM|0|0|BD|R|28.99
883929017164|SOM|0|0|BD|R|39.99

Output:

794051400123|COM|24|MOM|62|RNO|73|SOM|25
794051413727|COM|11|MOM|4|RNO|8
794051415622|COM|23|MOM|80|RNO|84|SOM|80
883929004676|COM|0|MOM|0|RNO|0|SOM|0
883929017164|MOM|0|SOM|0

Empty input lines can be a problem resulting with a line containing only '|'.
If your file contains empty lines, try this modifified version of era's solution :

sort file |
awk -F '|' 'NF==0 { next } $1 == prev { collect=collect "|" $2 "|" $3; next }
{ if (collect) print collect; prev = $1; collect = $1 "|" $2 "|" $3; next }
END { if (collect) print collect }'

Jean-Pierre.

Thanks All....
it is working....
but i am facing another problem

while i am trying to change the format

like

794051400123|99.98|COM|24|MOM|62|RNO|73|SOM|25

it gives me result like

794051400123|99.98
|COM|24|MOM|62|RNO|73|SOM|25

i.e brake 1 line into 2 line...

Please post a sample of input lines (preferably less than five lines) which allows us to reproduce this problem. With the input you have posted, the code works. We can't know what the special cases are if you don't provide representative input samples.

here i past my code:-

awk -F '|' 'NF==0 { next } $1 == prev {collect=collect","$2"="$3; next }
{ if (collect) print collect; prev = $1; collect = $1","$5","$6","$7","$2"="$3; next }
END { if (collect) print collect}'

Input:-

794051400123|COM|24|0|BD|R|99.98

794051413727|COM|11|0|BD|R|28.99

794051415622|COM|23|0|BD|R|28.99

883929004676|COM|0|0|BD|R|28.99
794051400123|MOM|62|0|BD|R|99.98

794051413727|MOM|4|0|BD|R|28.99

794051415622|MOM|80|0|BD|R|28.99

883929004676|MOM|0|0|BD|R|28.99

883929017164|MOM|0|0|BD|R|39.99
794051400123|RNO|73|0|BD|R|99.98

Corrent Out put:-

794051400123,BD,R,99.98,COM=24,MOM=62,SOM=25,RNO=73
794051413727,BD,R,28.99,COM=11,MOM=4,RNO=8
.....

But my out put(Wrong):-

794051400123,BD,R,99.98
,COM=24,MOM=62,SOM=25,RNO=73
794051413727,BD,R,28.99
,COM=11,MOM=4,RNO=8
.....

i.e breaks line from 99.98..Which is wrong.

With the input you posted, I get the following output.

794051400123,BD,R,99.98,COM=24
794051413727,BD,R,28.99,COM=11
794051415622,BD,R,28.99,COM=23
883929004676,BD,R,28.99,COM=0
794051400123,BD,R,99.98,MOM=62
794051413727,BD,R,28.99,MOM=4
794051415622,BD,R,28.99,MOM=80
883929004676,BD,R,28.99,MOM=0
883929017164,BD,R,39.99,MOM=0
794051400123,BD,R,99.98,RNO=73

There is not a SOM in sight in your sample input data, so I don't understand where that would come from.

If I sort the input before feeding it to awk, I get the following result.

794051400123,BD,R,99.98,COM=24,MOM=62,RNO=73
794051413727,BD,R,28.99,COM=11,MOM=4
794051415622,BD,R,28.99,COM=23,MOM=80
883929004676,BD,R,28.99,COM=0,MOM=0
883929017164,BD,R,39.99,MOM=0

If I understand your requirements, this is correct.

In other words, I cannot reproduce your problem. Again, we need input and output which we can reproduce the problem with.

yes...

794051400123,BD,R,99.98,COM=24,MOM=62,RNO=73
794051413727,BD,R,28.99,COM=11,MOM=4
794051415622,BD,R,28.99,COM=23,MOM=80
883929004676,BD,R,28.99,COM=0,MOM=0
883929017164,BD,R,39.99,MOM=0

is correct

but when i run my shell script all lines brakes from 99.98,28.99...etc.

i.e
794051400123,BD,R,99.98
,COM=24,MOM=62,RNO=73
794051413727,BD,R
,28.99,COM=11,MOM=4

i don't understand why it breaks the line? Will there any may to merge 2 line into one line.I have tried it but falied...Can u pls paste your code?

sort inputfile | awk ... your awk script here ...