concatenate all duplicate line in a file.

vaskarbasak · April 29, 2008, 8:25am

Hi All,

i have a zip file like the format

794051400123|COM|24|0|BD|R|99.98

794051413727|COM|11|0|BD|R|28.99

794051415622|COM|23|0|BD|R|28.99

883929004676|COM|0|0|BD|R|28.99
794051400123|MOM|62|0|BD|R|99.98

794051413727|MOM|4|0|BD|R|28.99

794051415622|MOM|80|0|BD|R|28.99

883929004676|MOM|0|0|BD|R|28.99

883929017164|MOM|0|0|BD|R|39.99
794051400123|RNO|73|0|BD|R|99.98

794051413727|RNO|8|0|BD|R|28.99

794051415622|RNO|84|0|BD|R|28.99

883929004676|RNO|0|0|BD|R|28.99
794051400123|SOM|25|0|BD|R|99.98

794051415622|SOM|80|0|BD|R|28.99

883929004676|SOM|0|0|BD|R|28.99

883929017164|SOM|0|0|BD|R|39.99
.................................

i need concate all duplicate line like

794051400123|COM|24|MOM|62|SOM|25|RNO|73
794051413727|COM|11||MOM|4|RNO|8
............
...

the file size is nearly 30 MB.So it takes lot of time.I have to do it with in 15 min.

please help me.

Thanks!
vaskar

jim_mcnamara · April 29, 2008, 10:43am

What happened to the dollar amount at the end of the line?

ShawnMilo · April 29, 2008, 11:40am

Tested and working.

#!/usr/bin/env python

import sys

input = open(sys.argv[1], 'r')

dict = {}

for line in input:

    line = line.rstrip()

    if line.count("|"):
        line = line.split("|")
        if dict.has_key(line[0]):
            dict[line[0]].extend(line[1:3])
        else:
            dict[line[0]] = line[1:3]
            
input.close()

for key, value in dict.items():

    print("%s|%s" % (key, "|".join(value)))

vaskarbasak · April 30, 2008, 4:00am

I don't know python script.Is it possible write simple shell script? I have done it but it takes lot of time.

era · April 30, 2008, 4:14am

Gift horse and all that ...

Assuming you mean take all values with an identical first field, and paste together fields 2 and 3 from all those lines, something like

sort file |
awk -F '|' '$1 == prev { collect=collect "|" $2 "|" $3; next }
{ if (collect) print collect; prev = $1; collect = $1 "|" $2 "|" $3; next }
END { if (collect) print collect }'

The sort might kill you if it's a very small or old system, but having sorted input makes the awk script very simple.

vaskarbasak · April 30, 2008, 9:20am

Thanks!! It is working....

vaskarbasak · August 25, 2008, 3:22pm

Hi,

I am facing a big problem...It gives me wrong result

like

|COM|24|MOM|62|SOM|25|RNO|73
794051413727|COM|11||MOM|4|RNO|8

pls help me.

Thanks!
vaskar

manosubsulo · August 26, 2008, 2:27am

Check this, might help u.

for i in `cat filename | cut -d"|" -f1`
do
if [ `grep "$i" filename | wc -l` -gt "1" ]
then
tmp=`grep "$i" filename`
echo $tmp | sed 's/ '"$i"'//g'
fi
done

era · August 26, 2008, 4:16am

manosubsulo: I'm not sure how that's an improvement; certainly it's a lot less efficient.

vaskarbasak: What is the output on the first line supposed to be? Can you provide sample input for which you get this result?

vaskarbasak · August 26, 2008, 6:00am

I think some problem in line
'$1 == prev { collect=collect "|" $2 "|" $3; next }
If i have changed the code
$1 == prev { if(collect) collect=collect "|" $2 "|" $3; next }
it works..

.

palsevlohit_123 · August 26, 2008, 6:32am

for line in `sort FILENAME_HERE`
do
str=`echo $line|cut -d"|" -f1`
if [ "$prev" -eq "$str" ]
then
str2=`echo $line | sed 's/'"$str"'//g'`
echo "$str2\c"
else
echo "\n$line\c"
fi
prev=$str
done

era · August 26, 2008, 7:19am

vaskarbasak: sounds like you have input lines which have an empty first field, in which case the output (as per your definition) is correct.

vaskarbasak · August 26, 2008, 7:55am

Such kind of problem occurs only first line of any files.

Correct

794051400123|COM|24|MOM|62|SOM|25|RNO|73
794051413727|COM|11||MOM|4|RNO|8

Wrong

COM|24|MOM|62|SOM|25|RNO|73
794051413727|COM|11||MOM|4|RNO|8

aigles · August 26, 2008, 8:13am

I have tested the era's solution, it works fine for me :

sort vaskar.dat |
awk -F '|' '
$1 == prev {
   collect=collect "|" $2 "|" $3;
   next
}
{
   if (collect) print collect;
   prev = $1;
   collect = $1 "|" $2 "|" $3;
   next
}
END {
   if (collect) print collect
}
'

Input file (vaskar.dat) :

794051400123|COM|24|0|BD|R|99.98
794051413727|COM|11|0|BD|R|28.99
794051415622|COM|23|0|BD|R|28.99
883929004676|COM|0|0|BD|R|28.99
794051400123|MOM|62|0|BD|R|99.98
794051413727|MOM|4|0|BD|R|28.99
794051415622|MOM|80|0|BD|R|28.99
883929004676|MOM|0|0|BD|R|28.99
883929017164|MOM|0|0|BD|R|39.99
794051400123|RNO|73|0|BD|R|99.98
794051413727|RNO|8|0|BD|R|28.99
794051415622|RNO|84|0|BD|R|28.99
883929004676|RNO|0|0|BD|R|28.99
794051400123|SOM|25|0|BD|R|99.98
794051415622|SOM|80|0|BD|R|28.99
883929004676|SOM|0|0|BD|R|28.99
883929017164|SOM|0|0|BD|R|39.99

Output:

794051400123|COM|24|MOM|62|RNO|73|SOM|25
794051413727|COM|11|MOM|4|RNO|8
794051415622|COM|23|MOM|80|RNO|84|SOM|80
883929004676|COM|0|MOM|0|RNO|0|SOM|0
883929017164|MOM|0|SOM|0

Empty input lines can be a problem resulting with a line containing only '|'.
If your file contains empty lines, try this modifified version of era's solution :

sort file |
awk -F '|' 'NF==0 { next } $1 == prev { collect=collect "|" $2 "|" $3; next }
{ if (collect) print collect; prev = $1; collect = $1 "|" $2 "|" $3; next }
END { if (collect) print collect }'

Jean-Pierre.

vaskarbasak · August 28, 2008, 8:20am

Thanks All....
it is working....
but i am facing another problem

while i am trying to change the format

like

794051400123|99.98|COM|24|MOM|62|RNO|73|SOM|25

it gives me result like

794051400123|99.98
|COM|24|MOM|62|RNO|73|SOM|25

i.e brake 1 line into 2 line...

era · August 28, 2008, 8:32am

Please post a sample of input lines (preferably less than five lines) which allows us to reproduce this problem. With the input you have posted, the code works. We can't know what the special cases are if you don't provide representative input samples.

vaskarbasak · August 28, 2008, 8:45am

here i past my code:-

awk -F '|' 'NF==0 { next } $1 == prev {collect=collect","$2"="$3; next }
{ if (collect) print collect; prev = $1; collect = $1","$5","$6","$7","$2"="$3; next }
END { if (collect) print collect}'

Input:-

794051400123|COM|24|0|BD|R|99.98

794051413727|COM|11|0|BD|R|28.99

794051415622|COM|23|0|BD|R|28.99

883929004676|COM|0|0|BD|R|28.99
794051400123|MOM|62|0|BD|R|99.98

794051413727|MOM|4|0|BD|R|28.99

794051415622|MOM|80|0|BD|R|28.99

883929004676|MOM|0|0|BD|R|28.99

883929017164|MOM|0|0|BD|R|39.99
794051400123|RNO|73|0|BD|R|99.98

Corrent Out put:-

794051400123,BD,R,99.98,COM=24,MOM=62,SOM=25,RNO=73
794051413727,BD,R,28.99,COM=11,MOM=4,RNO=8
.....

But my out put(Wrong):-

794051400123,BD,R,99.98
,COM=24,MOM=62,SOM=25,RNO=73
794051413727,BD,R,28.99
,COM=11,MOM=4,RNO=8
.....

i.e breaks line from 99.98..Which is wrong.

era · August 28, 2008, 9:02am

With the input you posted, I get the following output.

794051400123,BD,R,99.98,COM=24
794051413727,BD,R,28.99,COM=11
794051415622,BD,R,28.99,COM=23
883929004676,BD,R,28.99,COM=0
794051400123,BD,R,99.98,MOM=62
794051413727,BD,R,28.99,MOM=4
794051415622,BD,R,28.99,MOM=80
883929004676,BD,R,28.99,MOM=0
883929017164,BD,R,39.99,MOM=0
794051400123,BD,R,99.98,RNO=73

There is not a SOM in sight in your sample input data, so I don't understand where that would come from.

If I sort the input before feeding it to awk, I get the following result.

794051400123,BD,R,99.98,COM=24,MOM=62,RNO=73
794051413727,BD,R,28.99,COM=11,MOM=4
794051415622,BD,R,28.99,COM=23,MOM=80
883929004676,BD,R,28.99,COM=0,MOM=0
883929017164,BD,R,39.99,MOM=0

If I understand your requirements, this is correct.

In other words, I cannot reproduce your problem. Again, we need input and output which we can reproduce the problem with.

vaskarbasak · August 28, 2008, 9:14am

yes...

794051400123,BD,R,99.98,COM=24,MOM=62,RNO=73
794051413727,BD,R,28.99,COM=11,MOM=4
794051415622,BD,R,28.99,COM=23,MOM=80
883929004676,BD,R,28.99,COM=0,MOM=0
883929017164,BD,R,39.99,MOM=0

is correct

but when i run my shell script all lines brakes from 99.98,28.99...etc.

i.e
794051400123,BD,R,99.98
,COM=24,MOM=62,RNO=73
794051413727,BD,R
,28.99,COM=11,MOM=4

i don't understand why it breaks the line? Will there any may to merge 2 line into one line.I have tried it but falied...Can u pls paste your code?

era · August 28, 2008, 11:14am

sort inputfile | awk ... your awk script here ...