duplicates lines with one column different

dhanamurthy · May 5, 2008, 10:43am

Hi
I have the following lines in a file

SANDI108085FRANKLIN WRAP 7285
SANDI109514ZIPLOC STRETCH N SEAL 7285
SANDI110198CHOICE DM 0911
SANDI111144RANDOM WEIGHT BRAND 0704
SANDI111144RANDOM WEIGHT BRAND 0738
SANDI111144RANDOM WEIGHT BRAND 0739
SANDI113951NBL-NO COMPANY LISTED 7285
SANDI115203HOME BASICS 7285

I need the output like

SANDI108085FRANKLIN WRAP 7285
SANDI109514ZIPLOC STRETCH N SEAL 7285
SANDI110198CHOICE DM 0911
SANDI111144RANDOM WEIGHT BRAND 0704 0738 0739
SANDI113951NBL-NO COMPANY LISTED 7285
SANDI115203HOME BASICS 7285

Note:- SANDI111144RANDOMWEIGHT BRAND has same lines repeated but the last column is different, i am grouping those columns

Is there any way in sed or awk which can fit the logic very easily.

Regards
Dhana

jim_mcnamara · May 5, 2008, 11:22am

awk code --
awk '{
       key=substr($0,1,11)
       if(arr[key])
           {
                 arr[key]=sprintf("%s %s", arr[key], $NF)

           }
       else
            {
                arr[key]=$0
            }
    }
    END {for (i in arr) {print arr} } ' filenamecsadev:/home/jmcnama>
# output
csadev:/home/jmcnama> t.awk
SANDI110198CHOICE DM 0911 0911
SANDI108085FRANKLIN WRAP 7285 7285a
SANDI113951NBL-NO COMPANY LISTED 7285 7285b
SANDI115203HOME BASICS 7285 7285b
SANDI111144RANDOM WEIGHT BRAND 0704 0738 0739 0704a 0738b 0739b
SANDI109514ZIPLOC STRETCH N SEAL 7285 7285a


#input file
csadev:/home/jmcnama> cat filename
SANDI108085FRANKLIN WRAP 7285
SANDI109514ZIPLOC STRETCH N SEAL 7285
SANDI110198CHOICE DM 0911
SANDI111144RANDOM WEIGHT BRAND 0704
SANDI111144RANDOM WEIGHT BRAND 0738
SANDI111144RANDOM WEIGHT BRAND 0739
SANDI113951NBL-NO COMPANY LISTED 7285
SANDI115203HOME BASICS 7285
SANDI108085FRANKLIN WRAP 7285a
SANDI109514ZIPLOC STRETCH N SEAL 7285a
SANDI110198CHOICE DM 0911
SANDI111144RANDOM WEIGHT BRAND 0704a
SANDI111144RANDOM WEIGHT BRAND 0738b
SANDI111144RANDOM WEIGHT BRAND 0739b
SANDI113951NBL-NO COMPANY LISTED 7285b
SANDI115203HOME BASICS 7285b

dhanamurthy · May 5, 2008, 11:41am

Hi
Your logic works but i have a small correction in my requirement

the input file as i said will look like this
SANDI108085FRANKLIN WRAP 7285
SANDI109514ZIPLOC STRETCH N SEAL 7285
SANDI110198CHOICE DM 0911
SANDI111144RANDOM WEIGHT BRAND 0704
SANDI111144RANDOM WEIGHT BRAND 0738

The output should be in the format.
I need to know whether we can use printf '%s %-51s' in formatting in awk

SANDI FRANKLIN WRAP 108085 7285
SANDI ZIPLOC STRETHC N SEAL 109514 7285
SANDI CHOICE DM 110198 0911
SANDI RANDOM WEIGHT BRAND 111144 0704 0738

Regards

dhanamurthy · May 5, 2008, 11:43am

Hi
Also i have one more question
if we are using like this

arr[key]=sprintf("%s %s", arr[key], $NF)
We are creating a map or relationship between the key and the elements.
I would like to do a file processing of nearly 3GB size file.
If thisis the case will there be any memory issues coming out.

Regards
Dhana

matrixmadhan · May 5, 2008, 11:57am

dhanumurthy

actually in awk - associative arrays; you are creating the associativity between key and value

dhanamurthy · May 5, 2008, 12:05pm

Hi
Your logic works but i have a small correction in my requirement

the input file as i said will look like this
SANDI108085FRANKLIN WRAP 7285
SANDI109514ZIPLOC STRETCH N SEAL 7285
SANDI110198CHOICE DM 0911
SANDI111144RANDOM WEIGHT BRAND 0704
SANDI111144RANDOM WEIGHT BRAND 0738

The output should be in the format.

SANDI FRANKLIN WRAP 108085 7285
SANDI ZIPLOC STRETHC N SEAL 109514 7285
SANDI CHOICE DM 110198 0911
SANDI RANDOM WEIGHT BRAND 111144 0704 0738

I need to know whether we can use printf '%s %-51s' in formatting in awk

Regards

matrixmadhan · May 5, 2008, 12:13pm

you don't have to post the requirement again

dhanamurthy · May 6, 2008, 11:45am

Do any one have answer for my question ?

Regards
Dhana

jim_mcnamara · May 6, 2008, 12:18pm

printf format strings in awk work the same as printf format strings in C or in /usr/bin/printf.

Check your man page.

And yes, if your system does not have a lot of resources or has virtual memory limits there will br problems. As with any app.

ulimit -a

look for limitations in the output. unlimited == no limits

radoulov · May 6, 2008, 4:18pm

Based on your sample: assuming consecutive repeated lines and some fixed widths:

awk 'END { print s } { 
one = substr($1,1,5)
two = substr($1,6,6)
three = substr($1,12) 
if (one FS three != t) {
  print s 
  s = ""
  }
$1 = t = one FS three
_ = $NF; $NF = two FS $NF
s = s ? s FS _ : $0 
}' file

Use nawk or /usr/xpg4/bin/awk on Solaris.

summer_cherry · May 7, 2008, 5:38am

awk '{
for(i=1;i<=NF-1;i++)
{
	t=sprintf("%s %s",t,$i)
}
a[t]=sprintf("%s %s",a[t],$NF)
t=""
}
END{
for( i in a)
	print i" "a
}' filename