Beginner: Count & Sort Using Array's

BigTOE · February 4, 2011, 10:01am

Hi,

I'm new to linux & bash so please forgive my ignorance, just wondering if anyone can help.

I have a file (mainfile.txt) with comma deliminated values, like so:

 
    $1  $2  $3
613212, 36, 57
613212, 36, 10
613212, 36, 10
677774, 36, 57
619900, 10, 10

i need to split this file into two files. Any entries where the first number ($1) is in more than once needs to go in file1.txt. Any entries that only occur once need to go into file2.txt, like so:

File1.txt

613212, 36, 57
613212, 36, 10
613212, 36, 10

File2.txt

677774, 36, 57
619900, 10, 10

Any solution to this problem would be greatly appreciated, no matter how implemented.

however, i would like to get experience using arrays so a solution incorporating them would be fantastic. i've experimented with them, like this, but cant get my head around them.

awk ' { arr[$0]++ } END { for( str in arr  ) { if ( arr[str] = 1 ) print str " " arr[str] } } ' /directory/mainfile.txt | >/directory/file1.txt
 
awk ' { arr[$0]++ } END { for( str in arr  ) { if ( arr[str] > 1 ) print str " " arr[str] } } ' /directory/mainfile.txt | >/directory/file2.txt

It's a botch job and not surprisingly doesnt work. even if it did, how do i specify $1 and only $1 as the thing to look at, instead of a string?

Cheers,

Ian:b:

jim_mcnamara · February 4, 2011, 10:37am

There are probably more efficient ways as this involves two passes:

awk ' {arr[$1]++; next} END{for (i in arr) {print arr, i }} ' mainfile.txt > t.tmp
awk ' FILENAME=="t.tmp" {arr[$2]=$1; next}
       FILENAME=="mainfile.txt" { if(arr[$1]>1) 
                                                   {print$0 >"file1.txt" }  
                                             else 
                                                  {print $0 >"file2.txt" } }' t.tmp mainfile.txt

BigTOE · February 4, 2011, 12:21pm

Hi Jim,

Thanks for the reply. It's almost there but not quite.

After running, my Files are as follows:

$ cat /mainfile.txt
613212 36 57
613212 36 10
613212 36 10
677774 36 57
619900 10 10

$ cat /t.tmp
1 619900
3 613212
1 677774

$ cat /file1.txt  -  *empty*

$ cat /file2.txt
613212 36 57
613212 36 10
613212 36 10
677774 36 57
619900 10 10

the t.tmp file is spot on - it know there's three instances of 613212 and one each of the others.

However, when it comes to sending them to files, file1.txt contains nothing when it should have three entries (all 613212) - and file2.txt should have the other two entries.

i've tried changing certain things but i'm shooting in the dark here.

Thanks again for helping.

Cheers

Ian

yinyuemi · February 4, 2011, 1:09pm

try:

awk 'NR==FNR && ++a[$1] >1{b[$1]=1} NR!=FNR{if(b[$1]) {print >>"file1"} else {print >> "file2"}}' file file

Scrutinizer · February 5, 2011, 6:46am

awk -F, 'A[$1]++{print>"file2.txt";next}1' mainfile.txt >file1.txt

Franklin52 · February 5, 2011, 7:35am

@Scrutinizer, the first row of lines with more than 1 occurence of $1 goes to the wrong file.

Another approach:

awk -F, 'NR==FNR{a[$1]++;next}{print > (a[$1]>1?"File1.txt":"File2.txt")}' file file

Scrutinizer · February 5, 2011, 7:42am

@Franklin. That's right . I misread the post :o

BigTOE · February 7, 2011, 7:43am

awk 'NR==FNR && ++a[$1] >1{b[$1]=1} NR!=FNR{if(b[$1]) {print >>"file1.txt"} else {print >> "file2.txt"}}' file1.txt file2.txt

That works absolutely perfectly! Thanks guys!

The only thing is I havent got a clue how it works, specifically how it knows to look in 'mainfile.txt' as it is not specified. There are a few other files being generated immediately before this line of code and are in the same directory so how does NR (number of records?)and FNR (all files?) know which file to look in (and it is definitely looking at mainfile.txt because i've checked).

Thanks again for all contributions and the solution:b:

yinyuemi · February 7, 2011, 12:58pm

when NR==FNR, code read through the mainfile file to get which the repeat time of item ($1) is over 1(++a[$1]>1), and give them a tag (b[$1]=1), next, NR>FNR(or NR!=FNR), code read throught the mainfile again, for this time, based on your rule (b[$1]==1, file1; b[$1]==0, file2), generate two output files: file1 and file2. Does it make sense?

BigTOE · February 8, 2011, 6:58am

Hi Yinyuemi,

I'm afraid not. I understand the mathematics of how it decides if $1 is in more than once and if so what to do with it, and if not what to do with it.

it's just, at no point do i explicitly state "look in 'mainfile.txt' for this information"

there is a Master file. From this i strip some things out into another file. And again i strip some more data from this to create Mainfile.txt.

it seems no matter where i put this:

awk 'NR==FNR && ++a[$1] >1{b[$1]=1} NR!=FNR{if(b[$1]) {print >>"file1.txt"} else {print >> "file2.txt"}}' file1.txt file2.txt

no matter at which stage i place it as all the above files are created - at the beginning, in between them - it does the same thing and looks in 'mainfile.txt'. How?

I understand the high likelihood that this is just a case of me being a complete idiot and need something incredibly simple spelled out for me, but with only three months sporadic linux/bash programming under my belt that's not too surprising :o

Cheers again for your help.

Ian

rdcwayx · February 8, 2011, 6:20pm

really, I can't believe. tpyo in it?

BigTOE · February 9, 2011, 6:26am

Haha yeah it works, I already put in the '.txt' and the correct paths.

just dont know how it works...

---------- Post updated at 11:26 AM ---------- Previous update was at 09:34 AM ----------

OK,

remember how i said there is a Master file...

and from this i strip data out to produce another file...

and from this i strip more data out to get Mainfile.txt

and when i ran the array it somehow looked in Mainfile.txt and it all tallied up correctly.

well now i've tried using a different Master file. All the changes filter through all subsequent steps until it gets to the array part - and it is still outputting the same data as before .i.e it is still looking in the same place for it's data.

Is there any way i can explicitly state to look in mainfile.txt?

BigTOE · February 10, 2011, 7:43am

OK, scratch NR==FNR.

I tweaked jim's:-

awk ' {arr[$1]++; next} END{for (i in arr) {print arr, i }} ' mainfile.txt > t.tmp
awk ' FILENAME=="t.tmp" {arr[$2]=$1; next}
       FILENAME=="mainfile.txt" { if(arr[$1]>1) 
                                                   {print$0 >"file1.txt" }  
                                             else 
                                                  {print $0 >"file2.txt" } }' t.tmp mainfile.txt

this one does work perfectly. Thankyou to everyone who contributed

Ian