File breaking

bond2222 · October 21, 2010, 11:23am

Hey,

I have to take one CSV file and break into more files. Let's I have a file prices.csv and the data in the file like

1,12345
1,34567
1,23456
2,67890
2,77720
2,44556
2,55668
10,44996

based on the first column, I want to create files. in this example 1 is repeated three times and create file groupnumber_1.csv and this file has data only belongs to 1
groupnumber_1.csv
1,12345
1,34567
1,23456

and one more file I have to create groupnumber_2.csv
2,67890
2,77720
2,44556
2,55668

one more file groupnumber_10.csv
10,44996

etc this way I have to create as many as csv files based on ths first column.

Please help me with the shell script for this.

Tytalus · October 21, 2010, 12:11pm

#  nawk '{f="groupnumber_"$1".csv";print $0>f}' FS="," infile

#  head groupnumber_*
==> groupnumber_1.csv <==
1,12345
1,34567
1,23456

==> groupnumber_10.csv <==
10,44996

==> groupnumber_2.csv <==
2,67890
2,77720
2,44556
2,55668

ctsgnb · October 21, 2010, 12:39pm

awk -F, 'OFS, (++A[$1]) {print $0 > "groupnumber_"$1".csv"}' prices.csv

@Scru1Linizer : pls don't laugh about my solution

---------- Post updated at 06:39 PM ---------- Previous update was at 06:31 PM ----------

awk -F, '{print $0 >"groupnumber_"$1".csv"}' prices.csv

bond2222 · October 21, 2010, 2:57pm

Thanks for your quick responses. This forum is really helpful to me.

alister · October 21, 2010, 3:10pm

A pitfall of these solutions is that they will fail if they reach the open file resource limit. If there are many files to be created or if the alllowed number of open files per process is low and cannot be raised, they may require the addition of a close after the print.

Just something to keep in mind just in case.

Regards,
Alister

ctsgnb · October 21, 2010, 3:18pm

Dude Alister ... always nitpicking ... but still always true !

Under solaris :

rlim_fd_max

Description Specifies the �hard� limit on file descriptors that a single process might
have open.Overriding this limit requires superuser privilege.
Data Type Signed integer
Default 65,536
Range 1 to MAXINT
Units File descriptors
Dynamic? No
Validation None
When to Change When the maximum number of open files for a process is not enough.
Other limitations in system facilities can mean that a larger number of
file descriptors is not as useful as it might be. For example:
A 32-bit program using standard I/O is limited to 256 file
descriptors. A 64-bit program using standard I/O can use up to 2
billion descriptors. Specifically, standard I/O refers to the
stdio(3C) functions in libc(3LIB).
select is by default limited to 1024 descriptors per fd_set. For
more information, see select(3C). Starting with the Solaris 7
release, 32-bit application code can be recompiled with a larger
fd_set size (less than or equal to 65,536). A 64-bit application uses
an fd_set size of 65,536, which cannot be changed.

bond2222 · November 24, 2010, 2:21pm

nawk '{f="groupnumber_"$1".csv";print $0>f}' FS="," infile

Thanks

This is good. but I am trying to remove the First column after the files are generated. can we add anything to the above command. can you help me in this. Plz?

Scrutinizer · November 24, 2010, 2:38pm

Try:

nawk '{f="groupnumber_"$1".csv"; print $0>f; print $2}' FS="," infile > outfile

bond2222 · November 29, 2010, 9:39am

Thanks, But This command has limitation. on the test file it is working perfect. but when I am trying to work on the actual file, I can not create more than 25 files. it has some limit.

can we write loop logic on this. Please suggest me.

Scrutinizer · November 29, 2010, 11:32am

See if this works. You'd have to remove the files beforehand or new lines will be appended:

nawk '{f="groupnumber_"$1".csv"; print $0>>f; close(f);print $2}' FS="," infile > outfile

bond2222 · November 29, 2010, 2:31pm

Thanks. it is working fine.

bond2222 · December 2, 2010, 9:21pm

Working fine. But I don't understand this.

"You'd have to remove the files beforehand or new lines will be appended:"

I did not do anything, command working and creating the files.

Scrutinizer · December 3, 2010, 12:23am

What I mean is that when you run the command again, and if output files already exist, it will append new records to these existing files rather than overwrite them.