A command to split a file into two based on a string

madrazzii · June 18, 2013, 10:29pm

Hello

What command can i use to split a tab delimited txt file into two files base on the occurrence of a string

my file name is EDIT.txt

The content of file is below

XX 1234 PROCEDURES
XY 1634 PROCEDURES
XM 1245 CODES
XZ 1256 CODES

It has more than a million record
If there is PROCEDURES in a row i want to output to PROCEDURES.txt file else CODES.txt file

how would i use a AWK or SPLIT command?

Thanks for your help

Chubler_XL · June 18, 2013, 11:12pm

Using awk:

awk '/PROCEDURES/ { print > "PROCEDURES.TXT"; next} { print > "CODES.TXT" }' infile

rveri · June 19, 2013, 12:17am

Hi madrazzii,
Check this out:

awk '{if($0~/PROCEDURES/) print >"PROCEDURES.TXT" ; if ($0~/CODES/) print >"CODES.txt" }' EDIT.txt

madrazzii · June 19, 2013, 7:52am

Thank you both.

I was using to separate commands to do this

sed '/procedures/d' Edit.TXT > codes.txt

grep "procedures" > Procedures.txt

But will use the awl and now

Thanks

J

rbatte1 · June 19, 2013, 10:23am

To stick with the two command approach, you would be better with:-

grep "procedures" > procedures.txt
grep -v "procedures" > codes.txt

Robin

madrazzii · July 17, 2013, 3:14pm

another question i had...i have a file without a file extension but can be opened in notepad. The file is 300 mb in size. it basically has multiple data sets in one file and i want to extract each of the data sets into a txt file. The remarks to identify each data set in the file are T0, P1, P2, P3, P4, P5, P6 and T9 that appear at the END of each record in the file and i want this file to be split into 8 different files where file 1 will have only records with T0 and file 2 with P1 and so on. There might be T0, P1 etc remarks in the middle of the line but the criteria to extract should be that these remarks are at the end of the row/line/record in the source file. source file name is RAW. Is there GREP command or any other command where I could use a IF then ELSE or a CASE statement

tukuyomi · July 17, 2013, 4:11pm

awk '{print > $3".txt"}' EDIT.txt

awk '{print > $NF".txt"}' RAW

Chubler_XL · July 17, 2013, 4:13pm

I'f you know you only have a few different record types you could try something like this:

awk '{ print > "file." substr($0,length-1) }' RAW

otherwise try

awk '
/T0$/ { print > "file1" ; next }
/P1$/ { print > "file2" ; next }
/P2$/ { print > "file3" ; next }
/P3$/ { print > "file4" ; next }
{ print > "file.UNKNOWN" }' RAW

madrazzii · July 17, 2013, 4:33pm

Thank you. i did the first method and changed a little

awk '{print > "file.txt" susbtr($0,length-2) }' RAW

i did length -2 to get the TO, P1 etc appended to output files and also added ..txt hoping i would get fileT0.txt, fileP1.txt but i get file.txtT0, file.txtP1.

is there a way to have it saved as .txt extension when it outputs?

Thanks

tukuyomi · July 17, 2013, 4:38pm

awk '{print > "file"$NF".txt"}' RAW

madrazzii · July 17, 2013, 4:54pm

Works perfect. thanks again

---------- Post updated at 04:54 PM ---------- Previous update was at 04:43 PM ----------

sorry to bother again but when i use ls command to list it the file is displayed as fileT0?.txt. it opens in my ubuntu machine but when i try to copy it into windows box, it doesn't open because the file is displayed as fileT0 .txt.(there is a space). i am not able to rename it nor copy it.

any help?

tukuyomi · July 17, 2013, 5:03pm

I suspect your RAW file to have \r\n at the end of each line (typically all txt files created from MS Windows notepad). You have to remove \r from the original file. Try (not tested)

tr '\r\n' '\n' RAW > RAW.1

and retry the awk script with RAW1

madrazzii · July 17, 2013, 5:21pm

The file was received from a client and they said it was from z/OS system. i tried the code but it says 'tr - extra operand' and the output file is 0 byte

---------- Post updated at 05:21 PM ---------- Previous update was at 05:15 PM ----------

i opened the raw file in notepad++ and has the [CR][LF] at the end of each line

like

..........................T0[CR][LF]
..........................P1[CR][LF]
..........................P2[CR][LF]
..........................P2[CR][LF]
..........................P2[CR][LF]

Corona688 · July 17, 2013, 5:29pm

tukuyomi:

I suspect your RAW file to have \r\n at the end of each line (typically all txt files created from MS Windows notepad). You have to remove \r from the original file. Try (not tested)
tr '\r\n' '\n' RAW > RAW.1
and retry the awk script with RAW1

tr does not work that way. It deals with individual characters, not strings. Since you gave it two characters on the input, it's expecting 2 characters on the output, too.

This should work:

tr -d '\r' < input > output

Chubler_XL · July 17, 2013, 5:34pm

You could also just have awk ignore the extra char on the end like this:

awk '{print > "file.txt" susbtr($0,length-2,2) }' RAW

tukuyomi · July 18, 2013, 2:56am

Thanks Corona688 for your input !
Here is another solution using AWK:

~/unix.com$ awk '{gsub("\r","");print > "file"$NF".txt"}' RAW

madrazzii · July 18, 2013, 12:18pm

Thank you all for your replies. i will try them and let you know. thanks again

---------- Post updated at 11:58 AM ---------- Previous update was at 10:09 AM ----------

i got it without the space in the file name but since the "\r" is removed (carriage return) my rows all are jumbled up. when i view in Notepad++ i see only [LF] at end of line and it is not like [CR][LF] at end of line. this cause multiple rows in one line. i was going to use the file to import into SQL server but that would cause an error without a correct line breaker like [CR][LF].

is there way to append to the code to have the carriage return?

---------- Post updated at 12:18 PM ---------- Previous update was at 11:58 AM ----------

Note: I used Chuber's code and renamed the file extensions. it does involve time but i got what i need now. thanks

Code:

awk '{print > "file.txt" susbtr($0,length-2,2) }' RAW