sort and split file by 2 cols (1 col after the other)

Dear All,

I am a newbie to shell scripting so this one is really over my head.

I have a text file with five fields as below:

76576.867188 6232.454102 2.008904 55.000000 3
76576.867188 6232.454102 3.607231 55.000000 4
76576.867188 6232.454102 1.555146 65.000000 3
76576.867188 6232.454102 3.226928 65.000000 4
76576.867188 6232.454102 2.180096 100.000000 4
66576.867188 4232.454102 2.944000 0.000000 1
66576.867188 4232.454102 3.549000 0.000000 2
66576.867188 4232.454102 4.517000 0.000000 3
.......

I want to a bash script to sort and split the file using the 4th column and then further split each resulting file by the 5th column.

I tried:

#!/bin/bash

sort $1 | awk '{ file=substr($0,35,5)".tmp"; print >> file }'

for i in 'ls *.tmp'
do
awk '{ file=substr($0,44,1)".txt"; print >> file }'
done

but it is not working at all.

Could someone please point me in the right direction?

Thanks.

What is the desired output?

To keep the forums high quality for all users, please take the time to format your posts correctly.

Use Code Tags when you post any code or data samples so others can easily read your code. You can easily do this by highlighting your code and then clicking on the # in the editing menu. (You can also type code tags

```text
 and 
```

by hand.)

You don't need a full script :wink:

sort -nk4 file | awk '{newfile=$NF".txt";print >  newfile}'

To Franklin52 and All,

Sorry for not making myself clear in my earlier post.

I want output files like:

55.000 3.txt
76576.867188 6232.454102 2.008904 55.000000 3
...................................................55.000000 3

55.000 4.txt
76576.867188 6232.454102 3.607231 55.000000 4
...................................................55.000000 4

65.000000 3.txt
76576.867188 6232.454102 1.555146 65.000000 3
...................................................65.000000 3

65.000000 4.txt
76576.867188 6232.454102 3.226928 65.000000 4
...................................................65.000000 4

etc.

I am however trying out danmero's sugggestion.

Thanks

---------- Post updated at 11:20 AM ---------- Previous update was at 09:52 AM ----------

Hi All

If I keep the 4th field at the same length as below:

76576.867188 6232.454102 2.008904 55.00 3
76576.867188 6232.454102 3.607231 55.00 4
76576.867188 6232.454102 1.555146 65.00 3
76576.867188 6232.454102 3.226928 65.00 4
76576.867188 6232.454102 2.180096 100.0 4
66576.867188 4232.454102 2.944000 0.000 1
66576.867188 4232.454102 3.549000 0.000 2
66576.867188 4232.454102 4.517000 0.000 3

then this:

sort -nk4 myfile | awk '{ file=substr($0,35,7)".txt"; print >> file }'

works.

But I cannot figure it out for variable 4th field length.

danmero,

I get this error with your code:

".txt' <No such file or directory>n't redirect to '1"

Avoid spaces in file names, try this:

sort -nk4 myfile | awk '{ print > $4 "_" $5 ".txt" }'

The question is why you need the 4th field length to archive your goal?
Take a closer look at my and Franklin52 solutions.

Franklin
I get an error message similar to danmero's with your code:

danmero

I don't need the length of the 4th field per se.
I focused on it out of fustration I guess.

I am still trying to get your code or franklin52's to work.

Post a data sample using [code] tags and we will try to help you.

danmero,

Here is a data sample:

76576.867188 6232.454102 1.944000 0.000000 1
76576.867188 6232.454102 2.549000 0.000000 2
76576.867188 6232.454102 3.517000 0.000000 3
76576.867188 6232.454102 4.976000 0.000000 4
76576.867188 6232.454102 0.792961 55.000000 2
76576.867188 6232.454102 2.008904 55.000000 3
76576.867188 6232.454102 3.607231 55.000000 4
76576.867188 6232.454102 1.555146 65.000000 3
76576.867188 6232.454102 3.226928 65.000000 4
76576.867188 6232.454102 2.180096 100.000000 4
66576.867188 4232.454102 2.944000 0.000000 1
66576.867188 4232.454102 3.549000 0.000000 2
66576.867188 4232.454102 4.517000 0.000000 3
66576.867188 4232.454102 5.976000 0.000000 4
66576.867188 4232.454102 0.855956 55.000000 2
66576.867188 4232.454102 2.104038 55.000000 3
66576.867188 4232.454102 3.737020 55.000000 4
66576.867188 4232.454102 1.608926 65.000000 3
66576.867188 4232.454102 3.319681 65.000000 4
66576.867188 4232.454102 2.234251 100.000000 4
86576.867188 5232.454102 3.944000 0.000000 1
86576.867188 5232.454102 3.549000 0.000000 2
86576.867188 5232.454102 4.537932 0.000000 3
86576.867188 5232.454102 6.022880 0.000000 4
86576.867188 5232.454102 0.705243 55.000000 2
86576.867188 5232.454102 1.563464 55.000000 3
86576.867188 5232.454102 3.311949 55.000000 4
86576.867188 5232.454102 1.635146 65.000000 3
86576.867188 5232.454102 3.370348 65.000000 4
86576.867188 5232.454102 2.266589 100.000000 4

Thanks

# cat file
76576.867188 6232.454102 1.944000 0.000000 1
76576.867188 6232.454102 2.549000 0.000000 2
76576.867188 6232.454102 3.517000 0.000000 3
76576.867188 6232.454102 4.976000 0.000000 4
76576.867188 6232.454102 0.792961 55.000000 2
76576.867188 6232.454102 2.008904 55.000000 3
76576.867188 6232.454102 3.607231 55.000000 4
76576.867188 6232.454102 1.555146 65.000000 3
76576.867188 6232.454102 3.226928 65.000000 4
76576.867188 6232.454102 2.180096 100.000000 4
66576.867188 4232.454102 2.944000 0.000000 1
66576.867188 4232.454102 3.549000 0.000000 2
66576.867188 4232.454102 4.517000 0.000000 3
66576.867188 4232.454102 5.976000 0.000000 4
66576.867188 4232.454102 0.855956 55.000000 2
66576.867188 4232.454102 2.104038 55.000000 3
66576.867188 4232.454102 3.737020 55.000000 4
66576.867188 4232.454102 1.608926 65.000000 3
66576.867188 4232.454102 3.319681 65.000000 4
66576.867188 4232.454102 2.234251 100.000000 4
86576.867188 5232.454102 3.944000 0.000000 1
86576.867188 5232.454102 3.549000 0.000000 2
86576.867188 5232.454102 4.537932 0.000000 3
86576.867188 5232.454102 6.022880 0.000000 4
86576.867188 5232.454102 0.705243 55.000000 2
86576.867188 5232.454102 1.563464 55.000000 3
86576.867188 5232.454102 3.311949 55.000000 4
86576.867188 5232.454102 1.635146 65.000000 3
86576.867188 5232.454102 3.370348 65.000000 4
86576.867188 5232.454102 2.266589 100.000000 4

# sort -nk4 file | awk '{newfile=$NF".txt";print >  newfile}'

# cat 1.txt
66576.867188 4232.454102 2.944000 0.000000 1
76576.867188 6232.454102 1.944000 0.000000 1
86576.867188 5232.454102 3.944000 0.000000 1

# cat 2.txt
66576.867188 4232.454102 3.549000 0.000000 2
76576.867188 6232.454102 2.549000 0.000000 2
86576.867188 5232.454102 3.549000 0.000000 2
66576.867188 4232.454102 0.855956 55.000000 2
76576.867188 6232.454102 0.792961 55.000000 2
86576.867188 5232.454102 0.705243 55.000000 2

# cat 3.txt
66576.867188 4232.454102 4.517000 0.000000 3
76576.867188 6232.454102 3.517000 0.000000 3
86576.867188 5232.454102 4.537932 0.000000 3
66576.867188 4232.454102 2.104038 55.000000 3
76576.867188 6232.454102 2.008904 55.000000 3
86576.867188 5232.454102 1.563464 55.000000 3
66576.867188 4232.454102 1.608926 65.000000 3
76576.867188 6232.454102 1.555146 65.000000 3
86576.867188 5232.454102 1.635146 65.000000 3

# cat 4.txt
66576.867188 4232.454102 5.976000 0.000000 4
76576.867188 6232.454102 4.976000 0.000000 4
86576.867188 5232.454102 6.022880 0.000000 4
66576.867188 4232.454102 3.737020 55.000000 4
76576.867188 6232.454102 3.607231 55.000000 4
86576.867188 5232.454102 3.311949 55.000000 4
66576.867188 4232.454102 3.319681 65.000000 4
76576.867188 6232.454102 3.226928 65.000000 4
86576.867188 5232.454102 3.370348 65.000000 4
66576.867188 4232.454102 2.234251 100.000000 4
76576.867188 6232.454102 2.180096 100.000000 4
86576.867188 5232.454102 2.266589 100.000000 4

That's what you want :cool:

So will this do...?

awk '{print >> $4 " " $5 ".txt"}' datafile

Or are you literally after that extra line, like this ...?

awk '{ print >> $4 " " $5 ".txt"; f[$4 " " $5]=1 }
     END { for (s in f) print "..................................................." s >> s ".txt" }' datafile

cambridge and danmero

cambridge's code generates the following error message:

awk: syntax error at source line 1
 context is
	{print >> $4 " >>>  " <<< 
awk: illegal statement at source line 1
awk: illegal statement at source line 1

The following are the output files I am after:

0.000000 1.txt
66576.867188 4232.454102 2.944000 0.000000 1
76576.867188 6232.454102 1.944000 0.000000 1
86576.867188 5232.454102 3.944000 0.000000 1

0.000000 2.txt
66576.867188 4232.454102 3.549000 0.000000 2
76576.867188 6232.454102 2.549000 0.000000 2
86576.867188 5232.454102 3.549000 0.000000 2

55.000000 2.txt
66576.867188 4232.454102 0.855956 55.000000 2
76576.867188 6232.454102 0.792961 55.000000 2
86576.867188 5232.454102 0.705243 55.000000 2

0.000000 3.txt
66576.867188 4232.454102 4.517000 0.000000 3
76576.867188 6232.454102 3.517000 0.000000 3
86576.867188 5232.454102 4.537932 0.000000 3

55.000000 3.txt
66576.867188 4232.454102 2.104038 55.000000 3
76576.867188 6232.454102 2.008904 55.000000 3
86576.867188 5232.454102 1.563464 55.000000 3

65.000000 3.txt
66576.867188 4232.454102 1.608926 65.000000 3
76576.867188 6232.454102 1.555146 65.000000 3
86576.867188 5232.454102 1.635146 65.000000 3

0.000000 4.txt
66576.867188 4232.454102 5.976000 0.000000 4
76576.867188 6232.454102 4.976000 0.000000 4
86576.867188 5232.454102 6.022880 0.000000 4

55.000000 4.txt
66576.867188 4232.454102 3.737020 55.000000 4
76576.867188 6232.454102 3.607231 55.000000 4
86576.867188 5232.454102 3.311949 55.000000 4

65.000000 4.txt
66576.867188 4232.454102 3.319681 65.000000 4
76576.867188 6232.454102 3.226928 65.000000 4
86576.867188 5232.454102 3.370348 65.000000 4

100.000000 4.txt
66576.867188 4232.454102 2.234251 100.000000 4
76576.867188 6232.454102 2.180096 100.000000 4
86576.867188 5232.454102 2.266589 100.000000 4

It is important each output file name is made up of the 4th and 5th columns' value.

I hope this clarifies my question.

---------- Post updated at 01:21 AM ---------- Previous update was at 12:58 AM ----------

danmero and cambridge

This seems to work:

sort -nk4 infile | awk '{ file=substr($0,35,11)".txt"; print >> file }'

Thanks for your efforts

Almost :cool:

#sort -nk4 infile | awk '{ file=substr($0,35,11)".txt"; print file }'
0.000000 1.txt
0.000000 2.txt
0.000000 3.txt
0.000000 4.txt
0.000000 1.txt
0.000000 2.txt
0.000000 3.txt
0.000000 4.txt
0.000000 2.txt
0.000000 1.txt
0.000000 3.txt
0.000000 4.txt
55.000000 2.txt
55.000000 3.txt
55.000000 4.txt
55.000000 2.txt
55.000000 3.txt
55.000000 4.txt
55.000000 2.txt
55.000000 3.txt
55.000000 4.txt
65.000000 3.txt
65.000000 4.txt
65.000000 3.txt
65.000000 4.txt
65.000000 3.txt
65.000000 4.txt
100.000000 .txt
100.000000 .txt
100.000000 .txt

Maybe

# sort -nk4 infile | awk '{ file=$4"_"$5".txt"; print file }'
0.000000_1.txt
0.000000_2.txt
0.000000_3.txt
0.000000_4.txt
0.000000_1.txt
0.000000_2.txt
0.000000_3.txt
0.000000_4.txt
0.000000_2.txt
0.000000_1.txt
0.000000_3.txt
0.000000_4.txt
55.000000_2.txt
55.000000_3.txt
55.000000_4.txt
55.000000_2.txt
55.000000_3.txt
55.000000_4.txt
55.000000_2.txt
55.000000_3.txt
55.000000_4.txt
65.000000_3.txt
65.000000_4.txt
65.000000_3.txt
65.000000_4.txt
65.000000_3.txt
65.000000_4.txt
100.000000_4.txt
100.000000_4.txt
100.000000_4.txt

See your previous post/requirement :wink:

PS: Next time post the data sample and required output when you start a new thread, that will save time to everybody :wink:

Thanks danmero, your last solution is even better.

Sorry I didn't make myself clear initially.

Now one last request:

If I enter values for the 4th and 5th columns on the command line, how can I extract the corresponding files? Example if I enter the following:

4th column = 0.000000 , 55.000000, 100.000000
5th column = 1

How do I extract /select the corresponding

0.000000 1.txt, 55.000000 1.txt, 100.000000 1.txt

files?

Thanks again.

"cut" command maybe an alternative to awk way to split any file with fixed or variable lenghth fields and definable field separator.

sort -nk4 file | awk '$4 ~ "^[0|55|100]" && $5==1{ file=$4"_"$5".txt"; print file }'

Is this a "catch 22" ? Base on your data sample will be only one file :wink:

gch
Thanks for you "cut" suggestion.

danmero

Seems like, but not.

The sample file contains "layers" for a visualization program.

For example:

0.000000 1.txt
66576.867188 4232.454102 2.944000 0.000000 1
76576.867188 6232.454102 1.944000 0.000000 1
86576.867188 5232.454102 3.944000 0.000000 1

55.000000 2.txt
66576.867188 4232.454102 0.855956 55.000000 2
76576.867188 6232.454102 0.792961 55.000000 2
86576.867188 5232.454102 0.705243 55.000000 2

represent layer 1 @ 0.000000 and layer 2 @ 55.000000.

I want to select "layer" files for visualization by entering their 4th and 5th column values on the command line.

Hope this outlines what I am trying to do.

Thanks for all the help so far, really appreciated.

I think the below code snippet may help you.
the sample_log contains the sample input.

cat sample_log | while read LINE
do
#$echo $i
fl_name=`echo $LINE | awk '{ print $5 }'`
#echo $fl_name
fl_nm=$fl_name.txt
#echo $fl_nm
var1=`echo $LINE | awk '{print $4 }'`
echo $var1 " " $fl_name >$fl_nm
echo $LINE >>$fl_nm
echo "............................. $var1 $fl_name">>$fl_nm
done

Something like that ?

# fourth="0.000000|55.000000|100.000000"
# fifth=1
# sort -nk4 file | awk '$4 ~ x && $5==y{ file=$4"_"$5".txt"; print > file }' x=$fourth y=$fifth

siba
Thanks, your code works just like danmero's earlier solution.

danmero
Much appreciation, you have answered my last question.