Urgent Need Help! Merging lines in .txt file

I need to write a script that reads through an input .txt file and replaces the end value with the end value of the next line for lines that have distance <=4000. The first label line is not actually in the input. In the below example, 3217 is the distance from the end of the first line to the start of the second line. 14021 is the distance from the previous line (not included) to the start of the first line. So once the script finds a distance <=4000, it needs to replace the end of the previous line with the end of the current line.

Any help would be greatly appreciated! Thanks!

INPUT:

chrm start end block length distance
chr7 27398704 27399096 ENm010Block536 392 14021
chr7 27402314 27402466 ENm010Block537 152 3217
chr7 27412536 27412726 ENm010Block538 190 10069
chr7 27416032 27416424 ENm010Block539 392 3305
chr7 27420022 27420972 ENm010Block540 950 3597

Desired OUTPUT:

chr7 27398704 27402466
chr7 27412536 27420972

If I understand correctly the output should be:

chr7 27398704 27402466
chr7 27412536 27416424

Am I missing something?

No, actually I meant to put the original
27420972

because the next distance is <=4000 as well, so those two would get merged as well. See, when you have several distances<=4000 consecutively, you continue to merge them, until the distance is no longer <=4000.

Could you post a bigger input sample with the desired input?

Sure! This may take me a little while since I'm doing it manually, but it should be up in about 15 minutes. Thanks for your interest! :slight_smile:

INPUT: Use distance <=1000 to merge

chr7 27104483 27104633 ENm010Block71 150 0
chr7 27104634 27104812 ENm010Block72 178 0
chr7 27104813 27105154 ENm010Block73 341 0
chr7 27106872 27106977 ENm010Block74 105 1717
chr7 27106978 27107481 ENm010Block75 503 0
chr7 27107482 27108156 ENm010Block76 674 0
chr7 27108157 27108194 ENm010Block77 37 0
chr7 27108422 27108700 ENm010Block78 278 227
chr7 27109258 27109365 ENm010Block79 107 557
chr7 27109366 27109431 ENm010Block80 65 0
chr7 27109432 27110017 ENm010Block81 585 0
chr7 27110018 27110056 ENm010Block82 38 0
chr7 27110057 27110309 ENm010Block83 252 0
chr7 27110310 27110435 ENm010Block84 125 0
chr7 27110436 27110489 ENm010Block85 53 0
chr7 27110490 27110550 ENm010Block86 60 0
chr7 27110551 27110789 ENm010Block87 238 0
chr7 27111956 27112348 ENm010Block88 392 1166
chr7 27112374 27112830 ENm010Block89 456 25
chr7 27114388 27114881 ENm010Block90 493 1557
chr7 27114882 27115338 ENm010Block91 456 0
chr7 27115339 27115870 ENm010Block92 531 0
chr7 27116098 27116173 ENm010Block93 75 227
chr7 27116174 27116705 ENm010Block94 531 0
chr7 27116706 27116755 ENm010Block95 49 0
chr7 27116756 27116781 ENm010Block96 25 0
chr7 27116782 27116945 ENm010Block97 163 0
chr7 27116946 27117276 ENm010Block98 330 0
chr7 27117277 27117960 ENm010Block99 683 0
chr7 27118910 27119137 ENm010Block100 227 949
chr7 27119138 27119213 ENm010Block101 75 0
chr7 27119214 27119365 ENm010Block102 151 0
chr7 27119366 27119783 ENm010Block103 417 0
chr7 27119784 27119822 ENm010Block104 38 0
chr7 27119823 27119948 ENm010Block105 125 0
chr7 27119949 27119985 ENm010Block106 36 0
chr7 27119986 27120353 ENm010Block107 367 0
chr7 27120354 27120430 ENm010Block108 76 0
chr7 27120431 27120734 ENm010Block109 303 0
chr7 27120735 27120784 ENm010Block110 49 0
chr7 27120785 27121113 ENm010Block111 328 0
chr7 27121114 27121886 ENm010Block112 772 0
chr7 27121887 27121912 ENm010Block113 25 0
chr7 27121950 27122139 ENm010Block114 189 37
chr7 27122140 27122368 ENm010Block115 228 0
chr7 27122369 27122596 ENm010Block116 227 0
chr7 27123470 27123811 ENm010Block117 341 873
chr7 27123812 27124306 ENm010Block118 494 0
chr7 27124307 27125180 ENm010Block119 873 0
chr7 27126966 27127320 ENm010Block120 354 1785
chr7 27127612 27127725 ENm010Block121 113 291
chr7 27127726 27128410 ENm010Block122 684 0
chr7 27128411 27129055 ENm010Block123 644 0
chr7 27129056 27129182 ENm010Block124 126 0
chr7 27129183 27129550 ENm010Block125 367 0
chr7 27130006 27130043 ENm010Block126 37 455
chr7 27130044 27130880 ENm010Block127 836 0
chr7 27130881 27131260 ENm010Block128 379 0
chr7 27135440 27135630 ENm010Block129 190 4179
chr7 27136554 27136807 ENm010Block130 253 923
chr7 27136808 27136820 ENm010Block131 12 0
chr7 27136821 27136845 ENm010Block132 24 0
chr7 27136846 27136895 ENm010Block133 49 0
chr7 27136896 27137035 ENm010Block134 139 0
chr7 27137036 27137071 ENm010Block135 35 0
chr7 27137072 27137237 ENm010Block136 165 0
chr7 27137238 27137580 ENm010Block137 342 0
chr7 27137581 27137618 ENm010Block138 37 0
chr7 27137619 27137796 ENm010Block139 177 0

OUPUT:

chr7 27104483 27105154
chr7 27106872 27110789
chr7 27111956 27112830
chr7 27114388 27125180
chr7 27126966 27131260
chr7 27135440 27137618
chr7 27137619 27137796

Hm,
with this code (use nawk or /usr/xpg4/bin/awk on Solaris):

awk 'END { print _, __ } 
1 == NR || $NF >= 1000 {
  if (c) print _, __ 
  _ = $1 FS $2
  c = 1
  }  
{ __ = $3 }' file

I get this output:

chr7 27104483 27105154
chr7 27106872 27110789
chr7 27111956 27112830
chr7 27114388 27125180
chr7 27126966 27131260
chr7 27135440 27137796

Do you realy want to treat the last line as in the example output?

It makes the code a bit ugly:

awk 'END { print _, ___ RS ____ } 
1 == NR || $NF >= 1000 {
  if (c) print _, __ 
  _ = $1 FS $2
  c = 1 
  }  
{ ___ = __
  __ = $3 
  ____ = $1 FS $2 FS $3 }
' file

You may need to check if the last line has $NF >= 1000,
if that matters, I should add more code.

And ..., don't blame me for choosing such variable names,
if you don't like them, just change them :slight_smile:

I'm a begginer awk user and not sure how to use nawk. Is there a way to make the first code work for awk?
O, and your output was correct. I didn't treat the last line correctly!

Yes. This is Old-AWK compatible:

awk '1 == NR || $NF >= 1000 {
  if (c) print _, __ 
  _ = $1 FS $2
  c = 1
  }  
{ __ = $3 }
END { 
  print _, __ 
  } ' file

To run the original code under the New AWK just invoke the correct interpreter:

nawk 'END { print _, __ } 
1 == NR || $NF >= 1000 {
  if (c) print _, __ 
  _ = $1 FS $2
  c = 1
  }  
{ __ = $3 }' file

It WORKED! Thanks so much!! The way you had it was perfect, for some reason, it just had problems when I cut and paste it, and use dos2unix. But in the end, it worked. Thank you!

It worked! I was having copy and pasting problems..

If you're going to put it all on one line, try it like this:

 
awk '1 == NR || $NF >= 1000 { if (c) print _, __; _ = $1 FS $2; c = 1 } { __ = $3 } END { print _, __ } ' file

Thanks. I didn't realize that copying and pasting onto a WordPad document then saving it as .txt, then doing dos2unix file.txt didn't work. So I tried pasting it just to notepad, doing dos2unix on it, and it worked. Lesson learned! :slight_smile: