Replacing lines between two files with awk

rk4k · April 18, 2009, 10:12am

Hello Masters,

I have two subtitles file with different language like below

First file :

1
00:00:41,136 --> 00:00:43,900
[<i># Underdog theme</i>]

2
00:00:55,383 --> 00:00:58,477
<i> Ladies and gentlemen,</i>
<i>this is Simon Barsinister,</i>

3
00:00:58,553 --> 00:01:00,521
<i>the wickedest man in the world.</i>

4
00:01:00,588 --> 00:01:02,021
<i>He was evil and crazy.</i>

5
00:01:02,090 --> 00:01:06,026
<i>Simon and his wacky henchman, Cad,</i>
<i>schemed to rule the universe.</i>

6
00:01:06,094 --> 00:01:08,289
<i>But each time they were foiled by me,</i>

Second file :

1
00:00:35,060 --> 00:00:37,708
*** UNDERDOG ***

2
00:00:48,714 --> 00:00:51,668
Dame in gospodje,
to je Simon Barsinister,

3
00:00:51,745 --> 00:00:53,625    
najzlobnej�i �lovek
na svetu.

4
00:00:53,701 --> 00:00:55,084
Bil je zloben in blazen.

5
00:00:55,160 --> 00:00:58,918
Simon in njegov sluga Cad
sta spletkarila proti univerzi.

6
00:00:58,994 --> 00:01:01,106
Ampak vedno sem jima
na�rte prekri�al jaz,

I want to overwrite lines that contains time on first file with the apropriate lines from the second file so the final subtitles file will look like this :

1
00:00:35,060 --> 00:00:37,708
[<i># Underdog theme</i>]

2
00:00:48,714 --> 00:00:51,668
<i> Ladies and gentlemen,</i>
<i>this is Simon Barsinister,</i>

3
00:00:51,745 --> 00:00:53,625    
<i>the wickedest man in the world.</i>

4
00:00:53,701 --> 00:00:55,084
<i>He was evil and crazy.</i>

5
00:00:55,160 --> 00:00:58,918
<i>Simon and his wacky henchman, Cad,</i>
<i>schemed to rule the universe.</i>

6
00:00:58,994 --> 00:01:01,106
<i>But each time they were foiled by me,</i>

How to do it with awk/gawk ?

TIA.

ghostdog74 · April 18, 2009, 11:42am

Perl alternative

my %f2;
open(F2,"<","file2") or die "Cannot open file2: $!\n";
while ( <F2> ){chomp; $f2{++$d}=$_ if /-->/;}
close(F2);
open(F1,"<","file1") or die "Cannot open file1: $!\n";
while ( <F1> ){ chomp;  print /-->/ ? $f2{++$e}."\n" : $_."\n"; }

rubin · April 18, 2009, 4:18pm

Another way with awk:

awk 'FNR==NR{ if( /^[0-9][0-9]:/ ) a[++c]=$0; next }
     /[0-9]:/ && /-->/{print a[++n]; next}1' file2 file1

colemar · April 18, 2009, 5:11pm

The code proposed by Rubin should work almost always, but it does not work when by chance a particular dialog line is like a time line.

I think the right way to cope with the problem is to realize that the input files are in fact composed of multiline records, where the record separator is one ore more blank lines and the field separator is the newline.
In such a structure the time line is always the second field.

To make GNU awk deal with multiline records it is required to properly set some built-in variables, as explained here: Multiple Line - The GNU Awk User's Guide

This is my proposal for a program p.awk

BEGIN { RS = "" ; FS = "\n" ; ORS = "\n\n" ; OFS = "\n" }
FNR == 1 { fn++ } # track the file number
fn == 1 {     # if in file 1
  a[FNR] = $2 # then save field 2
  }
fn == 2 {     # if in file 2
  $2 = a[FNR] # then overwrite field 2 with same field from file 1
  print $0
  }

The command line is as follows:

awk -f p.awk second.txt first.txt > result.txt

rubin · April 18, 2009, 7:46pm

Good point, it might happen ..., based on the OP's actual files, one file seems to have sentences in one language, and the other file their respective english translations, so the chances of having a double time line, I think are small.

What happens if there are other records in between the first field and the time line ( not always a fixed field ) ? Maybe, this is not the case, but what if the records are not multilined ?
Codes can also be modified again to fit a particular situation ..., anyway I think the OP has a few options to choose from :).

colemar · April 19, 2009, 9:27am

I can't understand your argument... but I believe the input files as given are some standard subtitle format whose name I can't remember.
So the input can be safely characterized by stating that:

there are groups of three or more lines, and the groups are delimited by at least one blank line
the second line of any group always represents the time of the dialog

rubin · April 19, 2009, 10:12am

If my argument wasn't understood, I'd better wait for the OP's response and let him state that the timelines are duplicated somewhere in the records, and modify the codes accordingly.

rk4k · April 20, 2009, 12:07am

hi folks

sorry for late respons

I try ghostdog74 and colemar solutions .. both working great
The only problems that the two subtitles file is different at the end .. until that position the replacement working great ..
I will try rubin solution also .. give me time

It;s been a great here .. I learn a lot

Thanks all.

summer_cherry · April 20, 2009, 3:36am

nawk 'BEGIN{RS="";OFS=FS="\n"}
{
  if(NR==FNR)
    _[$1]=$2
  else
  {
  	$2=_[$1]
  	print
  	print ""
  }
}' file2 file1