Combining two perl commands into one

I made a program that extracts quotes while retaining special inner quotes (in this case an 'x' followed by an apostrophe). The original program is far more complicated than this, but I wanted to make it simple to troubleshoot.

I want to take these two perl commands and have the first command's results be piped into the second commands input but while only running perl once:

Input: ['Say, x'Hix'']

echo "['Say, x'Hix'']" | perl -lne 'push @a,/\[\N{U+0027}(.*?)(?<!x)\N{U+0027}/g; END{print "@a"};'

Output: Say, x'Hix'

Taking the output of the first command:

echo "Say, x'Hix'" | perl -pe 's/x\N{U+0027}/\N{U+0027}/g; END{print "@a"};'

Result (proper): Say, 'Hi'

I realise that I could easily just run perl twice and pipe them into eachother, but running perl twice seems inefficient; especially given that this command is ran thousands of times:

echo "['Say, x'Hix'']" | perl -lne 'push @a,/\[\N{U+0027}(.*?)(?<!x)\N{U+0027}/g; END{print "@a"};' |  perl -pe 's/x\N{U+0027}/\N{U+0027}/g; END{print "@a"};'

I've tried combining both commands in the code below, but it doesn't seem to be taking the output of the first command as input for the second:

echo "['Say, x'Hix'']" | perl -lne 'push @a,/\[\N{U+0027}(.*?)(?<!x)\N{U+0027}/g; END{print "@a"};' -pe 's/x\N{U+0027}/\N{U+0027}/g; END{print "@a"};'

Result (not what I want):

['Say, 'Hi'']
Say, x'Hix'
Say, x'Hix'

Any ideas?

Hi

perl -pe 's/\w(?=\x27)|\[\x27|\x27\]//g'

I think that the data still has a more complex form than in the example?

--- Post updated at 14:06 ---

perl -pe 's/\W*(\w+,\s*).((\x27)\w+)\w.*/$1$2$3/'

Thank you for posting that code. I tried it, but it didn't have any effect. The code is more complex. The program takes the output of Google Translate and sifts through a bunch of garbage to find the translation. I almost think Google makes it complex on purpose to discourage ripping it off the net...

Here is that part of the code. I didn't want to post it initially, as it is quite overwhelming, but I think it's also unfair to leave you guys in the dark:

text='Il dit, "Ça va?" Elle dit, "Oui, bien sûr!"'
translation=$(wget -U "Mozilla/5.0" -q -O- "http://translate.googleapis.com/translate_a/single?client=gtx&sl=fr&tl=en&dt=t&q=$text");
echo
echo $translation
echo
echo $translation | perl -lne 'push @a,/(?<!\,\[\[)\[\"(.*?)(?<!\\)\"/g;END{print "@a"}'\
                    | perl -CS -pwe ' s/\N{U+005C}\N{U+0022}/\N{U+0022}/g;'

005C = backslash
0022 = quote

The result is: He says, "Are you okay?" She says, "Yes, of course!"

...

If anyone is really bored and wants to see the full code, it is here (free to use, edit, and distribute... as always :)):

A simple Linux shell script for translating an .srt file into another language and merging both languages into an .ass file . GitHub

The program allows a person to take an .srt file, translate it into another language, and merge both languages into an .ass file.

So it can be understood a little better, here is a picture of my computer playing a video with a subtitle produced by the program:

This project allows me to learn French by watching movies with both French and English subtitles alongside eachother. Quite a worthwhile venture for a language learner like myself.

1 Like

Or maybe just do throw an agent?
GitHub - soimort/translate-shell: Command-line translator using Google Translate, Bing Translator, Yandex.Translate, etc.

trans -s fr -t en -b "$text"
He says, "Are you okay?" She says, "Yes, of course!"

--- Post updated at 17:12 ---

      if [ "$task" == "translating" ]; then
         translation=$(trans -s en -t fr -b "$text")
      else
[Events]
Format: Layer, Start, End, Style, Name, MarginL, MarginR, MarginV, Effect, Text
  [  0%] Dialogue: 0,0:04:23.80,0:04:26.55,en,,0000,0000,0000,,Good morning, James.
06:52:01 Dialogue: 0,0:04:23.80,0:04:26.55,fr,,0000,0000,0000,,Bonjour James.
  [  0%] Dialogue: 0,0:04:26.55,0:04:27.85,en,,0000,0000,0000,,How are you feeling?
06:51:41 Dialogue: 0,0:04:26.55,0:04:27.85,fr,,0000,0000,0000,,Comment allez-vous ?
  [  0%] Dialogue: 0,0:04:27.85,0:04:29.10,en,,0000,0000,0000,,Wait. What?
06:51:21 Dialogue: 0,0:04:27.85,0:04:29.10,fr,,0000,0000,0000,,Attendez. Quelle ?
  [  0%] Dialogue: 0,0:04:29.10,0:04:32.31,en,,0000,0000,0000,,It's perfectly normal to feel confused.
06:51:01 Dialogue: 0,0:04:29.10,0:04:32.31,fr,,0000,0000,0000,,Il est parfaitement normal de se sentir confus.
  [  0%] Dialogue: 0,0:04:32.31,0:04:35.90,en,,0000,0000,0000,,You just spent 120 years in suspended animation.
06:50:41 Dialogue: 0,0:04:32.31,0:04:35.90,fr,,0000,0000,0000,,Vous venez de passer 120 ans en animation suspendue.
  [  0%] Dialogue: 0,0:04:35.90,0:04:37.52,en,,0000,0000,0000,,What?
06:50:21 Dialogue: 0,0:04:35.90,0:04:37.52,fr,,0000,0000,0000,,Quelle ?
  [  0%] Dialogue: 0,0:04:37.52,0:04:39.32,en,,0000,0000,0000,,It's okay, James.
06:50:01 Dialogue: 0,0:04:37.52,0:04:39.32,fr,,0000,0000,0000,,Ça va, James.
  [  0%] Dialogue: 0,0:04:39.32,0:04:41.86,en,,0000,0000,0000,,-It's Jim. -Jim.
06:49:41 Dialogue: 0,0:04:39.32,0:04:41.86,fr,,0000,0000,0000,,-C'est Jim. -Jim.

But I must admit, this did not increase speed. You would thoroughly rewrite the entire script, send the entire dialogs for translation and then parse it along with the original. And only then I think you can run the program in real time and not for entire night

1 Like

It's a great program, but I prefer my program to be independent. Most people don't want to have to install a script plus another program. Many are shy of scripts to begin with. My program does it all in one easy script and doesn't require installation. In the end, they both bring out the same translation (as far as I can tell, and with non-asian languages), though mine does it in one line and trans takes an entire installation.

But the biggest thing is that I like to know what's going on. trans is far too advanced for me to follow. My simple script is perfect for me and other learners to build on.

That said, if that part of my code should ever fail due to Google changing its algorithm-which could easily happen-trans would be an awesome drop-in replacement until I patched my own. I'd have no qualms using it in such a case. It's a great idea. Thank you!

1 Like

In the Fedora is install like this

dnf install translate-shell

And thank you
I liked the program, I'll try to work with it. I'm just learning English :slight_smile:

1 Like

You are the first to run the script. Thank you! :o

Good point. And thank you so much for looking at my script.

Here is the issue with sending Google big chucks to translate: My first program did just that; it would send out 50 lines of text at once. It was very, very fast; a movie of 1000 lines could be translated under 10 minutes. The problem came with sorting out what text came from what line when it had to be reinserted.

For example:

Pseudo .srt movie file:

1.
How are you today?
2.
I am fine.
3.
Great!
4.
Lets go bird-watching.

In order to send the text to Google, you must give it all in one line, so you'd need to insert some type of mark to know where each line began so they could be inserted back in the proper order, so here is how I'd send it:

How are you today?\nI am fine\nGreat!\nLets go bird-watching.

The problem is that Google would turn the '\n' into something mysterious and ruin it. It might work well for 10 lines or 50, but it would never do all 1000 lines. I then tried:

How are you today? \n. I am fine \n. Great! \n. Lets go bird-watching.

The text itself would be fine, but the ' \n. ' would be dismantled just as it was before. I'd tried dozens and dozens of combinations: ###, ~~~, ~~~~~~, @@@, (TEXT_BREAK) , ... Every single time Google would do some thing like this:

How are you today? ### I am fine### Great!# ##Lets go bird-watching.

Note the break in the third '#' and the strange inserted spaces? I think Google knows what's going on and purposely wreaks the pattern so people can't do batch translations easily. I've looked around on reddit and other places and apparently there is no way to tell Google not to touch a certain bit of text (except with a special tag which can only be used when translating websites).

So frustrating.

I have movies that are 1000-2000+ lines of text and take 5-12 hours to translate. I've been leaving it on overnight for the past week, and in the morning it's done. I've never had a problem.

Help yourself. If you have any suggestions, please let me know, here or github.

Parles-tu français ?

1 Like

Hello
A good idea. I also started up. nezabudka / srt_to_ass . GitLab
I tried to work with your script, but since I have slow Internet, I was not able to translate the file over all night. I tried to send requests in a non-blocking cycle in the background, but after around 75% percent of translated in 5 minutes banned me in Google. :slight_smile:
I had to look for workarounds. That's what I did. I translate in portions of 50 renumbered lines. The script runs 1 to 2 minutes. I just made a release, I still need to write explanations and test. If it will be interesting, I will write a little later

1 Like

Google only bans for an hour or two. You'll be up and running again quickly. :wink:

Nice script! You are much better at this than I am. Much of the code is new to me, so it might take a while for me to understand it.

It runs fine for me with file.srt or if I use any .srt file that is English with no strange characters, but if I use a source file that has French in it, it doesn't work:

parameters selected: en[source] ru[target] ru[top screen]
seq: missing operand
Try 'seq --help' for more information.
file .srt is Ok...
file .text is Ok...
Translating...

It will continue to translate but the translation will end up scrambled. This would be so great to get working. I put in a few hours today trying to fix it with no luck.

Hi
I added a description to the repository. your 'seq' swears either at the wrong format or at the absence of a parameter. Try in the terminal first:

seq -f'(%.f)' 5

If everything works well then in the script in line 60 add

echo ${lastn//[!0-9]}
exit

It should output the last record number from the .srt file in the original it is 1237
I tried to run into Debian and also got an error in the first break point.
In fedora 30

parameters selected: en[source] ru[target] ru[top screen]
file .srt is Ok...
file .text is Ok...
Translating...
1250/1237
file .trans is Ok...
all files are moved to the fil-ass.bd7 directory
done!

Gone to debug
I apologize for the late reply. Notifications do not reach. I need to check spam filter

--- Post updated at 23:26 ---

I push the output files in the 'develop' branch

--- Post updated at 23:55 ---

I found out that Debian didn't work. 'mawk' was installed by default, did so

sudo apt install gawk
sudo apt install translate-shell
./to.sh file.srt
parameters selected: en[source] ru[target] ru[top screen]
file .srt is Ok...
file .text is Ok...
Translating...
350/1237[WARNING] Connection timed out. Retrying IPv4 connection.
650/1237[WARNING] Connection timed out. Retrying IPv4 connection.
700/1237[WARNING] Connection timed out. Retrying IPv4 connection.
750/1237[WARNING] Connection timed out. Retrying IPv4 connection.
950/1237[WARNING] Connection timed out. Retrying IPv4 connection.
1200/1237[WARNING] Connection timed out. Retrying IPv4 connection.
1250/1237
file .trans is Ok...
all files are moved to the fil-ass.heA directory
done!

Small network problems but everything worked correctly without changes in the script

1 Like

I'm going to give it a try tomorrow when I can dedicate more time to it. Thank you so much! :slight_smile:

Updated:

user@localhost:~$ seq -f'(%.f)' 5
(1)
(2)
(3)
(4)
(5)

With:

echo ${lastn//[!0-9]} exit

The response was 1237.

Your English to Russian file seems to work fine. Whenever I've tried to use an English for French file, there was nothing inside of mysub.trans. mysub.text would look like this:

 1
 I'd been living with Marianne
 almost 3 years
 
 2
 in an apartment belonging to her.
 
 3
 Things were fine between us,
 till she said one day:
 
 4
 Got a second?
 
 5
   Yes, why?

Note the space before the text and the absence of brackets. The mysub.srt file was a UTF-8, like yours.

I've attached the files I used to this post. And I'll continue to keep testing as you update it.

2 Likes

Hi

file Mother.srt
Mother.srt: UTF-8 Unicode text, with CRLF line terminators
sed -i 's/\r//g'  Mother.srt
file Mother.srt
Mother.srt: UTF-8 Unicode text

./to.sh Mother.srt
parameters selected: fr[source] ru[target] ru[top screen]
file .srt is Ok...
file .text is Ok...
Translating...
100/100
file .trans is Ok...
all files are moved to the Mot-ass.zyH directory
done!
cd Mot-ass.zyH
tail -6 Mother.ass
Dialogue: 0,0:14:13.14,0:14:14.68,ru,,0000,0000,0000,,Я большой поклонник.
Dialogue: 0,0:14:13.14,0:14:14.68,fr,,0000,0000,0000,,Je suis un grand admirateur.
Dialogue: 0,0:14:15.31,0:14:17.52,ru,,0000,0000,0000,,- �'ы читали �то? - Много раз.
Dialogue: 0,0:14:15.31,0:14:17.52,fr,,0000,0000,0000,,- Vous l'avez lu ? - Plein de fois.
Dialogue: 0,0:14:17.69,0:14:19.69,ru,,0000,0000,0000,,Твои �лова изменили мою жизнь.
Dialogue: 0,0:14:17.69,0:14:19.69,fr,,0000,0000,0000,,Vos mots ont changé ma vie.
2 Likes

This did help a lot. There still seems to be several missing translations. I've attached the files for you.

Example:

Dialogue: 0,0:18:22.35,0:18:24.83,en,,0000,0000,0000,,one day he left his car unlocked.
Dialogue: 0,0:18:22.35,0:18:24.83,fr,,0000,0000,0000,,un jour, il a laissé sa voiture déverrouillée.
Dialogue: 0,0:18:26.67,0:18:27.75,en,,0000,0000,0000,,I even...
Dialogue: 0,0:18:26.67,0:18:27.75,fr,,0000,0000,0000,,
Dialogue: 0,0:18:28.35,0:18:29.71,en,,0000,0000,0000,,Wait, what am I saying?
Dialogue: 0,0:18:28.35,0:18:29.71,fr,,0000,0000,0000,,
Dialogue: 0,0:18:29.87,0:18:31.39,en,,0000,0000,0000,,It was before. I was younger.
Dialogue: 0,0:18:29.87,0:18:31.39,fr,,0000,0000,0000,,C'était avant. J'étais plus jeune.
Dialogue: 0,0:18:33.71,0:18:35.87,en,,0000,0000,0000,,Anyway, I opened the door...
Dialogue: 0,0:18:33.71,0:18:35.87,fr,,0000,0000,0000,,
Dialogue: 0,0:18:41.35,0:18:43.47,en,,0000,0000,0000,,and took a picture of myself inside,
 Dialogue: 0,0:18:41.35,0:18:43.47,fr,,0000,0000,0000,,et a pris une photo de moi à l'intérieur,

This might be due to the way Google sometimes sends back the information that is between brackets. Often Google will add a space here and there as such: (1 ) instead of (1) and then the program won't recognise it and it'll be skipped. I was able to fix this issue in my new program by accounting for spaces between the brackets.

Google will also sometimes flip the brackets, like this:

Source: (1) Hello
Translation: Hello (1)

Then the program attempts to grab everything that is beyond the (1) but being flipped there is nothing there to grab. I found that using double brackets '((1))' seemed to stop the flipped numbers and deliver more consistent results than using single brackets. The program below will do batch translations of entire .srt files in minutes. I've only tested it on a few files, but so far it works well doing English to French or French to English as well as Russian. The code was rushed, and I'll need to do some optimization:

Subtitle .srt to .ass translator . GitHub

The files from your code are here:

Hi @bedtime
thank you. Indeed, translating into French is much more complicated.
I tagged a release in v2.0.0 with support for multi-line subtitles.
Maybe this will change the situation? In addition, in the .txt file for the translation,
I put a semicolon (only in the develop branch; you can use a period, just change this in the 112 line).
It gave out only 2 errors because Google translated the numbers in the 6th line and 9th

The .ass file will also have one line but separated by \N.
In the .txt files, lines without numbers belong to the subtitle upward number.
I want to remind you that if a stop occurs, it is better to exit the program
to edit the file according to the suggested prompts and after restarting the program
it will continue to execute from where it was stopped. Again, the same test will pass
and if there are any uncorrected errors, you can endlessly remain at the same level of program execution.
Try it very conveniently.
I can't try the point "." because I was again banned in Google :slight_smile: and even 'torsocks' does not help

2 Likes

I wanted to try to write everything in an AWK script, but it's not so easy to encode utf-8 text in URLs through AWK

awk -F '",["n]' '
BEGIN {
  text = "I%20do%20not%20know%20how%20to%20encode%20a%20text%0Ain%20an%20awk%20for%20a%20url%20request"
  HttpService = "/inet/tcp/0/translate.googleapis.com/80"
  print "GET http://translate.googleapis.com/translate_a/single?client=gtx&sl=en&tl=ru&dt=t&q=" text |& HttpService
  while ((HttpService |& getline) > 0)
     if($0 ~ /^[,[]+"/ ) {sub(/^[,[]+"/, "", $1); print $1, "==", $2} 
  close(HttpService)
}'

If Google bans me when I'm using method #1 (below), I simply use method #2, and vice-versa.

Method #1 translation=$(wget -U "Mozilla/5.0" -q -O- "http://translate.googleapis.com/translate\_a/single?client=gtx&sl=$source\_lang&tl=$target_lang&dt=t&q=$line" | perl -lne 'push @a,/(?<!\,\[\[)\[\"(.*?)(?<!\\)\"/g;END{print "@a"}' | perl -CS -pwe 's/\N{U+005C}\N{U+0022}\s?/\N{U+0022}/g;')

Method #2 translation=$(trans -s en -t fr -b "$line")

Perhaps method #2 uses an API and Google sees it as a separate thing? Anyways, it's always worked for me. I must caution you that you cannot safely translate in bulk more often than every 12 seconds. You may get away with it for several rounds, but you will inevitably be banned. You have this set to 2 seconds. It bans me when I follow your default.

I wanted to try to write everything in an AWK script, but it's not so easy to encode utf-8 text in URLs through AWK

For something like this, I would highly recommend Perl; it's much more powerful and was made to handle all kinds of file formats that awk and sed can't do. Awk and sed cannot even do a REGEX look-behind.

I've been at the computer for hours, so I'm taking a break. I'll test your fixes when I'm refreshed (tonight or tomorrow).

I've attached the results. None of the subtitles seemed to translate properly. At least with me. I made sure to get rid of the '\r' on each .srt file with your code.

All the above files worked with my new program with about 99.98% accuracy (998/1000 were fine). I'm going to guess that awk is messing things up in your program, but I'm not sure. French, like you said, is harder to translate, but it is essential to be able to use French as it comes up so often.

I will continue to test if as you update. If you need anymore information, please just ask.

***EDIT***

So frustrating. In my program, Google Translate really cannot be depended upon to give you back information exactly as you sent it:

 ((1)) I'd been living with Marianne almost 3 years ((2)) in an apartment belonging to her. ((3)) Things were fine between us, till she said one day: ((4)) Got a second? ((5)) Yes, why? ((6)) To tell you something, but it may be a bad time. ((7)) Go on, say it. What's wrong? ((8)) I'm pregnant. ((9)) You're pregnant... ((10)) That's great. ((11)) Is it a problem for you? ((12)) Abel... ((13)) I'm pregnant, but not from you. ((14)) What? ((15)) I'm not pregnant from you. ((16)) So... from who then? ((17)) Paul. ((18)) You had a thing with Paul? ((19)) Yes. ((20)) Since when? ((21)) A little over a year. ((22)) But Paul is someone... ((23)) so... ((24)) So what? ((25)) So different, so... ((26)) Different from who? ((27)) I don't know... you know Paul's family. ((28)) They know. ((29)) If you've seen his family, what does it mean? ((30)) It means we may have to get married. ((31)) It means a lot to his parents. ((32)) Yes, they're a bit conventional, but... ((33)) You haven't chosen a date? ((34)) It's been saved. ((35)) Paul wants you to come. ((36)) Unless it bothers you. It's up to you. ((37)) When is it? ((38)) The 26th. ((39)) Of this month? ((40)) We prefer to act quickly, because of... ((41)) You want me to pack my stuff? ((42)) You still have time, but before the 26th is best. ((43)) What's today? ((44)) The 16th. ((45)) I have to go. I'll be late. ((46)) Maybe it wasn't the right time to tell you. ((47)) I'm glad you're taking it well. ((48)) He thought you'd be angry. ((49)) I kept reassuring him. ((50)) Can he call you today? ((51))

Yields:

((1)) C'était comme si nous vivions ensemble. ((2)) Je me souviens, ((3)) un jour, il a laissé sa voiture déverrouillée. ((4)) J'ai même ... ((5)) Attends, qu'est-ce que je dis? ((6)) C'était avant. J'étais plus jeune. ((7)) Quoi qu'il en soit, j'ai ouvert la porte ... ((8)) et j'ai pris une photo de moi à l'intérieur, ((9)) comme si j'étais à côté de lui et qu'il conduisait. ((10)) Je m'imaginais aller à Venise. ((11)) Lors de notre lune de miel. ((12)) Ce jour-là, j'ai pensé: ((13)) 'Il m'a vu à coup sûr. ((14)) Nous sommes un couple. Nous sommes amoureux. Nous partons ensemble. ((15)) Le Il s'est enfui. Il ne m'avait pas vu. ((16)) Il était en retard au travail. ((17)) Ce jour-là, ((18)) sans argent pour un ticket de métro, ((19)) Je suis rentré à pied de la banlieue. ((20)) Je détestais Marianne de l'avoir pris de moi. ((21)) quand elle a choisi mon frère Paul. '((23)) Mais quand Marianne a quitté Abel, ((24)) il a bougé et je l'ai perdu de vue. ((25)) Pendant les cinq années suivantes, ((26)) je n'ai pas vu Abel. ( (27)) Bien sûr, j'ai eu quelques aventures, ((28)) mais aucune mérite d'être racontée. ((29)) Maintenant que j'y pense, tout ce temps, ((30)) je n'ai bien fait qu'une chose. ((31)) J'ai grandi. ((32)) Puis un jour, par hasard, j'ai vu Abel. ((33)) Je suis Eve, la sœur de Paul. ((34)) Bonjour, Eve. ((35 )) Vous avez grandi. ((36)) Merci. ((37)) - Ça va? - Je vais bien. ((38)) Et vous? N'avez-vous pas fait du théâtre? ((39)) Oui, mais je me suis arrêté au théâtre. ((40)) Je ' m dans l'immobilier maintenant. ((41)) Tu gèle. ((42)) Vos dents claquent. ((43)) Êtes-vous occupé maintenant? ((44)) Je suis en retard. Je dois y aller. ((45)) Je vais déposer ton écharpe. ((46)) Non, c'est un cadeau, d'accord? ((47)) Bonne journée, amoureux. ((48)) À bientôt ... ((49)) Au revoir, Eve. ((50)) Pas même un amant, ((51))

Where is ((22))? Gone. Google Translate ate it. If I decrease the number of chunks sent to Google then ((22)) comes back, but that is not a dependable solution. If I use single brackets around the numbers, several similar problems arise with the text.

brain explodes

2 Likes