Extract large list of substrings

dcfargo · September 8, 2008, 1:14pm

I have a very long string (millions of characters).

I have a file with start location and length that is thousands of rows long:

Start Length
5 10
16 21
44 100
215 37
...

I'd like to extract the substring that corresponds to the start and length from each row of the list:

I tried just using a large awk '{print substr($1,5,10), "\n", substr($1,16,21) "\n", substr($1,44,100) "\n", substr($1,215,37)...}' infile > outfile &

command

But it seems to hang likely because the Bash line is too long.

Can you help me with a way to get out the substrings as rows.

cfajohnson · September 8, 2008, 5:34pm

Where do you have it? Is it in a file? In a variable?

Are there any newlines in the string?

I have no problem extracting portions of a multimegabyte string using bash's parameter expansion:

## Assuming the string is in 'infile'
string=$( < infile )
while read start length
do
  printf "%s\n" "${string:$start:$length}"
done < /path/to/file/with/startpoints_and_lengths

dcfargo · September 8, 2008, 8:33pm

Thanks. Let me see if I understand.

The string has no space or line breaks its just millions of characters one after the other. We'll call that 'filestring'.

The numbers lists are in 'filenumbers' in the same directory.

string=$( <filestring )
while read start length
do
printf "%s\n" "${string:$start:$length}"
done < filenumbers > outfile

If that the correct command syntax?

Thanks so much.

cfajohnson · September 8, 2008, 8:42pm

dcfargo:

Thanks. Let me see if I understand.

The string has no space or line breaks its just millions of characters one after the other. We'll call that 'filestring'.

The numbers lists are in 'filenumbers' in the same directory.
string=$( <filestring )
while read start length
do
  printf "%s\n" "${string:$start:$length}"
done < filenumbers > outfile
If that the correct command syntax?

That is correct if the string is in a file called filestring.

If it is already in a variable, use that variable instead of string

dcfargo · September 9, 2008, 9:51am

I don't know what I'm doing wrong but that syntax appears to be writing the entire string for each line in the filenumbers instead of extracting the substring(s).

cfajohnson · September 9, 2008, 12:44pm

No one else knows what you are doing wrong, either, because you didn't post the code you executed.

Nor did you make it clear whether you already have the string in a variable or whether it has to be read from a file.

dcfargo · September 9, 2008, 1:12pm

Sorry. You know what I did. I had the wrong input file. Your code works great. I really appreciate all your help. My input file was start and the length of the string instead of start and the length of the substring of interest.

Thanks again so much.

summer_cherry · September 9, 2008, 10:25pm

nawk '{
if(NR==FNR)
	arr[$1]=$2
else
	for(i in arr)
		print substr($0,i,arr)
}' lengthfile stringfile