Edit a Huge one line file

We have a huge file which has just one really large line; about 500 MB. I want to

  1. Count all the occurrences of a phrase
  2. Replace the phrase with another.

Trying to open it using vi has not helped as it complains that it is too large. Can any script help? Please advise.

Thank you,

To count:

perl -ne '@x=/phrase/g;print $#x+1 . "\n"' file

To replace:

perl -i -pe 's/phrase/replace/g' file

Can you show a sample of this file? Maybe it just uses a different separator than newline, in which case awk could process it via changing the value of RS and ORS...

i've not dealt with a file of a single 500MB line, so I play and create one:

$ cat input
a horse is a horse, of course, of course, and no one can talk to a horse of course. That is, of course, unless the horse is the famous Mr. Ed.

while i cat onto itself until it reached >500MB.
now we play:

[mute@geek ~]$ perl -ne '@x=/horse/g;print $#x+1 . "\n"' input
Killed
[mute@geek ~]$ grep -o horse input | wc -l
grep: input: Cannot allocate memory
0
[mute@geek ~]$ awk -F 'horse' '{print NF}' input
awk: cmd. line:1: fatal: grow_iop_buffer: iop->buf: can't allocate 536870914 bytes of memory (Cannot allocate memory)
[mute@geek ~]$ gawk -vRS="horse" 'END{print NR}' input
20331009

well, seems there isn't many tools that aren't line based.. :frowning: awk will use a single character separator, unless it's gawk.

How much physical memory do you have on that machine, neutronscott?

well 256 ram and 256 swap. OP has issues opening the file in vi so I figured I'd run the tests in my low-end VPS.
as Corona pointed out, there must be a different record separator that could be used, so that a program doesn't attempt to load the entire line into memory.
more information is needed about the format of the file. i'm unaware of a standard unix visual editor that'd properly open such a file (though i'm very inexperienced in the subject, i can usually hack away solutions but i've not enough information)

Well, at least the problem with "vi" i can explain:

A file cannot be edited in the same place where it is stored. This is the reason why i.e.

sed '<something>' infile > outfile

works, while

sed '<something>' infile > infile

will not (the same is true for awk, etc.).

Some (GNU-)versions of sed circumvent this principal limitation by introducing inline-editing ("-i"), which works the same way interactive editors like vi do: they create a copy at program start and only upon saving/finishing the work they copy this over the original file.

vi typically uses /var/tmp per default, but can be configured to use other places too (at least to my knowledge all versions of vi offer such an option via the .exrc file). If this filesystem has not enough free space to hold the copy an attempt to edit the file will fail even if there would be enough space in memory to hold it.

Another limitation is the maximum line length: this is a system limitation and how long lines can grow is laid down in the kernel header file limits.h in the constant "LINE_MAX".

I hope this helps.

1 Like

Do you have the code for the program which wrote this file? It may be easier to write a program in the same programming language to read the file back.

Thanks for the tip.Could you please tell the command/alias to do the same?
I faced this problem couple of times but wasn't aware of any option like this..

perl -pi -e 's/old_string/new_string/g' file_pattern

The above option worked.

Thank you all

The file ".exrc" is usually located in your home directory and is a file automatically read whenever you start an instance of "vi". It holds all sorts of settings which you could set interactively too. Inside vi enter command mode (press <ESC> several times), press ":" to get to the command line and enter "set all" to see a list of available options. Here is what i use as an example, but that is personal taste (mine is in editors the same as in women: you-asked-for-it-you-get-it and no-bells-no-whistles-just-pure-efficieny):

# cat .exrc
" bakunins .exrc file, strip comments before use
set ai                     # automatic indenting
set sw=5                   # indenting width is 5 characters
set nonovice               # no "are you sure" nonsense
set terse                  # no nothing, i know what i do

set noshowmode             # this and the following is for vim to
set nomodeline             # switch off all the fancy stuff. I want an editor,
syntax off                 # not some Pacman-game with a text option
set nohlsearch
set noruler
set noshowmatch
set matchpairs=
set uc=0

I hope this helps.

bakunin

Thanks for the information. I might not be clear in words about my query.
I am aware of .exrc file. Was just curious about the "set" option which need to be altered to change the directory from /var/tmp to something else.

Having look at set-all list, I think that should be one of the below.

  backupdir=.,~/tmp,~/
  backupskip=/tmp/*,/tmp/*,/tmp/*
  directory=.,~/tmp,/var/tmp,/tmp

I think it's $TMPDIR . See man ex (which covers the core of vi ).

Beware that unix vi has all sorts of limits that make it unsuitable for large files. This includes the number of records and the length of a record.

1 Like