Edit a large file in place

mvijayv · August 29, 2007, 8:08pm

Folks,
I have a file with 50 million records having 2 columns. I have to do the below:

Generate some random numbers of a fixed length.
Replace the second column of randomly chosen rows with the random numbers.

I tried using a little bit of perl to generate random numbers and sed to replace it manually. The problem I see is that it generates an output with the replaced record with all 50 million records. I'd rather not have the output generated for each row update. I'd like to get the output once all the updates are done ....
I was wondering if I could edit the file in place using sed ... I did try to look for this in-place option .. but I dont have the GNU version of SED ...

Any thoughts ...?

Thanks
V

ilan · August 30, 2007, 4:23pm

Here is the initial move on your requirement:
>>1. Generate some random numbers of a fixed length.

i=00000000
echo $RANDOM$i | cut -c 1-8

the above serves to generate random numbers for more than 50 million records;again not sure how frequently a number repeats!!

not clear about the second requirement. be specific please...

-ilan

mvijayv · August 30, 2007, 6:45pm

Hi ilan,
Thanks for taking this up ... I have the first piece figured out ... I can generate a random number using a small perl script that I downloaded of the net ... but I have a problem with the second part ... I'll try to describe it better.

I have 50 million records with 2 columns. Both the columns are present in all the records.

Step1: Generate a random value (this is the part i figured out above)
Step2: Locate a random record among the 50 million
Step3: Replace the value in the second column with the value generated in step 1.
Step4: Go back to Step1, generate a new value, look for another random record, replace it with this value and so on for about a million times.

I want to be able to do this in place since everytime I replace a record using awk, it gives the whole 50 million inclusive of that change as the output and i have redirect the output to another file, rename it to the original and start over again for the next iteration.
What I need is a way to edit the file in place in a loop identifying random records and changing the second column a million times.

The high level requirement is:
Given a file of 50 million records, I have to generate a file that has 50 million records but has 1 million records whose second column varies from that of the first file. Maybe there is an easier way to do this ... But I am stumped right now ....

Thanks,
V

ahmedwaseem2000 · August 31, 2007, 2:44am

Could you please post some sample data of input and output so that we can be more clear about the requirement.

mvijayv · September 3, 2007, 2:09am

12123|12345678
42142|23442253
52315|32250205
....
....
...
....
....
around 50 million

Now I want to at random choose records and change the value of the second column

For example if I choose the second record at random. I will change the 2nd column to a random value:

12123|12345678
42142|53988989
52315|32250205
....
....
...
....
....

same operation 1 million times each time choosing a different record at random.

fazliturk · September 3, 2007, 8:24am

I tried the following code in aix,in ksh
code is long but there is no while etc.
let say your original file origfile

step 1.

sed s/"|"/" "/g origfile >tempfile

/** if you dont have sed ,you must change "|" with blank with someting */
/after this your original file looks like this 12123 12345678 */

grep -n "^$" tempfile >origfile,rm tempfile

/*after this your original file looks like this ;
1 12123 12345678
2 42142 53988989

step 2.
/produce 1 million random numbers and save to the RandNumbersFile/

step 3.
/produce 1 million random numbers and save to the RandRecordsFile/
sort -u RandRecordsFile>tempfile
mv tempfile RandRecordsFile

/you can produce 1 million numbers but if you sort it unically it can be less than 1 million. you must be sure that every line in this file is unique, the above command arranges this/

let "NeededLine=1000000-`wc -l RandRecordsFile |awk '{print $1}'`"

/*this line shows you how many new records do you need after sort */

counter=0
while [ $counter -lt $NeededLine ]
do
/**produce random RandomRecord(means random number).I mean you must add your code here **/
grep $RandomRecord RandomRecordsFile >/dev/null
if [ $? -ne 0 ]
then
echo $RandomRecord >>RandomRecordsFile
let "counter=$counter+1"
fi
done
sort -u RandomRecordsFile>tempfile
paste tempfile RandNumbersFile >RandomRecordsFile
rm tempfile

/** after this your RandomRecordsFile looks like this ;
1 12345678
27 53988989
first one stands for record num and the second rundom field (orig second field) **/

join -v1 origfile RandomRecordsFile >tempfile /** unmatched lines **/
join -o 1.1,1.2,2.2 origfile RandomRecordsFile >>tempfile /*matched lines */
sort -u tempfile >origfile /*sort on field1 */
/**if you need add these lines
cut -f2,f3 origfile >tempfile
sed s/" "/"|"/g tempfile>origfile **/
rm tempfile

so the code is;
/produce 1 million random numbers and save to the RandNumbersFile/
/produce 1 million random numbers and save to the RandRecordsFile/

cp yourfile origfile
sed s/"|"/" "/g origfile >tempfile
grep -n "^$" tempfile >origfile
sort -u RandRecordsFile>tempfile
mv tempfile RandRecordsFile
let "NeededLine=1000000-`wc -l RandRecordsFile |awk '{print $1}'`"
while [ $counter -lt $NeededLine ]
do
/**produce random RandomRecord(means random number).I mean you must add your code here **/
grep $RandomRecord RandomRecordsFile >/dev/null
if [ $? -ne 0 ]
then
echo $RandomRecord >>RandomRecordsFile
let "counter=$counter+1"
fi
done
sort -u RandomRecordsFile>tempfile
paste tempfile RandNumbersFile >RandomRecordsFile
join -v1 origfile RandomRecordsFile >tempfile
join -o 1.1,1.2,2.2 origfile RandomRecordsFile
sort -u tempfile >origfile
rm tempfile

mvijayv · September 11, 2007, 8:31pm

Thanks Fazliturk ...
I am going to be trying this pretty soon ... Will let you know how it goes ... - thanks again ...