Error running FORTRAN code

Hi,
I am new to this forum and do not know whether this is the appropriate place to post this question. Anyway am trying my luck.

I have a fortran program swanhcat.ftn, which is part of a wave modelling system. There is also a file hcat.nml which is required to run this program. The program's job is to combine many files (say 250) into a single file. It is of course not concatenation. The program was compiled and run. It worked fine when the number of input files to the program was a certain number (say 180) and failed when there were more number of input files. I should say that the input files are big. I give below details of one such file.
-rw-r--r-- 1 osf staff 15208731 Jun 27 12:40 restart_20090101.00-132

Actually the files are like restart_20090101.00-001, restart_20090101.00-002..... restart_20090101.00-180 and are the outputs of a numerical model run using 180 processors. Each processor does some part of the model run and thus the final information is contained in the 180 files together. I am trying to combine the information into a single file using the swanhcat.ftn program. It was a success when I tried with 180 processors (there were 180 input files).
When I tried with 181 input files(or higher), the program failed saying
'segmentation fault(core dumped)'. Does it have to do something with the memory used by the program?

How do I find out what is wrong?

Best regards,
s.

been a long time since I played with fortran...but

Immediate questions -:

what version are you running and on what OS ?

are you opening all files concurrently or sequentially; are you closing each file before reading from the next ?

:slight_smile:

Hi Tytalus,

Details you asked for are here:
Fortran:

IBM XL Fortran Enterprise Edition for AIX, V11.1.

OS:

AIX 5.3.0.0

All the files are read concurrently as:

 DO i=1,numfiles
         tmp = trim(basefile)
         WRITE(tmp(LEN_TRIM(tmp)+1:LEN_TRIM(tmp)+4),33) i
         OPEN(unit=10+i, file=tmp, status='old', iostat=ios)
         IF ( ios /= 0 ) THEN
            PRINT *,'Unable to open',TRIM(tmp)
         ELSE
            IF(verbose) PRINT *,TRIM(tmp),' opened succesfully'
         ENDIF
      ENDDO

All the files are closed at the end of the program as

 CLOSE ( unit = 10 );
      DO i=1,numfiles
         CLOSE (unit=10+i)
      ENDDO

Since it is a very big program of ~ 500 lines (it is part of the numerical model, not written by me) I could write only the relevant part here.

regrds,
s.

It would be very helpful if you could post your code examples using code tags, [code ][/code ], there is a space before the ] that you don't actually put in, but I have added it so you could read the tag text.

Here is an example,

DO i=1,numfiles
  tmp = trim(basefile)
  WRITE(tmp(LEN_TRIM(tmp)+1:LEN_TRIM(tmp)+4),33) i
  OPEN(unit=10+i, file=tmp, status='old' iostat=ios)
  IF ( ios /= 0 ) THEN
    PRINT *,'Unable to open',TRIM(tmp)
  ELSE
    IF(verbose) PRINT *,TRIM(tmp),' opened   succesfully'
  ENDIF
ENDDO

C All the files are closed at the end of the program as

CLOSE ( unit = 10 );
DO i=1,numfiles
  CLOSE (unit=10+i)
ENDDO  

I still program in fortran 77, so this syntax is not clear to me. I would set up a do loop with a control line number and not an ENDDO. I would also evaluate with .NE. instead of /=. etc.

The thing to do is to create a small program that just opens your files, closes your files, and prints something to confirm success. Get that to work. You can post that here with a sample input file so others can test it.

Most likely, you are exceeding the capacity of some data structure that stores the contents of the files, but that is hard to know without seeing the declarations. After you get the files opening and closing, you can look at the data structures in the program and add what is necessary to store the data. It can be helpful to have apps like this process one input file at a time if that is possible. Then it shouldn't matter how many input files you process, only how large the largest file is.

Fortran doesn't have dynamic memory allocation, so you have to pre-size your data structures to be larger than needed and also trap out the data loading process so you can quit with a handled exception if your data is larger than the available space. If that is a huge issue, you may consider re-writing some of the subroutines in c++, which has containers like vectors and maps that can re-size on the fly. It is generally not a problem to have mixed language code and calling a c++ function from Fortran code is not that difficult as long as you can compile both languages.

Fortran is a beastly language for file I/O and I'm sure it wold be very much easier to write this program in c++, or better yet, and interpreter like perl, python, or ruby. If the app is 500 lines of f95, it is probably 200 lines of c++, or 15 lines of python.

LMHmedchem

Hi LMHmedchem,

The files are opened and read successfully, as is printed by the program. I guess the problem is with some allocation when the number of input files is large. I could not attach the input files owing to the huge size. I paste below a few lines of one of the input files.

SWAN 1 Swan standard file, version
TIME time-dependent data
1 time coding option
LONLAT locations on the globe
124496 496 251 number of locations
79.749977 11.116000
79.749977 11.118289
79.749977 11.120578
79.749977 11.122867
79.749977 11.125156
79.749977 11.127445
79.749977 11.129734
79.749977 11.132023
79.749977 11.134312
79.749977 11.136601
79.749977 11.138890
79.749977 11.141179
79.749977 11.143467

Since the model uses fortran 95 programs, the syntax will be different from f77.
Regards,
S.

This is probably nothing other than an array, of some other data or control structure, that is not declared as large enough. This is common in Fortran because you have to pre-size everything and hope you leave enough space. Sometines an allocation will be sufficient when a program is first developed, but may need to be increased later. The use of a parameter.dat file is common where you hard code constants with sizes for things that may need to change. Then you use the constant as the array size. You keep them all in one place and well documented so it is easy to change. When you have too many input files, the array boundaries are exceeded and there is a seg fault. You could also be flat running out of ram, especially if this is run on an older machine with limited resources.

There are lots of other possibilities, but those are the most likely since the program runs for smaller input files. Another thing to consider is that one of the input files is malformed. Fortran is unbelievably picky about file formats and such, so if one of your models had an issue and produced improperly formatted output, or no output at all, that could also cause in issue if the specific format problem wasn't trapped out. The best way to evaluate that is to try to find where the process fails, if you can find the location in the specific input file where the app crashes, you can look at it to see if there is a visible issue or not.

It is still unclear to me what the final output of the app is supposed to be. If the input files are text files, you can try to compress them, or post them on another location on the web and post a link. If I have a sample of the two input files that you are trying to merge, and a sample of the expected output, it should not be to difficult to fix. The files do not need to be complete, or actual files, just enough to see what the format should be.

A quick look shows that this seems to be set to open up to 1000 files, after that you will exceed the size of the do loop iterator.

This is an important point, it looks like all of the files are closed at once, and if you are keep all of those large files opened, you may indeed be running out of memory. You have you checked you resources while the app is running to see how much available RAM you have?

LMHmedchem

Hi LMHmedchem,

I attach here 5 files-
restart_20080515-001.txt,.........restart_20080515-004.txt and restart_20080515.txt which is the result of hcat.exe working on the first 4 files. The actual names of these files should be without the extension of .txt, but I added it just for the ease of uploading them here.
This is a case where the executable hcat.exe of the program swanhcat.ftn was successful in combining the 4 files restart_20080515-00(1...4) to generate restart_20080515. In this case I have run my model with 4 processors to generate the 4 files restart_20080515-00(1..4). When I work with more processors, more such files will be generated. The program swanhcat.ftn (serial program) fails to combine the files when the file numbers are large or may be when the total size of all the files exceeds some value.
Even I can see that the program should be able to handle 1000 files, which is much above any practical requirement.
I tried running the program in HPC(High performance computing), which is having 128GB RAM per node. I was given unlimited memory access to run the program, but it failed. I also tried in my PC with 12 GB RAM, still it failed.

This doesn't seem to be too bad, but I think the app should be in c++, or an interpreter language, and not Fortran. Unfortunately I am getting ready for a conference and a grant meeting, so I may not be able to work much on this until later in the month.

Please let me know if you need it right away and I will see what I can do.

LMHmedchem

Please take your owm time.

Hi.

I looked over the code that you posted. It contains ALLOCATE statements, so I think you can ignore LMHmedchem comments about array declarations, because that is the Fortran-77 way of doing things. Fortran-90 and later use modern techniques for dynamic storage.

My take is different from LMHmedchem's in that I think a bird in the hand is worth two in the bush -- so you have a Fortran code, you might as well use it, as opposed to writing something new in a different language.

I considered the possible problems from too many files being open, so I wrote two test codes. The first simply opens files associating with "unit=1", but never explicitly closing them. The second opens files concurrently, associating with unit=n, where n varies. Both read a value from the file and the second goes back to read the first data file to make sure that the first file is still open and accessible.

I tried these codes with 2000 data files. "data0011 - data2010". The data files have a unique integer number in them (basically the sequence), and then the negative of that (number+1). Both test programs worked as I expected, without error.

I do agree with LMHmedchem that "Most likely, you are exceeding the capacity of some data structure that stores the contents of the files ...".

I will demonstrate a technique for identifying such errors in a subsequent post ... cheers, drl

---------- Post updated at 11:09 ---------- Previous update was at 10:57 ----------

Hi.

I wrote the following code to demonstrate how ALLOCATE can obtain storage dynamically, but the array is still subject to being over-run:

     program f1

! @(#) f1	Demonstrate Fortran-95.

       real, allocatable :: x(:)

       write (*,*) " Hello, world from Fortran-95."
       allocate (x(1))

       x(1) = 1.0
       write(*,*) " x(1) is ",x(1)

! For the loop below:
! 1)
! both gfortran, g95 -- beyond 6 one gets:
! glibc detected *** ./a.out: free(): invalid next size (fast)
! ifort (compiling name.90 -- beyond 7 yields:
! glibc detected *** ./a.out: double free or corruption

! 2)
! gfortran -fbounds-check:
! At line 19 of file one.f95
! Fortran runtime error: Array reference out of bounds for array
! 'x', upper bound of dimension 1 exceeded (2 > 1)
!
! g95 -fbounds-check:
! At line 25 of file one.f95
! Traceback: not available, compile with -ftrace=frame or -ftrace=full
! Fortran runtime error: Array element out of bounds: 2 in (1:1), dim=1
!
! ifort -check bounds two.f90
! forrtl: severe (408): fort: (2): Subscript #1 of the array X
! has value 2 which is greater than the upper bound of 1

       do i = 2, 6
         x(i) = float(i)
       end do

     end

This was run in the following context:

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian GNU/Linux 5.0.8 (lenny) 
GNU bash 3.2.39
gfortran GNU Fortran (Debian 4.3.2-1.1) 4.3.2
G95 (GCC 4.0.3 (g95 0.92!) Jun 24 2009)
ifort (IFORT) 11.1 20090827

The results are part of the comments above. The first section shows that errors occur when the array is over-run sufficiently. However, if one enables bounds-checking, the error is caught immediately.

Regrettably, the AIX box to which I have access does not have a Fortran compiler that I could find. However, I think IBM almost certainly will have provided a means to check for such errors, so you'll need to do a bit of reading for that.

Adding the bounds check may degrade performance, but it's such a easy change to make to see if that is the error, that your time will be well-spent.

Best wishes, and keep us updated ... cheers, drl