I haven't had the time to do a test on Linux yet, but I just finished a test on my Windows XP desktop machine (NTFS). I'm not sure how valuable this test is, but it's very interesting... Please give your thoughts on this.
Opening and closing 100 selected files randomly 100,000 times from directories containing different amount of files (relative times):
100 files: 100.0
1000 files: 100.4
10,000 files: 101.3
100,000 files: 109.6
1,000,000 files: 130.9
A performance hit of 30% when going from 100 to 1,000,000 files in a directory!
When I ran the tests again, they were not only faster, but the differences were almost zero:
100 files: 100.0
1000 files: 100.0
10,000 files: 100.6
100,000 files: 100.2
1,000,000 files: 100.3
Obviously, some caching is going on. So, if you open the same files over and over (and the number of files is small enough), it doesn't seem to matter how many files you keep in the directories.
This caching could suggest that the performance hit above would be larger if I had opened more files than 100. Another way of doing this test would be to read every single file in random order.
Maybe I should have used the same 1,000,000 files in each test case and instead distributed them differently (100 files per directory, 1000 files per directory etc). But then other variables would affect the results, such as how I distributed them -- path depth, number of directories etc.
Details
I used a script to create files with random names of 10+3 characters. I copied the files from the "100 directory" to the other directories, then added additional files. The files were almost empty (72 bytes).
Then I ran a Python script that opened and closed randomly selected files (from the 100 files above) in each directory. The source code is:
import datetime
import random
def getMS():
dt = datetime.datetime.now()
ms = dt.microsecond / 1000
ms += dt.second * 1000
ms += dt.minute * 60000
ms += dt.hour * 3600000
return ms
fh = open("files.txt", "r")
filenames = map(lambda fn: fn.strip(), fh.readlines())
fh.close()
random.seed()
NUMBER_OF_OPENS = 100000
TIMES_PER_CASE = 3
testcases = ["1000000", "100000", "10000", "1000", "100"]
for i in range(TIMES_PER_CASE):
for testcase in testcases:
starttime = getMS()
for j in range(NUMBER_OF_OPENS):
filename = "c:\\temp\\test" + testcase + "\\" + random.choice(filenames)
open(filename, "rb").close()
endtime = getMS()
print testcase, i, endtime - starttime
And the results:
C:\Temp>python -OO openfiles.py
1000000 0 16156
100000 0 13531
10000 0 12508
1000 0 12399
100 0 12346
1000000 1 12291
100000 1 12274
10000 1 11886
1000 1 11265
100 1 11117
1000000 2 11199
100000 2 11183
10000 2 11232
1000 2 11166
100 2 11166
Machine Specifications
I ran the tests on my old desktop DELL Optiplex 280 with a Pentium 4 CPU (2.8 GHz), 2 GB DDR2 SDRAM and 80 GB Serial ATA-150, 7200 rpm hard drive (cache size unknown).
I'm using Windows XP SP3 with NTFS. I shut down all anti-virus, indexing and updating services and most programs before running the tests.
The hard drive was defragmented after creating the small files and before running the tests. I also rebooted before running the tests.