I have a project which requires that I process files from a directory that contains hundreds of thousands, even into the millions of files. Enough that performing an ls in that directory is painfully slow so I’ve learned to only perform specific file lookups. Until now I had just put up with the slow file listing, but other day I finally had enough and decided to look into why ls and other file listing utilities like the python’s os.listdir are equally slow. Shouldn’t it be possible to just stream filenames out of a directory as you read the filesystem’s index of files in that directory rather than waiting until all of the filenames have been scanned and put into an array first?
It turns out that you can list a million files in a directory but not with ls. The key is to use the getdents system call which exists on both linux and freebsd. While a separate command line utility based on the C code in the first or second links will work, what I really wanted to do is stream the files in python. Python has the ability, thanks to cython, to interact with system libraries directly. So with a little more digging I was able to find a simple python module that uses cython to wrap the getdents system call and stream out the files from my directory ‘o many files. Since my final module isn’t exactly the same as the one I found I’ll post it below:
from ctypes import CDLL, c_char_p, c_int, c_long, c_char, Structure, POINTER from ctypes.util import find_library class c_dir(Structure): """Opaque type for directory entries, corresponds to struct DIR""" pass c_dir_p = POINTER(c_dir) class c_dirent(Structure): """Directory entry""" _fields_ =1 # filename c_dirent_p = POINTER(c_dirent) c_lib = CDLL(find_library("c")) opendir = c_lib.opendir opendir.argtypes = [c_char_p] opendir.restype = c_dir_p readdir = c_lib.readdir readdir.argtypes = [c_dir_p] readdir.restype = c_dirent_p closedir = c_lib.closedir closedir.argtypes = [c_dir_p] closedir.restype = c_int def listdir(path): """ A generator to return the names of files in the directory passed in """ dir_p = opendir(path) try: while True: p = readdir(dir_p) if not p: break name = p.contents.d_name if name not in (".", ".."): yield name finally: closedir(dir_p)
- 'd_off', c_long), # offset to the next dirent ('d_name', c_char * 4096 [↩]