Streaming filenames from an overpopulated directory


I have a project which requires that I process files from a directory that contains hundreds of thousands, even into the millions of files. Enough that performing an ls in that directory is painfully slow so I’ve learned to only perform specific file lookups. Until now I had just put up with the slow file listing, but other day I finally had enough and decided to look into why ls and other file listing utilities like the python’s os.listdir are equally slow. Shouldn’t it be possible to just stream filenames out of a directory as you read the filesystem’s index of files in that directory rather than waiting until all of the filenames have been scanned and put into an array first?

It turns out that you can list a million files in a directory but not with ls. The key is to use the getdents system call which exists on both linux and freebsd. While a separate command line utility based on the C code in the first or second links will work, what I really wanted to do is stream the files in python. Python has the ability, thanks to cython, to interact with system libraries directly. So with a little more digging I was able to find a simple python module that uses cython to wrap the getdents system call and stream out the files from my directory ‘o many files. Since my final module isn’t exactly the same as the one I found I’ll post it below:

from ctypes import CDLL, c_char_p, c_int, c_long, c_char, Structure, POINTER
from ctypes.util import find_library


class c_dir(Structure):
    """Opaque type for directory entries, corresponds to struct DIR"""
    pass
c_dir_p = POINTER(c_dir)


class c_dirent(Structure):
    """Directory entry"""
    _fields_ =1  # filename
c_dirent_p = POINTER(c_dirent)

c_lib = CDLL(find_library("c"))
opendir = c_lib.opendir
opendir.argtypes = [c_char_p]
opendir.restype = c_dir_p

readdir = c_lib.readdir
readdir.argtypes = [c_dir_p]
readdir.restype = c_dirent_p

closedir = c_lib.closedir
closedir.argtypes = [c_dir_p]
closedir.restype = c_int


def listdir(path):
    """
    A generator to return the names of files in the directory passed in
    """
    dir_p = opendir(path)
    try:
        while True:
            p = readdir(dir_p)
            if not p:
                break
            name = p.contents.d_name
            if name not in (".", ".."):
                yield name
    finally:
        closedir(dir_p)

No related content found.

  1. 'd_off', c_long), # offset to the next dirent ('d_name', c_char * 4096 []

Leave a comment