Streaming filenames from an overpopulated directory

I have a project which requires that I process files from a directory that contains hundreds of thousands, even into the millions of files. Enough that performing an ls in that directory is painfully slow so I’ve learned to only perform specific file lookups. Until now I had just put up with the slow file listing, but other day I finally had enough and decided to look into why ls and other file listing utilities like the python’s os.listdir are equally slow. Shouldn’t it be possible to just stream filenames out of a directory as you read the filesystem’s index of files in that directory rather than waiting until all of the filenames have been scanned and put into an array first?

It turns out that you can list a million files in a directory but not with ls. The key is to use the getdents system call which exists on both linux and freebsd. While a separate command line utility based on the C code in the first or second links will work, what I really wanted to do is stream the files in python. Python has the ability, thanks to cython, to interact with system libraries directly. So with a little more digging I was able to find a simple python module that uses cython to wrap the getdents system call and stream out the files from my directory ‘o many files. Since my final module isn’t exactly the same as the one I found I’ll post it below:

from ctypes import CDLL, c_char_p, c_int, c_long, c_char, Structure, POINTER
from ctypes.util import find_library

class c_dir(Structure):
    """Opaque type for directory entries, corresponds to struct DIR"""
c_dir_p = POINTER(c_dir)

class c_dirent(Structure):
    """Directory entry"""
    _fields_ =1  # filename
c_dirent_p = POINTER(c_dirent)

c_lib = CDLL(find_library("c"))
opendir = c_lib.opendir
opendir.argtypes = [c_char_p]
opendir.restype = c_dir_p

readdir = c_lib.readdir
readdir.argtypes = [c_dir_p]
readdir.restype = c_dirent_p

closedir = c_lib.closedir
closedir.argtypes = [c_dir_p]
closedir.restype = c_int

def listdir(path):
    A generator to return the names of files in the directory passed in
    dir_p = opendir(path)
        while True:
            p = readdir(dir_p)
            if not p:
            name = p.contents.d_name
            if name not in (".", ".."):
                yield name

No related content found.

  1. 'd_off', c_long), # offset to the next dirent ('d_name', c_char * 4096 []

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.