Posts Tagged parallel processing

Process forking and threading with PHP

I’ve been working on a rather large web application which is responsible for combining data from a variety of sources and presenting the data to the end user in a clean, unified fashion. During this process we sometimes run into cases where multiple related calls are made, each to perform some transformative work on a single set of data. We decided these calls could be made in a more parallel fashion and as such started looking into ways of parallelizing PHP so that relatively expensive operations could be performed at the same time and then the results combined in the end.

We examined a few possible solutions such as Gearman, popen, and multi curl. However all of these methods seemed to require more overhead than they were worth. What I really wanted to see was something more along the lines of POSIX threads to distribute the work load and shared memory for passing data between the parent and child threads.

After some searching through PHP extensions and the official documentation I ran across PHP’s Process Control Extensions suite which contains PCNTL functions, one of which is pcntl_fork. Combined with PHP’s Shared Memory Functions, this promises to fit the bill of inexpensive distribution of processing tasks along with low-overhead inter process communication.

Here is a sample proof-of-concept script. I’ll outline what it does below:

$data = array();

echo "Parent PID: ".getmypid().PHP_EOL;

function forkTest(array &$data) {
	$pids = array();

	$parent_pid = getmypid();

	for($i = 0; $i < 10; $i++) {
		if(getmypid() == $parent_pid) {
			$pids[] = pcntl_fork();
			echo "Forking child, \$pids now has ".count($pids)." elements".PHP_EOL;
		}
	} 	

	if (getmypid() == $parent_pid) {
		/* Parent thread */
		echo "Hello from parent: ".getmypid().PHP_EOL;
		array_push($data, "parent".getmypid()); 		  		 

		/* Process childrens' results as they exit */
		while(count($pids) > 0) {
			$pid = pcntl_waitpid(-1, $status);
			echo "Attempting to open memory with pid: ".$pid.PHP_EOL;
			$shm_id = shmop_open($pid, "a", 0, 0);

			$shm_data = unserialize(shmop_read($shm_id, 0, shmop_size($shm_id)));
			shmop_delete($shm_id);
			shmop_close($shm_id);

			$data = array_merge($data, $shm_data);

			/* Hunt down and remove pid entry */
			foreach($pids as $key => $tpid) {
				if($pid == $tpid) unset($pids[$key]);
			}
		}

		echo "All children exited, \$data now has:".count($data)." elements".PHP_EOL;
		$pids = array();
	} else {
		/* Children threads */
		$pdata = array();
		echo "Hello from child: ".getmypid().PHP_EOL;
		array_push($pdata, "child".getmypid());
		$data_str = serialize($pdata);

		$shm_id = shmop_open(getmypid(), "c", 0644, strlen($data_str));
		if (!$shm_id) {
			echo "Couldn't create shared memory segment".PHP_EOL;
		} else {
			if(shmop_write($shm_id, $data_str, 0) != strlen($data_str)) {
				echo "Couldn't write shared memory data".PHP_EOL;
			}
		}

		sleep(rand(1,10));
		exit(0);
	}
}

/* Run the test 10 times */
for($f = 0; $f < 10; $f++) {
	echo "Running $f forkTest()".PHP_EOL;
	forkTest($data);
}

echo "Fork test finished, \$data now contains ".count($data)." elements".PHP_EOL;
echo "\$data:".PHP_EOL.json_encode($data);

This code describes a function that spawns 10 child worker threads, each of which gets a reference to the global $data array. Each child thread pushes a string element containing the child thread’s process identifier into the array, serializes it, and then places it into a shared memory slot with the process id serving as the shared memory id. The parent process waits for each child thread to exit, gathers the data from shared memory, clears the shared memory, and then combines the results into the master $data array. My test application runs through this function 10 times to demonstrate how forking in PHP can be safe and memory efficient. The result should be a $data array with 110 elements in it. I’ve thrown in sleep commands with a random time between 1 and 10 seconds to show how threads can return at different times.

No doubt optimizations can be made but this should serve as at least a rudimentary example of true and efficient threading in PHP. Well, provided that the work you are planning on doing is worth the overhead (which, small as it may be, still exists and should be factored in) and provided that you do not mind locking your application down to a POSIX environment (meaning the above code will not work on windows platforms).

Tags: , , , , , , ,

Diskless computing vs distributed computing

A friend of mine recently asked me about cloud computing, what it was, and the ramifications of it on where we will see technology in the coming years. In his question he demonstrated a common confusion among most people between the difference between cloud computing and diskless computing.

Both of these are interesting areas of computer science, they do sometimes overlap, and they are both going to change computing in general in significant ways as time rolls on, but they are not the same.

Here’s are the differences to help  you can tell them apart.

Diskless computing

Diskless computing is best demonstrated in the Linux Terminal Server Project (excellent project, I’ve use it before to deploy over 150 diskless workstations in a company before) and Microsoft’s pathetic rival, Windows Terminal Services. Sun has their own solution as well and there are countless 3rd party utilities, but the basic idea behind them all is that you have one big computer (or series of computers) that all these “headless” computers connect to in order to retrieve an operating system, store files, etc. For large networks this network model is absolutely amazing.

Cloud Computing

Cloud computing, however, is the concept that you have a large problem that requires a lot of computing power to solve. Rather than buy bigger and bigger hardware, what we’ve found out (going back to Cray supercomputers) is that it is far better to split the problem down into iterative chunks and push those through multiple processors all at once rather than try to get a single processor to process everything. This is called distributed computing.

You might have heard of one of the major platforms for this type of computing, Beowulf, from the popular internet meme “imagine a beowulf cluster of…” Another very popular distributed computing platform (popular because it is far easier to install, operate, and write code for than the Beowulf project) is Hadoop. Hadoop is a project inspired by Google’s implementation of the MapReduce design paradigm written in Java which makes it a lot more portable.

Projects using Cloud Computing

Parallel processing is done today in a wide variety of settings including:

  • 3D rendering farms for companies such as Disney’s Pixar
  • indexing the web with Google, Yahoo, Microsoft, etc.
  • data mining of all sorts with companies like Wal-Mart, etc.

Join in!

There are some very popular projects using distributed computing technologies that regular people with CPU cycles to spare are encouraged to join in on like:

  • SETI@home where you can help process data that might help us identify extraterrestrial signals
  • Folding@home where you can help search for cures to various diseases
  • Genome@home where you can help map the human genome (again), this is tied closely to the folding@home project above
  • Shrek@home which was a pioneer project that a few of us got to participate in
  • others, including fightaids@home to help fight AIDS and lhc@home to process the massive amounts of data coming from the CERN’s Large Hadron Collider

So while diskless computing and cloud computing can have some areas of overlap (I configured the LTSP network I mentioned earlier to assist with the genome@home project when the systems were idle) they aren’t necessarily tied together.

Tags: , , , , , , , ,