Skip to content

Playing with Gearman

This was written in September 2009 when the current version of Gearman was 0.9. Thanks to Eric Day for answering my dumb questions along the way.

To get started, install Gearman. I am on Debian, so this is what I installed:
% apt-get install gearman gearman-job-server gearman-tools libgearman1 libgearman-dev libdrizzle-dev
Enable Gearman in /etc/default/gearman-server
Set up Gearman to use MySQL for its persistent queue store in /etc/default/gearman-job-server
 PARAMS="-q libdrizzle --libdrizzle-host=127.0.0.1 --libdrizzle-user=gearman \
                       --libdrizzle-password=your_pw --libdrizzle-db=gearman \
                       --libdrizzle-table=gearman_queue --libdrizzle-mysql"

% mysqladmin create gearman

% mysql 
mysql> create USER gearman@localhost identified by 'your_pw';
mysql> GRANT ALL on gearman.* to gearman@localhost;
** Careful, if you are running MySQL using --old-passwords this won't work with libdrizzle. You will need to get the 41-char password hash with a little snippet of PHP that does the double sha1 encoding:
% php -r "echo '*'.strtoupper(sha1(sha1('your_pw',true)));"

% mysql
mysql> UPDATE mysql.user set Password='above_output' where User='gearman';

% mysqladmin flush-privileges
Then start the server.
% /etc/init.d/gearman-job-server start
Check to make sure gearmand is running. If it isn't, check for errors in /var/log/gearman-job-server/gearman.log
Also note that your password will be visible with 'ps' with this setup. The Gearman guys will be addressing this in gearmand-0.10

Next, let's get the gearman PHP extension installed. Grab it from svn:
% svn co http://svn.php.net/repository/pecl/gearman/trunk gearman
% cd gearman
% phpize
% ./configure --with-php-config=/usr/local/bin/php-config
(If you have autoconf problems, apt-get install autoconf2.59 and set your PHP_AUTOCONF env variable to "autoconf2.59")
% make install
edit your php.ini file and add: extension=gearman.so

Now we can add a worker. A worker is something that will process a Gearman request. We can write it in almost any language, but here is one in PHP (worker.php):
#!/usr/local/bin/php
<?php
$worker= new GearmanWorker();
$worker->addServer('127.0.0.1');

$worker->addFunction("reverse", "reverse_fn");

while (1) {
  print "Waiting for job...\n";
  $ret= $worker->work();
  if ($worker->returnCode() != GEARMAN_SUCCESS) break;
}
function reverse_fn($job) {
  $workload= $job->workload();
  echo "Received job: " . $job->handle() . "\n";
  echo "Workload: $workload\n";
  $result= strrev($workload);

  for($i=1; $i<=10;  $i++) {
    $job->status($i,10);
    sleep(1);
  }

  echo "Result: $result\n";
  return $result;
}
And start it up:
% chmod +x worker.php
% ./worker.php
Waiting for job...
Now, in another terminal we can test it from the command line:
% gearman -f reverse Rasmus
10% Complete
20% Complete
30% Complete
40% Complete
50% Complete
60% Complete
70% Complete
80% Complete
90% Complete
100% Complete
sumsaR
Or, better yet, from our Web application we can call it like this:
<?php
// Set up connection to gearmand
$client= new GearmanClient();
$client->addServer('127.0.0.1');

$task = $client->addTask("reverse", "ABC123");
$task = $client->addTask("reverse", "DEF456");
$result = $client->runTasks();
Here the two addTask() calls simply add the tasks to the queue, but doesn't run them yet, so they return right away. runTasks() is blocking, and since we only have one worker running, it will run them one after the other and this call will take 20 seconds to return.

So, as you might expect, if you start a second instance of the worker program, these two tasks will run in parallel and runTasks() will return in 10 seconds this time.

If you don't want to hang around waiting for the results, you can do:
$task = $client->addTaskBackground("reverse", "ABC123");
This will cause runTasks() to be non-blocking. Useful for starting a background jobs in one web request and then later coming back to check the status of them.

Speaking of checking the status. This isn't completely intuitive on the first glance. Because of the way tasks are managed, you have to tell the client library not to free the tasks as soon as they have been sent. So, we need to do something like this:
$client= new GearmanClient();
$client->addServer();
$client->setOptions(GEARMAN_CLIENT_FREE_TASKS, 0);
Then we add and run 2 background tasks:
$task1 = $client->addTaskBackground("reverse", "ABC123");
$task2 = $client->addTaskBackground("reverse", "DEF456");
$result = $client->runTasks();
Now that the jobs are running, or in the queue, we can get the job handles with:
$job1 = $task1->jobHandle();
$job2 = $task2->jobHandle();
A job handle is just a string that looks like "H:colo:70"
colo is the name of my machine.

We can sleep(3) and then check the status.
var_dump($client->jobStatus($job1));
var_dump($client->jobStatus($job2));
Each jobStatus() returns an array that looks like this:
array(4) {
  [0]=>
  bool(true)
  [1]=>
  bool(true)
  [2]=>
  int(6)
  [3]=>
  int(10)
}
The first two elements are isKnown and isRunning respectively. These flags do what they sound like. isKnown tells you if gearman knows about the job. This will be true if the job is in the queue or it is running. And isRunning obviously will be true if the job is currently running. The next two fields are the numerator and denominator set by our worker. In this case it is showing 6 out of 10, so 60% complete.

Of course, these job handles, since they are just strings, could also be stored in a user's session and checked from a separate request or from an Ajax call. This is where the power of gearman starts to become apparent. Out-of-band managed processing. And, of course, the job server or servers don't need to run on localhost. You can set up a dedicated farm of backend processing servers.

In my example so far I started two background tasks, then checked each of them in sequence by calling $client->jobStatus() on each job handle. If you just have 2 jobs, that's not a big deal, but if you have a lot and you want a snapshot of what they are doing, like if you want to know which job is running faster, polling them in sequence will throw things off. So, you can create a separate task for checking each status and run those status checks just like you ran the jobs initially with runTasks(). Like this:
$status1 = $client->addTaskStatus($job1);
$status2 = $client->addTaskStatus($job2);
$result = $client->runTasks();
And you can get those same four fields with:
echo $status1->isKnown(), $status1->isRunning(), 
     $status1->taskNumerator(), $status1->taskDenominator();
Obviously if either isKnown() or isRunning() returns false, there is no point in checking the numerator and denominator.

Your client and worker can also send data back and forth between each other, but only for foreground tasks. First, from your worker you can send data like this:
$job->data(serialize(posix_times()));
We can put that in our for-loop in the above example and have it update once a second. In order to get this data from a foreground task, we have to define callback functions because the runTasks() call is blocking. It is done like this:
$client->setCompleteCallback("done");
$client->setClientCallback("data");
$task1 = $client->addTask("reverse", "ABC123");
$result = $client->runTasks();

function done($task) {
  echo "done() called\n";
  print_r($task->data());
  flush();
}

function data($task) {
  echo "data() called\n";
  print_r(unserialize($task->data()));
  flush();
}
The data() function will be called whenever the worker calls $job->data() and the done() function will be called when the task is done. $task->data() inside the done() function will be set to the returned result from the worker. In this example that is the reversed string.

Server-side callbacks in a Web app aren't terribly useful to me, so I don't really see myself using this data-passing feature. Having my frontend code sit around in a blocking runTasks() call isn't very efficient. I'd rather be collecting other data and start sending out my response to get my 1st-byte latency down. Therefore I only use non-blocking background tasks and it currently isn't possible to use this data passing mechanism from a background task. Right now if you call $job->data() from a background worker, the data is simply dropped.

You will have to write it to memcache or some other data store from your worker manually and have your clients check for data there. I think it would be a nice enhancement to Gearman if it could support a data storage plugin similar to the persistent queue storage plugin that would simplify this case.

Trackbacks

No Trackbacks

Comments

Display comments as Linear | Threaded

chris on :

For all other readers who find this post interesting, but have - like me - no idea what you are talking about:

http://gearman.org/

Gearman provides a generic application framework to farm out work to other machines or processes that are better suited to do the work. It allows you to do work in parallel, to load balance processing, and to call functions between languages.

Rasmus on :

Yeah, I guess I could have explained that.

DublinFrenchInAsia on :

This is cool

Great job there. A lot of new perspectives for us. With memcache, it will be a killing. Persistence and multi-threading in PHP. Very cool

S.L.

Goran Dodig on :

Great article! Thank you for all the missing pieces. :)

Van Tran on :

Just wondering about the multi threading capacity of the worker library. Let say I have a worker.php script, and all tasks are submitted using addTaskBackground. If I run a single instance of the worker script, so does the worker library automatically folk the child processes to handle multi jobs? Or I have to handle the multi threading myself using some kind of pcntrl?

I personally prefer Beanstalkd than Gearman due to its priority and delay capacity, but i have to handle the job execution manually. Gearman can auto execute a job via registered function, but not sure about the multi threading.

Thanks

Rasmus on :

I don't think there is any multi-threading capacity in the worker library. You launch as many workers as you think you will need and if you have more concurrent jobs than that, they will be queued up waiting for an available worker. Adding an external mechanism that could monitor the queue and potentially launch more workers doesn't seem like a very difficult thing to do. Someone has probably done it.

Urda on :

aptitude show libgearman1
No current or candidate version found for libgearman1
Package: libgearman1
State: not a real package

FYI libgearman1 doesn't seem to exist and will throw errors if you try to install the package.

Rasmus on :

Right, since writing this libgearman1 has been replaced by libgearman2 and then libgearman3.

Urda on :

You should update it for future reference :)

sumitdatta on :

@Van, No automatic multi threading or forking as such if you use a single worker.php. But as Rasmus said, I am sure there are ways and someone has already done it probably.

Add Comment

Enclosing asterisks marks text as bold (*word*), underscore are made via _word_.
E-Mail addresses will not be displayed and will only be used for E-Mail notifications.
Form options