Friday, February 26, 2010

Programmer - PCRE: Lazy and Greedy at the same time (Possessive Quantifiers)

Programmer Question

I am trying to match a series of text strings with PCRE on PHP, and am having trouble getting all the matches in between the first and second. Another problem I have is matching pairs.



If anyone wonders why on Earth I would want to do this, it's because of Doc Comments. Oh, how I wish Zend would make native/plugin functions to read Doc Comments from a PHP file...



The following example (plain) text will be used for each problem. It will always be pure PHP code, with only one opening tag at the beginning of the file, no closing. You can assume that the syntax will always be correct.



<?php
class someClass extends someExample
{
function doSomething($someArg = 'someValue')
{
// Nested code blocks...
if($boolTest){}
}
private function killFurbies(){}
protected function runSomething(){}
}

abstract
class anotherClass
{
public function __construct(){}
abstract function saveTheWhales();
}

function globalFunc(){}


Problem One



Trying to match all methods in a class; my RegEx does not find the method killFurbies() at all. Letting it be greedy means it only matches the last method in a class, and letting it be lazy means it only matches the first method.



$part = '.*';  // Greedy
$part = '.*?'; // Lazy

$regex = '%class(?:\\n|\\r|\\s)+([a-zA-Z_\\x7f-\\xff][a-zA-Z0-9_\\x7f-\\xff]*)'
. '.*?\{' . $part .'(?:(public|protected|private)(?:\\n|\\r|\\s)+)?'
. 'function(?:\\n|\\r|\\s)+([a-zA-Z_\\x7f-\\xff][a-zA-Z0-9_\\x7f-\\xff'
. ']*)(?:\\n|\\r|\\s)*\\(%ms';

preg_match_all($regex, file_get_contents(__EXAMPLE__), $matches, PREG_SET_ORDER);
var_dump($matches);


Results in:



// Lazy:
array(2) {
[0]=>
array(4) {
[0]=>
// Omitted.
[1]=>
string(9) "someClass"
[2]=>
string(0) ""
[3]=>
string(11) "doSomething"
}
[1]=>
array(4) {
[0]=>
// Omitted.
[1]=>
string(12) "anotherClass"
[2]=>
string(6) "public"
[3]=>
string(11) "__construct"
}
}

// Greedy:
array(2) {
[0]=>
array(4) {
[0]=>
// Omitted.
[1]=>
string(9) "someClass"
[2]=>
string(0) ""
[3]=>
string(13) "saveTheWhales"
}
[1]=>
array(4) {
[0]=>
// Omitted.
[1]=>
string(12) "anotherClass"
[2]=>
string(0) ""
[3]=>
string(13) "saveTheWhales"
}
}


How do I match all? :S



Problem Two



Of course, searching for Doc Comments on methods in a class means I have to know which character is the closing brace for the class the RegEx is running on. Is there anyway to count how many opening braces PCRE encounters ($n), and when it encounters the $n'th closing brace, to stop searching. At the moment, it will keep searching and find the methods inside classes defined later in the file, and also functions in the global scope. I'm guessing this has something to do recursion or subpatterns.



Any help would be gratefully appreciated, as I already feel this question is ridiculous as I'm typing it out. Anyone attempting to answer a question like this is braver than me!



Thanks, mniz.



Find the answer here

No comments:

Post a Comment

LinkWithin

Related Posts with Thumbnails