Using regex's to match HTML tags


#1

Regular expressions give me headaches. Heh, I’ve been working on them all day, and I must say that they are both one of the best things ever and one of the worst things ever to use. Anyways, I’m hoping one of you can figure out what I’m doing wrong with this code.

[php]<?php

$string = <<<html

A Sample Page This is a sample page.

I have some paragraphs, with some bold text, and some cool italics too!

html;

$string2 = str_replace("&", “&”, $string);
$string2 = str_replace("<", “<”, $string2);
$string2 = str_replace(">", “>”, $string2);
$string2 = str_replace(“n”, “
”, $string2);

print $string2."

";

$offset = 0;

while($offset <= (strlen($string) - 1)){

$string3 = preg_match("/^.?<([^nt])>(.)</\1>.*?$/s", $string, $matches, PREG_OFFSET_CAPTURE, $offset);

print “

”;
print_r($matches);
print “
”;

if($matches[1][1]){
$offset = $matches[1][1] + 1;
} else {
$offset = strlen($string);
}

print “

”.$offset."";

}

?>[/php]

This script takes the “sample page” I have up top, and first displays it on the page, and then should run through the script and grab the text in between the HTML tags. Because preg_match_all will search for a match, and then search for any other matches from the END of the last match, I decided I should use a loop, so nested tags would be picked up on as well. But unfortunately, all it picks up is the bold tag, and nothing else. I even tried sort of “anchoring” the regex to the beginning and ends of the string with ^.? and .?$ to hopefully get it to match the tag first, but no luck. Can anyone figure out what I need to change here?