Little Bear Island

Epic problems require epic solutions

the little bear the island

substr in PHP without breaking HTML tags

I've already run into this problem twice and still haven't come to a totally satisfactory solution, but I think I have a pretty unique one now.

A blog I was working on had an issue cutting HTML tags. Each entry preview is limited to 500 characters, and I naively used substr with a length of 500, not considering the fact that a set length of 500 may cut an HTML tag in half or leave a tag opened. My previous solution was to find all safe text (that is, text between tags) and find the closest match to the substr length. This method was fairly complex for such a "simple" function and it only avoided cutting tags, but did not avoid leaving opened tags. My new method consists of this:

  1. Match all tags and their innerHTML

    preg_match('/<.+?>(.+?)<\/[^>]+>/',$str,$matches)
    
  2. Strip tags from the desired text

    strip_tags($str)
    
  3. substr to the desired length

    substr(strip_tags($str))
    
  4. str_replace all matches back into the string

    str_replace($matches[1],$matches[0],$str)
    

UPDATE: I wrote a slightly more robust version of this using a similar idea. The above method works fine for simple matching of HTML tags, but its biggest weakness is replacing too much (for example, replacing all "HI" occurrences with <strong>HI</strong>, even though <strong>HI</strong> occurs only once at a specific position). Here is a revised method, which find the matches and then inserts them back into a clean version of the string (using substr and PREG_OFFSET_CAPTURE):

function substrhtml($str,$start,$len){
    $str_clean = substr(strip_tags($str),$start,$len);
    if(preg_match_all('/<[^>]+>/',$str,$matches,PREG_OFFSET_CAPTURE)){
        for($i=0;$i<count($matches[0]);$i++){
            $str_clean = substr($str_clean,0,$matches[0][$i][1]) . $matches[0][$i][0] . substr($str_clean,$matches[0][$i][1]);
        }
        return $str_clean;
    }else{
        return substr($str,$start,$len);
    }
}

This method has the weakness of appending the unused tags at the end of the returned text. Virtually all tags anyone would put in there are inline, and because this method does not put anything inside the tags, the tags are effectively invisible (except in the source). This way there will never be opened tags, and the very small downside is that you'll have some empty tags at the end.

UPDATE2: Here is a version that does not put the empty tags in at the end, but still closes the last tag if it is opened:

function substrhtml($str,$start,$len){
    $str_clean = substr(strip_tags($str),$start,$len);
    if(preg_match_all('/<[^>]+>/',$str,$matches,PREG_OFFSET_CAPTURE)){
        for($i=0;$i<count($matches[0]);$i++){
            if($matches[0][$i][1] < $len){
                $str_clean = substr($str_clean,0,$matches[0][$i][1]) . $matches[0][$i][0] . substr($str_clean,$matches[0][$i][1]);
            }else if(preg_match('/^</[^>]+>$/',$matches[0][$i][0])){
                $str_clean = substr($str_clean,0,$matches[0][$i][1]) . $matches[0][$i][0] . substr($str_clean,$matches[0][$i][1]);
                break;
            }
        }
        return $str_clean;
    }else{
        return substr($str,$start,$len);
    }
}
Post a comment