Oliver Nassar

Convert/remove certain characters within attribute values for a tag

February 21, 2013

I was in need of converting certain characters within the attribute values for any tag. For example, I wanted <input value="Oliver<Nassar" to become <input value="Oliver&lt;Nassar". As you can see, I needed to convert the < character to it's entity equivalent &lt;. In fact, I want to do the same for the > character.

While I am doing a replacement, the regular expression I came up with can be readily used for the removal. Here's the expression:

(\s{1}[a-z\-]+\s?=\s?([\'|"]{1}))([^\2]*)2

And here it is split up into seperate components:

(
    \s{1}
    [a-z\-]+
    \s?
    =
    \s?
    (
        [\'|"]{1}
    )
)
([^\2]*)
\2

The logic is as follows:

  1. Capture a whitespace character, attribute name and delimter (eg. ' or ") as the first back-reference (eg. value=")
  2. The \s? marks that between the attribute name (eg. value) and the equal sign, an optional whitespace character can be defined
  3. The same goes for after the equal sign
  4. As the second back-reference, capture the delimter (eg. ' or ")
  5. The third back-reference is then setup to capture the attribute value (eg. Oliver<Nassar), and will stop under it reaches the delimeter that was capture as the second back-reference (eg. the ' or " character)
  6. Match the end of the attribute equation by searching for the second back-reference (not sure if this is really required, but ah well)

In PHP, the full expression, with replacement, turned into this:

echo preg_replace_callback(
    '/' .
        '(' .
            '\s{1}' .
            '[a-z\-]+' .
            '\s?' .
            '=' .
            '\s?' .
            '(' .
                '[\'|"]{1}' .
            ')' .
        ')' .
        '([^\2]*)' .
        '\2' .
    '/iU',
    function($match) {
        $replacedVersion = $match[1] . str_replace(
                array('<', '>'),
                array('&lt;', '&gt;'),
                $match[3]
            ) . $match[2];
        return $replacedVersion;
    },
    $markup
);

As you can see from above, I included the iU flags to be case-insenstive and have the associated sub-expressions not be greedy. The preg_replace_callback function then converts the contents of the second back-reference to have the < and > symbols encoded to their respective HTML entities.

Note that this logic is general enough to apply to any tag (eg. a, script, style, etc).

Hope that helps.