I was in need of converting certain characters within the attribute values for any tag. For example, I wanted <input value="Oliver<Nassar"
to become <input value="Oliver<Nassar"
. As you can see, I needed to convert the <
character to it's entity equivalent <
. In fact, I want to do the same for the >
character.
While I am doing a replacement, the regular expression I came up with can be readily used for the removal. Here's the expression:
(\s{1}[a-z\-]+\s?=\s?([\'|"]{1}))([^\2]*)2
And here it is split up into seperate components:
(
\s{1}
[a-z\-]+
\s?
=
\s?
(
[\'|"]{1}
)
)
([^\2]*)
\2
The logic is as follows:
- Capture a whitespace character, attribute name and delimter (eg.
'
or"
) as the first back-reference (eg.value="
) - The
\s?
marks that between the attribute name (eg.value
) and the equal sign, an optional whitespace character can be defined - The same goes for after the equal sign
- As the second back-reference, capture the delimter (eg.
'
or"
) - The third back-reference is then setup to capture the attribute value (eg.
Oliver<Nassar
), and will stop under it reaches the delimeter that was capture as the second back-reference (eg. the'
or"
character) - Match the end of the attribute equation by searching for the second back-reference (not sure if this is really required, but ah well)
In PHP, the full expression, with replacement, turned into this:
echo preg_replace_callback(
'/' .
'(' .
'\s{1}' .
'[a-z\-]+' .
'\s?' .
'=' .
'\s?' .
'(' .
'[\'|"]{1}' .
')' .
')' .
'([^\2]*)' .
'\2' .
'/iU',
function($match) {
$replacedVersion = $match[1] . str_replace(
array('<', '>'),
array('<', '>'),
$match[3]
) . $match[2];
return $replacedVersion;
},
$markup
);
As you can see from above, I included the iU
flags to be case-insenstive and have the associated sub-expressions not be greedy. The preg_replace_callback
function then converts the contents of the second back-reference to have the <
and >
symbols encoded to their respective HTML entities.
Note that this logic is general enough to apply to any tag (eg. a
, script
, style
, etc).
Hope that helps.