View Full Version : Regex help?
lorren.biffin
April 18th, 2009, 05:45 PM
Hey all,
I'm in the process of learning Regex, and trying to put together a real-world example.. Here's my problem:
I need to format strings that have *phrases like this*, and phrases /like this/, where strings surrounded by asterisks would be wrapped with <strong> tags, and strings surrounded by forward-slashes would be wrapped with <em> tags. These strings can contain HTML tags. When formatting the strings, the formatting marks (* and /) should be replaced by their relative tags. Here are my current expressions:
Strong Exp: /\s+ +([^*]+) +\s+/i
Em Exp: /\s+[\/]+([^\/]+)[\/]+\s+/i
Obviously, these are not sufficient for my needs.. The primary reasons being that I need to:
Ignore instances where the string begins and/or ends inside of an HTML tag, and
Ignore instances where the formatting marks are part of larger strings where those marks are necessary such as in a URL (http://google.com) or while in a notation (..*Offer valid only on in-store..)
An example string:
*know who, http://dude.com .. *that's* right, /bro/!
Where-as the word "that's" would be bold, and the word "bro" would be italic.
I'm at a loss. Any ideas?
Thanks in advance! :D
NeoDreamer
April 19th, 2009, 01:56 AM
I think the question you are asking is whether you can read the writer's mind. The double slash case is easy, but the asterisk case requires knowing what the writer intended. The solution would be quite complex and would require advanced linguistic analysis.
lorren.biffin
April 19th, 2009, 02:57 AM
I think the question you are asking is whether you can read the writer's mind. The double slash case is easy, but the asterisk case requires knowing what the writer intended. The solution would be quite complex and would require advanced linguistic analysis.
I understand that, and that's fine. I'm not looking for an end-all Regex expression to grab the intention of the writer.. But I don't think it's unrealistic, through either a Regex expression or a Regex/script-logic combination to determine if an asterisk that's hanging out off by itself was intended to be formatting markup, or a singular notation.
I'm sure it's entirely possible, and it doesn't have to be one string of Regex. Of course, if that's possible then it would be the desired approach, but if not I'm completely open to using an algorithm to parse the entered data.
And if it's not possible, please feel free to show me why. ;) Locked into learning mode here.
TheCanadian
April 19th, 2009, 03:13 AM
Just because I don't know doesn't mean there isn't an answer, but I'd say that the example you gave would be impossible to do with regex while still mainting usability with other strings unless you have some other criteria to determine when a regex should ignore an asterisk because it denotes a footnote. The best solution I think is to use a double asterisk (or some other unique character sequence) to denote <strong> emphasis.
lorren.biffin
April 19th, 2009, 03:36 AM
Thanks for the input guys. I really do appreciate it. The advice to use double characters is also a good one. I'll put some thought into it. Although, thought I'd share this.. Stepping in the right direction:
/(^|\s|\() ([^*]+)[^ ]? [[:punct:]]*?/i
Basically says (from my basic understanding of Regex thus far): A string that either starts with the beginning of the line("^"), or a space ("\s") or an opening paren ("("), followed by an asterisk (" "), followed by one or more of any character that is not the asterisk ("[^*]+"), followed by anything that is not a space - un-greedy ("[^ ]?"), followed by zero or more of any punctuation mark - un-greedy ("[[:punct:]]*?"). Whole expression is case-insensitive (the "i" after the expression).
In the contextual PHP code, I strip HTML tags before running this expression, and create links from URLs after.. This all pretty much does what I need it to do.. Although, it still feels like it could break fairly easily. I may consider the double character syntax method.
After reading this all again, I feel I may not need the punctuation syntax, and I may want to add another asterisk negation in where I'm negating the space character..
Still looking for input. :) Regex is fun.
UPDATE: Even closer :) :
/(^|\s) ([^*]+[^ *]) /i
Esherido
April 19th, 2009, 07:15 PM
UPDATE: Even closer :) :
/(^|\s) ([^*]+[^ *]) /i
That looks pretty good, the one thing that I feel I should add is that you should be able to escape the * with a backslash instead of wrapping it inside the square bracket modifier, which requires extra processing by the regex engine. (And PHP isn't exactly a speed demon when it comes to regex.)
/(^|\s)\*([^*]+[^ *])\*/i
lorren.biffin
April 21st, 2009, 01:37 PM
Ah, great call. :) I'll pop it in tonight.
Thanks.
ajcates
April 26th, 2009, 07:15 AM
w00t! Go regex!
http://www.regular-expressions.info/reference.html
http://www.regular-expressions.info/refadv.html
I belive they need to teach it in middle school. It is so handy when writing anything.
Powered by vBulletin® Version 4.1.10 Copyright © 2012 vBulletin Solutions, Inc. All rights reserved.