The regexes will work fine (I even voted up Martin Browns answer), but they are expensive (and personally I find any pattern longer than a couple of characters prohibitively obtuse)
This function
string AddSpacesToSentence(string text, bool preserveAcronyms)
{
if (string.IsNullOrWhiteSpace(text))
return string.Empty;
StringBuilder newText = new StringBuilder(text.Length * 2);
newText.Append(text[0]);
for (int i = 1; i < text.Length; i++)
{
if (char.IsUpper(text[i]))
if ((text[i - 1] != ' ' && !char.IsUpper(text[i - 1])) ||
(preserveAcronyms && char.IsUpper(text[i - 1]) &&
i < text.Length - 1 && !char.IsUpper(text[i + 1])))
newText.Append(' ');
newText.Append(text[i]);
}
return newText.ToString();
}
Will do it 100,000 times in 2,968,750 ticks, the regex will take 25,000,000 ticks (and thats with the regex compiled).
It's better, for a given value of better (i.e. faster) however it's more code to maintain. "Better" is often compromise of competing requirements.
Update
It's a good long while since I looked at this, and I just realised the timings haven't been updated since the code changed (it only changed a little).
On a string with 'Abbbbbbbbb' repeated 100 times (i.e. 1,000 bytes), a run of 100,000 conversions takes the hand coded function 4,517,177 ticks, and the Regex below takes 59,435,719 making the Hand coded function run in 7.6% of the time it takes the Regex.
Update 2
Will it take Acronyms into account? It will now!
The logic of the if statment is fairly obscure, as you can see expanding it to this ...
if (char.IsUpper(text[i]))
if (char.IsUpper(text[i - 1]))
if (preserveAcronyms && i < text.Length - 1 && !char.IsUpper(text[i + 1]))
newText.Append(' ');
else ;
else if (text[i - 1] != ' ')
newText.Append(' ');
... doesn't help at all!
Here's the original simple method that doesn't worry about Acronyms
string AddSpacesToSentence(string text)
{
if (string.IsNullOrWhiteSpace(text))
return "";
StringBuilder newText = new StringBuilder(text.Length * 2);
newText.Append(text[0]);
for (int i = 1; i < text.Length; i++)
{
if (char.IsUpper(text[i]) && text[i - 1] != ' ')
newText.Append(' ');
newText.Append(text[i]);
}
return newText.ToString();
}
This can be done without regex:
>>> string = "Special $#! characters spaces 888323"
>>> ''.join(e for e in string if e.isalnum())
'Specialcharactersspaces888323'
You can use str.isalnum
:
S.isalnum() -> bool
Return True if all characters in S are alphanumeric
and there is at least one character in S, False otherwise.
If you insist on using regex, other solutions will do fine. However note that if it can be done without using a regular expression, that's the best way to go about it.
Best Answer
Assuming theses numbers can only be separated by one space or hyphen, following two ideas:
By use of
\G
to chain matches:See this demo regexstorm or your updated sample - replace with
x$1
(capture of group 1)This will first find a number between 12 and 19 characters and chain matches from there. The second lookahead will check at each matching digit if there are at least four digits ahead.
Similar to your current pattern:
Demo at regexstorm or updated .NET demo - replace just with
x
(like your current code)This will do the whole lookaround checks at each digit found and is probably more costly.
(the atomic group at
(?>[ -]?\d)*
will prevent matching such as0 1234567890123456789
)The reason your current regex did not work for the sample lies in
(?<![\d-*])
which purpose is meant to separate the whole number from text but it just checks for one of the listed characters. Together with[\d*\s]){12,19}
that could match the specified amount of digits or whitespace.Besides I would not use something like
[\d-*\s]
. In this case (.NET regex) there is no error but it still looks ugly. An unescaped hyphen inside a character class is used to denote a character range. To match a literal hpyhen put it at start/end of the character-class or escape it with a backslash.