C++11 regex replace is indeed rather slow, as of yet, at least. PCRE performs much better in terms of pattern matching speed, however, PCRECPP provides very limited means for regular expression based substitution, citing the man page:
You can replace the first match of "pattern" in "str" with "rewrite".
Within "rewrite", backslash-escaped digits (\1 to \9) can be used to
insert text matching corresponding parenthesized group from the
pattern. \0 in "rewrite" refers to the entire matching text.
This is really poor, compared to Perl's 's' command. That is why I wrote my own C++ wrapper around PCRE that handles regular expression based substitution in a fashion that is close to Perl's 's', and also supports 16- and 32-bit character strings: PCRSCPP:
Command string syntax
Command syntax follows Perl s/pattern/substitute/[options]
convention. Any character (except the backslash \
) can be used as a
delimiter, not just /
, but make sure that delimiter is escaped with
a backslash (\
) if used in pattern
, substitute
or options
substrings, e.g.:
s/\\/\//g
to replace all backslashes with forward ones
Remember to double backslashes in C++ code, unless using raw string
literal (see string literal):
pcrscpp::replace rx("s/\\\\/\\//g");
Pattern string syntax
Pattern string is passed directly to pcre*_compile
, and thus has to
follow PCRE syntax as described in PCRE documentation.
Substitute string syntax
Substitute string backreferencing syntax is similar to Perl's:
$1
... $n
: nth capturing subpattern matched.
$&
and $0
: the whole match
${label}
: labled subpattern matched. label
is up to 32 alphanumerical +
underscore characters ('A'-'Z'
,'a'-'z'
,'0'-'9'
,'_'
),
first character must be alphabetical
$`
and $'
(backtick and tick) refer to the areas of the subject before
and after the match, respectively. As in Perl, the unmodified
subject is used, even if a global substitution previously matched.
Also, following escape sequences get recognized:
\n
: newline
\r
: carriage return
\t
: horizontal tab
\f
: form feed
\b
: backspace
\a
: alarm, bell
\e
: escape
\0
: binary zero
Any other escape sequence \<char>
, is interpreted as <char>
,
meaning that you have to escape backslashes too
Options string syntax
In Perl-like manner, options string is a sequence of allowed modifier
letters. PCRSCPP recognizes following modifiers:
- Perl-compatible flags
g
: global replace, not just the first match
i
: case insensitive match
(PCRE_CASELESS)
m
: multi-line mode: ^
and $
additionally match positions
after and before newlines, respectively
(PCRE_MULTILINE)
s
: let the scope of the .
metacharacter include newlines
(treat newlines as ordinary characters)
(PCRE_DOTALL)
x
: allow extended regular expression syntax,
enabling whitespace and comments in complex patterns
(PCRE_EXTENDED)
- PHP-compatible flags
A
: "anchor" pattern: look only for "anchored" matches: ones that
start with zero offset. In single-line mode is identical to
prefixing all pattern alternative branches with ^
(PCRE_ANCHORED)
D
: treat dollar $
as subject end assertion only, overriding the default:
end, or immediately before a newline at the end.
Ignored in multi-line mode
(PCRE_DOLLAR_ENDONLY)
U
: invert *
and +
greediness logic: make ungreedy by default,
?
switches back to greedy. (?U)
and (?-U)
in-pattern switches
remain unaffected
(PCRE_UNGREEDY)
u
: Unicode mode. Treat pattern and subject as UTF8/UTF16/UTF32 string.
Unlike in PHP, also affects newlines, \R
, \d
, \w
, etc. matching
((PCRE_UTF8/PCRE_UTF16/PCRE_UTF32) | PCRE_NEWLINE_ANY
| PCRE_BSR_UNICODE | PCRE_UCP)
- PCRSCPP own flags:
N
: skip empty matches
(PCRE_NOTEMPTY)
T
: treat substitute as a trivial string, i.e., make no backreference
and escape sequences interpretation
n
: discard non-matching portions of the string to replace
Note: PCRSCPP does not automatically add newlines,
the replacement result is plain concatenation of matches,
be specifically aware of this in multiline mode
I wrote a simple speed test code, which stores a 10x copy of file "move.sh" and tests regex performance on resulting string:
#include <pcrscpp.h>
#include <string>
#include <iostream>
#include <fstream>
#include <regex>
#include <chrono>
int main (int argc, char *argv[]) {
const std::string file_name("move.sh");
pcrscpp::replace pcrscpp_rx(R"del(s/(?:^|\n)mv[ \t]+(?:-f)?[ \t]+"([^\n]+)"[ \t]+"([^\n]+)"(?:$|\n)/$1\n$2\n/Dgn)del");
std::regex std_rx (R"del((?:^|\n)mv[ \t]+(?:-f)?[ \t]+"([^\n]+)"[ \t]+"([^\n]+)"(?:$|\n))del");
std::ifstream file (file_name);
if (!file.is_open ()) {
std::cerr << "Unable to open file " << file_name << std::endl;
return 1;
}
std::string buffer;
{
file.seekg(0, std::ios::end);
size_t size = file.tellg();
file.seekg(0);
if (size > 0) {
buffer.resize(size);
file.read(&buffer[0], size);
buffer.resize(size - 1); // strip '\0'
}
}
file.close();
std::string bigstring;
bigstring.reserve(10*buffer.size());
for (std::string::size_type i = 0; i < 10; i++)
bigstring.append(buffer);
int n = 10;
std::cout << "Running tests " << n << " times: be patient..." << std::endl;
std::chrono::high_resolution_clock::duration std_regex_duration, pcrscpp_duration;
std::chrono::high_resolution_clock::time_point t1, t2;
std::string result1, result2;
for (int i = 0; i < n; i++) {
// clear result
std::string().swap(result1);
t1 = std::chrono::high_resolution_clock::now();
result1 = std::regex_replace (bigstring, std_rx, "$1\\n$2", std::regex_constants::format_no_copy);
t2 = std::chrono::high_resolution_clock::now();
std_regex_duration = (std_regex_duration*i + (t2 - t1)) / (i + 1);
// clear result
std::string().swap(result2);
t1 = std::chrono::high_resolution_clock::now();
result2 = pcrscpp_rx.replace_copy (bigstring);
t2 = std::chrono::high_resolution_clock::now();
pcrscpp_duration = (pcrscpp_duration*i + (t2 - t1)) / (i + 1);
}
std::cout << "Time taken by std::regex_replace: "
<< std_regex_duration.count()
<< " ms" << std::endl
<< "Result size: " << result1.size() << std::endl;
std::cout << "Time taken by pcrscpp::replace: "
<< pcrscpp_duration.count()
<< " ms" << std::endl
<< "Result size: " << result2.size() << std::endl;
return 0;
}
(note that std
and pcrscpp
regular expressions do the same here, the trailing newline in expression for pcrscpp
is due to std::regex_replace
not stripping newlines despite std::regex_constants::format_no_copy
)
and launched it on a large (20.9 MB) shell move script:
Running tests 10 times: be patient...
Time taken by std::regex_replace: 12090771487 ms
Result size: 101087330
Time taken by pcrscpp::replace: 5910315642 ms
Result size: 101087330
As you can see, PCRSCPP is more than 2x faster. And I expect this gap to grow with pattern complexity increase, since PCRE deals with complicated patterns much better. I originally wrote a wrapper for myself, but I think it can be useful for others too.
Regards,
Alex
Best Answer
Notice
See also this answer: https://stackoverflow.com/a/21708215 which was the base for the EDIT 2 at the bottom here.
I've augmented the loop to 1000000 to get a better timing measure.
This is my Python timing:
Here's an equivalent of your code, just a bit prettier:
Timing:
This is an optimization to avoid construction/allocation of vector and string objects:
Timing:
This is near a 100% performance improvement.
The vector is created before the loop, and can grow its memory in the first iteration. Afterwards there's no memory deallocation by
clear()
, the vector maintains the memory and construct strings in-place.Another performance increase would be to avoid construction/destruction
std::string
completely, and hence, allocation/deallocation of its objects.This is a tentative in this direction:
Timing:
An ultimate improvement would be to have a
std::vector
ofconst char *
as return, where each char pointer would point to a substring inside the originals
c string itself. The problem is that, you can't do that because each of them would not be null terminated (for this, see usage of C++1ystring_ref
in a later sample).This last improvement could also be achieved with this:
I've built the samples with clang 3.3 (from trunk) with -O3. Maybe other regex libraries are able to perform better, but in any case, allocations/deallocations are frequently a performance hit.
Boost.Regex
This is the
boost::regex
timing for the c string arguments sample:Same code,
boost::regex
andstd::regex
interface in this sample are identical, just needed to change the namespace and include.Best wishes for it to get better over time, C++ stdlib regex implementations are in their infancy.
EDIT
For sake of completion, I've tried this (the above mentioned "ultimate improvement" suggestion) and it didn't improved performance of the equivalent
std::vector<std::string> &v
version in anything:This has to do with the array_ref and string_ref proposal. Here's a sample code using it:
It will also be cheaper to return a vector of
string_ref
rather thanstring
copies for the case ofsplit
with vector return.EDIT 2
This new solution is able to get output by return. I have used Marshall Clow's
string_view
(string_ref
got renamed) libc++ implementation found at https://github.com/mclow/string_view.Timing:
Note how faster this is compared to previous results. Of course, it's not filling a
vector
inside the loop (nor matching anything in advance probably too), but you get a range anyway, which you can range over with range-basedfor
, or even use it to fill avector
.As ranging over the
iterator_range
createsstring_view
s over an originalstring
(or a null terminated string), this gets very lightweight, never generating unnecessary string allocations.Just to compare using this
split
implementation but actually filling avector
we could do this:This uses boost range copy algorithm to fill the vector in each iteration, the timing is:
As can be seen, no much difference in comparison with the optimized
string_view
output param version.Note also there's a proposal for a
std::split
that would work like this.