C++ – Increase Regex Replace Performance

c++performancepythonregexreplace

I'm a beginner C++ programmer working on a small C++ project for which I have to process a number of relatively large XML files and remove the XML tags out of them. I've succeeded doing so using the C++0x regex library. However, I'm running into some performance issues. Just reading in the files and executing the regex_replace function over its contents takes around 6 seconds on my PC. I can bring this down to 2 by adding some compiler optimization flags. Using Python, however, I can get it done it less than 100 milliseconds. Obviously, I'm doing something very inefficient in my C++ code. What can I do to speed this up a bit?

My C++ code:

std::regex xml_tags_regex("<[^>]*>");

for (std::vector<std::string>::iterator it = _files.begin(); it != 
        _files.end(); it++) {

    std::ifstream file(*it);
    file.seekg(0, std::ios::end);
    size_t size = file.tellg();

    std::string buffer(size, ' ');

    file.seekg(0);
    file.read(&buffer[0], size);

    buffer = regex_replace(buffer, xml_tags_regex, "");

    file.close();
}

My Python code:

regex = re.compile('<[^>]*>')

for filename in filenames:
    with open(filename) as f:
        content = f.read()
        content = regex.sub('', content)

P.S. I don't really care about processing the complete file at once. I just found that reading a file line by line, word by word or character by character slowed it down considerably.

Best Answer

C++11 regex replace is indeed rather slow, as of yet, at least. PCRE performs much better in terms of pattern matching speed, however, PCRECPP provides very limited means for regular expression based substitution, citing the man page:

You can replace the first match of "pattern" in "str" with "rewrite". Within "rewrite", backslash-escaped digits (\1 to \9) can be used to insert text matching corresponding parenthesized group from the pattern. \0 in "rewrite" refers to the entire matching text.

This is really poor, compared to Perl's 's' command. That is why I wrote my own C++ wrapper around PCRE that handles regular expression based substitution in a fashion that is close to Perl's 's', and also supports 16- and 32-bit character strings: PCRSCPP:

Command string syntax

Command syntax follows Perl s/pattern/substitute/[options] convention. Any character (except the backslash \) can be used as a delimiter, not just /, but make sure that delimiter is escaped with a backslash (\) if used in pattern, substitute or options substrings, e.g.:

s/\\/\//g to replace all backslashes with forward ones

Remember to double backslashes in C++ code, unless using raw string literal (see string literal):

pcrscpp::replace rx("s/\\\\/\\//g");

Pattern string syntax

Pattern string is passed directly to pcre*_compile, and thus has to follow PCRE syntax as described in PCRE documentation.

Substitute string syntax

Substitute string backreferencing syntax is similar to Perl's:

$1 ... $n: nth capturing subpattern matched.

$& and $0: the whole match

${label} : labled subpattern matched. label is up to 32 alphanumerical + underscore characters ('A'-'Z','a'-'z','0'-'9','_'), first character must be alphabetical

$` and $' (backtick and tick) refer to the areas of the subject before and after the match, respectively. As in Perl, the unmodified subject is used, even if a global substitution previously matched.

Also, following escape sequences get recognized:

\n: newline

\r: carriage return

\t: horizontal tab

\f: form feed

\b: backspace

\a: alarm, bell

\e: escape

\0: binary zero

Any other escape sequence \<char>, is interpreted as <char>, meaning that you have to escape backslashes too

Options string syntax

In Perl-like manner, options string is a sequence of allowed modifier letters. PCRSCPP recognizes following modifiers:

Perl-compatible flags

g: global replace, not just the first match

i: case insensitive match
(PCRE_CASELESS)

m: multi-line mode: ^ and $ additionally match positions after and before newlines, respectively
(PCRE_MULTILINE)

s: let the scope of the . metacharacter include newlines (treat newlines as ordinary characters)
(PCRE_DOTALL)

x: allow extended regular expression syntax, enabling whitespace and comments in complex patterns
(PCRE_EXTENDED)

PHP-compatible flags

A: "anchor" pattern: look only for "anchored" matches: ones that start with zero offset. In single-line mode is identical to prefixing all pattern alternative branches with ^
(PCRE_ANCHORED)

D: treat dollar $ as subject end assertion only, overriding the default: end, or immediately before a newline at the end. Ignored in multi-line mode
(PCRE_DOLLAR_ENDONLY)

U: invert * and + greediness logic: make ungreedy by default, ? switches back to greedy. (?U) and (?-U) in-pattern switches remain unaffected
(PCRE_UNGREEDY)

u: Unicode mode. Treat pattern and subject as UTF8/UTF16/UTF32 string. Unlike in PHP, also affects newlines, \R, \d, \w, etc. matching
((PCRE_UTF8/PCRE_UTF16/PCRE_UTF32) | PCRE_NEWLINE_ANY | PCRE_BSR_UNICODE | PCRE_UCP)

PCRSCPP own flags:

N: skip empty matches
(PCRE_NOTEMPTY)

T: treat substitute as a trivial string, i.e., make no backreference and escape sequences interpretation

n: discard non-matching portions of the string to replace
Note: PCRSCPP does not automatically add newlines, the replacement result is plain concatenation of matches, be specifically aware of this in multiline mode

I wrote a simple speed test code, which stores a 10x copy of file "move.sh" and tests regex performance on resulting string:

#include <pcrscpp.h>
#include <string>
#include <iostream>
#include <fstream>
#include <regex>

#include <chrono>

int main (int argc, char *argv[]) {
    const std::string file_name("move.sh");
    pcrscpp::replace pcrscpp_rx(R"del(s/(?:^|\n)mv[ \t]+(?:-f)?[ \t]+"([^\n]+)"[ \t]+"([^\n]+)"(?:$|\n)/$1\n$2\n/Dgn)del");
    std::regex std_rx          (R"del((?:^|\n)mv[ \t]+(?:-f)?[ \t]+"([^\n]+)"[ \t]+"([^\n]+)"(?:$|\n))del");

    std::ifstream file (file_name);
    if (!file.is_open ()) {
        std::cerr << "Unable to open file " << file_name << std::endl;
        return 1;
    }
    std::string buffer;
    {
        file.seekg(0, std::ios::end);
        size_t size = file.tellg();
        file.seekg(0);
        if (size > 0) {
            buffer.resize(size);
            file.read(&buffer[0], size);
            buffer.resize(size - 1); // strip '\0'
        }
    }
    file.close();
    std::string bigstring;
    bigstring.reserve(10*buffer.size());
    for (std::string::size_type i = 0; i < 10; i++)
        bigstring.append(buffer);

    int n = 10;

    std::cout << "Running tests " << n << " times: be patient..." << std::endl;

    std::chrono::high_resolution_clock::duration std_regex_duration, pcrscpp_duration;
    std::chrono::high_resolution_clock::time_point t1, t2;
    std::string result1, result2;

    for (int i = 0; i < n; i++) {
        // clear result
        std::string().swap(result1);
        t1 = std::chrono::high_resolution_clock::now();
        result1 = std::regex_replace (bigstring, std_rx, "$1\\n$2", std::regex_constants::format_no_copy);
        t2 = std::chrono::high_resolution_clock::now();

        std_regex_duration = (std_regex_duration*i + (t2 - t1)) / (i + 1);

        // clear result
        std::string().swap(result2);

        t1 = std::chrono::high_resolution_clock::now();
        result2 = pcrscpp_rx.replace_copy (bigstring);
        t2 = std::chrono::high_resolution_clock::now();
        pcrscpp_duration = (pcrscpp_duration*i + (t2 - t1)) / (i + 1);
    }
    std::cout << "Time taken by std::regex_replace: "
              << std_regex_duration.count()
              << " ms" << std::endl
              << "Result size: " << result1.size() << std::endl;

    std::cout << "Time taken by pcrscpp::replace: "
              << pcrscpp_duration.count()
              << " ms" << std::endl
              << "Result size: " << result2.size() << std::endl;

    return 0;
}

(note that std and pcrscpp regular expressions do the same here, the trailing newline in expression for pcrscpp is due to std::regex_replace not stripping newlines despite std::regex_constants::format_no_copy)

and launched it on a large (20.9 MB) shell move script:

Running tests 10 times: be patient...
Time taken by std::regex_replace: 12090771487 ms
Result size: 101087330
Time taken by pcrscpp::replace: 5910315642 ms
Result size: 101087330

As you can see, PCRSCPP is more than 2x faster. And I expect this gap to grow with pattern complexity increase, since PCRE deals with complicated patterns much better. I originally wrote a wrapper for myself, but I think it can be useful for others too.

Regards, Alex

Related Solutions

Python – How to Perform Mass String Replace Efficiently

mydict = {"&y":"\033[0;30m",
          "&c":"\033[0;31m",
          "&b":"\033[0;32m",
          "&Y":"\033[0;33m",
          "&u":"\033[0;34m"}
mystr = "The &yquick &cbrown &bfox &Yjumps over the &ulazy dog"

for k, v in mydict.iteritems():
    mystr = mystr.replace(k, v)

print mystr
The ←[0;30mquick ←[0;31mbrown ←[0;32mfox ←[0;33mjumps over the ←[0;34mlazy dog

I took the liberty of comparing a few solutions:

mydict = dict([('&' + chr(i), str(i)) for i in list(range(65, 91)) + list(range(97, 123))])

# random inserts between keys
from random import randint
rawstr = ''.join(mydict.keys())
mystr = ''
for i in range(0, len(rawstr), 2):
    mystr += chr(randint(65,91)) * randint(0,20) # insert between 0 and 20 chars

from time import time

# How many times to run each solution
rep = 10000

print 'Running %d times with string length %d and ' \
      'random inserts of lengths 0-20' % (rep, len(mystr))

# My solution
t = time()
for x in range(rep):
    for k, v in mydict.items():
        mystr.replace(k, v)
    #print(mystr)
print '%-30s' % 'Tor fixed & variable dict', time()-t

from re import sub, compile, escape

# Peter Hansen
t = time()
for x in range(rep):
    sub(r'(&[a-zA-Z])', r'%(\1)s', mystr) % mydict
print '%-30s' % 'Peter fixed & variable dict', time()-t

# Claudiu
def multiple_replace(dict, text): 
    # Create a regular expression  from the dictionary keys
    regex = compile("(%s)" % "|".join(map(escape, dict.keys())))

    # For each match, look-up corresponding value in dictionary
    return regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], text)

t = time()
for x in range(rep):
    multiple_replace(mydict, mystr)
print '%-30s' % 'Claudio variable dict', time()-t

# Claudiu - Precompiled
regex = compile("(%s)" % "|".join(map(escape, mydict.keys())))

t = time()
for x in range(rep):
    regex.sub(lambda mo: mydict[mo.string[mo.start():mo.end()]], mystr)
print '%-30s' % 'Claudio fixed dict', time()-t

# Andrew Y - variable dict
def mysubst(somestr, somedict):
  subs = somestr.split("&")
  return subs[0] + "".join(map(lambda arg: somedict["&" + arg[0:1]] + arg[1:], subs[1:]))

t = time()
for x in range(rep):
    mysubst(mystr, mydict)
print '%-30s' % 'Andrew Y variable dict', time()-t

# Andrew Y - fixed
def repl(s):
  return mydict["&"+s[0:1]] + s[1:]

t = time()
for x in range(rep):
    subs = mystr.split("&")
    res = subs[0] + "".join(map(repl, subs[1:]))
print '%-30s' % 'Andrew Y fixed dict', time()-t

Results in Python 2.6

Running 10000 times with string length 490 and random inserts of lengths 0-20
Tor fixed & variable dict      1.04699993134
Peter fixed & variable dict    0.218999862671
Claudio variable dict          2.48400020599
Claudio fixed dict             0.0940001010895
Andrew Y variable dict         0.0309998989105
Andrew Y fixed dict            0.0310001373291

Both claudiu's and andrew's solutions kept going into 0, so I had to increase it to 10 000 runs.

I ran it in Python 3 (because of unicode) with replacements of chars from 39 to 1024 (38 is ampersand, so I didn't wanna include it). String length up to 10.000 including about 980 replacements with variable random inserts of length 0-20. The unicode values from 39 to 1024 causes characters of both 1 and 2 bytes length, which could affect some solutions.

mydict = dict([('&' + chr(i), str(i)) for i in range(39,1024)])

# random inserts between keys
from random import randint
rawstr = ''.join(mydict.keys())
mystr = ''
for i in range(0, len(rawstr), 2):
    mystr += chr(randint(65,91)) * randint(0,20) # insert between 0 and 20 chars

from time import time

# How many times to run each solution
rep = 10000

print('Running %d times with string length %d and ' \
      'random inserts of lengths 0-20' % (rep, len(mystr)))

# Tor Valamo - too long
#t = time()
#for x in range(rep):
#    for k, v in mydict.items():
#        mystr.replace(k, v)
#print('%-30s' % 'Tor fixed & variable dict', time()-t)

from re import sub, compile, escape

# Peter Hansen
t = time()
for x in range(rep):
    sub(r'(&[a-zA-Z])', r'%(\1)s', mystr) % mydict
print('%-30s' % 'Peter fixed & variable dict', time()-t)

# Peter 2
def dictsub(m):
    return mydict[m.group()]

t = time()
for x in range(rep):
    sub(r'(&[a-zA-Z])', dictsub, mystr)
print('%-30s' % 'Peter fixed dict', time()-t)

# Claudiu - too long
#def multiple_replace(dict, text): 
#    # Create a regular expression  from the dictionary keys
#    regex = compile("(%s)" % "|".join(map(escape, dict.keys())))
#
#    # For each match, look-up corresponding value in dictionary
#    return regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], text)
#
#t = time()
#for x in range(rep):
#    multiple_replace(mydict, mystr)
#print('%-30s' % 'Claudio variable dict', time()-t)

# Claudiu - Precompiled
regex = compile("(%s)" % "|".join(map(escape, mydict.keys())))

t = time()
for x in range(rep):
    regex.sub(lambda mo: mydict[mo.string[mo.start():mo.end()]], mystr)
print('%-30s' % 'Claudio fixed dict', time()-t)

# Separate setup for Andrew and gnibbler optimized dict
mydict = dict((k[1], v) for k, v in mydict.items())

# Andrew Y - variable dict
def mysubst(somestr, somedict):
  subs = somestr.split("&")
  return subs[0] + "".join(map(lambda arg: somedict[arg[0:1]] + arg[1:], subs[1:]))

def mysubst2(somestr, somedict):
  subs = somestr.split("&")
  return subs[0].join(map(lambda arg: somedict[arg[0:1]] + arg[1:], subs[1:]))

t = time()
for x in range(rep):
    mysubst(mystr, mydict)
print('%-30s' % 'Andrew Y variable dict', time()-t)
t = time()
for x in range(rep):
    mysubst2(mystr, mydict)
print('%-30s' % 'Andrew Y variable dict 2', time()-t)

# Andrew Y - fixed
def repl(s):
  return mydict[s[0:1]] + s[1:]

t = time()
for x in range(rep):
    subs = mystr.split("&")
    res = subs[0] + "".join(map(repl, subs[1:]))
print('%-30s' % 'Andrew Y fixed dict', time()-t)

# gnibbler
t = time()
for x in range(rep):
    myparts = mystr.split("&")
    myparts[1:]=[mydict[x[0]]+x[1:] for x in myparts[1:]]
    "".join(myparts)
print('%-30s' % 'gnibbler fixed & variable dict', time()-t)

Results:

Running 10000 times with string length 9491 and random inserts of lengths 0-20
Tor fixed & variable dict      0.0 # disqualified 329 secs
Peter fixed & variable dict    2.07799983025
Peter fixed dict               1.53100013733 
Claudio variable dict          0.0 # disqualified, 37 secs
Claudio fixed dict             1.5
Andrew Y variable dict         0.578000068665
Andrew Y variable dict 2       0.56299996376
Andrew Y fixed dict            0.56200003624
gnibbler fixed & variable dict 0.530999898911

(** Note that gnibbler's code uses a different dict, where keys don't have the '&' included. Andrew's code also uses this alternate dict, but it didn't make much of a difference, maybe just 0.01x speedup.)

C++11 – Why Regex is Slower Than Python

Notice

See also this answer: https://stackoverflow.com/a/21708215 which was the base for the EDIT 2 at the bottom here.

I've augmented the loop to 1000000 to get a better timing measure.

This is my Python timing:

real    0m2.038s
user    0m2.009s
sys     0m0.024s

Here's an equivalent of your code, just a bit prettier:

#include <regex>
#include <vector>
#include <string>

std::vector<std::string> split(const std::string &s, const std::regex &r)
{
    return {
        std::sregex_token_iterator(s.begin(), s.end(), r, -1),
        std::sregex_token_iterator()
    };
}

int main()
{
    const std::regex r(" +");
    for(auto i=0; i < 1000000; ++i)
       split("a b c", r);
    return 0;
}

Timing:

real    0m5.786s
user    0m5.779s
sys     0m0.005s

This is an optimization to avoid construction/allocation of vector and string objects:

#include <regex>
#include <vector>
#include <string>

void split(const std::string &s, const std::regex &r, std::vector<std::string> &v)
{
    auto rit = std::sregex_token_iterator(s.begin(), s.end(), r, -1);
    auto rend = std::sregex_token_iterator();
    v.clear();
    while(rit != rend)
    {
        v.push_back(*rit);
        ++rit;
    }
}

int main()
{
    const std::regex r(" +");
    std::vector<std::string> v;
    for(auto i=0; i < 1000000; ++i)
       split("a b c", r, v);
    return 0;
}

Timing:

real    0m3.034s
user    0m3.029s
sys     0m0.004s

This is near a 100% performance improvement.

The vector is created before the loop, and can grow its memory in the first iteration. Afterwards there's no memory deallocation by clear(), the vector maintains the memory and construct strings in-place.

Another performance increase would be to avoid construction/destruction std::string completely, and hence, allocation/deallocation of its objects.

This is a tentative in this direction:

#include <regex>
#include <vector>
#include <string>

void split(const char *s, const std::regex &r, std::vector<std::string> &v)
{
    auto rit = std::cregex_token_iterator(s, s + std::strlen(s), r, -1);
    auto rend = std::cregex_token_iterator();
    v.clear();
    while(rit != rend)
    {
        v.push_back(*rit);
        ++rit;
    }
}

Timing:

real    0m2.509s
user    0m2.503s
sys     0m0.004s

An ultimate improvement would be to have a std::vector of const char * as return, where each char pointer would point to a substring inside the original s c string itself. The problem is that, you can't do that because each of them would not be null terminated (for this, see usage of C++1y string_ref in a later sample).

This last improvement could also be achieved with this:

#include <regex>
#include <vector>
#include <string>

void split(const std::string &s, const std::regex &r, std::vector<std::string> &v)
{
    auto rit = std::cregex_token_iterator(s.data(), s.data() + s.length(), r, -1);
    auto rend = std::cregex_token_iterator();
    v.clear();
    while(rit != rend)
    {
        v.push_back(*rit);
        ++rit;
    }
}

int main()
{
    const std::regex r(" +");
    std::vector<std::string> v;
    for(auto i=0; i < 1000000; ++i)
       split("a b c", r, v); // the constant string("a b c") should be optimized
                             // by the compiler. I got the same performance as
                             // if it was an object outside the loop
    return 0;
}

I've built the samples with clang 3.3 (from trunk) with -O3. Maybe other regex libraries are able to perform better, but in any case, allocations/deallocations are frequently a performance hit.

Boost.Regex

This is the boost::regex timing for the c string arguments sample:

real    0m1.284s
user    0m1.278s
sys     0m0.005s

Same code, boost::regex and std::regex interface in this sample are identical, just needed to change the namespace and include.

Best wishes for it to get better over time, C++ stdlib regex implementations are in their infancy.

EDIT

For sake of completion, I've tried this (the above mentioned "ultimate improvement" suggestion) and it didn't improved performance of the equivalent std::vector<std::string> &v version in anything:

#include <regex>
#include <vector>
#include <string>

template<typename Iterator> class intrusive_substring
{
private:
    Iterator begin_, end_;

public:
    intrusive_substring(Iterator begin, Iterator end) : begin_(begin), end_(end) {}

    Iterator begin() {return begin_;}
    Iterator end() {return end_;}
};

using intrusive_char_substring = intrusive_substring<const char *>;

void split(const std::string &s, const std::regex &r, std::vector<intrusive_char_substring> &v)
{
    auto rit = std::cregex_token_iterator(s.data(), s.data() + s.length(), r, -1);
    auto rend = std::cregex_token_iterator();
    v.clear(); // This can potentially be optimized away by the compiler because
               // the intrusive_char_substring destructor does nothing, so
               // resetting the internal size is the only thing to be done.
               // Formerly allocated memory is maintained.
    while(rit != rend)
    {
        v.emplace_back(rit->first, rit->second);
        ++rit;
    }
}

int main()
{
    const std::regex r(" +");
    std::vector<intrusive_char_substring> v;
    for(auto i=0; i < 1000000; ++i)
       split("a b c", r, v);

    return 0;
}

This has to do with the array_ref and string_ref proposal. Here's a sample code using it:

#include <regex>
#include <vector>
#include <string>
#include <string_ref>

void split(const std::string &s, const std::regex &r, std::vector<std::string_ref> &v)
{
    auto rit = std::cregex_token_iterator(s.data(), s.data() + s.length(), r, -1);
    auto rend = std::cregex_token_iterator();
    v.clear();
    while(rit != rend)
    {
        v.emplace_back(rit->first, rit->length());
        ++rit;
    }
}

int main()
{
    const std::regex r(" +");
    std::vector<std::string_ref> v;
    for(auto i=0; i < 1000000; ++i)
       split("a b c", r, v);

    return 0;
}

It will also be cheaper to return a vector of string_ref rather than string copies for the case of split with vector return.

EDIT 2

This new solution is able to get output by return. I have used Marshall Clow's string_view (string_ref got renamed) libc++ implementation found at https://github.com/mclow/string_view.

#include <string>
#include <string_view>
#include <boost/regex.hpp>
#include <boost/range/iterator_range.hpp>
#include <boost/iterator/transform_iterator.hpp>

using namespace std;
using namespace std::experimental;
using namespace boost;

string_view stringfier(const cregex_token_iterator::value_type &match) {
    return {match.first, static_cast<size_t>(match.length())};
}

using string_view_iterator =
    transform_iterator<decltype(&stringfier), cregex_token_iterator>;

iterator_range<string_view_iterator> split(string_view s, const regex &r) {
    return {
        string_view_iterator(
            cregex_token_iterator(s.begin(), s.end(), r, -1),
            stringfier
        ),
        string_view_iterator()
    };
}

int main() {
    const regex r(" +");
    for (size_t i = 0; i < 1000000; ++i) {
        split("a b c", r);
    }
}

Timing:

real    0m0.385s
user    0m0.385s
sys     0m0.000s

Note how faster this is compared to previous results. Of course, it's not filling a vector inside the loop (nor matching anything in advance probably too), but you get a range anyway, which you can range over with range-based for, or even use it to fill a vector.

As ranging over the iterator_range creates string_views over an original string(or a null terminated string), this gets very lightweight, never generating unnecessary string allocations.

Just to compare using this split implementation but actually filling a vector we could do this:

int main() {
    const regex r(" +");
    vector<string_view> v;
    v.reserve(10);
    for (size_t i = 0; i < 1000000; ++i) {
        copy(split("a b c", r), back_inserter(v));
        v.clear();
    }
}

This uses boost range copy algorithm to fill the vector in each iteration, the timing is:

real    0m1.002s
user    0m0.997s
sys     0m0.004s

As can be seen, no much difference in comparison with the optimized string_view output param version.

Note also there's a proposal for a std::split that would work like this.

Best Answer

Command string syntax

Pattern string syntax

Substitute string syntax

Options string syntax

Related Solutions

Python – How to Perform Mass String Replace Efficiently

C++11 – Why Regex is Slower Than Python

Notice

Boost.Regex

EDIT

EDIT 2

Related Question