c++ - 用于匹配 C++ 字符串常量的正则表达式

coder 2024-02-03 原文

我目前正在开发一个 C++ 预处理器，我需要匹配超过 0 个字母的字符串常量，如 "hey I'm a string .
我目前正在这里使用这个 \"([^\\\"]+|\\.)+\"但它在我的一个测试用例中失败了。

测试用例:

std::cout << "hello" << " world";
std::cout << "He said: \"bananas\"" << "...";
std::cout << "";
std::cout << "\x12\23\x34";

预期输出:

std::cout << String("hello") << String(" world");
std::cout << String("He said: \"bananas\"") << String("...");
std::cout << "";
std::cout << String("\x12\23\x34");

在第二个我反而得到

std::cout << String("He said: \")bananas\"String(" << ")...";

简短的重现代码(使用 AR.3 的正则表达式):

std::string in_line = "std::cout << \"He said: \\\"bananas\\\"\" << \"...\";";
std::regex r("\"([^\"]+|\\.|(?<=\\\\)\")+\"");
in_line = std::regex_replace(in_line, r, "String($&)");

最佳答案

对源文件进行词法分析对于正则表达式来说是一项很好的工作。但是对于这样的任务，让我们使用比 std::regex 更好的正则表达式引擎.让我们首先使用 PCRE(或 boost::regex)。在这篇文章的最后，我将展示您可以使用功能较少的引擎做什么。

我们只需要进行部分词法分析，忽略所有不会影响字符串文字的无法识别的标记。我们需要处理的是:

单行评论

多行注释

字符字面量

字符串文字

我们将使用扩展( x )选项，它忽略模式中的空格。

注释

这是什么 [lex.comment]说:

The characters /* start a comment, which terminates with the characters */. These comments do not nest. The characters // start a comment, which terminates immediately before the next new-line character. If there is a form-feed or a vertical-tab character in such a comment, only white-space characters shall appear between it and the new-line that terminates the comment; no diagnostic is required. [ Note: The comment characters //, /*, and */ have no special meaning within a // comment and are treated just like other characters. Similarly, the comment characters // and /* have no special meaning within a /* comment. — end note ]

# singleline comment
// .* (*SKIP)(*FAIL)

# multiline comment
| /\* (?s: .*? ) \*/ (*SKIP)(*FAIL)

十分简单。如果您在那里匹配任何内容，只需 (*SKIP)(*FAIL) - 意思是你扔掉火柴。 (?s: .*? )适用于 s (单行)修饰符 .元字符，这意味着它可以匹配换行符。

字 rune 字

这是来自 [lex.ccon] 的语法:

 character-literal:  
    encoding-prefix(opt) ’ c-char-sequence ’
  encoding-prefix:
    one of u8 u U L
  c-char-sequence:
    c-char
    c-char-sequence c-char
  c-char:
    any member of the source character set except the single-quote ’, backslash \, or new-line character
    escape-sequence
    universal-character-name
  escape-sequence:
    simple-escape-sequence
    octal-escape-sequence
    hexadecimal-escape-sequence
  simple-escape-sequence: one of \’ \" \? \\ \a \b \f \n \r \t \v
  octal-escape-sequence:
    \ octal-digit
    \ octal-digit octal-digit
    \ octal-digit octal-digit octal-digit
  hexadecimal-escape-sequence:
    \x hexadecimal-digit
    hexadecimal-escape-sequence hexadecimal-digit

让我们先定义一些东西，稍后我们将需要它们:

(?(DEFINE)
  (?<prefix> (?:u8?|U|L)? )
  (?<escape> \\ (?:
    ['"?\\abfnrtv]         # simple escape
    | [0-7]{1,3}           # octal escape
    | x [0-9a-fA-F]{1,2}   # hex escape
    | u [0-9a-fA-F]{4}     # universal character name
    | U [0-9a-fA-F]{8}     # universal character name
  ))
)

prefix被定义为可选 u8 , u , U或 L

escape是按照标准定义的，除了我已经合并了 universal-character-name为简单起见

一旦我们有了这些，字 rune 字就非常简单了:

(?&prefix) ' (?> (?&escape) | [^'\\\r\n]+ )+ ' (*SKIP)(*FAIL)

我们用 (*SKIP)(*FAIL) 扔掉它

简单的字符串

它们的定义方式几乎与字 rune 字相同。这是[lex.string]的一部分:

  string-literal:
    encoding-prefix(opt) " s-char-sequence(opt) "
    encoding-prefix(opt) R raw-string
  s-char-sequence:
    s-char
    s-char-sequence s-char
  s-char:
    any member of the source character set except the double-quote ", backslash \, or new-line character
    escape-sequence
    universal-character-name

这将反射(reflect)字 rune 字:

(?&prefix) " (?> (?&escape) | [^"\\\r\n]+ )* "

区别在于:

这次的字符序列是可选的( * 而不是 + )

未转义时不允许使用双引号而不是单引号

我们实际上不会扔掉它:)

原始字符串

这是原始字符串部分:

  raw-string:
    " d-char-sequence(opt) ( r-char-sequence(opt) ) d-char-sequence(opt) "
  r-char-sequence:
    r-char
    r-char-sequence r-char
  r-char:
    any member of the source character set, except a right parenthesis )
    followed by the initial d-char-sequence (which may be empty) followed by a double quote ".
  d-char-sequence:
    d-char
    d-char-sequence d-char
  d-char:
    any member of the basic source character set except:
    space, the left parenthesis (, the right parenthesis ), the backslash \,
    and the control characters representing horizontal tab,
    vertical tab, form feed, and newline.

这个的正则表达式是:

(?&prefix) R " (?<delimiter>[^ ()\\\t\x0B\r\n]*) \( (?s:.*?) \) \k<delimiter> "

[^ ()\\\t\x0B\r\n]*是分隔符中允许的字符集 ( d-char )

\k<delimiter>指之前匹配的分隔符

完整的图案

完整的模式是:

(?(DEFINE)
  (?<prefix> (?:u8?|U|L)? )
  (?<escape> \\ (?:
    ['"?\\abfnrtv]         # simple escape
    | [0-7]{1,3}           # octal escape
    | x [0-9a-fA-F]{1,2}   # hex escape
    | u [0-9a-fA-F]{4}     # universal character name
    | U [0-9a-fA-F]{8}     # universal character name
  ))
)

# singleline comment
// .* (*SKIP)(*FAIL)

# multiline comment
| /\* (?s: .*? ) \*/ (*SKIP)(*FAIL)

# character literal
| (?&prefix) ' (?> (?&escape) | [^'\\\r\n]+ )+ ' (*SKIP)(*FAIL)

# standard string
| (?&prefix) " (?> (?&escape) | [^"\\\r\n]+ )* "

# raw string
| (?&prefix) R " (?<delimiter>[^ ()\\\t\x0B\r\n]*) \( (?s:.*?) \) \k<delimiter> "

见 the demo here .
boost::regex
这是一个使用 boost::regex 的简单演示程序:

#include <string>
#include <iostream>
#include <boost/regex.hpp>

static void test()
{
    boost::regex re(R"regex(
        (?(DEFINE)
          (?<prefix> (?:u8?|U|L) )
          (?<escape> \\ (?:
            ['"?\\abfnrtv]         # simple escape
            | [0-7]{1,3}           # octal escape
            | x [0-9a-fA-F]{1,2}   # hex escape
            | u [0-9a-fA-F]{4}     # universal character name
            | U [0-9a-fA-F]{8}     # universal character name
          ))
        )

        # singleline comment
        // .* (*SKIP)(*FAIL)

        # multiline comment
        | /\* (?s: .*? ) \*/ (*SKIP)(*FAIL)

        # character literal
        | (?&prefix)? ' (?> (?&escape) | [^'\\\r\n]+ )+ ' (*SKIP)(*FAIL)

        # standard string
        | (?&prefix)? " (?> (?&escape) | [^"\\\r\n]+ )* "

        # raw string
        | (?&prefix)? R " (?<delimiter>[^ ()\\\t\x0B\r\n]*) \( (?s:.*?) \) \k<delimiter> "
    )regex", boost::regex::perl | boost::regex::no_mod_s | boost::regex::mod_x | boost::regex::optimize);

    std::string subject(R"subject(
std::cout << L"hello" << " world";
std::cout << "He said: \"bananas\"" << "...";
std::cout << "";
std::cout << "\x12\23\x34";
std::cout << u8R"hello(this"is\a\""""single\\(valid)"
raw string literal)hello";

"" // empty string
'"' // character literal

// this is "a string literal" in a comment
/* this is
   "also inside"
   //a comment */

// and this /*
"is not in a comment"
// */

"this is a /* string */ with nested // comments"
    )subject");

    std::cout << boost::regex_replace(subject, re, "String\\($&\\)", boost::format_all) << std::endl;
}

int main(int argc, char **argv)
{
    try
    {
        test();
    }
    catch(std::exception ex)
    {
        std::cerr << ex.what() << std::endl;
    }

    return 0;
}

(我禁用了语法高亮，因为它在这段代码上很疯狂)

出于某种原因，我不得不接受 ?量词出 prefix (将 (?<prefix> (?:u8?|U|L)? ) 更改为 (?<prefix> (?:u8?|U|L) ) 并将 (?&prefix) 更改为 (?&prefix)? )以使模式起作用。我相信这是 boost::regex 中的一个错误，因为 PCRE 和 Perl 在原始模式下都可以正常工作。

如果我们手头没有漂亮的正则表达式引擎怎么办？

请注意，虽然此模式在技术上使用递归，但它从不嵌套递归调用。通过将相关的可重用部分内联到主模式中，可以避免递归。

可以以降低性能为代价避免使用其他一些结构。我们可以安全地替换原子组 (?> ... )与正常组(?: ... )如果我们不嵌套量词以避免 catastrophic backtracking .

我们也可以避免(*SKIP)(*FAIL)如果我们在替换函数中添加一行逻辑:所有要跳过的替代项都分组在一个捕获组中。如果捕获组匹配，则忽略匹配。如果不是，那么它是一个字符串文字。

所有这一切都意味着我们可以在 JavaScript 中实现它，它拥有你能找到的最简单的正则表达式引擎之一，代价是打破 DRY 规则并使模式难以辨认。一旦转换，正则表达式就变成了这个怪物:

(\/\/.*|\/\*[\s\S]*?\*\/|(?:u8?|U|L)?'(?:\\(?:['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]{1,2}|u[0-9a-fA-F]{4}|U[0-9a-fA-F]{8})|[^'\\\r\n])+')|(?:u8?|U|L)?"(?:\\(?:['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]{1,2}|u[0-9a-fA-F]{4}|U[0-9a-fA-F]{8})|[^"\\\r\n])*"|(?:u8?|U|L)?R"([^ ()\\\t\x0B\r\n]*)\([\s\S]*?\)\2"

这是您可以玩的交互式演示:

function run() {
    var re = /(\/\/.*|\/\*[\s\S]*?\*\/|(?:u8?|U|L)?'(?:\\(?:['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]{1,2}|u[0-9a-fA-F]{4}|U[0-9a-fA-F]{8})|[^'\\\r\n])+')|(?:u8?|U|L)?"(?:\\(?:['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]{1,2}|u[0-9a-fA-F]{4}|U[0-9a-fA-F]{8})|[^"\\\r\n])*"|(?:u8?|U|L)?R"([^ ()\\\t\x0B\r\n]*)\([\s\S]*?\)\2"/g;
    
    var input = document.getElementById("input").value;
    var output = input.replace(re, function(m, ignore) {
        return ignore ? m : "String(" + m + ")";
    });
    document.getElementById("output").innerText = output;
}

document.getElementById("input").addEventListener("input", run);
run();

<h2>Input:</h2>
<textarea id="input" style="width: 100%; height: 50px;">
std::cout << L"hello" << " world";
std::cout << "He said: \"bananas\"" << "...";
std::cout << "";
std::cout << "\x12\23\x34";
std::cout << u8R"hello(this"is\a\""""single\\(valid)"
raw string literal)hello";

"" // empty string
'"' // character literal

// this is "a string literal" in a comment
/* this is
   "also inside"
   //a comment */

// and this /*
"is not in a comment"
// */

"this is a /* string */ with nested // comments"
</textarea>
<h2>Output:</h2>
<pre id="output"></pre>

关于c++ - 用于匹配 C++ 字符串常量的正则表达式，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/41909225/

amp 43 code 34 lt c++regex string c-preprocessor

有关c++ - 用于匹配 C++ 字符串常量的正则表达式的更多相关文章

ruby - 如何从 ruby 中的字符串运行任意对象方法？ - 2
总的来说，我对ruby还比较陌生，我正在为我正在创建的对象编写一些rspec测试用例。许多测试用例都非常基础，我只是想确保正确填充和返回值。我想知道是否有办法使用循环结构来执行此操作。不必为我要测试的每个方法都设置一个assertEquals。例如:describeitem,"TestingtheItem"doit"willhaveanullvaluetostart"doitem=Item.new#HereIcoulddotheitem.name.shouldbe_nil#thenIcoulddoitem.category.shouldbe_nilendend但我想要一些方法来使用
Ruby 解析字符串 - 2
我有一个字符串input="maybe(thisis|thatwas)some((nice|ugly)(day|night)|(strange(weather|time)))"Ruby中解析该字符串的最佳方法是什么？我的意思是脚本应该能够像这样构建句子:maybethisissomeuglynightmaybethatwassomenicenightmaybethiswassomestrangetime等等，你明白了......我应该一个字符一个字符地读取字符串并构建一个带有堆栈的状态机来存储括号值以供以后计算，还是有更好的方法？也许为此目的准备了一个开箱即用的库？
ruby-on-rails - 在 Rails 中将文件大小字符串转换为等效千字节 - 2
我的目标是转换表单输入，例如“100兆字节”或“1GB”，并将其转换为我可以存储在数据库中的文件大小(以千字节为单位)。目前，我有这个:defquota_convert@regex=/([0-9]+)(.*)s/@sizes=%w{kilobytemegabytegigabyte}m=self.quota.match(@regex)if@sizes.include?m[2]eval("self.quota=#{m[1]}.#{m[2]}")endend这有效，但前提是输入是倍数(“gigabytes”，而不是“gigabyte”)并且由于使用了eval看起来疯狂不安全。所以，功能正常，
ruby-on-rails - unicode 字符串的长度 - 2
在我的Rails(2.3，Ruby1.8.7)应用程序中，我需要将字符串截断到一定长度。该字符串是unicode，在控制台中运行测试时，例如'א'.length，我意识到返回了双倍长度。我想要一个与编码无关的长度，以便对unicode字符串或latin1编码字符串进行相同的截断。我已经了解了Ruby的大部分unicode资料，但仍然有些一头雾水。应该如何解决这个问题？最佳答案 Rails有一个返回多字节字符的mb_chars方法。试试unicode_string.mb_chars.slice(0,50)
ruby - 将差异补丁应用于字符串/文件 - 2
对于具有离线功能的智能手机应用程序，我正在为Xml文件创建单向文本同步。我希望我的服务器将增量/差异(例如GNU差异补丁)发送到目标设备。这是计划:Time=0Server:hasversion_1ofXmlfile(~800kiB)Client:hasversion_1ofXmlfile(~800kiB)Time=1Server:hasversion_1andversion_2ofXmlfile(each~800kiB)computesdeltaoftheseversions(=patch)(~10kiB)sendspatchtoClient(~10kiBtransferred)Cl
ruby-on-rails - Rails 常用字符串(用于通知和错误信息等) - 2
大约一年前，我决定确保每个包含非唯一文本的Flash通知都将从模块中的方法中获取文本。我这样做的最初原因是为了避免一遍又一遍地输入相同的字符串。如果我想更改措辞，我可以在一个地方轻松完成，而且一遍又一遍地重复同一件事而出现拼写错误的可能性也会降低。我最终得到的是这样的:moduleMessagesdefformat_error_messages(errors)errors.map{|attribute,message|"Error:#{attribute.to_s.titleize}#{message}."}enddeferror_message_could_not_find(obje
ruby - 如何以所有可能的方式将字符串拆分为长度最多为 3 的连续子字符串？ - 2
我试图获取一个长度在1到10之间的字符串，并输出将字符串分解为大小为1、2或3的连续子字符串的所有可能方式。例如:输入:123456将整数分割成单个字符，然后继续查找组合。该代码将返回以下所有数组。[1,2,3,4,5,6][12,3,4,5,6][1,23,4,5,6][1,2,34,5,6][1,2,3,45,6][1,2,3,4,56][12,34,5,6][12,3,45,6][12,3,4,56][1,23,45,6][1,2,34,56][1,23,4,56][12,34,56][123,4,5,6][1,234,5,6][1,2,345,6][1,2,3,456][123
ruby-on-rails - 未初始化的常量 Psych::Syck (NameError) - 2
在我的gem中，我需要yaml并且在我的本地计算机上运行良好。但是在将我的gem推送到rubygems.org之后，当我尝试使用我的gem时，我收到一条错误消息=>"uninitializedconstantPsych::Syck(NameError)"谁能帮我解决这个问题？附言RubyVersion=>ruby1.9.2,GemVersion=>1.6.2,Bundlerversion=>1.0.15 最佳答案经过几个小时的研究，我发现=>“YAML使用未维护的Syck库，而Psych使用现代的LibYAML”因此，为了解决
ruby - 什么是填充的 Base64 编码字符串以及如何在 ruby 中生成它们？ - 2
我正在使用的第三方API的文档状态:"[O]urAPIonlyacceptspaddedBase64encodedstrings."什么是“填充的Base64编码字符串”以及如何在Ruby中生成它们。下面的代码是我第一次尝试创建转换为Base64的JSON格式数据。xa=Base64.encode64(a.to_json) 最佳答案他们说的padding其实就是Base64本身的一部分。它是末尾的“=”和“==”。Base64将3个字节的数据包编码为4个编码字符。所以如果你的输入数据有长度n和n%3=1=>"=="末尾用于填充n%
ruby - 如何使用文字标量样式在 YAML 中转储字符串？ - 2
我有一大串格式化数据(例如JSON)，我想使用Psychinruby同时保留格式转储到YAML。基本上，我希望JSON使用literalstyle出现在YAML中:---json:|{"page":1,"results":["item","another"],"total_pages":0}但是，当我使用YAML.dump时，它不使用文字样式。我得到这样的东西:---json:!"{\n\"page\":1,\n\"results\":[\n\"item\",\"another\"\n],\n\"total_pages\":0\n}\n"我如何告诉Psych以想要的样式转储标量？解

c++ - 用于匹配 C++ 字符串常量的正则表达式

有关c++ - 用于匹配 C++ 字符串常量的正则表达式的更多相关文章

随机推荐