草庐IT

java - 正则表达式字符类双重否定中的错误?

coder 2023-05-19 原文

更新:在Java 11中,以下所述的错误似乎已修复
(可能它甚至在更早的时候就已修复,但是我不知道确切在哪个版本中使用。Bug report有关nhahtdh's answer中链接的类似问题,建议使用Java 9)。

TL; DR (修复前):
为什么[^\\D2][^[^0-9]2][^2[^0-9]]在Java中得到不同的结果?

用于测试的代码。您现在可以跳过它。

String[] regexes = { "[[^0-9]2]", "[\\D2]", "[013-9]", "[^\\D2]", "[^[^0-9]2]", "[^2[^0-9]]" };
String[] tests = { "x", "1", "2", "3", "^", "[", "]" };

System.out.printf("match | %9s , %6s | %6s , %6s , %6s , %10s%n", (Object[]) regexes);
System.out.println("-----------------------------------------------------------------------");
for (String test : tests)
    System.out.printf("%5s | %9b , %6b | %7b , %6b , %10b , %10b %n", test,
            test.matches(regexes[0]), test.matches(regexes[1]),
            test.matches(regexes[2]), test.matches(regexes[3]),
            test.matches(regexes[4]), test.matches(regexes[5]));

可以说我需要正则表达式,它将接受以下字符
  • 不是数字,
  • ,但2除外。

  • 因此,此类正则表达式应表示除0134,...,9之外的每个字符。我至少可以用两种方式写出来,这将是所有不是数字的数字的总和:2:
  • [[^0-9]2]
  • [\\D2]

  • 这两个正则表达式均按预期工作
    match , [[^0-9]2] ,  [\D2]
    --------------------------
        x ,      true ,   true
        1 ,     false ,  false
        2 ,      true ,   true
        3 ,     false ,  false
        ^ ,      true ,   true
        [ ,      true ,   true
        ] ,      true ,   true
    
    现在让我说我要撤消接受的字符。 (所以我想接受除2以外的所有数字)
    我可以创建正则表达式,其中明确包含所有接受的字符,例如
  • [013-9]

  • 或尝试通过将其包装在另一个[^...]中来否定两个先前描述的正则表达式
  • [^\\D2]
  • [^[^0-9]2]甚至
  • [^2[^0-9]]

  • 但令我惊讶的是,只有前两个版本能按预期工作
    match | [[^0-9]2] ,  [\D2] | [013-9] , [^\D2] , [^[^0-9]2] , [^2[^0-9]] 
    ------+--------------------+------------------------------------------- 
        x |      true ,   true |   false ,  false ,       true ,       true 
        1 |     false ,  false |    true ,   true ,      false ,       true 
        2 |      true ,   true |   false ,  false ,      false ,      false 
        3 |     false ,  false |    true ,   true ,      false ,       true 
        ^ |      true ,   true |   false ,  false ,       true ,       true 
        [ |      true ,   true |   false ,  false ,       true ,       true 
        ] |      true ,   true |   false ,  false ,       true ,       true 
    
    所以我的问题是,为什么[^[^0-9]2][^2[^0-9]]不像[^\D2]一样起作用?我可以以某种方式更正这些正则表达式,以便能够在其中使用[^0-9]吗?

    最佳答案

    Oracle的Pattern类实现的字符类解析代码中发生了一些奇怪的巫毒,如果您从Oracle网站下载JRE/JDK或使用OpenJDK,则该类随您一起。我还没有检查其他JVM(尤其是GNU Classpath)实现如何解析问题中的正则表达式。

    从这一点出发,对Pattern类及其内部工作的任何引用都严格限于Oracle的实现(引用实现)。

    如问题所示,需要花费一些时间来阅读和理解Pattern类如何解析嵌套否定。但是,我编写了一个program1以从Pattern对象(带有Reflection API)中提取信息,以查看编译结果。下面的输出来自在Java HotSpot Client VM版本1.7.0_51上运行我的程序。

    1:目前,该程序令人尴尬。完成并重构后,我将使用链接更新此帖子。

    [^0-9]
    Start. Start unanchored match (minLength=1)
    CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
      Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
    LastNode
    Node. Accept match
    

    这没什么奇怪的。
    [^[^0-9]]
    Start. Start unanchored match (minLength=1)
    CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
      Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
    LastNode
    Node. Accept match
    
    [^[^[^0-9]]]
    Start. Start unanchored match (minLength=1)
    CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
      Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
    LastNode
    Node. Accept match
    

    上面的后两种情况与[^0-9]编译到相同的程序,即违反直觉的
    [[^0-9]2]
    Start. Start unanchored match (minLength=1)
    Pattern.union (character class union). Match any character matched by either character classes below:
      CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
        Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
      BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
        [U+0032]
        2
    LastNode
    Node. Accept match
    
    [\D2]
    Start. Start unanchored match (minLength=1)
    Pattern.union (character class union). Match any character matched by either character classes below:
      CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
        Ctype. Match POSIX character class DIGIT (US-ASCII)
      BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
        [U+0032]
        2
    LastNode
    Node. Accept match
    

    如问题中所述,在上述2种情况下没有什么奇怪的。
    [013-9]
    Start. Start unanchored match (minLength=1)
    Pattern.union (character class union). Match any character matched by either character classes below:
      BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 2 character(s):
        [U+0030][U+0031]
        01
      Pattern.rangeFor (character range). Match any character within the range from code point U+0033 to code point U+0039 (both ends inclusive)
    LastNode
    Node. Accept match
    
    [^\D2]
    Start. Start unanchored match (minLength=1)
    Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class:
      CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
        CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
          Ctype. Match POSIX character class DIGIT (US-ASCII)
      BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
        [U+0032]
        2
    LastNode
    Node. Accept match
    

    如问题中所述,这两个案例按预期工作。但是,请注意引擎如何对第一个字符类(\D)进行补充,并将集差异应用于包含剩余字符的字符类。
    [^[^0-9]2]
    Start. Start unanchored match (minLength=1)
    Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class:
      CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
        Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
      BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
        [U+0032]
        2
    LastNode
    Node. Accept match
    
    [^[^[^0-9]]2]
    Start. Start unanchored match (minLength=1)
    Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class:
      CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
        Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
      BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
        [U+0032]
        2
    LastNode
    Node. Accept match
    
    [^[^[^[^0-9]]]2]
    Start. Start unanchored match (minLength=1)
    Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class:
      CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
        Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
      BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
        [U+0032]
        2
    LastNode
    Node. Accept match
    

    正如Keppil在评论中进行的测试所证实的那样,上面的输出显示上述所有3个正则表达式都编译到了同一程序中!
    [^2[^0-9]]
    Start. Start unanchored match (minLength=1)
    Pattern.union (character class union). Match any character matched by either character classes below:
      CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
        BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
          [U+0032]
          2
      CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
        Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
    LastNode
    Node. Accept match
    

    我们得到的NOT(UNION(2, NOT(0-9))代替了0-13-9,而不是UNION(NOT(2), NOT(0-9)),即NOT(2)
    [^2[^[^0-9]]]
    Start. Start unanchored match (minLength=1)
    Pattern.union (character class union). Match any character matched by either character classes below:
      CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
        BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
          [U+0032]
          2
      CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
        Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
    LastNode
    Node. Accept match
    

    由于存在相同的错误,正则表达式[^2[^[^0-9]]][^2[^0-9]]编译到相同的程序。

    有一个尚 Unresolved 错误,似乎具有相同的性质:JDK-6609854

    说明

    初步

    以下是在进一步阅读之前应该知道的Pattern类的实现细节:
  • Pattern类将String编译成一个节点链,每个节点负责一个明确的职责,并将工作委托(delegate)给链中的下一个节点。 Node类是所有节点的基类。
  • CharProperty类是所有与字符类相关的Node的基类。
  • BitClass类是CharProperty类的子类,它使用boolean[]数组加快对Latin-1字符(代码点<=>add方法,该方法允许在编译过程中添加字符。
  • CharProperty.complementPattern.unionPattern.intersection是与设置操作相对应的方法。他们所做的是不言自明的。
  • Pattern.setDifferenceasymmetric set difference

  • 乍一看解析字符类

    在查看负责解析字符类的方法CharProperty clazz(boolean consume)方法的完整代码之前,让我们看一下极其简化的代码版本,以了解代码流程:
    private CharProperty clazz(boolean consume) {
        // [Declaration and initialization of local variables - OMITTED]
        BitClass bits = new BitClass();
        int ch = next();
        for (;;) {
            switch (ch) {
                case '^':
                    // Negates if first char in a class, otherwise literal
                    if (firstInClass) {
                        // [CODE OMITTED]
                        ch = next();
                        continue;
                    } else {
                        // ^ not first in class, treat as literal
                        break;
                    }
                case '[':
                    // [CODE OMITTED]
                    ch = peek();
                    continue;
                case '&':
                    // [CODE OMITTED]
                    continue;
                case 0:
                    // [CODE OMITTED]
                    // Unclosed character class is checked here
                    break;
                case ']':
                    // [CODE OMITTED]
                    // The only return statement in this method
                    // is in this case
                    break;
                default:
                    // [CODE OMITTED]
                    break;
            }
            node = range(bits);
    
            // [CODE OMITTED]
            ch = peek();
        }
    }
    

    该代码基本上会读取输入(将输入的String转换为代码点的以空值终止的int[]),直到命中]或String的结尾(未封闭的字符类)。

    该代码与continue块中的breakswitch混合在一起有点混淆。但是,只要您意识到continue属于外部for循环而break属于switch块,该代码就很容易理解:
  • continue结尾的案例将永远不会在switch语句之后执行代码。
  • break结尾的案例可能会在switch语句之后执行代码(如果还没有return)。

  • 通过上面的观察,我们可以看到,只要发现一个字符是非特殊字符,就应该将其包括在字符类中,我们将在switch语句之后执行代码,其中node = range(bits);是第一个语句。

    如果检查source code,则方法CharProperty range(BitClass bits)解析“字符类中的单个字符或字符范围”。该方法要么返回传入的相同BitClass对象(添加了新字符),要么返回CharProperty类的新实例。

    血腥细节

    接下来,让我们看一下完整的代码版本(省略解析字符类交集&&的部分):
    private CharProperty clazz(boolean consume) {
        CharProperty prev = null;
        CharProperty node = null;
        BitClass bits = new BitClass();
        boolean include = true;
        boolean firstInClass = true;
        int ch = next();
        for (;;) {
            switch (ch) {
                case '^':
                    // Negates if first char in a class, otherwise literal
                    if (firstInClass) {
                        if (temp[cursor-1] != '[')
                            break;
                        ch = next();
                        include = !include;
                        continue;
                    } else {
                        // ^ not first in class, treat as literal
                        break;
                    }
                case '[':
                    firstInClass = false;
                    node = clazz(true);
                    if (prev == null)
                        prev = node;
                    else
                        prev = union(prev, node);
                    ch = peek();
                    continue;
                case '&':
                    // [CODE OMITTED]
                    // There are interesting things (bugs) here,
                    // but it is not relevant to the discussion.
                    continue;
                case 0:
                    firstInClass = false;
                    if (cursor >= patternLength)
                        throw error("Unclosed character class");
                    break;
                case ']':
                    firstInClass = false;
    
                    if (prev != null) {
                        if (consume)
                            next();
    
                        return prev;
                    }
                    break;
                default:
                    firstInClass = false;
                    break;
            }
            node = range(bits);
    
            if (include) {
                if (prev == null) {
                    prev = node;
                } else {
                    if (prev != node)
                        prev = union(prev, node);
                }
            } else {
                if (prev == null) {
                    prev = node.complement();
                } else {
                    if (prev != node)
                        prev = setDifference(prev, node);
                }
            }
            ch = peek();
        }
    }
    

    查看case '[':语句的switch中的代码以及switch语句后的代码:
  • node变量存储解析单元(独立字符,字符范围,速记字符类,POSIX/Unicode字符类或嵌套字符类)的结果。
  • prev变量存储了到目前为止的编译结果,并且总是在编译node中的一个单元后立即进行更新。

  • 由于记录字符类是否被否定的局部变量boolean include从未传递给任何方法调用,因此只能在此方法中对其进行操作。唯一读取和处理include的位置是在switch语句之后。

    正在 build 中

    关于java - 正则表达式字符类双重否定中的错误?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/21934168/

    有关java - 正则表达式字符类双重否定中的错误?的更多相关文章

    1. ruby - 如何从 ruby​​ 中的字符串运行任意对象方法? - 2

      总的来说,我对ruby​​还比较陌生,我正在为我正在创建的对象编写一些rspec测试用例。许多测试用例都非常基础,我只是想确保正确填充和返回值。我想知道是否有办法使用循环结构来执行此操作。不必为我要测试的每个方法都设置一个assertEquals。例如:describeitem,"TestingtheItem"doit"willhaveanullvaluetostart"doitem=Item.new#HereIcoulddotheitem.name.shouldbe_nil#thenIcoulddoitem.category.shouldbe_nilendend但我想要一些方法来使用

    2. Ruby 解析字符串 - 2

      我有一个字符串input="maybe(thisis|thatwas)some((nice|ugly)(day|night)|(strange(weather|time)))"Ruby中解析该字符串的最佳方法是什么?我的意思是脚本应该能够像这样构建句子:maybethisissomeuglynightmaybethatwassomenicenightmaybethiswassomestrangetime等等,你明白了......我应该一个字符一个字符地读取字符串并构建一个带有堆栈的状态机来存储括号值以供以后计算,还是有更好的方法?也许为此目的准备了一个开箱即用的库?

    3. ruby - 其他文件中的 Rake 任务 - 2

      我试图在一个项目中使用rake,如果我把所有东西都放到Rakefile中,它会很大并且很难读取/找到东西,所以我试着将每个命名空间放在lib/rake中它自己的文件中,我添加了这个到我的rake文件的顶部:Dir['#{File.dirname(__FILE__)}/lib/rake/*.rake'].map{|f|requiref}它加载文件没问题,但没有任务。我现在只有一个.rake文件作为测试,名为“servers.rake”,它看起来像这样:namespace:serverdotask:testdoputs"test"endend所以当我运行rakeserver:testid时

    4. ruby-on-rails - 在 Rails 中将文件大小字符串转换为等效千字节 - 2

      我的目标是转换表单输入,例如“100兆字节”或“1GB”,并将其转换为我可以存储在数据库中的文件大小(以千字节为单位)。目前,我有这个:defquota_convert@regex=/([0-9]+)(.*)s/@sizes=%w{kilobytemegabytegigabyte}m=self.quota.match(@regex)if@sizes.include?m[2]eval("self.quota=#{m[1]}.#{m[2]}")endend这有效,但前提是输入是倍数(“gigabytes”,而不是“gigabyte”)并且由于使用了eval看起来疯狂不安全。所以,功能正常,

    5. ruby-on-rails - Ruby net/ldap 模块中的内存泄漏 - 2

      作为我的Rails应用程序的一部分,我编写了一个小导入程序,它从我们的LDAP系统中吸取数据并将其塞入一个用户表中。不幸的是,与LDAP相关的代码在遍历我们的32K用户时泄漏了大量内存,我一直无法弄清楚如何解决这个问题。这个问题似乎在某种程度上与LDAP库有关,因为当我删除对LDAP内容的调用时,内存使用情况会很好地稳定下来。此外,不断增加的对象是Net::BER::BerIdentifiedString和Net::BER::BerIdentifiedArray,它们都是LDAP库的一部分。当我运行导入时,内存使用量最终达到超过1GB的峰值。如果问题存在,我需要找到一些方法来更正我的代

    6. ruby-on-rails - unicode 字符串的长度 - 2

      在我的Rails(2.3,Ruby1.8.7)应用程序中,我需要将字符串截断到一定长度。该字符串是unicode,在控制台中运行测试时,例如'א'.length,我意识到返回了双倍长度。我想要一个与编码无关的长度,以便对unicode字符串或latin1编码字符串进行相同的截断。我已经了解了Ruby的大部分unicode资料,但仍然有些一头雾水。应该如何解决这个问题? 最佳答案 Rails有一个返回多字节字符的mb_chars方法。试试unicode_string.mb_chars.slice(0,50)

    7. ruby-on-rails - Rails 3 中的多个路由文件 - 2

      Rails2.3可以选择随时使用RouteSet#add_configuration_file添加更多路由。是否可以在Rails3项目中做同样的事情? 最佳答案 在config/application.rb中:config.paths.config.routes在Rails3.2(也可能是Rails3.1)中,使用:config.paths["config/routes"] 关于ruby-on-rails-Rails3中的多个路由文件,我们在StackOverflow上找到一个类似的问题

    8. ruby - 将差异补丁应用于字符串/文件 - 2

      对于具有离线功能的智能手机应用程序,我正在为Xml文件创建单向文本同步。我希望我的服务器将增量/差异(例如GNU差异补丁)发送到目标设备。这是计划:Time=0Server:hasversion_1ofXmlfile(~800kiB)Client:hasversion_1ofXmlfile(~800kiB)Time=1Server:hasversion_1andversion_2ofXmlfile(each~800kiB)computesdeltaoftheseversions(=patch)(~10kiB)sendspatchtoClient(~10kiBtransferred)Cl

    9. ruby-on-rails - Rails 常用字符串(用于通知和错误信息等) - 2

      大约一年前,我决定确保每个包含非唯一文本的Flash通知都将从模块中的方法中获取文本。我这样做的最初原因是为了避免一遍又一遍地输入相同的字符串。如果我想更改措辞,我可以在一个地方轻松完成,而且一遍又一遍地重复同一件事而出现拼写错误的可能性也会降低。我最终得到的是这样的:moduleMessagesdefformat_error_messages(errors)errors.map{|attribute,message|"Error:#{attribute.to_s.titleize}#{message}."}enddeferror_message_could_not_find(obje

    10. ruby - 如何以所有可能的方式将字符串拆分为长度最多为 3 的连续子字符串? - 2

      我试图获取一个长度在1到10之间的字符串,并输出将字符串分解为大小为1、2或3的连续子字符串的所有可能方式。例如:输入:123456将整数分割成单个字符,然后继续查找组合。该代码将返回以下所有数组。[1,2,3,4,5,6][12,3,4,5,6][1,23,4,5,6][1,2,34,5,6][1,2,3,45,6][1,2,3,4,56][12,34,5,6][12,3,45,6][12,3,4,56][1,23,45,6][1,2,34,56][1,23,4,56][12,34,56][123,4,5,6][1,234,5,6][1,2,345,6][1,2,3,456][123

    随机推荐