java - 写入 Lucene 索引，一次一个文档，随着时间的推移变慢

coder 2024-03-09 原文

我们有一个程序，它持续运行，做各种事情，并更改我们数据库中的一些记录。这些记录使用 Lucene 编制索引。所以每次我们改变一个实体时，我们都会做类似的事情:

打开数据库事务，打开Lucene IndexWriter
在事务中对数据库进行更改，并使用 indexWriter.deleteDocuments(..) 然后 indexWriter.addDocument(..) 在 Lucene 中更新该实体.
如果一切顺利，提交数据库事务并提交 IndexWriter。

这工作正常，但随着时间的推移，indexWriter.commit() 需要越来越多的时间。最初它需要大约 0.5 秒，但经过数百次此类交易后，它需要超过 3 秒。如果脚本运行时间更长，我相信它会花费更长的时间。

到目前为止，我的解决方案是注释掉 indexWriter.addDocument(..) 和 indexWriter.commit()，并时不时地重新创建整个索引首先使用 indexWriter.deleteAll() 然后在一个 Lucene transction/IndexWriter 中重新添加所有文档(约 14 秒内约 250k 个文档)。但这显然违背了数据库和 Lucene 提供的事务性方法，后者使两者保持同步，并使使用 Lucene 进行搜索的我们工具的用户可以看到数据库的更新。

我14秒可以添加250k个文档，但是添加1个文档需要3秒，这似乎很奇怪。我做错了什么，我该如何改善这种情况？

最佳答案

你做错的是假设 Lucene 的 built-in transactional capabilities具有与典型关系数据库相当的性能和保证，当they really don't .更具体地说，在您的情况下，提交会将所有索引文件与磁盘同步，从而使提交时间与索引大小成正比。这就是为什么随着时间的推移您的 indexWriter.commit() 需要越来越多的时间。 Javadoc IndexWriter.commit() 甚至警告说:

This may be a costly operation, so you should test the cost in your application and do it only when really necessary.

您能想象数据库文档会告诉您避免提交吗？

由于您的主要目标似乎是通过 Lucene 搜索及时保持数据库更新可见，为了改善这种情况，请执行以下操作:

让 indexWriter.deleteDocuments(..) 和 indexWriter.addDocument(..) 在数据库提交成功之后触发，而不是之前
定期执行 indexWriter.commit() 而不是每次事务，只是为了确保您的更改最终写入磁盘
使用 SearcherManager用于搜索和调用 maybeRefresh()在合理的时间范围内定期查看更新的文档

下面是一个示例程序，演示了如何通过定期执行 maybeRefresh() 来检索文档更新。它建立了 100000 个文档的索引，使用 ScheduledExecutorService设置 commit() 和 maybeRefresh() 的定期调用，提示您更新单个文档，然后重复搜索直到更新可见。程序终止时会正确清理所有资源。请注意，更新何时可见的控制因素是何时调用 maybeRefresh()，而不是 commit()。

import java.io.IOException;
import java.nio.file.Paths;
import java.util.Scanner;
import java.util.concurrent.*;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.*;
import org.apache.lucene.index.*;
import org.apache.lucene.search.*;
import org.apache.lucene.store.FSDirectory;

public class LucenePeriodicCommitRefreshExample {
    ScheduledExecutorService scheduledExecutor;
    MyIndexer indexer;
    MySearcher searcher;

    void init() throws IOException {
        scheduledExecutor = Executors.newScheduledThreadPool(3);
        indexer = new MyIndexer();
        indexer.init();
        searcher = new MySearcher(indexer.indexWriter);
        searcher.init();
    }

    void destroy() throws IOException {
        searcher.destroy();
        indexer.destroy();
        scheduledExecutor.shutdown();
    }

    class MyIndexer {
        IndexWriter indexWriter;
        Future commitFuture;

        void init() throws IOException {
            indexWriter = new IndexWriter(FSDirectory.open(Paths.get("C:\\Temp\\lucene-example")), new IndexWriterConfig(new StandardAnalyzer()));
            indexWriter.deleteAll();
            for (int i = 1; i <= 100000; i++) {
                add(String.valueOf(i), "whatever " + i);
            }
            indexWriter.commit();
            commitFuture = scheduledExecutor.scheduleWithFixedDelay(() -> {
                try {
                    indexWriter.commit();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }, 5, 5, TimeUnit.MINUTES);
        }

        void add(String id, String text) throws IOException {
            Document doc = new Document();
            doc.add(new StringField("id", id, Field.Store.YES));
            doc.add(new StringField("text", text, Field.Store.YES));
            indexWriter.addDocument(doc);
        }

        void update(String id, String text) throws IOException {
            indexWriter.deleteDocuments(new Term("id", id));
            add(id, text);
        }

        void destroy() throws IOException {
            commitFuture.cancel(false);
            indexWriter.close();
        }
    }

    class MySearcher {
        IndexWriter indexWriter;
        SearcherManager searcherManager;
        Future maybeRefreshFuture;

        public MySearcher(IndexWriter indexWriter) {
            this.indexWriter = indexWriter;
        }

        void init() throws IOException {
            searcherManager = new SearcherManager(indexWriter, true, null);
            maybeRefreshFuture = scheduledExecutor.scheduleWithFixedDelay(() -> {
                try {
                    searcherManager.maybeRefresh();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }, 0, 5, TimeUnit.SECONDS);
        }

        String findText(String id) throws IOException {
            IndexSearcher searcher = null;
            try {
                searcher = searcherManager.acquire();
                TopDocs topDocs = searcher.search(new TermQuery(new Term("id", id)), 1);
                return searcher.doc(topDocs.scoreDocs[0].doc).getField("text").stringValue();
            } finally {
                if (searcher != null) {
                    searcherManager.release(searcher);
                }
            }
        }

        void destroy() throws IOException {
            maybeRefreshFuture.cancel(false);
            searcherManager.close();
        }
    }

    public static void main(String[] args) throws IOException {
        LucenePeriodicCommitRefreshExample example = new LucenePeriodicCommitRefreshExample();
        example.init();
        Runtime.getRuntime().addShutdownHook(new Thread() {
            @Override
            public void run() {
                try {
                    example.destroy();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        });

        try (Scanner scanner = new Scanner(System.in)) {
            System.out.print("Enter a document id to update (from 1 to 100000): ");
            String id = scanner.nextLine();
            System.out.print("Enter what you want the document text to be: ");
            String text = scanner.nextLine();
            example.indexer.update(id, text);
            long startTime = System.nanoTime();
            String foundText;
            do {
                foundText = example.searcher.findText(id);
            } while (!text.equals(foundText));
            long elapsedTimeMillis = TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - startTime);
            System.out.format("it took %d milliseconds for the searcher to see that document %s is now '%s'\n", elapsedTimeMillis, id, text);
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            System.exit(0);
        }
    }
}

此示例已成功使用 Lucene 5.3.1 和 JDK 1.8.0_66 进行测试。

关于java - 写入 Lucene 索引，一次一个文档，随着时间的推移变慢，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/32269632/

推移 Lucene indexWriter code IOException java performance indexing

有关java - 写入 Lucene 索引，一次一个文档，随着时间的推移变慢的更多相关文章

ruby - 使用 Vim Rails，您可以创建一个新的迁移文件并一次性打开它吗？ - 2
使用带有Rails插件的vim，您可以创建一个迁移文件，然后一次性打开该文件吗？textmate也可以这样吗？最佳答案你可以使用rails.vim然后做类似的事情::Rgeneratemigratonadd_foo_to_bar插件将打开迁移生成的文件，这正是您想要的。我不能代表textmate。关于ruby-使用VimRails，您可以创建一个新的迁移文件并一次性打开它吗？，我们在StackOverflow上找到一个类似的问题： https://sta
ruby-on-rails - Rails - 一个 View 中的多个模型 - 2
我需要从一个View访问多个模型。以前，我的links_controller仅用于提供以不同方式排序的链接资源。现在我想包括一个部分(我假设)显示按分数排序的顶级用户(@users=User.all.sort_by(&:score))我知道我可以将此代码插入每个链接操作并从View访问它，但这似乎不是“ruby方式”，我将需要在不久的将来访问更多模型。这可能会变得很脏，是否有针对这种情况的任何技术？注意事项:我认为我的应用程序正朝着单一格式和动态页面内容的方向发展，本质上是一个典型的网络应用程序。我知道before_filter但考虑到我希望应用程序进入的方向，这似乎很麻烦。最终从任何
ruby-on-rails - 渲染另一个 Controller 的 View - 2
我想要做的是有2个不同的Controller，client和test_client。客户端Controller已经构建，我想创建一个test_clientController，我可以使用它来玩弄客户端的UI并根据需要进行调整。我主要是想绕过我在客户端中内置的验证及其对加载数据的管理Controller的依赖。所以我希望test_clientController加载示例数据集，然后呈现客户端Controller的索引View，以便我可以调整客户端UI。就是这样。我在test_clients索引方法中试过这个:classTestClientdefindexrender:template=>
ruby - 如何每月在 Heroku 运行一次 Scheduler 插件？ - 2
在选择我想要运行操作的频率时，唯一的选项是“每天”、“每小时”和“每10分钟”。谢谢!我想为我的Rails3.1应用程序运行调度程序。最佳答案这不是一个优雅的解决方案，但您可以安排它每天运行，并在实际开始工作之前检查日期是否为当月的第一天。关于ruby-如何每月在Heroku运行一次Scheduler插件？，我们在StackOverflow上找到一个类似的问题： https://stackoverflow.com/questions/8692687/
Ruby 写入和读取对象到文件 - 2
好的，所以我的目标是轻松地将一些数据保存到磁盘以备后用。您如何简单地写入然后读取一个对象？所以如果我有一个简单的类classCattr_accessor:a,:bdefinitialize(a,b)@a,@b=a,bendend所以如果我从中非常快地制作一个objobj=C.new("foo","bar")#justgaveitsomerandomvalues然后我可以把它变成一个kindaidstring=obj.to_s#whichreturns""我终于可以将此字符串打印到文件或其他内容中。我的问题是，我该如何再次将这个id变回一个对象？我知道我可以自己挑选信息并制作一个接受该信
ruby-on-rails - 如果 Object::try 被发送到一个 nil 对象，为什么它会起作用？ - 2
如果您尝试在Ruby中的nil对象上调用方法，则会出现NoMethodError异常并显示消息:"undefinedmethod‘...’fornil:NilClass"然而，有一个tryRails中的方法，如果它被发送到一个nil对象，它只返回nil:require'rubygems'require'active_support/all'nil.try(:nonexisting_method)#noNoMethodErrorexceptionanymore那么try如何在内部工作以防止该异常？最佳答案像Ruby中的所有其他对象
ruby - 为什么 SecureRandom.uuid 创建一个唯一的字符串？ - 2
关闭。这个问题需要detailsorclarity.它目前不接受答案。想改进这个问题吗？通过editingthispost添加细节并澄清问题.关闭8年前。Improvethisquestion为什么SecureRandom.uuid创建一个唯一的字符串？SecureRandom.uuid#=>"35cb4e30-54e1-49f9-b5ce-4134799eb2c0"SecureRandom.uuid方法创建的字符串从不重复？
java - 等价于 Java 中的 Ruby Hash - 2
我真的很习惯使用Ruby编写以下代码:my_hash={}my_hash['test']=1Java中对应的数据结构是什么？最佳答案 HashMapmap=newHashMap();map.put("test",1);我假设？关于java-等价于Java中的RubyHash，我们在StackOverflow上找到一个类似的问题： https://stackoverflow.com/questions/22737685/
ruby-on-rails - Ruby 检查日期时间是否为 iso8601 并保存 - 2
我需要检查DateTime是否采用有效的ISO8601格式。喜欢:#iso8601?我检查了ruby是否有特定方法，但没有找到。目前我正在使用date.iso8601==date来检查这个。有什么好的方法吗？编辑解释我的环境，并改变问题的范围。因此，我的项目将使用jsapiFullCalendar，这就是我需要iso8601字符串格式的原因。我想知道更好或正确的方法是什么，以正确的格式将日期保存在数据库中，或者让ActiveRecord完成它们的工作并在我需要时间信息时对其进行操作。最佳答案我不太明白你的问题。我假设您想检查
ruby-on-rails - Rails - 从另一个模型中创建一个模型的实例 - 2
我有一个正在构建的应用程序，我需要一个模型来创建另一个模型的实例。我希望每辆车都有4个轮胎。汽车模型classCar轮胎模型classTire但是，在make_tires内部有一个错误，如果我为Tire尝试它，则没有用于创建或新建的activerecord方法。当我检查轮胎时，它没有这些方法。我该如何补救？错误是这样的:未定义的方法'create'forActiveRecord::AttributeMethods::Serialization::Tire::Module我测试了两个环境:测试和开发，它们都因相同的错误而失败。最佳答案

java - 写入 Lucene 索引，一次一个文档，随着时间的推移变慢

有关java - 写入 Lucene 索引，一次一个文档，随着时间的推移变慢的更多相关文章

随机推荐