草庐IT

java - 使用 iText 替换 PDF 文件中的文本

coder 2024-04-01 原文

我正在使用 iText(5.5.13) 库读取 .PDF 并替换文件中的模式。问题在于未找到该模式,因为在库读取 pdf 时不知何故出现了一些奇怪的字符。

例如,在句子中:

"This is a test in order to see if the"

当我试图阅读它时变成了这个:

[(This is a )9(te)-3(st)9( in o)-4(rd)15(er )-2(t)9(o)-5( s)8(ee)7( if t)-3(h)3(e )]

因此,如果我尝试查找并替换 "test",则不会在 pdf 中找到 "test" 单词,并且不会被替换

这是我使用的代码:

public void processPDF(String src, String dest) {

    try {

      PdfReader reader = new PdfReader(src);
      PdfArray refs = null;
      PRIndirectReference reference = null;

      int nPages = reader.getNumberOfPages();

      for (int i = 1; i <= nPages; i++) {
        PdfDictionary dict = reader.getPageN(i);
        PdfObject object = dict.getDirectObject(PdfName.CONTENTS);
        if (object.isArray()) {
          refs = dict.getAsArray(PdfName.CONTENTS);
          ArrayList<PdfObject> references = refs.getArrayList();

          for (PdfObject r : references) {

            reference = (PRIndirectReference) r;
            PRStream stream = (PRStream) PdfReader.getPdfObject(reference);
            byte[] data = PdfReader.getStreamBytes(stream);
            String dd = new String(data, "UTF-8");

            dd = dd.replaceAll("@pattern_1234", "trueValue");
            dd = dd.replaceAll("test", "tested");

            stream.setData(dd.getBytes());
          }

        }
        if (object instanceof PRStream) {
          PRStream stream = (PRStream) object;

          byte[] data = PdfReader.getStreamBytes(stream);
          String dd = new String(data, "UTF-8");
          System.out.println("content---->" + dd);
          dd = dd.replaceAll("@pattern_1234", "trueValue");
          dd = dd.replaceAll("This", "FIRST");

          stream.setData(dd.getBytes(StandardCharsets.UTF_8));
        }
      }
      PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(dest));
      stamper.close();
      reader.close();
    }

    catch (Exception e) {
    }
  }

最佳答案

正如评论和答案中已经提到的,PDF 不是一种用于文本编辑 的格式。它是最终格式,有关文本流、布局甚至到 Unicode 的映射的信息都是可选的。

因此,即使假设存在关于将字形映射到 Unicode 的可选信息,使用 iText 完成此任务的方法可能看起来有点不令人满意:首先使用自定义文本提取策略确定相关文本的位置,然后继续使用 PdfCleanUpProcessor 删除该位置所有内容的当前内容,最后将替换文本绘制到间隙中。

在这个答案中,我将提供一个辅助类,允许结合前两个步骤,查找和删除现有文本,其优点是确实只删除了文本 还有任何背景图形等,如 PdfCleanUpProcessor 编辑。助手还返回已删除文本的位置,允许在其上标记替换。

帮助类基于 PdfContentStreamEditor this earlier answer .请使用the version of this class on github ,不过,因为原始类自构想以来已经得到了一些增强。

SimpleTextRemover helper 类说明了从 PDF 中正确删除文本的必要条件。实际上它有几个方面的局限性:

  • 它只替换实际页面内容流中的文本。

    要同时替换嵌入式 XObject 中的文本,必须递归地遍历相关页面的 XObject 资源,并对它们应用编辑器。

  • 它与 SimpleTextExtractionStrategy 的“简单”方式相同:它假定显示说明的文本按阅读顺序出现在内容中。

    同时处理顺序不同且指令必须排序的内容流,这意味着所有传入指令和相关呈现信息必须缓存到页面末尾,而不仅仅是一次几个指令.然后可以对渲染信息进行排序,可以在排序后的渲染信息中识别要删除的部分,可以操作相关指令,最终可以存储指令。

  • 它不会尝试识别在视觉上代表空白的字形之间的间隙,而实际上根本没有字形。

    要识别间隙,必须扩展代码以检查两个连续的字形是否完全相继出现,或者是否存在间隙或跳行。

  • 在计算去除字形后留下的间隙时,它还没有考虑字符和单词的间距。

    要改进这一点,必须改进字形宽度计算。

不过,考虑到您的内容流中的示例摘录,这些限制可能不会妨碍您。

public class SimpleTextRemover extends PdfContentStreamEditor {
    public SimpleTextRemover() {
        super (new SimpleTextRemoverListener());
        ((SimpleTextRemoverListener)getRenderListener()).simpleTextRemover = this;
    }

    /**
     * <p>Removes the string to remove from the given page of the
     * document in the PDF reader the given PDF stamper works on.</p>
     * <p>The result is a list of glyph lists each of which represents
     * a match can can be queried for position information.</p>
     */
    public List<List<Glyph>> remove(PdfStamper pdfStamper, int pageNum, String toRemove) throws IOException {
        if (toRemove.length()  == 0)
            return Collections.emptyList();

        this.toRemove = toRemove;
        cachedOperations.clear();
        elementNumber = -1;
        pendingMatch.clear();
        matches.clear();
        allMatches.clear();
        editPage(pdfStamper, pageNum);
        return allMatches;
    }

    /**
     * Adds the given operation to the cached operations and checks
     * whether some cached operations can meanwhile be processed and
     * written to the result content stream.
     */
    @Override
    protected void write(PdfContentStreamProcessor processor, PdfLiteral operator, List<PdfObject> operands) throws IOException {
        cachedOperations.add(new ArrayList<>(operands));

        while (process(processor)) {
            cachedOperations.remove(0);
        }
    }

    /**
     * Removes any started match and sends all remaining cached
     * operations for processing.
     */
    @Override
    public void finalizeContent() {
        pendingMatch.clear();
        try {
            while (!cachedOperations.isEmpty()) {
                if (!process(this)) {
                    // TODO: Should not happen, so warn
                    System.err.printf("Failure flushing operation %s; dropping.\n", cachedOperations.get(0));
                }
                cachedOperations.remove(0);
            }
        } catch (IOException e) {
            throw new ExceptionConverter(e);
        }
    }

    /**
     * Tries to process the first cached operation. Returns whether
     * it could be processed.
     */
    boolean process(PdfContentStreamProcessor processor) throws IOException {
        if (cachedOperations.isEmpty())
            return false;

        List<PdfObject> operands = cachedOperations.get(0);
        PdfLiteral operator = (PdfLiteral) operands.get(operands.size() - 1);
        String operatorString = operator.toString();

        if (TEXT_SHOWING_OPERATORS.contains(operatorString))
            return processTextShowingOp(processor, operator, operands);

        super.write(processor, operator, operands);
        return true;
    }

    /**
     * Tries to processes a text showing operation. Unless a match
     * is pending and starts before the end of the argument of this
     * instruction, it can be processed. If the instructions contains
     * a part of a match, it is transformed to a TJ operation and
     * the glyphs in question are replaced by text position adjustments.
     * If the original operation had a side effect (jump to next line
     * or spacing adjustment), this side effect is explicitly added.
     */
    boolean processTextShowingOp(PdfContentStreamProcessor processor, PdfLiteral operator, List<PdfObject> operands) throws IOException {
        PdfObject object = operands.get(operands.size() - 2);
        boolean isArray = object instanceof PdfArray;
        PdfArray array = isArray ? (PdfArray) object : new PdfArray(object);
        int elementCount = countStrings(object);

        // Currently pending glyph intersects parameter of this operation -> cannot yet process
        if (!pendingMatch.isEmpty() && pendingMatch.get(0).elementNumber < processedElements + elementCount)
            return false;

        // The parameter of this operation is subject to a match -> copy as is
        if (matches.size() == 0 || processedElements + elementCount <= matches.get(0).get(0).elementNumber || elementCount == 0) {
            super.write(processor, operator, operands);
            processedElements += elementCount;
            return true;
        }

        // The parameter of this operation contains glyphs of a match -> manipulate 
        PdfArray newArray = new PdfArray();
        for (int arrayIndex = 0; arrayIndex < array.size(); arrayIndex++) {
            PdfObject entry = array.getPdfObject(arrayIndex);
            if (!(entry instanceof PdfString)) {
                newArray.add(entry);
            } else {
                PdfString entryString = (PdfString) entry;
                byte[] entryBytes = entryString.getBytes();
                for (int index = 0; index < entryBytes.length; ) {
                    List<Glyph> match = matches.size() == 0 ? null : matches.get(0);
                    Glyph glyph = match == null ? null : match.get(0);
                    if (glyph == null || processedElements < glyph.elementNumber) {
                        newArray.add(new PdfString(Arrays.copyOfRange(entryBytes, index, entryBytes.length)));
                        break;
                    }
                    if (index < glyph.index) {
                        newArray.add(new PdfString(Arrays.copyOfRange(entryBytes, index, glyph.index)));
                        index = glyph.index;
                        continue;
                    }
                    newArray.add(new PdfNumber(-glyph.width));
                    index++;
                    match.remove(0);
                    if (match.isEmpty())
                        matches.remove(0);
                }
                processedElements++;
            }
        }
        writeSideEffect(processor, operator, operands);
        writeTJ(processor, newArray);

        return true;
    }

    /**
     * Counts the strings in the given argument, itself a string or
     * an array containing strings and non-strings.
     */
    int countStrings(PdfObject textArgument) {
        if (textArgument instanceof PdfArray) {
            int result = 0;
            for (PdfObject object : (PdfArray)textArgument) {
                if (object instanceof PdfString)
                    result++;
            }
            return result;
        } else 
            return textArgument instanceof PdfString ? 1 : 0;
    }

    /**
     * Writes side effects of a text showing operation which is going to be
     * replaced by a TJ operation. Side effects are line jumps and changes
     * of character or word spacing.
     */
    void writeSideEffect(PdfContentStreamProcessor processor, PdfLiteral operator, List<PdfObject> operands) throws IOException {
        switch (operator.toString()) {
        case "\"":
            super.write(processor, OPERATOR_Tw, Arrays.asList(operands.get(0), OPERATOR_Tw));
            super.write(processor, OPERATOR_Tc, Arrays.asList(operands.get(1), OPERATOR_Tc));
        case "'":
            super.write(processor, OPERATOR_Tasterisk, Collections.singletonList(OPERATOR_Tasterisk));
        }
    }

    /**
     * Writes a TJ operation with the given array unless array is empty.
     */
    void writeTJ(PdfContentStreamProcessor processor, PdfArray array) throws IOException {
        if (!array.isEmpty()) {
            List<PdfObject> operands = Arrays.asList(array, OPERATOR_TJ);
            super.write(processor, OPERATOR_TJ, operands);
        }
    }

    /**
     * Analyzes the given text render info whether it starts a new match or
     * finishes / continues / breaks a pending match. This method is called
     * by the {@link SimpleTextRemoverListener} registered as render listener
     * of the underlying content stream processor.
     */
    void renderText(TextRenderInfo renderInfo) {
        elementNumber++;
        int index = 0;
        for (TextRenderInfo info : renderInfo.getCharacterRenderInfos()) {
            int matchPosition = pendingMatch.size();
            pendingMatch.add(new Glyph(info, elementNumber, index));
            if (!toRemove.substring(matchPosition, matchPosition + info.getText().length()).equals(info.getText())) {
                reduceToPartialMatch();
            }
            if (pendingMatch.size() == toRemove.length()) {
                matches.add(new ArrayList<>(pendingMatch));
                allMatches.add(new ArrayList<>(pendingMatch));
                pendingMatch.clear();
            }
            index++;
        }
    }

    /**
     * Reduces the current pending match to an actual (partial) match
     * after the addition of the next glyph has invalidated it as a
     * whole match.
     */
    void reduceToPartialMatch() {
        outer:
        while (!pendingMatch.isEmpty()) {
            pendingMatch.remove(0);
            int index = 0;
            for (Glyph glyph : pendingMatch) {
                if (!toRemove.substring(index, index + glyph.text.length()).equals(glyph.text)) {
                    continue outer;
                }
                index++;
            }
            break;
        }
    }

    String toRemove = null;
    final List<List<PdfObject>> cachedOperations = new LinkedList<>();

    int elementNumber = -1;
    int processedElements = 0;
    final List<Glyph> pendingMatch = new ArrayList<>();
    final List<List<Glyph>> matches = new ArrayList<>();
    final List<List<Glyph>> allMatches = new ArrayList<>();

    /**
     * Render listener class used by {@link SimpleTextRemover} as listener
     * of its content stream processor ancestor. Essentially it forwards
     * {@link TextRenderInfo} events and ignores all else.
     */
    static class SimpleTextRemoverListener implements RenderListener {
        @Override
        public void beginTextBlock() { }

        @Override
        public void renderText(TextRenderInfo renderInfo) {
            simpleTextRemover.renderText(renderInfo);
        }

        @Override
        public void endTextBlock() { }

        @Override
        public void renderImage(ImageRenderInfo renderInfo) { }

        SimpleTextRemover simpleTextRemover = null;
    }

    /**
     * Value class representing a glyph with information on
     * the displayed text and its position, the overall number
     * of the string argument of a text showing instruction
     * it is in and the index at which it can be found therein,
     * and the width to use as text position adjustment when
     * replacing it. Beware, the width does not yet consider
     * character and word spacing!
     */
    public static class Glyph {
        public Glyph(TextRenderInfo info, int elementNumber, int index) {
            text = info.getText();
            ascent = info.getAscentLine();
            base = info.getBaseline();
            descent = info.getDescentLine();
            this.elementNumber = elementNumber;
            this.index = index;
            this.width = info.getFont().getWidth(text);
        }

        public final String text;
        public final LineSegment ascent;
        public final LineSegment base;
        public final LineSegment descent;
        final int elementNumber;
        final int index;
        final float width;
    }

    final PdfLiteral OPERATOR_Tasterisk = new PdfLiteral("T*");
    final PdfLiteral OPERATOR_Tc = new PdfLiteral("Tc");
    final PdfLiteral OPERATOR_Tw = new PdfLiteral("Tw");
    final PdfLiteral OPERATOR_Tj = new PdfLiteral("Tj");
    final PdfLiteral OPERATOR_TJ = new PdfLiteral("TJ");
    final static List<String> TEXT_SHOWING_OPERATORS = Arrays.asList("Tj", "'", "\"", "TJ");
    final static Glyph[] EMPTY_GLYPH_ARRAY = new Glyph[0];
}

( SimpleTextRemover 辅助类)

你可以这样使用它:

PdfReader pdfReader = new PdfReader(SOURCE);
PdfStamper pdfStamper = new PdfStamper(pdfReader, RESULT_STREAM);
SimpleTextRemover remover = new SimpleTextRemover();

System.out.printf("\ntest.pdf - Test\n");
for (int i = 1; i <= pdfReader.getNumberOfPages(); i++)
{
    System.out.printf("Page %d:\n", i);
    List<List<Glyph>> matches = remover.remove(pdfStamper, i, "Test");
    for (List<Glyph> match : matches) {
        Glyph first = match.get(0);
        Vector baseStart = first.base.getStartPoint();
        Glyph last = match.get(match.size()-1);
        Vector baseEnd = last.base.getEndPoint();
        System.out.printf("  Match from (%3.1f %3.1f) to (%3.1f %3.1f)\n", baseStart.get(I1), baseStart.get(I2), baseEnd.get(I1), baseEnd.get(I2));
    }
}

pdfStamper.close();

( RemovePageTextContent 测试 testRemoveTestFromTest)

我的测试文件的控制台输出如下:

test.pdf - Test
Page 1:
  Match from (134,8 666,9) to (177,8 666,9)
  Match from (134,8 642,0) to (153,4 642,0)
  Match from (172,8 642,0) to (191,4 642,0)

以及输出 PDF 中这些位置缺少“测试”的情况。

您可以使用它们在相关位置绘制替换文本,而不是输出匹配坐标。

关于java - 使用 iText 替换 PDF 文件中的文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57308588/

有关java - 使用 iText 替换 PDF 文件中的文本的更多相关文章

  1. ruby - 如何使用 Nokogiri 的 xpath 和 at_xpath 方法 - 2

    我正在学习如何使用Nokogiri,根据这段代码我遇到了一些问题:require'rubygems'require'mechanize'post_agent=WWW::Mechanize.newpost_page=post_agent.get('http://www.vbulletin.org/forum/showthread.php?t=230708')puts"\nabsolutepathwithtbodygivesnil"putspost_page.parser.xpath('/html/body/div/div/div/div/div/table/tbody/tr/td/div

  2. ruby - 如何从 ruby​​ 中的字符串运行任意对象方法? - 2

    总的来说,我对ruby​​还比较陌生,我正在为我正在创建的对象编写一些rspec测试用例。许多测试用例都非常基础,我只是想确保正确填充和返回值。我想知道是否有办法使用循环结构来执行此操作。不必为我要测试的每个方法都设置一个assertEquals。例如:describeitem,"TestingtheItem"doit"willhaveanullvaluetostart"doitem=Item.new#HereIcoulddotheitem.name.shouldbe_nil#thenIcoulddoitem.category.shouldbe_nilendend但我想要一些方法来使用

  3. ruby - 使用 RubyZip 生成 ZIP 文件时设置压缩级别 - 2

    我有一个Ruby程序,它使用rubyzip压缩XML文件的目录树。gem。我的问题是文件开始变得很重,我想提高压缩级别,因为压缩时间不是问题。我在rubyzipdocumentation中找不到一种为创建的ZIP文件指定压缩级别的方法。有人知道如何更改此设置吗?是否有另一个允许指定压缩级别的Ruby库? 最佳答案 这是我通过查看ruby​​zip内部创建的代码。level=Zlib::BEST_COMPRESSIONZip::ZipOutputStream.open(zip_file)do|zip|Dir.glob("**/*")d

  4. ruby - 为什么我可以在 Ruby 中使用 Object#send 访问私有(private)/ protected 方法? - 2

    类classAprivatedeffooputs:fooendpublicdefbarputs:barendprivatedefzimputs:zimendprotecteddefdibputs:dibendendA的实例a=A.new测试a.foorescueputs:faila.barrescueputs:faila.zimrescueputs:faila.dibrescueputs:faila.gazrescueputs:fail测试输出failbarfailfailfail.发送测试[:foo,:bar,:zim,:dib,:gaz].each{|m|a.send(m)resc

  5. ruby-on-rails - 使用 Ruby on Rails 进行自动化测试 - 最佳实践 - 2

    很好奇,就使用ruby​​onrails自动化单元测试而言,你们正在做什么?您是否创建了一个脚本来在cron中运行rake作业并将结果邮寄给您?git中的预提交Hook?只是手动调用?我完全理解测试,但想知道在错误发生之前捕获错误的最佳实践是什么。让我们理所当然地认为测试本身是完美无缺的,并且可以正常工作。下一步是什么以确保他们在正确的时间将可能有害的结果传达给您? 最佳答案 不确定您到底想听什么,但是有几个级别的自动代码库控制:在处理某项功能时,您可以使用类似autotest的内容获得关于哪些有效,哪些无效的即时反馈。要确保您的提

  6. ruby - 在 Ruby 中使用匿名模块 - 2

    假设我做了一个模块如下:m=Module.newdoclassCendend三个问题:除了对m的引用之外,还有什么方法可以访问C和m中的其他内容?我可以在创建匿名模块后为其命名吗(就像我输入“module...”一样)?如何在使用完匿名模块后将其删除,使其定义的常量不再存在? 最佳答案 三个答案:是的,使用ObjectSpace.此代码使c引用你的类(class)C不引用m:c=nilObjectSpace.each_object{|obj|c=objif(Class===objandobj.name=~/::C$/)}当然这取决于

  7. ruby - 其他文件中的 Rake 任务 - 2

    我试图在一个项目中使用rake,如果我把所有东西都放到Rakefile中,它会很大并且很难读取/找到东西,所以我试着将每个命名空间放在lib/rake中它自己的文件中,我添加了这个到我的rake文件的顶部:Dir['#{File.dirname(__FILE__)}/lib/rake/*.rake'].map{|f|requiref}它加载文件没问题,但没有任务。我现在只有一个.rake文件作为测试,名为“servers.rake”,它看起来像这样:namespace:serverdotask:testdoputs"test"endend所以当我运行rakeserver:testid时

  8. ruby-on-rails - 在 Rails 中将文件大小字符串转换为等效千字节 - 2

    我的目标是转换表单输入,例如“100兆字节”或“1GB”,并将其转换为我可以存储在数据库中的文件大小(以千字节为单位)。目前,我有这个:defquota_convert@regex=/([0-9]+)(.*)s/@sizes=%w{kilobytemegabytegigabyte}m=self.quota.match(@regex)if@sizes.include?m[2]eval("self.quota=#{m[1]}.#{m[2]}")endend这有效,但前提是输入是倍数(“gigabytes”,而不是“gigabyte”)并且由于使用了eval看起来疯狂不安全。所以,功能正常,

  9. ruby-on-rails - Ruby net/ldap 模块中的内存泄漏 - 2

    作为我的Rails应用程序的一部分,我编写了一个小导入程序,它从我们的LDAP系统中吸取数据并将其塞入一个用户表中。不幸的是,与LDAP相关的代码在遍历我们的32K用户时泄漏了大量内存,我一直无法弄清楚如何解决这个问题。这个问题似乎在某种程度上与LDAP库有关,因为当我删除对LDAP内容的调用时,内存使用情况会很好地稳定下来。此外,不断增加的对象是Net::BER::BerIdentifiedString和Net::BER::BerIdentifiedArray,它们都是LDAP库的一部分。当我运行导入时,内存使用量最终达到超过1GB的峰值。如果问题存在,我需要找到一些方法来更正我的代

  10. ruby - 使用 ruby​​ 和 savon 的 SOAP 服务 - 2

    我正在尝试使用ruby​​和Savon来使用网络服务。测试服务为http://www.webservicex.net/WS/WSDetails.aspx?WSID=9&CATID=2require'rubygems'require'savon'client=Savon::Client.new"http://www.webservicex.net/stockquote.asmx?WSDL"client.get_quotedo|soap|soap.body={:symbol=>"AAPL"}end返回SOAP异常。检查soap信封,在我看来soap请求没有正确的命名空间。任何人都可以建议我

随机推荐