Nutch_草庐IT

我正在为我的应用程序使用nutch爬虫，它需要爬取一组我提供给urls目录的URL，并且只获取该URL的内容。我对内部或外部链接的内容不感兴趣。所以我使用了NUTCH爬虫，并通过将深度设为1来运行爬虫命令。bin/nutchcrawlurls-dircrawl-depth1Nutch抓取url并给我给定url的内容。我正在使用readseg实用程序阅读内容。bin/nutchreadseg-dumpcrawl/segments/*arjun-nocontent-nofetch-nogenerate-noparse-noparsedata我正在获取网页的内容。我面临的问题是，如果我提供像