php - Libpuzzle 索引数百万张图片？

coder 2023-06-10 原文

它是关于来自 Mr. Frank Denis 的 libpuzzle libray for php ( http://libpuzzle.pureftpd.org/project/libpuzzle )。我想了解如何在我的 mysql 数据库中索引和存储数据。 vector的生成是绝对没问题的。

例子:

# Compute signatures for two images
$cvec1 = puzzle_fill_cvec_from_file('img1.jpg');
$cvec2 = puzzle_fill_cvec_from_file('img2.jpg');

# Compute the distance between both signatures
$d = puzzle_vector_normalized_distance($cvec1, $cvec2);

# Are pictures similar?
if ($d < PUZZLE_CVEC_SIMILARITY_LOWER_THRESHOLD) {
  echo "Pictures are looking similar\n";
} else {
  echo "Pictures are different, distance=$d\n";
}

这对我来说很清楚 - 但现在当我有大量图片 >1.000.000 时我该如何工作？我计算向量并将其与文件名一起存储在数据库中？现在如何找到相似的图片？如果我将每个向量存储在 mysql 中，我必须打开每个记录并使用 puzzle_vector_normalized_distance 函数计算距离。该过程需要很多时间(打开每个数据库条目 - 将其抛出函数，...)

我阅读了 lib puzzle libaray 中的自述文件，发现了以下内容:

Will it work with a database that has millions of pictures?

A typical image signature only requires 182 bytes, using the built-in compression/decompression functions.

Similar signatures share identical “words”, ie. identical sequences of values at the same positions. By using compound indexes (word + position), the set of possible similar vectors is dramatically reduced, and in most cases, no vector distance actually requires to get computed.

Indexing through words and positions also makes it easy to split the data into multiple tables and servers.

So yes, the Puzzle library is certainely not incompatible with projects that need to index millions of pictures.

我还找到了关于索引的描述:

------------------------ INDEXING ------------------------

How to quickly find similar pictures, if they are millions of records?

The original paper has a simple, yet efficient answer.

Cut the vector in fixed-length words. For instance, let's consider the following vector:

[ a b c d e f g h i j k l m n o p q r s t u v w x y z ]

With a word length (K) of 10, you can get the following words:

[ a b c d e f g h i j ] found at position 0 [ b c d e f g h i j k ] found at position 1 [ c d e f g h i j k l ] found at position 2 etc. until position N-1

Then, index your vector with a compound index of (word + position).

Even with millions of images, K = 10 and N = 100 should be enough to have very little entries sharing the same index.

Here's a very basic sample database schema:

+-----------------------------+
| signatures |
+-----------------------------+
| sig_id | signature | pic_id |
+--------+-----------+--------+

+--------------------------+
| words |
+--------------------------+
| pos_and_word | fk_sig_id |
+--------------+-----------+

I'd recommend splitting at least the "words" table into multiple tables and/or servers.

By default (lambas=9) signatures are 544 bytes long. In order to save storage space, they can be compressed to 1/third of their original size through the puzzle_compress_cvec() function. Before use, they must be uncompressed with puzzle_uncompress_cvec().

我认为压缩是错误的方式，因为我必须在比较之前解压缩每个向量。

我现在的问题是 - 处理数百万张图片的方式是什么以及如何以快速有效的方式比较它们。我不明白“向量的切割”如何帮助我解决我的问题。

非常感谢 - 也许我可以在这里找到使用 libpuzzle libaray 的人。

干杯。

最佳答案

那么，让我们看看他们给出的例子并尝试扩展。

假设您有一个表，用于存储与每个图像相关的信息(路径、名称、描述等)。在该表中，您将包含一个用于压缩签名的字段，该字段在您最初填充数据库时计算并存储。让我们这样定义该表:

CREATE TABLE images (
    image_id INTEGER NOT NULL PRIMARY KEY,
    name TEXT,
    description TEXT,
    file_path TEXT NOT NULL,
    url_path TEXT NOT NULL,
    signature TEXT NOT NULL
);

当您最初计算签名时，您还将计算签名中的一些单词:

// this will be run once for each image:
$cvec = puzzle_fill_cvec_from_file('img1.jpg');
$words = array();
$wordlen = 10; // this is $k from the example
$wordcnt = 100; // this is $n from the example
for ($i=0; $i<min($wordcnt, strlen($cvec)-$wordlen+1); $i++) {
    $words[] = substr($cvec, $i, $wordlen);
}

现在您可以将这些词放入表中，定义如下:

CREATE TABLE img_sig_words (
    image_id INTEGER NOT NULL,
    sig_word TEXT NOT NULL,
    FOREIGN KEY (image_id) REFERENCES images (image_id),
    INDEX (image_id, sig_word)
);

现在您插入到该表中，在找到该词的位置索引之前添加，以便您知道何时匹配一个词，它在签名中的相同位置匹配:

// the signature, along with all other data, has already been inserted into the images
// table, and $image_id has been populated with the resulting primary key
foreach ($words as $index => $word) {
    $sig_word = $index.'__'.$word;
    $dbobj->query("INSERT INTO img_sig_words (image_id, sig_word) VALUES ($image_id,
        '$sig_word')"); // figure a suitably defined db abstraction layer...
}

这样你的数据就初始化好了，你可以相对容易地抓取带有匹配词的图像:

// $image_id is set to the base image that you are trying to find matches to
$dbobj->query("SELECT i.*, COUNT(isw.sig_word) as strength FROM images i JOIN img_sig_words
    isw ON i.image_id = isw.image_id JOIN img_sig_words isw_search ON isw.sig_word =
    isw_search.sig_word AND isw.image_id != isw_search.image_id WHERE
    isw_search.image_id = $image_id GROUP BY i.image_id, i.name, i.description,
    i.file_path, i.url_path, i.signature ORDER BY strength DESC");

您可以通过添加要求最小强度 的HAVING 子句来改进查询，从而进一步减少您的匹配集。

我不保证这是最有效的设置，但它应该大致可以实现您正在寻找的功能。

基本上，以这种方式拆分和存储单词可以让您进行粗略的距离检查，而无需对签名运行专门的函数。

关于php - Libpuzzle 索引数百万张图片？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/9703762/

有关php - Libpuzzle 索引数百万张图片？的更多相关文章

ruby-on-rails - Ruby on Rails - 为文本区域和图片生成列 - 2
我是Rails的新手，所以请原谅简单的问题。我正在为一家公司创建一个网站。那家公司想在网站上展示它的客户。我想让客户自己管理这个。我正在为“客户”生成一个表格，我想要的三列是:公司名称、公司描述和Logo。对于名称，我使用的是name:string但不确定如何在脚本/生成脚手架终端命令中最好地创建描述列(因为我打算将其设置为文本区域)和图片。我怀疑描述(我想成为一个文本区域)应该仍然是描述:字符串，然后以实际形式进行调整。不确定如何处理图片字段。那么……说来话长:我在脚手架命令中输入什么来生成描述和图片列？最佳答案对于“文本”数
ruby-on-rails - Rails 3，在RAILS_ROOT上方显示来自本地文件系统的jpg图片 - 2
我正在尝试找出一种方法来显示来自不在RAILS_ROOT下(在RedHat或Ubuntu环境中)的已安装文件系统的图像。我不想使用符号链接(symboliclink)，因为这个应用程序实际上是通过Tomcat部署的，而当我关闭Tomcat时，Tomcat会尝试跟随符号链接(symboliclink)并删除挂载中的所有图像。由于这些文件的数量和大小，将图像放在public/images下也不是一种选择。我查看了send_file，但它只会显示一张图片。我需要在一个格式良好的页面中显示6个请求的图像。由于膨胀，我宁愿不使用Base64编码，但我不知道如何将图像数据与呈现的页面一起传递下去。
最新版人脸识别小程序图片识别生成二维码签到地图上选点进行位置签到计算签到距离课程会议活动打卡日常考勤上课签到打卡考勤口令签到 - 2
技术选型1，前端小程序原生MINA框架cssJavaScriptWxml2，管理后台云开发Cms内容管理系统web网页3，数据后台小程序云开发云函数云开发数据库（基于MongoDB）云存储4，人脸识别算法基于百度智能云实现人脸识别一，用户端效果图预览老规矩我们先来看效果图，如果效果图符合你的需求，就继续往下看，如果不符合你的需求，可以跳过。1-1，登录注册页可以看到登录页有注册入口，注册页如下我们的注册，需要管理员审核，审核通过后才可以正常登录使用小程序1-2，个人中心页登录成功以后，我们会进入个人中心页我们在个人中心页可以注册人脸，因为我们做人脸识别签到，需要先注册人脸才可以进行人脸比对，进
ruby-on-rails - 这个 C 和 PHP 程序员如何学习 Ruby 和 Rails？ - 2
按照目前的情况，这个问题不适合我们的问答形式。我们希望答案得到事实、引用或专业知识的支持，但这个问题可能会引发辩论、争论、投票或扩展讨论。如果您觉得这个问题可以改进并可能重新打开，visitthehelpcenter指导。关闭9年前。我来自C、php和bash背景，很容易学习，因为它们都有相同的C结构，我可以将其与我已经知道的联系起来。然后2年前我学了Python并且学得很好，Python对我来说比Ruby更容易学。然后从去年开始，我一直在尝试学习Ruby，然后是Rails，我承认，直到现在我还是学不会，讽刺的是那些打着简单易学的烙印，但是对于我这样一个老练的程序员来说，我只是无法将它
ruby-on-rails - 带图片 uploader 的多步表单 - 2
我想建立3步用户注册，在第2步上传头像。所以我遵循RyanBates的指南http://railscasts.com/episodes/217-multistep-forms.我正在使用CarrierWavegem来处理上传。但似乎我无法在用户session中存储上传的文件信息(我收到无法转储文件错误)。我在Controller中使用以下技术ifparams[:user][:img_path]@uploader=FirmImgUploader.new@uploader.store!(params[:user][:img_path])session[:img]=@uploaderpara
ruby - 如何在 Ruby 中更新图片文件的 EXIF 标签？ - 2
标题说明一切。最佳答案我正在使用MiniExiftool，它是Perl的Exiftool的ruby接口(interface)。https://github.com/janfri/mini_exiftoolhttp://www.sno.phy.queensu.ca/~phil/exiftool/用法:exif=MiniExiftool.new(file_path)exif.date_time_original=Time.nowexif["captionextract"]="Thisismynewcaption"exif.sav
ruby-on-rails - Rails 还是 Sinatra？ PHP程序员入门学习哪个好？ - 2
按照目前的情况，这个问题不适合我们的问答形式。我们希望答案得到事实、引用或专业知识的支持，但这个问题可能会引发辩论、争论、投票或扩展讨论。如果您觉得这个问题可以改进并可能重新打开，visitthehelpcenter指导。关闭10年前。我使用PHP的时间太长了，对它感到厌倦了。我也想学习一门新语言。我一直在使用Ruby并且喜欢它。我必须在Rails和Sinatra之间做出选择，那么您会推荐哪一个？Sinatra真的不能用来构建复杂的应用程序，它只能用于简单的应用程序吗？
css - 如何让我的背景图片只出现在 rails 4 的一页上？ - 2
我有一张背景图片，我无法只停留在一页上。我制作了一个带有一个主视图的欢迎Controller来显示它。我也在预编译我的Assets。背景显示得很好，但我的目标是只在我的home.html.erbView中显示背景图像。欢迎/home.html.erb:"lang="">title欢迎Controller:classWelcomeController样式表/welcome.css.scss:body{background:{image:asset-url("image.jpg");}}我的应用程序布局中有以下内容:在config/initializers/assets.rb中:Rails
ruby-on-rails - PHP 魔术方法 __call、__get 和 __set 的 Ruby 等价物 - 2
我很确定Ruby有这些(等同于__call、__get和__set)，否则find_by将如何在Rails中工作？也许有人可以举一个简单的例子来说明如何定义与find_by相同的方法？谢谢最佳答案简而言之你可以映射__调用带有参数的method_missing调用__设置为方法名称以'='结尾的method_missing调用__获取不带任何参数的method_missing调用__调用PHPclassMethodTest{publicfunction__call($name,$arguments){echo"Callingob
ruby - Lisp - 是否适合网络编程/应用程序(交互式)？ ruby 的方式是？ php的方式是？ - 2
Lisp是否适合Web编程/应用程序(交互式)，就像ruby和php一样？需要考虑的事情是:易于使用可部署性难度(尤其是对于编程初学者而言)(编辑)在阅读PaulGraham'sessay之后，我特别提到了CommonLisp.将是我的第一门编程语言。在这方面。这样做合适吗？我听说Clojure的宏功能不如CommonLisp的强大，这就是我尝试学习Clojure的原因。它教授编程并且非常强大。最佳答案 Lisp是一个语系，而不是单一的语言。为了稍微回答您的问题，是的，存在用于各种Lisp方言的Web框架，例如用于Common

php - Libpuzzle 索引数百万张图片？

有关php - Libpuzzle 索引数百万张图片？的更多相关文章

随机推荐