草庐IT

带有大文件的 Git

coder 2023-06-23 原文

情况

我有两台服务器,生产和开发。在生产服务器上,有两个应用程序和多个 (6) 个数据库 (MySQL),我需要将它们分发给开发人员进行测试。所有源代码都存储在GitLab中在开发服务器上,开发人员只能使用此服务器,无法访问生产服务器。当我们发布应用程序时,master 登录到生产环境并从 Git 中 pull 新版本。数据库很大(每个都超过 500M,并且还在增加),我需要尽可能简单地将它们分发给开发人员进行测试。

可能的解决方案

  • 在将数据库转储到单个文件的备份脚本之后,执行将每个数据库推送到其自己的分支的脚本。如果开发人员想要更新他的本地副本,他可以 pull 其中一个分支。

    发现这个无法正常工作。

  • 生产服务器上的 Cron 每天保存二进制日志并将它们推送到该数据库的分支中。因此,在分支中,有每日更改的文件,开发人员 pull 他没有的文件。当前的 SQL 转储将以另一种方式发送给开发人员。当存储库的大小变得太大时,我们将向开发人员发送完整转储并刷新存储库中的所有数据并从头开始。

问题

  • 解决方案可行吗?
  • 如果 git 正在向/从存储库推送/pull ,它是上传/下载整个文件,还是只是更改它们(即添加新行或编辑当前行)?
  • Git能管理这么大的文件吗?不能。
  • 如何设置存储库中保留的修订数量?与新解决方案无关。
  • 有没有更好的解决方案?我不想强制开发人员通过 FTP 或类似的方式下载这么大的文件。

最佳答案

2017 年更新:

Microsoft 正在为 Microsoft/GVFS 做出贡献: 一个允许 Git 处理“the largest repo on the planet”的 Git 虚拟文件系统
(即:Windows 代码库,大约有 350 万个文件,当 checkin Git 存储库时,会产生大约 300GB 的存储库,除了数千个 pull 请求外,还会在 440 个分支中每天生成 1,760 个“实验室构建”验证构建)

GVFS virtualizes the file system beneath your git repo so that git and all tools see what appears to be a normal repo, but GVFS only downloads objects as they are needed.

GVFS 的某些部分可能会贡献到上游(Git 本身)。
但与此同时,all new Windows development is now (August 2017) on Git .


2015 年 4 月更新:GitHub 建议:Announcing Git Large File Storage (LFS)

使用 git-lfs (参见 git-lfs.github.com )和支持它的服务器:lfs-test-server ,您只能将元数据存储在 git 存储库中,而将大文件存储在其他地方。 每次提交最多 2 Gb。

参见 git-lfs/wiki/Tutorial :

git lfs track '*.bin'
git add .gitattributes "*.bin"
git commit -m "Track .bin files"

原答案:

关于大文件的git限制是什么,你可以考虑 bup (在 GitMinutes #24 中有详细介绍)

design of bup 强调限制 git 存储库的三个问题:

  • 大文件(xdelta for packfile 仅在内存中,不适合大文件)
  • 大量文件,这意味着每个 blob 一个文件,而且 git gc 一次生成一个 packfile 很慢。
  • 巨大的包文件,包文件索引无法从(巨大的)包文件中检索数据。

处理大文件和xdelta

The primary reason git can't handle huge files is that it runs them through xdelta, which generally means it tries to load the entire contents of a file into memory at once.
If it didn't do this, it would have to store the entire contents of every single revision of every single file, even if you only changed a few bytes of that file.
That would be a terribly inefficient use of disk space
, and git is well known for its amazingly efficient repository format.

Unfortunately, xdelta works great for small files and gets amazingly slow and memory-hungry for large files.
For git's main purpose, ie. managing your source code, this isn't a problem.

What bup does instead of xdelta is what we call "hashsplitting."
We wanted a general-purpose way to efficiently back up any large file that might change in small ways, without storing the entire file every time. We read through the file one byte at a time, calculating a rolling checksum of the last 128 bytes.

rollsum seems to do pretty well at its job. You can find it in bupsplit.c.
Basically, it converts the last 128 bytes read into a 32-bit integer. What we then do is take the lowest 13 bits of the rollsum, and if they're all 1's, we consider that to be the end of a chunk.
This happens on average once every 2^13 = 8192 bytes, so the average chunk size is 8192 bytes.
We're dividing up those files into chunks based on the rolling checksum.
Then we store each chunk separately (indexed by its sha1sum) as a git blob.

With hashsplitting, no matter how much data you add, modify, or remove in the middle of the file, all the chunks before and after the affected chunk are absolutely the same.
All that matters to the hashsplitting algorithm is the 32-byte "separator" sequence, and a single change can only affect, at most, one separator sequence or the bytes between two separator sequences.
Like magic, the hashsplit chunking algorithm will chunk your file the same way every time, even without knowing how it had chunked it previously.

The next problem is less obvious: after you store your series of chunks as git blobs, how do you store their sequence? Each blob has a 20-byte sha1 identifier, which means the simple list of blobs is going to be 20/8192 = 0.25% of the file length.
For a 200GB file, that's 488 megs of just sequence data.

We extend the hashsplit algorithm a little further using what we call "fanout." Instead of checking just the last 13 bits of the checksum, we use additional checksum bits to produce additional splits.
What you end up with is an actual tree of blobs - which git 'tree' objects are ideal to represent.

处理大量文件和git gc

git is designed for handling reasonably-sized repositories that change relatively infrequently. You might think you change your source code "frequently" and that git handles much more frequent changes than, say, svn can handle.
But that's not the same kind of "frequently" we're talking about.

The #1 killer is the way it adds new objects to the repository: it creates one file per blob. Then you later run 'git gc' and combine those files into a single file (using highly efficient xdelta compression, and ignoring any files that are no longer relevant).

'git gc' is slow, but for source code repositories, the resulting super-efficient storage (and associated really fast access to the stored files) is worth it.

bup doesn't do that. It just writes packfiles directly.
Luckily, these packfiles are still git-formatted, so git can happily access them once they're written.

处理巨大的存储库(意味着大量巨大的包文件)

Git isn't actually designed to handle super-huge repositories.
Most git repositories are small enough that it's reasonable to merge them all into a single packfile, which 'git gc' usually does eventually.

The problematic part of large packfiles isn't the packfiles themselves - git is designed to expect the total size of all packs to be larger than available memory, and once it can handle that, it can handle virtually any amount of data about equally efficiently.
The problem is the packfile indexes (.idx) files.

each packfile (*.pack) in git has an associated idx (*.idx) that's a sorted list of git object hashes and file offsets.
If you're looking for a particular object based on its sha1, you open the idx, binary search it to find the right hash, then take the associated file offset, seek to that offset in the packfile, and read the object contents.

The performance of the binary search is about O(log n) with the number of hashes in the pack, with an optimized first step (you can read about it elsewhere) that somewhat improves it to O(log(n)-7).
Unfortunately, this breaks down a bit when you have lots of packs.

To improve performance of this sort of operation, bup introduces midx (pronounced "midix" and short for "multi-idx") files.
As the name implies, they index multiple packs at a time.

关于带有大文件的 Git,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/17888604/

有关带有大文件的 Git的更多相关文章

  1. ruby - 使用 RubyZip 生成 ZIP 文件时设置压缩级别 - 2

    我有一个Ruby程序,它使用rubyzip压缩XML文件的目录树。gem。我的问题是文件开始变得很重,我想提高压缩级别,因为压缩时间不是问题。我在rubyzipdocumentation中找不到一种为创建的ZIP文件指定压缩级别的方法。有人知道如何更改此设置吗?是否有另一个允许指定压缩级别的Ruby库? 最佳答案 这是我通过查看ruby​​zip内部创建的代码。level=Zlib::BEST_COMPRESSIONZip::ZipOutputStream.open(zip_file)do|zip|Dir.glob("**/*")d

  2. ruby - 其他文件中的 Rake 任务 - 2

    我试图在一个项目中使用rake,如果我把所有东西都放到Rakefile中,它会很大并且很难读取/找到东西,所以我试着将每个命名空间放在lib/rake中它自己的文件中,我添加了这个到我的rake文件的顶部:Dir['#{File.dirname(__FILE__)}/lib/rake/*.rake'].map{|f|requiref}它加载文件没问题,但没有任务。我现在只有一个.rake文件作为测试,名为“servers.rake”,它看起来像这样:namespace:serverdotask:testdoputs"test"endend所以当我运行rakeserver:testid时

  3. ruby-on-rails - 在 Rails 中将文件大小字符串转换为等效千字节 - 2

    我的目标是转换表单输入,例如“100兆字节”或“1GB”,并将其转换为我可以存储在数据库中的文件大小(以千字节为单位)。目前,我有这个:defquota_convert@regex=/([0-9]+)(.*)s/@sizes=%w{kilobytemegabytegigabyte}m=self.quota.match(@regex)if@sizes.include?m[2]eval("self.quota=#{m[1]}.#{m[2]}")endend这有效,但前提是输入是倍数(“gigabytes”,而不是“gigabyte”)并且由于使用了eval看起来疯狂不安全。所以,功能正常,

  4. ruby-on-rails - Rails 3 中的多个路由文件 - 2

    Rails2.3可以选择随时使用RouteSet#add_configuration_file添加更多路由。是否可以在Rails3项目中做同样的事情? 最佳答案 在config/application.rb中:config.paths.config.routes在Rails3.2(也可能是Rails3.1)中,使用:config.paths["config/routes"] 关于ruby-on-rails-Rails3中的多个路由文件,我们在StackOverflow上找到一个类似的问题

  5. ruby - 将差异补丁应用于字符串/文件 - 2

    对于具有离线功能的智能手机应用程序,我正在为Xml文件创建单向文本同步。我希望我的服务器将增量/差异(例如GNU差异补丁)发送到目标设备。这是计划:Time=0Server:hasversion_1ofXmlfile(~800kiB)Client:hasversion_1ofXmlfile(~800kiB)Time=1Server:hasversion_1andversion_2ofXmlfile(each~800kiB)computesdeltaoftheseversions(=patch)(~10kiB)sendspatchtoClient(~10kiBtransferred)Cl

  6. ruby - 如何将脚本文件的末尾读取为数据文件(Perl 或任何其他语言) - 2

    我正在寻找执行以下操作的正确语法(在Perl、Shell或Ruby中):#variabletoaccessthedatalinesappendedasafileEND_OF_SCRIPT_MARKERrawdatastartshereanditcontinues. 最佳答案 Perl用__DATA__做这个:#!/usr/bin/perlusestrict;usewarnings;while(){print;}__DATA__Texttoprintgoeshere 关于ruby-如何将脚

  7. ruby - 使用 Vim Rails,您可以创建一个新的迁移文件并一次性打开它吗? - 2

    使用带有Rails插件的vim,您可以创建一个迁移文件,然后一次性打开该文件吗?textmate也可以这样吗? 最佳答案 你可以使用rails.vim然后做类似的事情::Rgeneratemigratonadd_foo_to_bar插件将打开迁移生成的文件,这正是您想要的。我不能代表textmate。 关于ruby-使用VimRails,您可以创建一个新的迁移文件并一次性打开它吗?,我们在StackOverflow上找到一个类似的问题: https://sta

  8. Ruby 写入和读取对象到文件 - 2

    好的,所以我的目标是轻松地将一些数据保存到磁盘以备后用。您如何简单地写入然后读取一个对象?所以如果我有一个简单的类classCattr_accessor:a,:bdefinitialize(a,b)@a,@b=a,bendend所以如果我从中非常快地制作一个objobj=C.new("foo","bar")#justgaveitsomerandomvalues然后我可以把它变成一个kindaidstring=obj.to_s#whichreturns""我终于可以将此字符串打印到文件或其他内容中。我的问题是,我该如何再次将这个id变回一个对象?我知道我可以自己挑选信息并制作一个接受该信

  9. ruby - 如何使用 Ruby aws/s3 Gem 生成安全 URL 以从 s3 下载文件 - 2

    我正在编写一个小脚本来定位aws存储桶中的特定文件,并创建一个临时验证的url以发送给同事。(理想情况下,这将创建类似于在控制台上右键单击存储桶中的文件并复制链接地址的结果)。我研究过回形针,它似乎不符合这个标准,但我可能只是不知道它的全部功能。我尝试了以下方法:defauthenticated_url(file_name,bucket)AWS::S3::S3Object.url_for(file_name,bucket,:secure=>true,:expires=>20*60)end产生这种类型的结果:...-1.amazonaws.com/file_path/file.zip.A

  10. ruby - rspec 需要 .rspec 文件中的 spec_helper - 2

    我注意到像bundler这样的项目在每个specfile中执行requirespec_helper我还注意到rspec使用选项--require,它允许您在引导rspec时要求一个文件。您还可以将其添加到.rspec文件中,因此只要您运行不带参数的rspec就会添加它。使用上述方法有什么缺点可以解释为什么像bundler这样的项目选择在每个规范文件中都需要spec_helper吗? 最佳答案 我不在Bundler上工作,所以我不能直接谈论他们的做法。并非所有项目都checkin.rspec文件。原因是这个文件,通常按照当前的惯例,只

随机推荐