c++ - 如何用 g++ 向量化我的循环？

coder 2023-06-04 原文

我在搜索时找到的介绍链接:

正如您所见，它们中的大多数都是用于 C 的，但我认为它们也可能适用于 C++。这是我的代码:

template<typename T>
//__attribute__((optimize("unroll-loops")))
//__attribute__ ((pure))
void foo(std::vector<T> &p1, size_t start,
            size_t end, const std::vector<T> &p2) {
  typename std::vector<T>::const_iterator it2 = p2.begin();
  //#pragma simd
  //#pragma omp parallel for
  //#pragma GCC ivdep Unroll Vector
  for (size_t i = start; i < end; ++i, ++it2) {
    p1[i] = p1[i] - *it2;
    p1[i] += 1;
  }
}

int main()
{
    size_t n;
    double x,y;
    n = 12800000;
    vector<double> v,u;
    for(size_t i=0; i<n; ++i) {
        x = i;
        y = i - 1;
        v.push_back(x);
        u.push_back(y);
    }
    using namespace std::chrono;

    high_resolution_clock::time_point t1 = high_resolution_clock::now();
    foo(v,0,n,u);
    high_resolution_clock::time_point t2 = high_resolution_clock::now();

    duration<double> time_span = duration_cast<duration<double>>(t2 - t1);

    std::cout << "It took me " << time_span.count() << " seconds.";
    std::cout << std::endl;
    return 0;
}

我使用了上面注释的所有提示，但我没有得到任何加速，如示例输出所示(第一次运行未注释此 #pragma GCC ivdep Unroll Vector:

samaras@samaras-A15:~/Downloads$ g++ test.cpp -O3 -std=c++0x -funroll-loops -ftree-vectorize -o test
samaras@samaras-A15:~/Downloads$ ./test
It took me 0.026575 seconds.
samaras@samaras-A15:~/Downloads$ g++ test.cpp -O3 -std=c++0x -o test
samaras@samaras-A15:~/Downloads$ ./test
It took me 0.0252697 seconds.

还有希望吗？还是优化标志 O3 就可以解决问题？欢迎提出任何加速此代码(foo 函数)的建议!

我的 g++ 版本:

samaras@samaras-A15:~/Downloads$ g++ --version
g++ (Ubuntu 4.8.1-2ubuntu1~12.04) 4.8.1

请注意，循环的主体是随机的。我对以其他形式重写它不感兴趣。

编辑

回答说无能为力也是可以接受的!

最佳答案

O3 标志自动开启 -ftree-vectorize。 https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html

-O3 turns on all optimizations specified by -O2 and also turns on the -finline-functions, -funswitch-loops, -fpredictive-commoning, -fgcse-after-reload, -ftree-loop-vectorize, -ftree-loop-distribute-patterns, -ftree-slp-vectorize, -fvect-cost-model, -ftree-partial-pre and -fipa-cp-clone options

所以在这两种情况下，编译器都在尝试进行循环向量化。

使用g++ 4.8.2编译:

# In newer versions of GCC use -fopt-info-vec-missed instead of -ftree-vectorize
g++ test.cpp -O2 -std=c++0x -funroll-loops -ftree-vectorize -ftree-vectorizer-verbose=1 -o test

给出这个:

Analyzing loop at test.cpp:16                                                                                                                                                                                                                                               
                                                                                                                                                                                                                                                                                
                                                                                                                                                                                                                                                                                
Vectorizing loop at test.cpp:16                                                                                                                                                                                                                                             
                                                                                                                                                                                                                                                                                
test.cpp:16: note: create runtime check for data references *it2$_M_current_106 and *_39                                                                                                                                                                                    
test.cpp:16: note: created 1 versioning for alias checks.                                                                                                                                                                                                                   
                                                                                                                                                                                                                                                                                
test.cpp:16: note: LOOP VECTORIZED.                                                                                                                                                                                                                                         
Analyzing loop at test_old.cpp:29                                                                                                                                                                                                                                               
                                                                                                                                                                                                                                                                                
test.cpp:22: note: vectorized 1 loops in function.                                                                                                                                                                                                                          
                                                                                                                                                                                                                                                                                
test.cpp:18: note: Unroll loop 7 times                                                                                                                                                                                                                                      
                                                                                                                                                                                                                                                                                
test.cpp:16: note: Unroll loop 7 times                                                                                                                                                                                                                                      
                                                                                                                                                                                                                                                                                
test.cpp:28: note: Unroll loop 1 times

不带 -ftree-vectorize 标志编译:

g++ test.cpp -O2 -std=c++0x -funroll-loops -ftree-vectorizer-verbose=1 -o test

只返回这个:

test_old.cpp:16: note: Unroll loop 7 times

test_old.cpp:28: note: Unroll loop 1 times

第 16 行是循环函数的开始，因此编译器肯定会对其进行矢量化处理。检查汇编程序也证实了这一点。

我目前正在使用的笔记本电脑上似乎有一些激进的缓存，这使得准确测量函数运行所需的时间变得非常困难。

但您也可以尝试以下其他一些方法:

使用 __restrict__ 限定符告诉编译器数组之间没有重叠。
告诉编译器数组与 __builtin_assume_aligned 对齐(不可移植)

这是我的结果代码(我删除了模板，因为您希望对不同的数据类型使用不同的对齐方式)

#include <iostream>
#include <chrono>
#include <vector>

void foo( double * __restrict__ p1,
          double * __restrict__ p2,
          size_t start,
          size_t end )
{
  double* pA1 = static_cast<double*>(__builtin_assume_aligned(p1, 16));
  double* pA2 = static_cast<double*>(__builtin_assume_aligned(p2, 16));

  for (size_t i = start; i < end; ++i)
  {
      pA1[i] = pA1[i] - pA2[i];
      pA1[i] += 1;
  }
}

int main()
{
    size_t n;
    double x, y;
    n = 12800000;
    std::vector<double> v,u;

    for(size_t i=0; i<n; ++i) {
        x = i;
        y = i - 1;
        v.push_back(x);
        u.push_back(y);
    }

    using namespace std::chrono;

    high_resolution_clock::time_point t1 = high_resolution_clock::now();
    foo(&v[0], &u[0], 0, n );
    high_resolution_clock::time_point t2 = high_resolution_clock::now();

    duration<double> time_span = duration_cast<duration<double>>(t2 - t1);

    std::cout << "It took me " << time_span.count() << " seconds.";
    std::cout << std::endl;

    return 0;
}

就像我说的那样，我无法获得一致的时间测量结果，因此无法确认这是否会提高您的性能(甚至可能会降低!)

关于c++ - 如何用 g++ 向量化我的循环？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/29292818/

amp 何用 code test lt c++optimization g++vectorization loop-unrolling

有关c++ - 如何用 g++ 向量化我的循环？的更多相关文章

ruby - 树顶语法无限循环 - 2
我脑子里浮现出一些关于一种新编程语言的想法，所以我想我会尝试实现它。一位friend建议我尝试使用Treetop(Rubygem)来创建一个解析器。Treetop的文档很少，我以前从未做过这种事情。我的解析器表现得好像有一个无限循环，但没有堆栈跟踪；事实证明很难追踪到。有人可以指出入门级解析/AST指南的方向吗？我真的需要一些列出规则、常见用法等的东西来使用像Treetop这样的工具。我的语法分析器在GitHub上，以防有人希望帮助我改进它。class{initialize=lambda(name){receiver.name=name}greet=lambda{IO.puts("He
ruby-on-rails - 在 Ruby 中循环遍历多个数组 - 2
我有多个ActiveRecord子类Item的实例数组，我需要根据最早的事件循环打印。在这种情况下，我需要打印付款和维护日期，如下所示:ItemAmaintenancerequiredin5daysItemBpaymentrequiredin6daysItemApaymentrequiredin7daysItemBmaintenancerequiredin8days我目前有两个查询，用于查找maintenance和payment项目(非排他性查询)，并输出如下内容:paymentrequiredin...maintenancerequiredin...有什么方法可以改善上述(丑陋的)代
ruby-on-rails - Ruby on Rails : . 常量化 : wrong constant name error? - 2
我正在使用这个:4.times{|i|assert_not_equal("content#{i+2}".constantize,object.first_content)}我之前声明过局部变量content1content2content3content4content5我得到的错误NameError:wrongconstantnamecontent2这个错误是什么意思？我很确定我想要content2=\ 最佳答案你必须用一个大字母来调用ruby常量:Content2而不是content2。Aconstantnamestart
ruby-on-rails - 如何优雅地重启 thin + nginx？ - 2
我的瘦服务器配置了nginx，我的ROR应用程序正在它们上运行。在我发布代码更新时运行thinrestart会给我的应用程序带来一些停机时间。我试图弄清楚如何优雅地重启正在运行的Thin实例，但找不到好的解决方案。有没有人能做到这一点？最佳答案 #Restartjustthethinserverdescribedbythatconfigsudothin-C/etc/thin/mysite.ymlrestartNginx将继续运行并代理请求。如果您将Nginx设置为使用多个上游服务器，例如server{listen80;server
ruby-on-rails - 如何在我的 Rails 应用程序 View 中打印 ruby 变量的内容？ - 2
我是一个Rails初学者，但我想从我的RailsView(html.haml文件)中查看Ruby变量的内容。我试图在ruby中打印出变量(认为它会在终端中出现)，但没有得到任何结果。有什么建议吗？我知道Rails调试器，但更喜欢使用inspect来打印我的变量。最佳答案您可以在View中使用puts方法将信息输出到服务器控制台。您应该能够在View中的任何位置使用Haml执行以下操作:-puts@my_variable.inspect 关于ruby-on-rails-如何在我的R
ruby - RuntimeError(自动加载常量 Apps 多线程时检测到循环依赖 - 2
我收到这个错误:RuntimeError(自动加载常量Apps时检测到循环依赖当我使用多线程时。下面是我的代码。为什么会这样？我尝试多线程的原因是因为我正在编写一个HTML抓取应用程序。对Nokogiri::HTML(open())的调用是一个同步阻塞调用，需要1秒才能返回，我有100,000多个页面要访问，所以我试图运行多个线程来解决这个问题。有更好的方法吗？classToolsController0)app.website=array.join(',')putsapp.websiteelseapp.website="NONE"endapp.saveapps=Apps.order("
ruby - 我可以将我的 README.textile 以正确的格式放入我的 RDoc 中吗？ - 2
我喜欢使用Textile或Markdown为我的项目编写自述文件，但是当我生成RDoc时，自述文件被解释为RDoc并且看起来非常糟糕。有没有办法让RDoc通过RedCloth或BlueCloth而不是它自己的格式化程序运行文件？它可以配置为自动检测文件后缀的格式吗？(例如README.textile通过RedCloth运行，但README.mdown通过BlueCloth运行) 最佳答案使用YARD直接代替RDoc将允许您包含Textile或Markdown文件，只要它们的文件后缀是合理的。我经常使用类似于以下Rake任务的东西:
jquery - 我的 jquery AJAX POST 请求无需发送 Authenticity Token (Rails) - 2
rails中是否有任何规定允许站点的所有AJAXPOST请求在没有authenticity_token的情况下通过？我有一个调用Controller方法的JqueryPOSTajax调用，但我没有在其中放置任何真实性代码，但调用成功。我的ApplicationController确实有'request_forgery_protection'并且我已经改变了config.action_controller.consider_all_requests_local在我的environments/development.rb中为false我还搜索了我的代码以确保我没有重载ajaxSend来发送
java - 我的模型类或其他类中应该有逻辑吗 - 2
我只想对我一直在思考的这个问题有其他意见，例如我有classuser_controller和classuserclassUserattr_accessor:name,:usernameendclassUserController//dosomethingaboutanythingaboutusersend问题是我的User类中是否应该有逻辑user=User.newuser.do_something(user1)oritshouldbeuser_controller=UserController.newuser_controller.do_something(user1,user2)我
ruby - 使用 `+=` 和 `send` 方法 - 2
如何将send与+=一起使用？a=20;a.send"+=",10undefinedmethod`+='for20:Fixnuma=20;a+=10=>30 最佳答案恐怕你不能。+=不是方法，而是语法糖。参见http://www.ruby-doc.org/docs/ProgrammingRuby/html/tut_expressions.html它说Incommonwithmanyotherlanguages,Rubyhasasyntacticshortcut:a=a+2maybewrittenasa+=2.你能做的最好的事情是:

c++ - 如何用 g++ 向量化我的循环？

有关c++ - 如何用 g++ 向量化我的循环？的更多相关文章

随机推荐