Sabrou-mal サブロウ丸

今回やること
openMP
並列化
- CMake
実験
- mac book
- machine A
まとめ
コード
参考
他の記事

今回やること

openMPを用いた並列化

openMP

言わずと知れた, スレッド並列ツール. FortranとC/C++に対応. 特徴的なのが, (openMPがうまく動くようなコードになっていれば)並列化のために手を加える箇所が少ないこと. またスレッド並列なので, プロセス間のメッセージパッシングによるデータ共有を行う必要もありません.
要するにopenMPにて比較的お手軽に並列化を行うことができます. 逆に気をつけなければならないのがデータ同期, 全てのスレッドが同じメモリ空間を共有するため, グローバルデータの書き換え時に注意する必要があります.

並列化

f:id:inarizuuuushi:20210701090926p:plain

最も簡単なのは並列化したいforループにopenMPを適用することです. 上記の図のようにコード一覧を分割して, 並行してそのコードに対する推論アルゴリズムの性能を測定します.

下記のように, forループの前に#pragma omp parallel forを追加します. これにより for ループがスレッド数分に分割されます. スレッド数は環境変数 ( OMP_NUM_THREADS ) が使用されます. ( 他の指定方法もあります) ループ内で使用している変数は(デフォルトでは)全てのスレッドで共有扱いになるので, スレッドごとに独立させたい変数については, private や firstprivateにより追加で指定します. ( OpenMP入門: データ共有属性の指定について ) この1行を追加するだけ.

/**
 * @fn void runInteractive(Config &config)
 * @brief 全てのsecret codeに対する計測
 * @param[in] config
 */
void runTest(
      Config &config
      ) noexcept
{
   CodePtrList S;
   allCodeGenerator(config, S);

   std::vector<int> countTable(S.size());
   std::vector<double> timeTable(S.size());
   auto startTotal = omp_get_wtime();

#pragma omp parallel for firstprivate(config)  // ここだけ追加!
   for ( int i = 0; i < static_cast<int>(S.size()); ++i )
   {
      // test code
      config.setSecret(*S[i]);
      CodePtrList testS = copy(S);
      CodeList guessHist;
      CodePtrList G;

      int count = 0;
      auto start = omp_get_wtime();
      while( testS.size() > 1 )
      {
         count++;
         G.clear();
         setGuessCandidates(S, guessHist, config, G);
         Code guess = policy(testS, G, config);
         trial(testS, guess, config);
         guessHist.push_back(guess);
      }

      double elapsed = omp_get_wtime() - start;
      countTable[i] = count;
      timeTable[i] = elapsed;
   }

   auto elapsedTotal = omp_get_wtime() - startTotal;
}

namespace内に用意しているグローバル変数で固定値でないものは thread_local 修飾子をつけて , スレッドごとに用意されるようにします.

extern thread_local std::vector<int> x;  // もともと extern std::vector<int> x;
extern thread_local std::vector<int> y;  // もともと extern std::vector<int> y;

thread_local std::vector<int> x(0, 0);  // もともとstd::vector<int> x(0, 0);
thread_local std::vector<int> y(0, 0);  // もともとstd::vector<int> y(0, 0);

これで目的の並列化は達成できます.

CMake

find_package

find_package(OpenMP REQUIRED)
set(OpenMPLib OpenMP::OpenMP_CXX)

target_compile_options

-fopenmp を追加. ただし mac (AppleClang)の場合は -Xclang -fopenmp とXclangも必要でした.

target_compile_definitions

target_compile_definitions(
  master_mind_lib PRIVATE
  ${OpenMp_CXX_DEFS}
  )

実験

$ OMP_NUM_THREADS=${num_threads} ./bin/mastermind ${num_colors} ${num_pins} --test --policy entropy

にて, 並列数ごとの total_timeを比較します. それぞれ10回実行して, 平均時間を算出します.

mac book

プロセッサ: 2.7 GHz クアッドコア Intel Core i7

num colors	num pins	total_time[s] threads=1	total_time[s] threads=2	total_time[s] threads=4
4	4	0.4374	0.2296	0.1290
5	5	242.9154	131.7104	97.5318
5	4	3.4860	2.2360	1.0285
6	4	31.6422	18.7869	11.9237
4	5	15.2080	8.3673	5.9889
3	5	0.3130	0.1628	0.1014

machine A

プロセッサ: 2.30GHz Intel(R) Xeon(R) CPU E5-2670 v3

num colors	num pins	total time[s] threads=1	total time[s] threads=2	total time[s] threads=4	total time[s] threads=8	total time[s] threads=16
4	4	0.4684	0.2556	0.1430	0.0842	0.0442
5	5	204.3259	105.8567	56.4360	30.7383	16.7211
5	4	2.9121	1.6415	0.8832	0.4671	0.2673
6	4	24.4358	13.0411	7.1945	3.8827	2.1821
4	5	14.1828	7.3915	3.8774	2.2165	1.1551
3	5	0.3103	0.1937	0.0973	0.0521	0.0284

順調に高速化されていますね. ぱっと見, 線形的に実行時間が減少しているのがみて取れます.