White-eye Bright

土曜日, 7月 17, 2021

Google colabでcudfを使う

https://acro-engineer.hatenablog.com/entry/2020/12/10/120000

が参考になるが、21/7/17現在、仕様が違う

（update_modules.pyはない、あとpythonのバージョンが

3.6ではなく3.7）。

!git clone https://github.com/rapidsai/rapidsai-csp-utils.git
!bash rapidsai-csp-utils/colab/rapids-colab.sh stable

import sys, os

dist_package_index = sys.path.index('/usr/local/lib/python3.7/dist-packages')
sys.path = sys.path[:dist_package_index] + ['/usr/local/lib/python3.7/site-packages'] + sys.path[dist_package_index:]
sys.path
exec(open('rapidsai-csp-utils/colab/install_rapids.py').read(), globals())

でうまくいきそう。

金曜日, 10月 16, 2020

【メモ】How to Win a Data Science Competition: Learn from Top Kagglers　第1週　その6

●Textデータに対する前処理（Bag of wordsなどの前に実施）

1. 大文字を小文字に変換

2. 同じ単語の処理（cars⇒car, had⇒have）

　stemming（語幹処理）とlemmatization（見出語化）

　stemming: democracy, democratic, democratization⇒democr

　lemmatization: democracy, democratic, democratization⇒democracy

3. stopwordsの除去

●Bag of wordsのパイプライン

前処理（上記参照）⇒Ngrams⇒後処理（TFiDF）

水曜日, 10月 14, 2020

【メモ】How to Win a Data Science Competition: Learn from Top Kagglers　第1週　その5

●欠損値への対処法

1. -999や-1で置き換える

2. 平均値や中央値で置き換える

3. 値を推定する（時系列データであれば線形補間など）

4. 新しい二値変数（例:is_null）を追加して、nanの時に1とする

注意：categorical featureのencodingに

　　　欠損値のあるnumerical featureを使用する場合、

　　　欠損値を-999で置き換えてからencodingすると

　　　無意味に大きな値になり得る。

　　　encodingにencodingに欠損値を使用しない。

⇒特徴量を生成する場合は、欠損値の扱いに注意を要する

　（featrure generationの前に欠損値補間をしない）

火曜日, 10月 13, 2020

【メモ】How to Win a Data Science Competition: Learn from Top Kagglers　第1週　その4

●datetimeに関する特徴量

祝日やあるイベントからの経過日数や

祝日やあるイベントまでの残り日数

2つのイベント間の日数

●coordinateに関する特徴量

train/testデータからの興味深い地点の抽出

外部データの追加

ある特定の地点・施設までの距離

クラスタリングにより得たクラスタ中心までの距離

周辺地域の統計量（平均価格など）

座標系の回転が有効な場合もある

（streetの上下で物件の価格の傾向が変わる場合など）

【メモ】How to Win a Data Science Competition: Learn from Top Kagglers　第1週　その3

●categorical featureの取り扱い

Label encoding：tree-based modelでは効果を発揮する。

　　　　　　　　しかしnon tree-based modelsでは有効に働かない。

　　　　　　　　文字列をアルファベット順で数値に置き換えるのも

　　　　　　　　label encoding

　　　　　　　　sklearn.preprocessing.LabelEncoder

Order of appearance：出現順序

　　　　　　　　　　　pandas.factorize

　　　　　　　　　　　sort=Trueを指定するとアルファベット順

　　　　　　　　　　　（label encodingと等価）

Frequency encoding：Non tree-based modelsでも有効。

　　　　　　　　　　　ただし頻度が等しい値が2つあると、

　　　　　　　　　　　両者の区別がつかない。

One-hot encoding：Non tree-based modelsで有効。

　　　　　　　　　　最小値が0、最大値が1なので既に正規化されている。

　　　　　　　　　　ユニークな値が多い（各値の出現回数が1に近いものばかり）

　　　　　　　　　　場合にone-hot encodingをするとほとんどが0な列が

　　　　　　　　　　たくさんできるので、疎行列として扱う必要あり。

　　　　　　　　　　XGBoostやLightGBMは疎行列をそのまま扱える

　　　　　　　　　　（CatBoostは不可らしい

　　　　　　　　　　https://cocon-corporation.com/cocontoco/ensemble-methods_sparse-matrix_memory/#index2）

複数のcategorical featureの相互作用を考慮したい

⇒2つの文字列を連結した変数を作ってからone-hot encoding

⇒linear modelやkNNで有効

月曜日, 10月 12, 2020

【メモ】How to Win a Data Science Competition: Learn from Top Kagglers　第1週　その2

●preprocessing of numerical features

外れ値への対処：

①クリッピング（パーセンタイルを使ったりする）

min, max = np.percentile(data, [1, 99]) data = np.clip(data, min, max)

②rank transformation（外れ値がある場合はMinMaxScalerよりも良い）

from scipy.stats import rankdata

値の大小の順番を変数として扱う

決定木系以外（特にNeural Network）の性能を上げ得る前処理：

①対数変換

np.log(1 + x)

②平方根変換

np.sqrt(x + 2/3)

【メモ】How to Win a Data Science Competition: Learn from Top Kagglers　第1週　その1

●Linear model

線形モデルは、まばらな高次元データに特に適しています

●Tree-based model

Keep in mind that for Tree-Based Methods, it's hard to capture linear dependencies since it requires a lot of splits.

ツリーベースのメソッドの場合、多くの分割が必要になるため、線形依存関係をとらえるのは難しいことに注意してください。

●k-NN

k-NN approach heavily relies on how to measure point closeness.

k-NN法は点の近さをどのように測定するかに大きく依存します。

●NN（ニューラルネット）

Feed-forward Neural Nets are harder to interpret but they produce smooth non-linear decision boundary.

フィードフォワード・ニューラルネットは解釈が難しいですが、滑らかな非線形決定境界を生成します。

●Random Forests

Each tree in forest is independent from the others, so two RF with 500 trees is essentially the same as single RF model with 1000 trees.

森の各木は他の木から独立しているので、500本の木を持つ2つのRFは、1000本の木を持つ1つのRFモデルと本質的に同じです。

金曜日, 11月 01, 2019

Ubuntu16.04にpgplotをインストール

sudo apt-get install libx11-dev
してからコンパイル。

月曜日, 10月 28, 2019

Fortranで行数が分からないファイルからデータを読み込む

read(ファイル番号、*,END=999)
のように書いて
ファイルの最後に来た時の処理を
999を付けた行に書く。

White-eye Bright

土曜日, 7月 17, 2021

Google colabでcudfを使う

金曜日, 10月 16, 2020

【メモ】How to Win a Data Science Competition: Learn from Top Kagglers　第1週　その6

水曜日, 10月 14, 2020

【メモ】How to Win a Data Science Competition: Learn from Top Kagglers　第1週　その5

火曜日, 10月 13, 2020

【メモ】How to Win a Data Science Competition: Learn from Top Kagglers　第1週　その4

【メモ】How to Win a Data Science Competition: Learn from Top Kagglers　第1週　その3

月曜日, 10月 12, 2020

【メモ】How to Win a Data Science Competition: Learn from Top Kagglers　第1週　その2

【メモ】How to Win a Data Science Competition: Learn from Top Kagglers　第1週　その1

金曜日, 11月 01, 2019

Ubuntu16.04にpgplotをインストール

月曜日, 10月 28, 2019

Fortranで行数が分からないファイルからデータを読み込む

フォロワー

ラベル

ブログアーカイブ

Links

自己紹介

White-eye Bright

土曜日, 7月 17, 2021

金曜日, 10月 16, 2020

水曜日, 10月 14, 2020

火曜日, 10月 13, 2020

月曜日, 10月 12, 2020

金曜日, 11月 01, 2019

月曜日, 10月 28, 2019

フォロワー

ラベル

ブログ アーカイブ

Links

自己紹介

ブログアーカイブ