SETI-BL:通道方向对齐(TensorFlow & TPU)

在本例中,利用数据科学技能帮助识别Breakthrough Listen目标扫描中的异常信号。由于没有已确认的外星信号,训练机器学习算法,因此在如大海捞针般的望远镜数据中加入了一些模拟信号(称之为“针”)。目前已经识别出一些隐藏的针,以便您可以训练模型找到更多隐藏的”针“。数据由二维数组组成,因此计算机视觉、数字信号处理、异常检测等方法可能很有优势。

方法论

  • 本实例演示了如何使用CNN对信号进行分类。
  • 尽管这个问题看起来像是信号检测问题,但我们可以将其转换为图像分类问题。我们可以使用不同的方法将一维信号转换为二维信号(图像),例如傅里叶变换,它基本上将数据从时间域转换为频域。 因此,从[time, intensity]我们得到[time, frequency, intensity]。
  • 该实例使用预先计算的tfrecord进行训练。TFRecord可以将训练规模扩大至2倍。使用TPU作为我们的设备,这可以节省大量的时间。
  • 对于增强,使用旋转、翻转、剪切、缩放、移位、丢弃、随机抖动。
  • 该实例使用EfficientNet作为核心模型(分类器)。
  • 然后,我们计算该实例的Out of Fold (OOF)分数。
  • 最后,我们提交模型预测值。

包和配置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
import os
import shutil
from glob import glob
from kaggle_datasets import KaggleDatasets
import tensorflow as tf, re, math
import tensorflow.keras.backend as K
import efficientnet.tfkeras as efn
import tensorflow_addons as tfa
import pandas as pd
import numpy as np
import random
import sklearn
from sklearn.model_selection import KFold
from sklearn.metrics import roc_auc_score
import matplotlib.pyplot as plt

os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' # to avoid too many logging messages
# USE VERBOSE=0 for silent, VERBOSE=1 for interactive, VERBOSE=2 for commit
VERBOSE = 0
DISPLAY_PLOT = True

DEVICE = "TPU" #or "GPU"

# USE DIFFERENT SEED FOR DIFFERENT STRATIFIED KFOLD
SEED = 42

# NUMBER OF FOLDS. USE 2, 5, 10
FOLDS = 5

# WHICH IMAGE SIZES TO LOAD EACH FOLD
IMG_SIZES = [[273, 256]]*FOLDS

# BATCH SIZE AND EPOCHS
BATCH_SIZES = [32]*FOLDS
EPOCHS = [12]*FOLDS

# WHICH EFFICIENTNET B? TO USE
EFF_NETS = [6]*FOLDS

# Augmentations
AUGMENT = True
TRANSFORM = True

# AFFINE TRANSFORMATION
ROT_ = 0.0 # ROTATION
SHR_ = 2.0 # SHEAR
HZOOM_ = 8.0 # HORIZONTAL ZOOM
WZOOM_ = 8.0 # VERTICAL ZOOM
HSHIFT_ = 8.0 # HORIZONTAL SHIFT
WSHIFT_ = 8.0 # VERTICAL SHIFT

# COARSE DROPOUT - MORE ABOUT THIS IS LATER
PROBABILITY = 0.50 # PROBABILITY OF DROPOUT
CT = 8 # NUMBER OF DROPOUTS
SZ = 0.05 # SIZE OF THE DROPOUT - CALCULATED AS PERCENT OF IMAGE-DIMENSION

#bri, contrast
sat = (0.7, 1.3) # SATURATION
cont = (0.8, 1.2) # CONTRAST
bri = 0.1 # BRIGHTNESS

# WEIGHTS FOR FOLD MODELS WHEN PREDICTING TEST
WGTS = [1/FOLDS]*FOLDS

# TEST TIME AUGMENTATION STEPS
TTA = 11

# 再现性
def seeding(SEED):
np.random.seed(SEED)
random.seed(SEED)
os.environ['PYTHONHASHSEED'] = str(SEED)
os.environ['TF_CUDNN_DETERMINISTIC'] = str(SEED)
tf.random.set_seed(SEED)
print('seeding done!!!')

seeding(SEED)

以下代码自动检测硬件(tpugpucpu):

  • 什么是TPU?:它们是专门用于深度学习任务的硬件加速器。它们配备128GB高速内存,允许更大的批次、更大的模型以及更大的训练输入。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
if DEVICE == "TPU":
print("connecting to TPU...")
try:
# detect and init the TPU
tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
print('Running on TPU ', tpu.master())
except ValueError:
print("Could not connect to TPU")
tpu = None

if tpu:
try:
print("initializing TPU ...")
tf.config.experimental_connect_to_cluster(tpu)
tf.tpu.experimental.initialize_tpu_system(tpu)
# instantiate a distribution strategy
strategy = tf.distribute.experimental.TPUStrategy(tpu)
print("TPU initialized")
except _:
print("failed to initialize TPU")
else:
DEVICE = "GPU"

if DEVICE != "TPU":
print("Using default strategy for CPU and single GPU")
# instantiate a distribution strategy
strategy = tf.distribute.get_strategy()

if DEVICE == "GPU":
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))

AUTO = tf.data.experimental.AUTOTUNE
REPLICAS = strategy.num_replicas_in_sync
print(f'REPLICAS: {REPLICAS}')

# connecting to TPU...
# Running on TPU grpc://10.0.0.2:8470
# initializing TPU ...
# TPU initialized
# REPLICAS: 8

预处理

  • 原始信号有6个通道。正如这里提到的,我只拿了3个。
  • 到目前为止还没有进行任何信号处理。把它留给模型。
  • 使用简单的增强,其中一些可能会损害模型。
1
2
3
4
5
6
7
8
9
10
11
12
GCS_PATH = [None]*FOLDS
for i,k in enumerate(IMG_SIZES):
GCS_PATH[i] = KaggleDatasets().get_gcs_path('setibl-%ix%i-tfrec-dataset'%(k[0],k[1]))
files_train = np.sort(np.array(tf.io.gfile.glob(GCS_PATH[0] + '/train*.tfrec')))
files_test = np.sort(np.array(tf.io.gfile.glob(GCS_PATH[0] + '/test*.tfrec')))
num_train_files = len(files_train)
num_test_files = len(files_test)
print('train_files:',num_train_files)
print('test_files:',num_test_files)

# train_files: 20
# test_files: 10
元数据
  • train/ - 以numpy float16格式(v1.20.1)存储的一组cadence片段文件训练集,每个cadence片段id一个文件,并在train_labels.csv文件中找到相应的标签。每个文件都有维度(6, 273, 256),第一个维度代表节奏的6个位置,第二个和第三个维度代表2D频谱图。
  • test/ - 测试集节奏片段文件;你必须预测节奏中是否含有“针”,这是本次比赛的目标。
  • Sample_submission.csv - 正确格式的示例提交文件。
  • train_labels - 与train/文件夹中找到的节奏片段文件相对应的目标(按id)。
  • old_leaky_data - 完整的重新启动前数据,包括测试标签。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
train_label_df = pd.read_csv('../input/c/seti-breakthrough-listen/train_labels.csv')
test_label_df = pd.read_csv('../input/c/seti-breakthrough-listen/sample_submission.csv')

train_paths = glob('../input/c/seti-breakthrough-listen/train/**/*.npy')
test_paths = glob('../input/c/seti-breakthrough-listen/test/**/*.npy')

train_df = pd.DataFrame({'filepath':train_paths})
train_df['id'] = train_df.filepath.map(lambda x: x.split('/')[-1].split('.')[0])
train_df['group'] = train_df.filepath.map(lambda x: x.split('/')[-2])
train_df = pd.merge(train_df, train_label_df, on='id', how='left')

print(f'num_train: {len(train_paths)}\nnum_test : {len(test_paths)}')

# num_train: 60000
# num_test : 39995
分类分布

Tfrecord数据集共有20个文件,为了正确的验证方案,每个文件都已分层target & group feature

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
from plotly.offline import init_notebook_mode, iplot, plot
import plotly as py
import plotly.express as px
init_notebook_mode(connected=True)
import plotly.graph_objs as go

fig = go.Figure(data=[
go.Bar(name='haystack',
y=train_df.target.value_counts().values[0:1],
x=['haystack'],
text = train_df.target.value_counts()[0:1],
orientation='v',
textposition='outside',),
go.Bar(name='needle',
y=train_df.target.value_counts().values[1:],
x=['needle'],
text = train_df.target.value_counts()[1:],
orientation='v',
textposition='outside',)
])
# Change the bar mode
fig.update_layout(
width=800,
height=600,
title=f'Class Distribution',
yaxis_title='Number of Images',
xaxis_title='Class Name',)
iplot(fig)

数据增强

使用简单的增强,其中一些可能会损害模型。

  • 随机翻转(左右)
  • 无旋转
  • 随机亮度
  • 随机对比
  • 剪切
  • 缩放
  • 粗略丢弃/切断

由于这不是典型的图片数据,而是信号,典型的图像增强可能会造成一些损害。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
# Augmentations by @cdeotte
def get_mat(rotation, shear, height_zoom, width_zoom, height_shift, width_shift):
# returns 3x3 transformmatrix which transforms indicies
# CONVERT DEGREES TO RADIANS
rotation = math.pi * rotation / 180.
shear = math.pi * shear / 180.

def get_3x3_mat(lst):
return tf.reshape(tf.concat([lst],axis=0), [3,3])

# ROTATION MATRIX
c1 = tf.math.cos(rotation)
s1 = tf.math.sin(rotation)
one = tf.constant([1],dtype='float32')
zero = tf.constant([0],dtype='float32')

rotation_matrix = get_3x3_mat([c1, s1, zero,
-s1, c1, zero,
zero, zero, one])
# SHEAR MATRIX
c2 = tf.math.cos(shear)
s2 = tf.math.sin(shear)

shear_matrix = get_3x3_mat([one, s2, zero,
zero, c2, zero,
zero, zero, one])
# ZOOM MATRIX
zoom_matrix = get_3x3_mat([one/height_zoom, zero, zero,
zero, one/width_zoom, zero,
zero, zero, one])
# SHIFT MATRIX
shift_matrix = get_3x3_mat([one, zero, height_shift,
zero, one, width_shift,
zero, zero, one])

return K.dot(K.dot(rotation_matrix, shear_matrix), K.dot(zoom_matrix, shift_matrix))

def transform(image, DIM=IMG_SIZES[0]):
# input image - is one image of size [dim,dim,3] not a batch of [b,dim,dim,3]
# output - image randomly rotated, sheared, zoomed, and shifted

# fixed for non-square image thanks to Chris Deotte

if DIM[0]!=DIM[1]:
pad = (DIM[0]-DIM[1])//2
image = tf.pad(image, [[0, 0], [pad, pad+1],[0, 0]])

NEW_DIM = DIM[0]

XDIM = NEW_DIM%2 #fix for size 331

rot = ROT_ * tf.random.normal([1], dtype='float32')
shr = SHR_ * tf.random.normal([1], dtype='float32')
h_zoom = 1.0 + tf.random.normal([1], dtype='float32') / HZOOM_
w_zoom = 1.0 + tf.random.normal([1], dtype='float32') / WZOOM_
h_shift = HSHIFT_ * tf.random.normal([1], dtype='float32')
w_shift = WSHIFT_ * tf.random.normal([1], dtype='float32')

# GET TRANSFORMATION MATRIX
m = get_mat(rot,shr,h_zoom,w_zoom,h_shift,w_shift)

# LIST DESTINATION PIXEL INDICES
x = tf.repeat(tf.range(NEW_DIM//2, -NEW_DIM//2,-1), NEW_DIM)
y = tf.tile(tf.range(-NEW_DIM//2, NEW_DIM//2), [NEW_DIM])
z = tf.ones([NEW_DIM*NEW_DIM], dtype='int32')
idx = tf.stack( [x,y,z] )

# ROTATE DESTINATION PIXELS ONTO ORIGIN PIXELS
idx2 = K.dot(m, tf.cast(idx, dtype='float32'))
idx2 = K.cast(idx2, dtype='int32')
idx2 = K.clip(idx2, -NEW_DIM//2+XDIM+1, NEW_DIM//2)

# FIND ORIGIN PIXEL VALUES
idx3 = tf.stack([NEW_DIM//2-idx2[0,], NEW_DIM//2-1+idx2[1,]])
d = tf.gather_nd(image, tf.transpose(idx3))

if DIM[0]!=DIM[1]:
image = tf.reshape(d,[NEW_DIM, NEW_DIM,3])
image = image[:, pad:DIM[1]+pad,:]
image = tf.reshape(image, [*DIM, 3])

return image
Dropout

粗略DropoutCutout增强是防止过度拟合和支持泛化的技术。他们从训练图像中随机删除矩形。通过删除图像的一部分,激励我们的模型关注整个图像,因为它永远不知道图像的哪一部分将出现。(这与CNN中的dropout层相似但又不同)。

  • 剪切是去除1个随机大小的矩形的技术。
  • 粗略剔除是去除许多大小相似的小矩形的技术。

通过更改下面的参数,我们可以进行粗略的dropoutcutout

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# by cdotte
def dropout(image,DIM=IMG_SIZES[0], PROBABILITY = 0.6, CT = 5, SZ = 0.1):
# input image - is one image of size [dim,dim,3] not a batch of [b,dim,dim,3]
# output - image with CT squares of side size SZ*DIM removed

# DO DROPOUT WITH PROBABILITY DEFINED ABOVE
P = tf.cast( tf.random.uniform([],0,1)<PROBABILITY, tf.int32)
if (P==0)|(CT==0)|(SZ==0):
return image

for k in range(CT):
# CHOOSE RANDOM LOCATION
x = tf.cast( tf.random.uniform([],0,DIM[1]),tf.int32)
y = tf.cast( tf.random.uniform([],0,DIM[0]),tf.int32)
# COMPUTE SQUARE
WIDTH = tf.cast( SZ*min(DIM),tf.int32) * P
ya = tf.math.maximum(0,y-WIDTH//2)
yb = tf.math.minimum(DIM[0],y+WIDTH//2)
xa = tf.math.maximum(0,x-WIDTH//2)
xb = tf.math.minimum(DIM[1],x+WIDTH//2)
# DROPOUT IMAGE
one = image[ya:yb,0:xa,:]
two = tf.zeros([yb-ya,xb-xa,3], dtype = image.dtype)
three = image[ya:yb,xb:DIM[1],:]
middle = tf.concat([one,two,three],axis=1)
image = tf.concat([image[0:ya,:,:],middle,image[yb:DIM[0],:,:]],axis=0)

image = tf.reshape(image,[*DIM,3])
return image
读取TFRecord数据

那么,有没有想过为什么人们使用TFRecord

  • TFRecord(.tfrecord/.tfrec)格式是TensorFlow的自定义数据格式,用于存储二进制记录序列。
  • TFRecord数据在存储磁盘上占用的空间更少,从磁盘读取和写入的时间也更少。
  • 使用TFRecord有许多优点:
    • 更高效的存储
    • 快速输入/输出
    • 独立文件
    • TPU要求您以TFRecord格式向其传递数据

要使用tfrecrod构建数据管道,我们需要使用tf.data API。那么,tf.data是什么?

  • tf.data API使我们能够从简单、可重用的片段构建复杂的输入管道。例如,图像模型的管道可能会聚合分布式文件系统中文件的数据,对每个图像应用随机变换,并将随机选择的图像合并到一批中进行训练。
  • tf.data API提供了tf.data.Dataset抽象,它表示一系列元素,其中每个元素包含一个或多个组件。例如,在图像管道中,元素可能是单个训练示例,具有代表图像及其标签的一对张量分量。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
def read_labeled_tfrecord(example):
tfrec_format = {
'image' : tf.io.FixedLenFeature([], tf.string),
'image_id' : tf.io.FixedLenFeature([], tf.string),
'target' : tf.io.FixedLenFeature([], tf.int64)
}
example = tf.io.parse_single_example(example, tfrec_format)
return example['image'], example['target']


def read_unlabeled_tfrecord(example, return_image_id):
tfrec_format = {
'image' : tf.io.FixedLenFeature([], tf.string),
'image_id' : tf.io.FixedLenFeature([], tf.string),
}
example = tf.io.parse_single_example(example, tfrec_format)
return example['image'], example['image_id'] if return_image_id else 0


def prepare_image(img, augment=True, dim=IMG_SIZES[0]):
img = tf.image.decode_png(img, channels=3)
img = tf.cast(img, tf.float32)
img = img/255.0 # rescale image

if augment:
img = transform(img,DIM=dim) if TRANSFORM else img
img = tf.image.random_flip_left_right(img)
#img = tf.image.random_hue(img, 0.01)
img = tf.image.random_saturation(img, sat[0], sat[1])
img = tf.image.random_contrast(img, cont[0], cont[1])
img = tf.image.random_brightness(img, bri)

img = tf.reshape(img, [*dim, 3])

return img

def count_data_items(fileids):
n = [int(re.compile(r"-([0-9]*)\.").search(fileid).group(1))
for fileid in fileids]
return np.sum(n)

数据管道

  • 读取TFRecord文件。
  • 缓存数据以加快训练速度。
  • 重复数据流(仅用于训练和测试时间增强)。
  • 打乱数据(仅用于训练)。
  • 解析TFRecord数据。
  • 将数据从ByteString解码为图像数据。
  • 处理图像数据(重新缩放)。
  • 应用增强。
  • 批次数据。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
def get_dataset(files, augment = False, shuffle = False, repeat = False, 
labeled=True, return_image_ids=True, batch_size=16, dim=IMG_SIZES[0]):

ds = tf.data.TFRecordDataset(files, num_parallel_reads=AUTO) # read tfrecord
ds = ds.cache() # cache data for speedup

if repeat:
ds = ds.repeat() # repeat the data (for training only)

if shuffle:
ds = ds.shuffle(1024*2, seed=SEED) # shuffle the data (for training only)
opt = tf.data.Options()
opt.experimental_deterministic = False # order won't be maintained when we shuffle
ds = ds.with_options(opt)

if labeled:
ds = ds.map(read_labeled_tfrecord, num_parallel_calls=AUTO) # unparse tfrecord data with labels
else:
ds = ds.map(lambda example: read_unlabeled_tfrecord(example, return_image_ids),
num_parallel_calls=AUTO) # unparse tfrecord data without labels

ds = ds.map(lambda img, imgid_or_label: (prepare_image(img, augment=augment, dim=dim),
imgid_or_label), # get img from bytestring, augmentations
num_parallel_calls=AUTO)
if labeled and augment:
ds = ds.map(lambda img, label: (dropout(img, DIM=dim, PROBABILITY = PROBABILITY, CT = CT, SZ = SZ), label),
num_parallel_calls=AUTO) # use dropout only in training

ds = ds.batch(batch_size * REPLICAS) # batch the data
ds = ds.prefetch(AUTO) # prefatch data for speedup
return ds

可视化

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
def display_batch(batch, size=3):
imgs, tars = batch
for img_idx in range(size):
plt.figure(figsize=(5*2, 15*2))
for idx in range(3):
plt.subplot(size, 3, idx+1)
plt.title(f'id:{tars[img_idx].numpy().decode("utf-8")}')
plt.imshow(imgs[img_idx,:, :, idx])
plt.text(5, 15, str(idx), bbox={'facecolor': 'white'})
plt.xticks([])
plt.yticks([])
plt.tight_layout()
plt.show()

fold = 0
ds = get_dataset(files_train, augment=True, shuffle=False, repeat=True,labeled=False,return_image_ids=True,
dim=IMG_SIZES[fold], batch_size = BATCH_SIZES[fold])
ds = ds.unbatch().batch(20)
batch = next(iter(ds))
display_batch(batch, 3)

构建模型

您可以尝试其他模型,例如,

  • Vision Transformer(ViT)
  • 残差网络(ResNet)
  • 创始网络(InceptionNet)
  • XceptionNet
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
EFNS = [efn.EfficientNetB0, efn.EfficientNetB1, efn.EfficientNetB2, efn.EfficientNetB3, 
efn.EfficientNetB4, efn.EfficientNetB5, efn.EfficientNetB6, efn.EfficientNetB7]

def build_model(dim=IMG_SIZES[0], ef=0):
inp = tf.keras.layers.Input(shape=(*dim,3)) # input layer with propoer img-dimension
base = EFNS[ef](input_shape=(*dim,3),weights='imagenet',include_top=False) # get base model (efficientnet), use imgnet weights
x = base(inp)
x = tf.keras.layers.GlobalAveragePooling2D()(x) # use GAP to get pooling result form conv outputs
x = tf.keras.layers.Dense(32, activation = 'relu')(x) # use activation to apply non-linearity
x = tf.keras.layers.Dense(1,activation='sigmoid')(x) # use sigmoid to convert predictions to [0-1]
model = tf.keras.Model(inputs=inp,outputs=x)
opt = tf.keras.optimizers.Adam(learning_rate=0.001)
loss = tf.keras.losses.BinaryCrossentropy(label_smoothing=0.01) # label smoothing for robustness
model.compile(optimizer=opt,loss=loss,metrics=['AUC'])
return model

Learning Rate Schedule

  • 这是迁移学习的常见训练进度。
  • 学习率开始时接近于零,然后增加到最大值,然后随着时间的推移而衰减。考虑改变时间表和/或学习率。请注意,随着批量大小的增加,最大学习率也随之增大。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
def get_lr_callback(batch_size=8, plot=False):
lr_start = 0.000005
lr_max = 0.00000125 * REPLICAS * batch_size
lr_min = 0.000001
lr_ramp_ep = 5
lr_sus_ep = 0
lr_decay = 0.8

def lrfn(epoch):
if epoch < lr_ramp_ep:
lr = (lr_max - lr_start) / lr_ramp_ep * epoch + lr_start

elif epoch < lr_ramp_ep + lr_sus_ep:
lr = lr_max

else:
lr = (lr_max - lr_min) * lr_decay**(epoch - lr_ramp_ep - lr_sus_ep) + lr_min

return lr
if plot:
plt.figure(figsize=(10,5))
plt.plot(np.arange(EPOCHS[0]), [lrfn(epoch) for epoch in np.arange(EPOCHS[0])], marker='o')
plt.xlabel('epoch'); plt.ylabel('learnig rate')
plt.title('Learning Rate Scheduler')
plt.show()

lr_callback = tf.keras.callbacks.LearningRateScheduler(lrfn, verbose=False)
return lr_callback

_=get_lr_callback(BATCH_SIZES[0], plot=True )

训练

我们的模型将根据您在上述配置中选择的FOLDSEPOCHS数量进行训练。每次折叠(fold)时,验证AUC最低的模型都会被保存并用于预测OOF和测试。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
skf = KFold(n_splits=FOLDS,shuffle=True,random_state=SEED)
oof_pred = []; oof_tar = []; oof_val = []; oof_f1 = []; oof_ids = []; oof_folds = []
preds = np.zeros((count_data_items(files_test),1))

for fold,(idxT,idxV) in enumerate(skf.split(np.arange(num_train_files))):
# DISPLAY FOLD INFO
if DEVICE=='TPU':
if tpu: tf.tpu.experimental.initialize_tpu_system(tpu)

# CREATE TRAIN AND VALIDATION SUBSETS
files_train = tf.io.gfile.glob([GCS_PATH[fold] + '/train%.2i*.tfrec'%x for x in idxT])
np.random.shuffle(files_train);
files_valid = tf.io.gfile.glob([GCS_PATH[fold] + '/train%.2i*.tfrec'%x for x in idxV])
files_test = np.sort(np.array(tf.io.gfile.glob(GCS_PATH[fold] + '/test*.tfrec')))

print('#'*25); print('#### FOLD',fold+1)
print('#### Image Size: (%i, %i) | model: %s | batch_size %i'%
(IMG_SIZES[fold][0],IMG_SIZES[fold][1],EFNS[EFF_NETS[fold]].__name__,BATCH_SIZES[fold]*REPLICAS))
train_images = count_data_items(files_train)
val_images = count_data_items(files_valid)
print('#### Training: %i | Validation: %i'%(train_images, val_images))

# BUILD MODEL
K.clear_session()
with strategy.scope():
model = build_model(dim=IMG_SIZES[fold],ef=EFF_NETS[fold])
print('#'*25)
# SAVE BEST MODEL EACH FOLD
sv = tf.keras.callbacks.ModelCheckpoint(
'fold-%i.h5'%fold, monitor='val_auc', verbose=0, save_best_only=True,
save_weights_only=True, mode='max', save_freq='epoch')

# TRAIN
print('Training...')
history = model.fit(
get_dataset(files_train, augment=AUGMENT, shuffle=True, repeat=True,
dim=IMG_SIZES[fold], batch_size = BATCH_SIZES[fold]),
epochs=EPOCHS[fold],
callbacks = [sv,get_lr_callback(BATCH_SIZES[fold])],
steps_per_epoch=count_data_items(files_train)/BATCH_SIZES[fold]//REPLICAS,
validation_data=get_dataset(files_valid,augment=False,shuffle=False,
repeat=False,dim=IMG_SIZES[fold]),
#class_weight = {0:1,1:2},
verbose=VERBOSE
)

# Loading best model for inference
print('Loading best model...')
model.load_weights('fold-%i.h5'%fold)

# PREDICT OOF USING TTA
print('Predicting OOF with TTA...')
ds_valid = get_dataset(files_valid,labeled=False,return_image_ids=False,augment=AUGMENT,
repeat=True,shuffle=False,dim=IMG_SIZES[fold],batch_size=BATCH_SIZES[fold]*2)
ct_valid = count_data_items(files_valid); STEPS = TTA * ct_valid/BATCH_SIZES[fold]/2/REPLICAS
pred = model.predict(ds_valid,steps=STEPS,verbose=VERBOSE)[:TTA*ct_valid,]
oof_pred.append( np.mean(pred.reshape((ct_valid,TTA),order='F'),axis=1) )

# GET OOF TARGETS AND idS
ds_valid = get_dataset(files_valid, augment=False, repeat=False, dim=IMG_SIZES[fold],
labeled=True, return_image_ids=True)
oof_tar.append( np.array([target.numpy() for img, target in iter(ds_valid.unbatch())]) )
oof_folds.append( np.ones_like(oof_tar[-1],dtype='int8')*fold )
ds = get_dataset(files_valid, augment=False, repeat=False, dim=IMG_SIZES[fold],
labeled=False, return_image_ids=True)
oof_ids.append( np.array([img_id.numpy().decode("utf-8") for img, img_id in iter(ds.unbatch())]))

# PREDICT TEST USING TTA
print('Predicting Test with TTA...')
ds_test = get_dataset(files_test,labeled=False,return_image_ids=False,augment=AUGMENT,
repeat=True,shuffle=False,dim=IMG_SIZES[fold],batch_size=BATCH_SIZES[fold]*2)
ct_test = count_data_items(files_test); STEPS = TTA * ct_test/BATCH_SIZES[fold]/2/REPLICAS
pred = model.predict(ds_test,steps=STEPS,verbose=VERBOSE)[:TTA*ct_test,]
preds[:,0] += np.mean(pred.reshape((ct_test,TTA),order='F'),axis=1) * WGTS[fold]

# REPORT RESULTS
auc = roc_auc_score(oof_tar[-1],oof_pred[-1])
oof_val.append(np.max( history.history['val_auc'] ))
print('#### FOLD %i OOF AUC without TTA = %.3f, with TTA = %.3f'%(fold+1,oof_val[-1],auc))

# PLOT TRAINING
if DISPLAY_PLOT:
plt.figure(figsize=(15,5))
plt.plot(np.arange(len(history.history['auc'])),history.history['auc'],'-o',label='Train auc',color='#ff7f0e')
plt.plot(np.arange(len(history.history['auc'])),history.history['val_auc'],'-o',label='Val auc',color='#1f77b4')
x = np.argmax( history.history['val_auc'] ); y = np.max( history.history['val_auc'] )
xdist = plt.xlim()[1] - plt.xlim()[0]; ydist = plt.ylim()[1] - plt.ylim()[0]
plt.scatter(x,y,s=200,color='#1f77b4'); plt.text(x-0.03*xdist,y-0.13*ydist,'max auc\n%.2f'%y,size=14)
plt.ylabel('auc',size=14); plt.xlabel('Epoch',size=14)
plt.legend(loc=2)
plt2 = plt.gca().twinx()
plt2.plot(np.arange(len(history.history['auc'])),history.history['loss'],'-o',label='Train Loss',color='#2ca02c')
plt2.plot(np.arange(len(history.history['auc'])),history.history['val_loss'],'-o',label='Val Loss',color='#d62728')
x = np.argmin( history.history['val_loss'] ); y = np.min( history.history['val_loss'] )
ydist = plt.ylim()[1] - plt.ylim()[0]
plt.scatter(x,y,s=200,color='#d62728'); plt.text(x-0.03*xdist,y+0.05*ydist,'min loss',size=14)
plt.ylabel('Loss',size=14)
plt.title('FOLD %i - Image Size (%i, %i), %s'%
(fold+1,IMG_SIZES[fold][0],IMG_SIZES[fold][1],EFNS[EFF_NETS[fold]].__name__),size=18)
plt.legend(loc=3)
plt.savefig(f'fig{fold}.png')
plt.show()

计算 OOF

OOF(未折叠)预测将保存到磁盘。如果您希望集成多个模型,请使用OOF来确定混合模型的最佳权重。当用于混合OOF时,选择可最大化OOF CV分数的权重。然后使用这些相同的权重来混合您的测试预测。

注意

  • 不要仅仅为了提升LB而进行混合,因为大多数时候它最终会变得过度拟合。
  • 尝试通过混合不同的模型来提高CV,您可以密切关注LB
  • 由于只有20%的数据将用于计算LB分数,因此依靠CV应该是一个安全的选择。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# COMPUTE OVERALL OOF AUC
oof = np.concatenate(oof_pred); true = np.concatenate(oof_tar);
ids = np.concatenate(oof_ids); folds = np.concatenate(oof_folds)
auc = roc_auc_score(true,oof)
print('Overall OOF AUC with TTA = %.3f'%auc)

# SAVE OOF TO DISK
df_oof = pd.DataFrame(dict(image_id = ids, target=true, pred = oof, fold=folds))
df_oof.to_csv('oof.csv',index=False)
df_oof.head()

# Overall OOF AUC with TTA = 0.669

# image_id target pred fold
# 0 707ec47b606a3d0 0 0.083118 0
# 1 743d0594f4e2ff7 0 0.059541 0
# 2 747d5a1f5ab25a9 0 0.086644 0
# 3 786909d18574aeb 0 0.040169 0
# 4 7726ef7381e86c7 0 0.091762 0

后期处理

有多种方法可以根据patient信息修改预测以增加CV-LB。您可以在OOF上进行实验。

提交

1
2
3
4
5
6
7
8
9
10
11
12
13
14
ds = get_dataset(files_test, augment=False, repeat=False, dim=IMG_SIZES[fold],labeled=False, return_image_ids=True)
image_ids = np.array([img_id.numpy().decode("utf-8") for img, img_id in iter(ds.unbatch())])

submission = pd.DataFrame({'id':image_ids, 'target':preds[:,0]})
submission = submission.sort_values('id')
submission.to_csv('submission.csv', index=False)
submission.head()

# id target
# 27604 000bf832cae9ff1 0.096986
# 29604 000c74cc71a1140 0.081791
# 18754 000f5f9851161d3 0.085843
# 17054 000f7499e95aba6 0.128763
# 20854 00133ce6ec257f9 0.080760

预测分布

1
2
plt.figure(figsize=(10,5))
plt.hist(submission.target,bins=100);