據(jù)說(shuō)，Transformer 不能有效地進(jìn)行時(shí)間序列預(yù)測(cè)？

2023-07-05 00:34 作者:HuggingFace 0人讀過(guò) | 我要投稿

簡(jiǎn)介

幾個(gè)月前，我們介紹了 Informer 這個(gè)模型，相關(guān)論文 (Zhou, Haoyi, et al., 2021) 是一篇獲得了 AAAI 2021 最佳論文獎(jiǎng)的時(shí)間序列論文。我們也展示了一個(gè)使用 Informer 進(jìn)行多變量概率預(yù)測(cè)的例子。在本文中，我們討論以下問(wèn)題: Transformer 模型對(duì)時(shí)間序列預(yù)測(cè)真的有效嗎？我們給出的答案是，它們真的有效。

首先，我們將會(huì)提供一些實(shí)驗(yàn)證據(jù)，展示其真正的有效性。我們的對(duì)比實(shí)驗(yàn)將表明，?DLinear?這個(gè)簡(jiǎn)單線性模型并沒(méi)有像說(shuō)的那樣比 transformer 好。當(dāng)我們?cè)谕饶Ｐ痛笮『拖嗤O(shè)定的情況下對(duì)比時(shí)，我們發(fā)現(xiàn)基于 transformer 的模型在我們關(guān)注的測(cè)試標(biāo)準(zhǔn)上表現(xiàn)得更好。其次，我們將會(huì)介紹?Autoformer?模型，相關(guān)論文 (Wu, Haixu, et al., 2021) 在 Informer 模型問(wèn)世后發(fā)表在 NeurIPS 2021 上。Autoformer 的模型現(xiàn)在已經(jīng)可以在 ?? Transformers 中使用。最后，我們還會(huì)討論?DLinear?模型，該模型是一個(gè)簡(jiǎn)單的前向網(wǎng)絡(luò)，使用了 Autoformer 中的分解層 (decomposition layer)。DLinear 模型是在 Are Transformers Effective for Time Series Forecasting? 這篇論文中提出的，文中聲稱其性能在時(shí)間序列預(yù)測(cè)領(lǐng)域超越了 transformer 系列的算法。

下面我們開(kāi)始！

評(píng)估 Transformer 系列模型和 DLinear 模型

在 AAAI 2023 的論文 Are Transformers Effective for Time Series Forecasting? 中，作者聲稱 transformer 系列模型在時(shí)間序列預(yù)測(cè)方面并不有效。他們拿基于 transformer 的模型與一個(gè)簡(jiǎn)單的線性模型?DLinear?作對(duì)比。DLinear 使用了 Autoformer 中的 decomposition layer 結(jié)構(gòu) (下文將會(huì)介紹)，作者聲稱其性能超越了基于 transformer 的模型。但事實(shí)真的是這樣嗎？我們接下來(lái)看看。

上表展示了 Autoformer 和 DLinear 在三個(gè)論文中用到的數(shù)據(jù)集上的表現(xiàn)。結(jié)果說(shuō)明 Autoformer 在三個(gè)數(shù)據(jù)集上表現(xiàn)都超越了 DLinear 模型。

接下來(lái)，我們將介紹 Autoformer 和 DLinear 模型，演示我們?nèi)绾卧谏媳?Traffic 數(shù)據(jù)集上對(duì)比它們的性能，并為結(jié)果提供一些可解釋性。

先說(shuō)結(jié)論:?一個(gè)簡(jiǎn)單的線性模型可能在某些特定情況下更有優(yōu)勢(shì)，但可能無(wú)法像 transformer 之類的復(fù)雜模型那樣處理協(xié)方差信息。

Autoformer 詳細(xì)介紹

Autoformer 基于傳統(tǒng)的時(shí)間序列方法: 把時(shí)間序列分解為季節(jié)性 (seasonality) 以及趨勢(shì) - 周期 (trend-cycle) 這些要素。這通過(guò)加入分解層 (?Decomposition Layer?) 來(lái)實(shí)現(xiàn)，以此來(lái)增強(qiáng)模型獲取這些信息的能力。此外，Autoformer 中還獨(dú)創(chuàng)了自相關(guān) (auto-correlation) 機(jī)制，替換掉了傳統(tǒng) transformer 中的自注意力 (self-attention)。該機(jī)制使得模型可以利用注意力機(jī)制中周期性的依賴，提升了總體性能。

下面，我們將深入探討 Autoformer 的這兩大主要貢獻(xiàn): 分解層 (?Decomposition Layer?) 和自相關(guān)機(jī)制 (?Autocorrelation Mechanism?)。相關(guān)代碼也會(huì)提供出來(lái)。

分解層

分解是一個(gè)時(shí)間序列領(lǐng)域十分常用的方法，但在 Autoformer 以前都沒(méi)有被密集集成入深度學(xué)習(xí)模型中。我們先簡(jiǎn)單介紹這一概念，隨后會(huì)使用 PyTorch 代碼演示這一思路是如何應(yīng)用到 Autoformer 中的。

時(shí)間序列分解

在時(shí)間序列分析中，分解 (decomposition) 是把一個(gè)時(shí)間序列拆分成三個(gè)系統(tǒng)性要素的方法: 趨勢(shì)周期 (trend-cycle) 、季節(jié)性變動(dòng) (seasonal variation) 和隨機(jī)波動(dòng) (random fluctuations)。趨勢(shì)要素代表了時(shí)間序列的長(zhǎng)期走勢(shì)方向; 季節(jié)要素反映了一些反復(fù)出現(xiàn)的模式，例如以一年或一季度為周期出現(xiàn)的模式; 而隨機(jī) (無(wú)規(guī)律) 因素則反映了數(shù)據(jù)中無(wú)法被上述兩種要素解釋的隨機(jī)噪聲。

有兩種主流的分解方法: 加法分解和乘法分解，這在 statsmodels 這個(gè)庫(kù)里都有實(shí)現(xiàn)。通過(guò)分解時(shí)間序列到這三個(gè)要素，我們能更好地理解和建模數(shù)據(jù)中潛在的模式。

但怎樣把分解集成進(jìn) transformer 結(jié)構(gòu)呢？我們可以參考參考 Autoformer 的做法。

Autoformer 中的分解

Autoformer 把分解作為一個(gè)內(nèi)部計(jì)算操作集成到模型中，如上圖所示。可以看到，編碼器和解碼器都使用了分解模塊來(lái)集合 trend-cyclical 信息，并從序列中漸進(jìn)地提取 seasonal 信息。這種內(nèi)部分解的概念已經(jīng)從 Autoformer 中展示了其有效性。所以很多其它的時(shí)間序列論文也開(kāi)始采用這一方法，例如 FEDformer (Zhou, Tian, et al., ICML 2022) 和 DLinear (Zeng, Ailing, et al., AAAI 2023)，這更說(shuō)明了其在時(shí)間序列建模中的意義。

現(xiàn)在，我們正式地給分解層做出定義:

對(duì)一個(gè)長(zhǎng)度為?? $L$ 的序列?，分解層返回的? $%5Cmathcal%7BX%7D_%5Ctextrm%7Btrend%7D%20%E5%92%8C%20%5Cmathcal%7BX%7D_%5Ctextrm%7Bseasonal%7D$ ?定義如下:

$%5Cmathcal%7BX%7D_%5Ctextrm%7Btrend%7D%20%3D%20%5Ctextrm%7BAvgPool(Padding(%7D%20%5Cmathcal%7BX%7D%20%5Ctextrm%7B))%7D%20%5C%5C%0A%5Cmathcal%7BX%7D_%5Ctextrm%7Bseasonal%7D%20%3D%20%5Cmathcal%7BX%7D%20-%20%5Cmathcal%7BX%7D_%5Ctextrm%7Btrend%7D$

對(duì)應(yīng)的 PyTorch 代碼實(shí)現(xiàn)是:

import?torch from?torch?import?nn class?DecompositionLayer(nn.Module): ????""" ????Returns?the?trend?and?the?seasonal?parts?of?the?time?series. ????""" ????def?__init__(self,?kernel_size): ????????super().__init__() ????????self.kernel_size?=?kernel_size ????????self.avg?=?nn.AvgPool1d(kernel_size=kernel_size,?stride=1,?padding=0)?#?moving?average ????def?forward(self,?x): ????????"""Input?shape:?Batch?x?Time?x?EMBED_DIM""" ????????#?padding?on?the?both?ends?of?time?series ????????num_of_pads?=?(self.kernel_size?-?1)?//?2 ????????front?=?x[:,?0:1,?:].repeat(1,?num_of_pads,?1) ????????end?=?x[:,?-1:,?:].repeat(1,?num_of_pads,?1) ????????x_padded?=?torch.cat([front,?x,?end],?dim=1) ????????#?calculate?the?trend?and?seasonal?part?of?the?series ????????x_trend?=?self.avg(x_padded.permute(0,?2,?1)).permute(0,?2,?1) ????????x_seasonal?=?x?-?x_trend ????????return?x_seasonal,?x_trend

可見(jiàn)，代碼非常簡(jiǎn)單，可以很方便地用在其它模型中，正如 DLinear 那樣。下面，我們講解第二個(gè)創(chuàng)新點(diǎn):?注意力 (自相關(guān)) 機(jī)制?。

注意力 (自相關(guān)) 機(jī)制

除了分解層之外，Autoformer 還使用了一個(gè)原創(chuàng)的自相關(guān) (autocorrelation) 機(jī)制，可以完美替換自注意力 (self-attention) 機(jī)制。在最原始的時(shí)間序列 transformer 模型中，注意力權(quán)重是在時(shí)域計(jì)算并逐點(diǎn)聚合的。而從上圖中可以看出，Autoformer 不同的是它在頻域計(jì)算這些 (使用快速傅立葉變換)，然后通過(guò)時(shí)延聚合它們。

接下來(lái)部分，我們深入細(xì)節(jié)，并使用代碼作出講解。

時(shí)域的注意力機(jī)制

理論上講，給定一個(gè)時(shí)間延遲?，一個(gè)離散變量的? $y$ ?自相關(guān)性??可以用來(lái)衡量這個(gè)變量當(dāng)前時(shí)刻? $t$ ?的值和過(guò)去時(shí)刻? $t-%5Ctau$ ?的值之間的“關(guān)系”(皮爾遜相關(guān)性，pearson correlation): $%5Ctextrm%7BAutocorrelation%7D(%5Ctau)%20%3D%20%5Ctextrm%7BCorr%7D(y_t%2C%20y_%7Bt-%5Ctau%7D)$

使用自相關(guān)性，Autoformer 提取了 query 和 key 之間基于頻域的相互依賴，而不是像之前那樣兩兩之間的點(diǎn)乘?？梢园堰@個(gè)操作看成是自注意力中??的替換。

實(shí)際操作中，query 和 key 之間的自相關(guān)是通過(guò) FFT 一次性針對(duì)?所有時(shí)間延遲?計(jì)算出來(lái)的。通過(guò)這種方法，自相關(guān)機(jī)制達(dá)到了? $O(L%20%5Clog%20L)$ ?的時(shí)間復(fù)雜度 (??是輸入時(shí)間長(zhǎng)度)，這個(gè)速度和 Informer 的 ProbSparse attention 接近。值得一提的是，使用 FFT 計(jì)算自相關(guān)性的理論基礎(chǔ)是 Wiener–Khinchin theorem，這里我們不細(xì)講了。

現(xiàn)在，我們來(lái)看看相應(yīng)的 PyTorch 代碼:

import?torch def?autocorrelation(query_states,?key_states): ????""" ????Computes?autocorrelation(Q,K)?using?`torch.fft`. ????Think?about?it?as?a?replacement?for?the?QK^T?in?the?self-attention. ???? ????Assumption:?states?are?resized?to?same?shape?of?[batch_size,?time_length,?embedding_dim]. ????""" ????query_states_fft?=?torch.fft.rfft(query_states,?dim=1) ????key_states_fft?=?torch.fft.rfft(key_states,?dim=1) ????attn_weights?=?query_states_fft?*?torch.conj(key_states_fft) ????attn_weights?=?torch.fft.irfft(attn_weights,?dim=1) ???? ????return?attn_weights

代碼非常簡(jiǎn)潔！?? 請(qǐng)注意這只是?autocorrelation(Q,K)?的部分實(shí)現(xiàn)，完整實(shí)現(xiàn)請(qǐng)參考 ?? Transformers 中的代碼。

接下來(lái)，我們將看到如何使用時(shí)延值聚合我們的?attn_weights?，這個(gè)過(guò)程被稱為時(shí)延聚合 (?Time Delay Aggregation?)。

時(shí)延聚合

我們用?? $%5Cmathcal%7BR_%7BQ%2CK%7D%7D$ 來(lái)表示自相關(guān) (即?attn_weights?)。那么問(wèn)題是: 我們應(yīng)該如何聚合這些? $%5Cmathcal%7BR_%7BQ%2CK%7D%7D(%5Ctau_1)%2C%20%5Cmathcal%7BR_%7BQ%2CK%7D%7D(%5Ctau_2)%2C%20%E2%80%A6%2C%20%5Cmathcal%7BR_%7BQ%2CK%7D%7D(%5Ctau_k)$ ?到 ? $%5Cmathcal%7BV%7D$ ?上面？在標(biāo)準(zhǔn)的自注意力機(jī)制中，這種聚合通過(guò)點(diǎn)乘完成。但在 Autoformer 中，我們使用了一種不同的方法。首先我們?cè)跁r(shí)延? $%5Ctau_1%2C%20%5Ctau_2%2C%20%E2%80%A6%20%5Ctau_k$ ?上對(duì)齊?，計(jì)算在這些時(shí)延下它對(duì)應(yīng)的值，這個(gè)操作叫作?Rolling?。接下來(lái)，我們將對(duì)齊的? $%5Cmathcal%7BV%7D$ ?和自相關(guān)的值進(jìn)行逐點(diǎn)的乘法運(yùn)算。在上圖中，你可以看到在左邊是基于時(shí)延對(duì)? $%5Cmathcal%7BV%7D$ ?進(jìn)行的 Rolling 操作; 而右邊就展示了與自相關(guān)進(jìn)行的逐點(diǎn)乘法。

整個(gè)過(guò)程可以用以下公式總結(jié):

$%5Ctau_1%2C%20%5Ctau_2%2C%20%E2%80%A6%20%5Ctau_k%20%3D%20%5Ctextrm%7Barg%20Top-k%7D(%5Cmathcal%7BR_%7BQ%2CK%7D%7D(%5Ctau))%20%5C%0A%5Chat%7B%5Cmathcal%7BR%7D%7D%5Cmathcal%7B%20_%7BQ%2CK%7D%7D(%5Ctau%20_1)%2C%20%5Chat%7B%5Cmathcal%7BR%7D%7D%5Cmathcal%7B_%20%7BQ%2CK%7D%7D(%5Ctau%20_2)%2C%20%E2%80%A6%2C%20%5Chat%7B%5Cmathcal%7BR%7D%7D%5Cmathcal%7B_%20%7BQ%2CK%7D%7D(%5Ctau%20_k)%20%3D%20%5Ctextrm%7BSoftmax%7D(%5Cmathcal%7BR_%20%7BQ%2CK%7D%7D(%5Ctau%20_1)%2C%20%5Cmathcal%7BR_%20%7BQ%2CK%7D%7D(%5Ctau_2)%2C%20%E2%80%A6%2C%20%5Cmathcal%7BR_%20%7BQ%2CK%7D%7D(%5Ctau_k))%20%5C%0A%5Ctextrm%7BAutocorrelation-Attention%7D%20%3D%20%5Csum_%7Bi%3D1%7D%5Ek%20%5Ctextrm%7BRoll%7D(%5Cmathcal%7BV%7D%2C%20%5Ctau_i)%20%5Ccdot%20%5Chat%7B%5Cmathcal%7BR%7D%7D%5Cmathcal%7B_%7BQ%2CK%7D%7D(%5Ctau%20_i)$

就是這樣！需要注意的是，?是一個(gè)超參數(shù)，我們稱之為?autocorrelation_factor?(類似于 Informer 里的?sampling_factor?) ; 而 softmax 是在乘法操作之前運(yùn)用到自相關(guān)上面的。

現(xiàn)在，我們已經(jīng)可以看看最終的代碼了:

import?torch import?math def?time_delay_aggregation(attn_weights,?value_states,?autocorrelation_factor=2): ????""" ????Computes?aggregation?as?value_states.roll(delay)*?top_k_autocorrelations(delay). ????The?final?result?is?the?autocorrelation-attention?output. ????Think?about?it?as?a?replacement?of?the?dot-product?between?attn_weights?and?value?states. ???? ????The?autocorrelation_factor?is?used?to?find?top?k?autocorrelations?delays. ????Assumption:?value_states?and?attn_weights?shape:?[batch_size,?time_length,?embedding_dim] ????""" ????bsz,?num_heads,?tgt_len,?channel?=?... ????time_length?=?value_states.size(1) ????autocorrelations?=?attn_weights.view(bsz,?num_heads,?tgt_len,?channel) ????#?find?top?k?autocorrelations?delays ????top_k?=?int(autocorrelation_factor?*?math.log(time_length)) ????autocorrelations_mean?=?torch.mean(autocorrelations,?dim=(1,?-1))?#?bsz?x?tgt_len ????top_k_autocorrelations,?top_k_delays?=?torch.topk(autocorrelations_mean,?top_k,?dim=1) ????#?apply?softmax?on?the?channel?dim ????top_k_autocorrelations?=?torch.softmax(top_k_autocorrelations,?dim=-1)?#?bsz?x?top_k ????#?compute?aggregation:?value_states.roll(delay)*?top_k_autocorrelations(delay) ????delays_agg?=?torch.zeros_like(value_states).float()?#?bsz?x?time_length?x?channel ????for?i?in?range(top_k): ????????value_states_roll_delay?=?value_states.roll(shifts=-int(top_k_delays[i]),?dims=1) ????????top_k_at_delay?=?top_k_autocorrelations[:,?i] ????????#?aggregation ????????top_k_resized?=?top_k_at_delay.view(-1,?1,?1).repeat(num_heads,?tgt_len,?channel) ????????delays_agg?+=?value_states_roll_delay?*?top_k_resized ????attn_output?=?delays_agg.contiguous() ????return?attn_output

完成！Autoformer 模型現(xiàn)在已經(jīng)可以在 ?? Transformers 中使用了，名字就叫?AutoformerModel?。

針對(duì)這個(gè)模型，我們要對(duì)比單變量 transformer 模型與 DLinear 的性能，DLinear 本質(zhì)也是單變量的。后面我們也會(huì)展示兩個(gè)多變量 transformer 模型的性能 (在同一數(shù)據(jù)上訓(xùn)練的)。

DLinear 詳細(xì)介紹

實(shí)際上，DLinear 結(jié)構(gòu)非常簡(jiǎn)單，僅僅是從 Autoformer 的?DecompositionLayer?上連接全連接層。它使用?DecompositionLayer?來(lái)分解輸入的世界序列到殘差部分 (季節(jié)性) 和趨勢(shì)部分。前向過(guò)程中，每個(gè)部分都被輸入到各自的線性層，并被映射成?prediction_length?長(zhǎng)度的輸出。最終的輸出就是兩個(gè)輸入的和:

def?forward(self,?context): ????seasonal,?trend?=?self.decomposition(context) ????seasonal_output?=?self.linear_seasonal(seasonal) ????trend_output?=?self.linear_trend(trend) ????return?seasonal_output?+?trend_output

在這種設(shè)定下，首先我們把輸入的序列映射成?prediction-length * hidden?維度 (通過(guò)?linear_seasonal?和?linear_trend?兩個(gè)層) ; 得到的結(jié)果會(huì)被相加起來(lái)，并轉(zhuǎn)換為?(prediction_length, hidden)?形狀; 最后，維度為?hidden?的隱性表征會(huì)被映射到某種分布的參數(shù)上。

在我們的測(cè)評(píng)中，我們使用 GluonTS 中 DLinear 的實(shí)現(xiàn)。

示例: Traffic 數(shù)據(jù)集

我們希望用實(shí)驗(yàn)結(jié)果展示庫(kù)中基于 transformer 模型的性能，這里我們使用 Traffic 數(shù)據(jù)集，該數(shù)據(jù)集有 862 條時(shí)間序列數(shù)據(jù)。我們將在每條時(shí)間序列上訓(xùn)練一個(gè)共享的模型 (單變量設(shè)定)。每個(gè)時(shí)間序列都代表了一個(gè)傳感器的占有率值，值的范圍在 0 到 1 之間。下面的這些超參數(shù)我們將在所有模型中保持一致。

#?Traffic?prediction_length?is?24.?Reference: #?https://github.com/awslabs/gluonts/blob/6605ab1278b6bf92d5e47343efcf0d22bc50b2ec/src/gluonts/dataset/repository/_lstnet.py#L105 prediction_length?=?24 context_length?=?prediction_length*2 batch_size?=?128 num_batches_per_epoch?=?100 epochs?=?50 scaling?=?"std"

使用的 transformer 模型都很小:

encoder_layers=2 decoder_layers=2 d_model=16

這里我們不再講解如何用?Autoformer?訓(xùn)練模型，讀者可以參考之前兩篇博客 (TimeSeriesTransformer 和 Informer) 并替換模型為?Autoformer?、替換數(shù)據(jù)集為?traffic?。我們也訓(xùn)練了現(xiàn)成的模型放在 HuggingFace Hub 上，稍后的評(píng)測(cè)將會(huì)使用這里的模型。

載入數(shù)據(jù)集

首先安裝必要的庫(kù):

!pip?install?-q?transformers?datasets?evaluate?accelerate?"gluonts[torch]"?ujson?tqdm

traffic?數(shù)據(jù)集 (Lai et al. (2017)) 包含了舊金山的交通數(shù)據(jù)。它包含 862 條以小時(shí)為時(shí)間單位的時(shí)間序列，代表了道路占有率的數(shù)值，其數(shù)值范圍為?，記錄了舊金山灣區(qū)高速公路從 2015 年到 2016 年的數(shù)據(jù)。

from?gluonts.dataset.repository.datasets?import?get_dataset dataset?=?get_dataset("traffic") freq?=?dataset.metadata.freq prediction_length?=?dataset.metadata.prediction_length

我們可視化一條時(shí)間序列看看，并畫(huà)出訓(xùn)練和測(cè)試集的劃分:

import?matplotlib.pyplot?as?plt train_example?=?next(iter(dataset.train)) test_example?=?next(iter(dataset.test)) num_of_samples?=?4*prediction_length figure,?axes?=?plt.subplots() axes.plot(train_example["target"][-num_of_samples:],?color="blue") axes.plot( ????test_example["target"][-num_of_samples?-?prediction_length?:], ????color="red", ????alpha=0.5, ) plt.show()

定義訓(xùn)練和測(cè)試集劃分:

train_dataset?=?dataset.train test_dataset?=?dataset.test

定義數(shù)據(jù)變換

接下來(lái)，我們定義數(shù)據(jù)的變換，尤其是時(shí)間相關(guān)特征的制作 (基于數(shù)據(jù)集本身和一些普適做法)。

我們定義一個(gè)?Chain?，代表 GluonTS 中一系列的變換 (這類似圖像里?torchvision.transforms.Compose?)。這讓我們將一系列變換集成到一個(gè)處理流水線中。

下面代碼中，每個(gè)變換都添加了注釋，用以說(shuō)明它們的作用。從更高層次講，我們將遍歷每一個(gè)時(shí)間序列，并添加或刪除一些特征:

from?transformers?import?PretrainedConfig from?gluonts.time_feature?import?time_features_from_frequency_str from?gluonts.dataset.field_names?import?FieldName from?gluonts.transform?import?( ????AddAgeFeature, ????AddObservedValuesIndicator, ????AddTimeFeatures, ????AsNumpyArray, ????Chain, ????ExpectedNumInstanceSampler, ????RemoveFields, ????SelectFields, ????SetField, ????TestSplitSampler, ????Transformation, ????ValidationSplitSampler, ????VstackFeatures, ????RenameFields, ) def?create_transformation(freq:?str,?config:?PretrainedConfig)?->?Transformation: ????#?create?a?list?of?fields?to?remove?later ????remove_field_names?=?[] ????if?config.num_static_real_features?==?0: ????????remove_field_names.append(FieldName.FEAT_STATIC_REAL) ????if?config.num_dynamic_real_features?==?0: ????????remove_field_names.append(FieldName.FEAT_DYNAMIC_REAL) ????if?config.num_static_categorical_features?==?0: ????????remove_field_names.append(FieldName.FEAT_STATIC_CAT) ????return?Chain( ????????#?step?1:?remove?static/dynamic?fields?if?not?specified ????????[RemoveFields(field_names=remove_field_names)] ????????#?step?2:?convert?the?data?to?NumPy?(potentially?not?needed) ????????+?( ????????????[ ????????????????AsNumpyArray( ????????????????????field=FieldName.FEAT_STATIC_CAT, ????????????????????expected_ndim=1, ????????????????????dtype=int, ????????????????) ????????????] ????????????if?config.num_static_categorical_features?>?0 ????????????else?[] ????????) ????????+?( ????????????[ ????????????????AsNumpyArray( ????????????????????field=FieldName.FEAT_STATIC_REAL, ????????????????????expected_ndim=1, ????????????????) ????????????] ????????????if?config.num_static_real_features?>?0 ????????????else?[] ????????) ????????+?[ ????????????AsNumpyArray( ????????????????field=FieldName.TARGET, ????????????????#?we?expect?an?extra?dim?for?the?multivariate?case: ????????????????expected_ndim=1?if?config.input_size?==?1?else?2, ????????????), ????????????#?step?3:?handle?the?NaN's?by?filling?in?the?target?with?zero ????????????#?and?return?the?mask?(which?is?in?the?observed?values) ????????????#?true?for?observed?values,?false?for?nan's ????????????#?the?decoder?uses?this?mask?(no?loss?is?incurred?for?unobserved?values) ????????????#?see?loss_weights?inside?the?xxxForPrediction?model ????????????AddObservedValuesIndicator( ????????????????target_field=FieldName.TARGET, ????????????????output_field=FieldName.OBSERVED_VALUES, ????????????), ????????????#?step?4:?add?temporal?features?based?on?freq?of?the?dataset ????????????#?these?serve?as?positional?encodings ????????????AddTimeFeatures( ????????????????start_field=FieldName.START, ????????????????target_field=FieldName.TARGET, ????????????????output_field=FieldName.FEAT_TIME, ????????????????time_features=time_features_from_frequency_str(freq), ????????????????pred_length=config.prediction_length, ????????????), ????????????#?step?5:?add?another?temporal?feature?(just?a?single?number) ????????????#?tells?the?model?where?in?the?life?the?value?of?the?time?series?is ????????????#?sort?of?running?counter ????????????AddAgeFeature( ????????????????target_field=FieldName.TARGET, ????????????????output_field=FieldName.FEAT_AGE, ????????????????pred_length=config.prediction_length, ????????????????log_scale=True, ????????????), ????????????#?step?6:?vertically?stack?all?the?temporal?features?into?the?key?FEAT_TIME ????????????VstackFeatures( ????????????????output_field=FieldName.FEAT_TIME, ????????????????input_fields=[FieldName.FEAT_TIME,?FieldName.FEAT_AGE] ????????????????+?( ????????????????????[FieldName.FEAT_DYNAMIC_REAL] ????????????????????if?config.num_dynamic_real_features?>?0 ????????????????????else?[] ????????????????), ????????????), ????????????#?step?7:?rename?to?match?HuggingFace?names ????????????RenameFields( ????????????????mapping={ ????????????????????FieldName.FEAT_STATIC_CAT:?"static_categorical_features", ????????????????????FieldName.FEAT_STATIC_REAL:?"static_real_features", ????????????????????FieldName.FEAT_TIME:?"time_features", ????????????????????FieldName.TARGET:?"values", ????????????????????FieldName.OBSERVED_VALUES:?"observed_mask", ????????????????} ????????????), ????????] ????)

定義?`InstanceSplitter`

我們需要?jiǎng)?chuàng)建一個(gè)?InstanceSplitter?，用來(lái)給訓(xùn)練、驗(yàn)證和測(cè)試集提供采樣窗口，得到一段時(shí)間的內(nèi)的時(shí)間序列 (我們不可能把完整的整段數(shù)據(jù)輸入給模型，畢竟時(shí)間太長(zhǎng)，而且也有內(nèi)存限制)。

這個(gè)實(shí)例分割工具每一次將會(huì)隨機(jī)選取?context_length?長(zhǎng)度的數(shù)據(jù)，以及緊隨其后的?prediction_length?長(zhǎng)度的窗口，并為相應(yīng)的窗口標(biāo)注?past_?或?future_?。這樣可以保證?values?能被分為?past_values?和隨后的?future_values?，各自作為編碼器和解碼器的輸入。除了?values?，對(duì)于?time_series_fields?中的其它 key 對(duì)應(yīng)的數(shù)據(jù)也是一樣。

from?gluonts.transform?import?InstanceSplitter from?gluonts.transform.sampler?import?InstanceSampler from?typing?import?Optional def?create_instance_splitter( ????config:?PretrainedConfig, ????mode:?str, ????train_sampler:?Optional[InstanceSampler]?=?None, ????validation_sampler:?Optional[InstanceSampler]?=?None, )?->?Transformation: ????assert?mode?in?["train",?"validation",?"test"] ????instance_sampler?=?{ ????????"train":?train_sampler ????????or?ExpectedNumInstanceSampler( ????????????num_instances=1.0,?min_future=config.prediction_length ????????), ????????"validation":?validation_sampler ????????or?ValidationSplitSampler(min_future=config.prediction_length), ????????"test":?TestSplitSampler(), ????}[mode] ????return?InstanceSplitter( ????????target_field="values", ????????is_pad_field=FieldName.IS_PAD, ????????start_field=FieldName.START, ????????forecast_start_field=FieldName.FORECAST_START, ????????instance_sampler=instance_sampler, ????????past_length=config.context_length?+?max(config.lags_sequence), ????????future_length=config.prediction_length, ????????time_series_fields=["time_features",?"observed_mask"], ????)

創(chuàng)建 PyTorch 的 DataLoader

接下來(lái)就該創(chuàng)建 PyTorch DataLoader 了: 這讓我們能把數(shù)據(jù)整理成 batch 的形式，即 (input, output) 對(duì)的形式，或者說(shuō)是 (?past_values?,?future_values?) 的形式。

from?typing?import?Iterable import?torch from?gluonts.itertools?import?Cyclic,?Cached from?gluonts.dataset.loader?import?as_stacked_batches def?create_train_dataloader( ????config:?PretrainedConfig, ????freq, ????data, ????batch_size:?int, ????num_batches_per_epoch:?int, ????shuffle_buffer_length:?Optional[int]?=?None, ????cache_data:?bool?=?True, ?**kwargs, )?->?Iterable: ????PREDICTION_INPUT_NAMES?=?[ ????????"past_time_features", ????????"past_values", ????????"past_observed_mask", ????????"future_time_features", ????] ????if?config.num_static_categorical_features?>?0: ????????PREDICTION_INPUT_NAMES.append("static_categorical_features") ????if?config.num_static_real_features?>?0: ????????PREDICTION_INPUT_NAMES.append("static_real_features") ????TRAINING_INPUT_NAMES?=?PREDICTION_INPUT_NAMES?+?[ ????????"future_values", ????????"future_observed_mask", ????] ????transformation?=?create_transformation(freq,?config) ????transformed_data?=?transformation.apply(data,?is_train=True) ????if?cache_data: ????????transformed_data?=?Cached(transformed_data) ????#?we?initialize?a?Training?instance ????instance_splitter?=?create_instance_splitter(config,?"train") ????#?the?instance?splitter?will?sample?a?window?of ????#?context?length?+?lags?+?prediction?length?(from?the?366?possible?transformed?time?series) ????#?randomly?from?within?the?target?time?series?and?return?an?iterator. ????stream?=?Cyclic(transformed_data).stream() ????training_instances?=?instance_splitter.apply(stream,?is_train=True) ????return?as_stacked_batches( ????????training_instances, ????????batch_size=batch_size, ????????shuffle_buffer_length=shuffle_buffer_length, ????????field_names=TRAINING_INPUT_NAMES, ????????output_type=torch.tensor, ????????num_batches_per_epoch=num_batches_per_epoch, ????) def?create_test_dataloader( ????config:?PretrainedConfig, ????freq, ????data, ????batch_size:?int, ?**kwargs, ): ????PREDICTION_INPUT_NAMES?=?[ ????????"past_time_features", ????????"past_values", ????????"past_observed_mask", ????????"future_time_features", ????] ????if?config.num_static_categorical_features?>?0: ????????PREDICTION_INPUT_NAMES.append("static_categorical_features") ????if?config.num_static_real_features?>?0: ????????PREDICTION_INPUT_NAMES.append("static_real_features") ????transformation?=?create_transformation(freq,?config) ????transformed_data?=?transformation.apply(data,?is_train=False) ????#?we?create?a?Test?Instance?splitter?which?will?sample?the?very?last ????#?context?window?seen?during?training?only?for?the?encoder. ????instance_sampler?=?create_instance_splitter(config,?"test") ????#?we?apply?the?transformations?in?test?mode ????testing_instances?=?instance_sampler.apply(transformed_data,?is_train=False) ????return?as_stacked_batches( ????????testing_instances, ????????batch_size=batch_size, ????????output_type=torch.tensor, ????????field_names=PREDICTION_INPUT_NAMES, ????)

在 Autoformer 上評(píng)測(cè)

我們已經(jīng)在這個(gè)數(shù)據(jù)集上預(yù)訓(xùn)練了一個(gè) Autoformer 了，所以我們可以直接拿來(lái)模型在測(cè)試集上測(cè)一下:

from?transformers?import?AutoformerConfig,?AutoformerForPrediction config?=?AutoformerConfig.from_pretrained("kashif/autoformer-traffic-hourly") model?=?AutoformerForPrediction.from_pretrained("kashif/autoformer-traffic-hourly") test_dataloader?=?create_test_dataloader( ????config=config, ????freq=freq, ????data=test_dataset, ????batch_size=64, )

在推理時(shí)，我們使用模型的?generate()?方法來(lái)預(yù)測(cè)?prediction_length?步的未來(lái)數(shù)據(jù)，基于最近使用的對(duì)應(yīng)時(shí)間序列的窗口長(zhǎng)度。

from?accelerate?import?Accelerator accelerator?=?Accelerator() device?=?accelerator.device model.to(device) model.eval() forecasts_?=?[] for?batch?in?test_dataloader: ????outputs?=?model.generate( ????????static_categorical_features=batch["static_categorical_features"].to(device) ????????if?config.num_static_categorical_features?>?0 ????????else?None, ????????static_real_features=batch["static_real_features"].to(device) ????????if?config.num_static_real_features?>?0 ????????else?None, ????????past_time_features=batch["past_time_features"].to(device), ????????past_values=batch["past_values"].to(device), ????????future_time_features=batch["future_time_features"].to(device), ????????past_observed_mask=batch["past_observed_mask"].to(device), ????) ????forecasts_.append(outputs.sequences.cpu().numpy())

模型輸出的數(shù)據(jù)形狀是 (?batch_size?,?number of samples?,?prediction length?,?input_size?)。

在下面這個(gè)例子中，我們?yōu)轭A(yù)測(cè)接下來(lái) 24 小時(shí)的交通數(shù)據(jù)而得到了 100 條可能的數(shù)值，而 batch size 是 64:

forecasts_[0].shape >>>?(64,?100,?24)

我們?cè)诖怪狈较虬阉鼈兌询B起來(lái) (使用?numpy.vstack?函數(shù))，以此獲取所有測(cè)試集時(shí)間序列的預(yù)測(cè): 我們有?7?個(gè)滾動(dòng)的窗口，所以有?7 * 862 = 6034?個(gè)預(yù)測(cè)。

import?numpy?as?np forecasts?=?np.vstack(forecasts_) print(forecasts.shape) >>>?(6034,?100,?24)

我們可以把預(yù)測(cè)結(jié)果和 ground truth 做個(gè)對(duì)比。為此，我們使用 ?? Evaluate 這個(gè)庫(kù)，它里面包含了 MASE 的度量方法。

我們對(duì)每個(gè)時(shí)間序列用這一度量標(biāo)準(zhǔn)計(jì)算相應(yīng)的值，并算出其平均值:

from?tqdm.autonotebook?import?tqdm from?evaluate?import?load from?gluonts.time_feature?import?get_seasonality mase_metric?=?load("evaluate-metric/mase") forecast_median?=?np.median(forecasts,?1) mase_metrics?=?[] for?item_id,?ts?in?enumerate(tqdm(test_dataset)): ????training_data?=?ts["target"][:-prediction_length] ????ground_truth?=?ts["target"][-prediction_length:] ????mase?=?mase_metric.compute( ????????predictions=forecast_median[item_id], ????????references=np.array(ground_truth), ????????training=np.array(training_data), ????????periodicity=get_seasonality(freq)) ????mase_metrics.append(mase["mase"])

所以 Autoformer 模型的結(jié)果是:

print(f"Autoformer?univariate?MASE:?{np.mean(mase_metrics):.3f}") >>>?Autoformer?univariate?MASE:?0.910

我們還可以畫(huà)出任意時(shí)間序列預(yù)測(cè)針對(duì)其 ground truth 的對(duì)比，這需要以下函數(shù):

import?matplotlib.dates?as?mdates import?pandas?as?pd test_ds?=?list(test_dataset) def?plot(ts_index): ????fig,?ax?=?plt.subplots() ????index?=?pd.period_range( ????????start=test_ds[ts_index][FieldName.START], ????????periods=len(test_ds[ts_index][FieldName.TARGET]), ????????freq=test_ds[ts_index][FieldName.START].freq, ????).to_timestamp() ????ax.plot( ????????index[-5*prediction_length:], ????????test_ds[ts_index]["target"][-5*prediction_length:], ????????label="actual", ????) ????plt.plot( ????????index[-prediction_length:], ????????np.median(forecasts[ts_index],?axis=0), ????????label="median", ????) ???? ????plt.gcf().autofmt_xdate() ????plt.legend(loc="best") ????plt.show()

比如，測(cè)試集中第四個(gè)時(shí)間序列的結(jié)果對(duì)比，畫(huà)出來(lái)是這樣:

plot(4)

在 DLinear 上評(píng)測(cè)

gluonts?提供了一種 DLinear 的實(shí)現(xiàn)，我們將使用這個(gè)實(shí)現(xiàn)區(qū)訓(xùn)練、測(cè)評(píng)該算法:

from?gluonts.torch.model.d_linear.estimator?import?DLinearEstimator #?Define?the?DLinear?model?with?the?same?parameters?as?the?Autoformer?model estimator?=?DLinearEstimator( ????prediction_length=dataset.metadata.prediction_length, ????context_length=dataset.metadata.prediction_length*2, ????scaling=scaling, ????hidden_dimension=2, ???? ????batch_size=batch_size, ????num_batches_per_epoch=num_batches_per_epoch, ????trainer_kwargs=dict(max_epochs=epochs) )

訓(xùn)練模型:

predictor?=?estimator.train( ????training_data=train_dataset, ????cache_data=True, ????shuffle_buffer_length=1024 ) >>>?INFO:pytorch_lightning.callbacks.model_summary: ??????|?Name??|?Type?????????|?Params ????--------------------------------------- ????0?|?model?|?DLinearModel?|?4.7?K? ????--------------------------------------- ????4.7?K?????Trainable?params ????0?Non-trainable?params ????4.7?K?????Total?params ????0.019?Total?estimated?model?params?size?(MB) ????Training:?0it?[00:00,??it/s] ????... ????INFO:pytorch_lightning.utilities.rank_zero:Epoch?49,?global?step?5000:?'train_loss'?was?not?in?top?1 ????INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit`?stopped:?`max_epochs=50`?reached.

在測(cè)試集上評(píng)測(cè):

from?gluonts.evaluation?import?make_evaluation_predictions,?Evaluator forecast_it,?ts_it?=?make_evaluation_predictions( ????dataset=dataset.test, ????predictor=predictor, ) d_linear_forecasts?=?list(forecast_it) d_linear_tss?=?list(ts_it) evaluator?=?Evaluator() agg_metrics,?_?=?evaluator(iter(d_linear_tss),?iter(d_linear_forecasts))

所以 DLinear 對(duì)應(yīng)的結(jié)果是:

dlinear_mase?=?agg_metrics["MASE"] print(f"DLinear?MASE:?{dlinear_mase:.3f}") >>>?DLinear?MASE:?0.965

同樣地，我們畫(huà)出預(yù)測(cè)結(jié)果與 ground truth 的對(duì)比曲線圖:

def?plot_gluonts(index): ????plt.plot(d_linear_tss[index][-4?*?dataset.metadata.prediction_length:].to_timestamp(),?label="target") ????d_linear_forecasts[index].plot(show_label=True,?color='g') ????plt.legend() ????plt.gcf().autofmt_xdate() ????plt.show()plot_gluonts(4)

實(shí)際上，?traffic?數(shù)據(jù)集在平日和周末會(huì)出現(xiàn)傳感器中模式的分布偏移。那我們還應(yīng)該怎么做呢？由于 DLinear 沒(méi)有足夠的能力去處理協(xié)方差信息，或者說(shuō)是任何的日期時(shí)間的特征，我們給出的窗口大小無(wú)法覆蓋全面，使得讓模型有足夠信息去知道當(dāng)前是在預(yù)測(cè)平日數(shù)據(jù)還是周末數(shù)據(jù)。因此模型只會(huì)去預(yù)測(cè)更為普適的結(jié)果，這就導(dǎo)致其預(yù)測(cè)分布偏向平日數(shù)據(jù)，因而導(dǎo)致對(duì)周末數(shù)據(jù)的預(yù)測(cè)變得更差。當(dāng)然，如果我們給一個(gè)足夠大的窗口，一個(gè)線性模型也可以識(shí)別出周末的模式，但當(dāng)我們的數(shù)據(jù)中存在以月或以季度為單位的模式分布時(shí)，那就需要更大的窗口了。

總結(jié)

所以 transformer 模型和線性模型對(duì)比的結(jié)論是什么呢？不同模型在測(cè)試集上的 MASE 指標(biāo)如下所示:

可以看到，我們?nèi)ツ暌氲?最原始的 Transformer 模型獲得了最好的性能指標(biāo)。其次，多變量模型一般都比對(duì)應(yīng)的單變量模型更差，原因在于序列間的相關(guān)性關(guān)系一般都較難預(yù)測(cè)。額外添加的波動(dòng)通常會(huì)損壞預(yù)測(cè)結(jié)果，或者模型可能會(huì)學(xué)到一些錯(cuò)誤的相關(guān)性信息。最近的一些論文，如 CrossFormer (ICLR 23) 和 CARD 也在嘗試解決這些 transformer 模型中的問(wèn)題。多變量模型通常在訓(xùn)練數(shù)據(jù)足夠大的時(shí)候才會(huì)表現(xiàn)得好。但當(dāng)我們與單變量模型在小的公開(kāi)數(shù)據(jù)集上對(duì)比時(shí)，通常單變量模型會(huì)表現(xiàn)得更好。相對(duì)于線性模型，通常其相應(yīng)尺寸的單變量 transformer 模型或其它神經(jīng)網(wǎng)絡(luò)類模型會(huì)表現(xiàn)得更好。

總結(jié)來(lái)講，transformer 模型在時(shí)間序列預(yù)測(cè)領(lǐng)域，遠(yuǎn)沒(méi)有達(dá)到要被淘汰的境地。然而大規(guī)模訓(xùn)練數(shù)據(jù)對(duì)它巨大潛力的挖掘是至關(guān)重要的，這一點(diǎn)不像 CV 或 NLP 領(lǐng)域，時(shí)間序列預(yù)測(cè)缺乏大規(guī)模公開(kāi)數(shù)據(jù)集。當(dāng)前絕大多數(shù)的時(shí)間序列預(yù)訓(xùn)練模型也不過(guò)是在諸如 UCR & UEA 這樣的少量樣本上訓(xùn)練的。即使這些基準(zhǔn)數(shù)據(jù)集為時(shí)間序列預(yù)測(cè)的發(fā)展進(jìn)步提供了基石，其較小的規(guī)模和泛化性的缺失使得大規(guī)模預(yù)訓(xùn)練仍然面臨諸多困難。

所以對(duì)于時(shí)間序列預(yù)測(cè)領(lǐng)域來(lái)講，發(fā)展大規(guī)模、強(qiáng)泛化性的數(shù)據(jù)集 (就像 CV 領(lǐng)域的 ImageNet 一樣) 是當(dāng)前最重要的事情。這將會(huì)極大地促進(jìn)時(shí)間序列分析領(lǐng)域與訓(xùn)練模型的發(fā)展研究，提升與訓(xùn)練模型在時(shí)間序列預(yù)測(cè)方面的能力。

聲明

我們誠(chéng)摯感謝 Lysandre Debut 和 Pedro Cuenca 提供的深刻見(jiàn)解和對(duì)本項(xiàng)目的幫助。 ??

英文原文:?https://hf.co/blog/autoformer

作者: Eli Simhayev, Kashif Rasul, Niels Rogge

譯者: Hoi2022

審校/排版: zhongdongy (阿東)

標(biāo)簽：人工智能 Transformer

最美情侣中文字幕电影,在线麻豆精品传媒,在线网站高清黄,久久黄色视频