散文網(wǎng) » 科技 »學(xué)習(xí) » elasticsearch生產(chǎn)環(huán)境運(yùn)維經(jīng)驗(yàn)總結(jié)

elasticsearch生產(chǎn)環(huán)境運(yùn)維經(jīng)驗(yàn)總結(jié)

2023-03-15 18:32 作者:awldls-sre 0人讀過 | 我要投稿

1.出于高可用的考慮，同一個(gè)分區(qū)的多個(gè)副本不會(huì)被分配到同一臺(tái)機(jī)器

2.Local Gateway參數(shù)生效順序（僅在重啟master時(shí)生效）

gateway:expected_nodes，只要達(dá)到該值，立即可以進(jìn)入恢復(fù)狀態(tài)，假如有恢復(fù)必要的話

gateway:recover_after_time，如果未達(dá)到expected_nodes值，則需要等待recover_after_time時(shí)長(zhǎng)，不管你當(dāng)前有多少個(gè)nodes了

gateway:recover_after_nodes，在達(dá)到recover_after_time的時(shí)間后，還需要達(dá)到recover_after_nodes的設(shè)置值，才能進(jìn)入恢復(fù)狀態(tài)

3.避免所有索引被刪除

action.destructive_requires_name:true，通過該參數(shù)禁止通過正則進(jìn)行index的刪除操作

curl -XDELETE?http://localhost:9200/*/

4.避免使用虛擬內(nèi)存（三選一）

最佳方式：關(guān)閉操作系統(tǒng)的swap分區(qū)（swapoff -a）

次選：vm.swappiness=0（僅在物理內(nèi)存不夠時(shí)才使用swap分區(qū)）

最后：bootstrap.memory_lock: true

5.集群各類角色

node.master（顯示）

node.data（顯示）

node.ingest（顯示）

node.coordinatint（隱性）

6.master數(shù)量至少3個(gè)，避免腦裂

discovery.zen.minimum_master_nodes: 2

7.操作命令

調(diào)整副本數(shù)：curl -XPUT?http://localhost/yunxiaobai/_settings?pretty?-d ‘{“settings”:{“index”:{“number_of_replicas”:”10″}}}’

創(chuàng)建index：curl -XPUT ‘localhost:9200/yunxiaobai?pretty’

插入數(shù)據(jù)：curl -XPUT ‘localhost:9200/yunxiaobai/external/1?pretty’ -d ‘ { “name”:”yunxiaobai” }’

獲取數(shù)據(jù)：curl -XGET ‘localhost:9200/yunxiaobai/external/1?pretty’

刪除索引：curl -XDELETE ‘localhost:9200/jiaozhenqing?pretty’

屏蔽節(jié)點(diǎn)：curl -XPUT 127.0.0.1:9200/_cluster/settings?pretty -d ‘{ “transient” :{“cluster.routing.allocation.exclude._ip” : “10.0.0.1”}}’

刪除模板：curl -XDELETE?http://127.0.0.1:9200/_template/metricbeat-6.2.4

調(diào)整shard刷新時(shí)間：curl -XPUT?http://localhost:9200/metricbeat-6.2.4-2018.05.21/_settings?pretty?-d ‘{“settings”:{“index”:{“refresh_interval”:”30s”} }}’

提交模板配置文件：curl -XPUT localhost:9200/_template/devops-logstore-template -d @devops-logstore.json

查詢模板： curl -XGET localhost:9200/_template/devops-logstor-template

查詢線程池：http://localhost:9200/_cat/thread_pool/bulk?v&h=ip,name,active,rejected,completed

8.集群健康狀態(tài)

green：所有的主分片和副本分片都正常運(yùn)行。

yellow：所有的主分片都正常運(yùn)行，但不是所有的副本分片都正常運(yùn)行。

red：有主分片沒能正常運(yùn)行。

9.故障節(jié)點(diǎn)分片延時(shí)分配

index.unassigned.node_left.delayed_timeout：1m，該配置表示一個(gè)節(jié)點(diǎn)故障1m后，系統(tǒng)會(huì)開始對(duì)該節(jié)點(diǎn)上的分片進(jìn)行恢復(fù)操作。如果故障節(jié)點(diǎn)上的分片是主分片，那么即使是延時(shí)分配，其他節(jié)點(diǎn)對(duì)應(yīng)的分片副本也會(huì)被置為主分片，否則，該索引無法正常使用，僅僅是延時(shí)了副本的故障恢復(fù)。之所以有時(shí)候需要調(diào)整該值，是為了避免一些糟糕情況的發(fā)生，例如一臺(tái)機(jī)器死機(jī)重啟，那么因?yàn)閱?dòng)耗時(shí)超過一分鐘，所以系統(tǒng)會(huì)對(duì)該機(jī)器上的分片進(jìn)行故障恢復(fù)，恢復(fù)完畢后，這臺(tái)機(jī)器啟動(dòng)完畢服務(wù)恢復(fù)了，那么這臺(tái)機(jī)器上的數(shù)據(jù)就沒有意義了，就被刪除了。這時(shí)候，因?yàn)檫@是一個(gè)空機(jī)器了，所以系統(tǒng)還會(huì)觸發(fā)平衡操作，這折騰就大了。設(shè)置為0時(shí)，表示不等待節(jié)點(diǎn)故障立即重新分配。Since elasticsearch 5.x index level settings can NOT be set on the nodes configuration like the elasticsearch.yaml, in system properties or command line?arguments.In?order to upgrade all indices the settings must be updated via the /${index}/_settings API.

cluster.routing.allocation.enable”:”none”，該參數(shù)配置生效時(shí)，可以創(chuàng)建索引，但是索引是處于不可用狀態(tài)的

10.名詞解釋

indexes和indices區(qū)別

indices一般在數(shù)學(xué)，金融和相關(guān)領(lǐng)域使用，而indexes使用則相對(duì)廣泛

indexes在美國(guó)、加拿大等國(guó)的英語里比較常見。但indices盛行于除北美國(guó)家以外的英語里。

index和lucene的區(qū)別

在集群級(jí)別上的索引稱為index

在節(jié)點(diǎn)級(jí)別（各個(gè)分片都是一個(gè)lucene索引）稱為lucene

11.滾動(dòng)升級(jí)（升級(jí)期間，集群處于不可用狀態(tài)）

Disable shard allocation

curl -XPUT?http://localhost:9200/_cluster/settings?pretty?-d ‘{ “persistent”: {“cluster.routing.allocation.enable” : “none” } } ‘，此時(shí)可以創(chuàng)建索引，但是索引不可用

Stop non-essential indexing and perform a synced flush

curl -X POST “l(fā)ocalhost:9200/_flush/synced” 此時(shí)集群不對(duì)外進(jìn)行響應(yīng)

Shut down a single node

Upgrade the node you shut down

Upgrade any plugins

Start the upgraded node

Reenable shard allocation

curl -XPUT?http://localhost:9200/_cluster/settings?pretty?-d ‘{ “persistent”: {“cluster.routing.allocation.enable” : “all” } } ‘

Wait for the node to recover

Repeat

12.關(guān)鍵指標(biāo)（參考x-pack）

indices.search.query_current/indices.search.query_total

indices.search.query_time_in_millis

indices.indexing.index_current/indices.indexing.index_total

indices.indexing.index_time_in_millis

jvm.mem.heap_used_percent

number_of_nodes(_cat/health)

active_shards_percent_as_number(_cat/health)

status(_cat/health)

13，如何設(shè)置索引的分片數(shù)比較合適

條件：

每GB的堆對(duì)應(yīng)的分片數(shù)量應(yīng)低于20個(gè)

每個(gè)節(jié)點(diǎn)有jvm堆內(nèi)存 30G

fielddata 大小9G

磁盤容量1490GB，則可用的磁盤容量為：1490GB*12.8%=1299.28GB約1299GB

盡量保持分片大小一致

結(jié)論：

索引主分片規(guī)模為：

單個(gè)分片容量：1200GB/(30*20)=2.165GB

對(duì)于一個(gè)新建索引，預(yù)測(cè)總大小10GB，則共設(shè)置主分片：10GB/2GB=5個(gè)

14.集群的索引副本數(shù)如何確定

當(dāng)ES的查詢操作較少時(shí)（每秒少于一個(gè)查詢操作），副本數(shù)為1即可保證了數(shù)據(jù)的冗余（況且還有備份）

ES副本數(shù)為2時(shí)，可以提高查詢性能（副本可以分擔(dān)索引的查詢操作）：代價(jià)為CPU，內(nèi)存和文件句柄

15.ES集群的備份和恢復(fù)

官方給出的ES集群的數(shù)據(jù)備份方式

S3 Repository

Azure Repository

HDFS Repository

Google Cloud Storage Repository

以下是使用HDFS進(jìn)行數(shù)據(jù)備份和恢復(fù)，HDFS備份是基于增量備份方式進(jìn)行備份的，需要安裝repository-hdfs插件并重啟集群才能生效。

對(duì)于使用HDFS來進(jìn)行備份的方式，這是一種增量備份的方式（同一個(gè)存儲(chǔ)庫(kù)下，數(shù)據(jù)是增量備份的，每次備份，系統(tǒng)僅備份發(fā)生變化的segment）

使用的API：

#HDFS配置和創(chuàng)建存儲(chǔ)庫(kù)

curl -XPUT “http://localhost:9200/_snapshot/?my_backup” -d ‘

{

“type”: “hdfs”,

“settings”: {

“path”: “/back/es/”, #存儲(chǔ)庫(kù)路徑

“l(fā)oad_defaults”: “true”, #加載Hadoop默認(rèn)配置

“compress”: “true”,

“uri”: “hdfs://localhost:8020” } #Hadoop IP地址

}

#創(chuàng)建快照

# snapshot_1為快照名稱

curl -XPUT “http://localhost:9200/_snapshot/my_backup/snapshot_1?wait_for_completion=true?” -d ‘{

“indices”: “index_1,index_2”, //注意不設(shè)置這個(gè)屬性，默認(rèn)是備份所有索引

“include_global_state”: false

}

#ES數(shù)據(jù)恢復(fù)

#本集群恢復(fù)，查找要恢復(fù)的快照ID，執(zhí)行如下命令恢復(fù)數(shù)據(jù)

curl -XPOST?http://localhost:9200/_snapshot/my_backup/backup_1/_restore’

16.監(jiān)控

核心監(jiān)控項(xiàng)：集群健康狀態(tài)，功能監(jiān)控，索引延遲，流量監(jiān)控

監(jiān)控項(xiàng)介紹：

pendin_task

pending task 反應(yīng)了master節(jié)點(diǎn)尚未執(zhí)行的集群級(jí)別的更改任務(wù)（例如：創(chuàng)建索引，更新映射，分配分片）的列表。pending task的任務(wù)是分級(jí)別的（優(yōu)先級(jí)排序：IMMEDIATE>URGENT>HIGH>NORMAL>LOW>LANGUID）,只有當(dāng)上一級(jí)別的任務(wù)執(zhí)行完畢后才會(huì)執(zhí)行下一級(jí)別的任務(wù)，這也說明了：當(dāng)出現(xiàn)HIGH級(jí)別以上的pending task任務(wù)時(shí)，備份和創(chuàng)建索引等一些低級(jí)別任務(wù)雖然任務(wù)對(duì)資源占用不多，也將不會(huì)執(zhí)行，返回異常，“ProcessClusterEventTimeoutException”。

17.kibana調(diào)優(yōu)

在啟動(dòng)文件的開頭添加如下配置項(xiàng)：NODE_OPTIONS=”–max-old-space-size=4096″ 其中4096的單位為MB

18.推薦插件

X-pack

ElasticHQ

x-pack許可證書過期的問題

x-pack許可證書分為試用的許可證書和注冊(cè)后的許可證書。安裝證書不用重啟節(jié)點(diǎn)

當(dāng)x-pack禁用security functionality可以通過如下步驟來安裝證許可證書：

注冊(cè)elasticsearch

收到許可證郵件，下載許可證json文件

通過如下API來進(jìn)行安裝

curl -XPUT u elastic 'http://0.0.0.0:9200/_xpack/license?acknowledge=true&pretty' -H "Content-Type；application/json" -d @json

19.數(shù)據(jù)直接寫入ES還是使用ELK？

直接寫入ES：適用于非核心場(chǎng)景，簡(jiǎn)單直接依賴少

需要引入Kafka：適用于核心場(chǎng)景，通過Kafka作為代理層，既可以提升ELK集群整體的穩(wěn)定性，也可以引入Hadoop生態(tài)做形式的數(shù)據(jù)分析。

基于多ES集群消費(fèi)實(shí)現(xiàn)多集群熱備，實(shí)現(xiàn)單集群故障后快速切換到其他可用集群

ES故障期間Kafka作為數(shù)據(jù)緩沖避免數(shù)據(jù)丟失，故障恢復(fù)后繼續(xù)消費(fèi)故障期間的數(shù)據(jù)

控制消費(fèi)速率，避免ES被突增流量壓死

實(shí)現(xiàn)批量寫操作，提升ES性能，分布式直接寫入ES很難做聚合

20.ES引入代理的優(yōu)勢(shì)

封禁部分高危操作，如調(diào)整集群參數(shù)，刪除操作等

封禁部分業(yè)務(wù)訪問，如非授權(quán)用戶，測(cè)試用戶等

21.ES集群重啟預(yù)案及影響分析

ES集群重啟預(yù)案

es每個(gè)節(jié)點(diǎn)均是有狀態(tài)，不同索引分片配置設(shè)置不同，單個(gè)分片可能有冗余，可能有1備份。因此，為不影響服務(wù)，升級(jí)或者重啟es服務(wù)需要逐個(gè)依次進(jìn)行（滾動(dòng)重啟）

#注：重啟操作應(yīng)在集群狀態(tài)為green時(shí)進(jìn)行，重啟任一一個(gè)節(jié)點(diǎn)前后，都要保證在集群狀態(tài)恢復(fù)到green狀態(tài)時(shí)。

#步驟1-禁用分片分配

#如果不禁止，當(dāng)停止某一節(jié)點(diǎn)后，分片分配進(jìn)程會(huì)進(jìn)行UNASSIGNED分片的分配（當(dāng)集群狀態(tài)達(dá)到recovery要求，觸發(fā)恢復(fù)閾值時(shí)）。這樣的情況下，會(huì)造成大量的IO操作。但是禁用分片后，功能上會(huì)禁止新建索引。

curl -X PUT?http://0.0.0.0:9200/_cluster/settings?pretty?-d '{"transient": {"cluster.routing.allocation.enable": "none"}}'

#步驟2-驗(yàn)證修改后的配置：

curl -X GET?http://0.0.0.0:9200/_cluster/settings?pretty

#步驟3-執(zhí)行同步刷新

#這一步操作的原因是：當(dāng)有分片在集群重啟過程中并沒有發(fā)生更新，則跳過對(duì)這些分片的同步校驗(yàn)，提高分片恢復(fù)的速度

curl -XPOST "http://0.0.0.0:9200/_flush/synced?pretty"

#步驟4-重啟client-node節(jié)點(diǎn)

#重啟client-node組節(jié)點(diǎn)有小概率導(dǎo)致寫入丟失（由于LB的輪詢策略:當(dāng)一個(gè)節(jié)點(diǎn)離線后，10s內(nèi)不會(huì)再將請(qǐng)求分配到該節(jié)點(diǎn)。可以通過不要立即重啟另外的client節(jié)點(diǎn)來避免此問題）

##子步驟1--重啟一個(gè)client-node組的節(jié)點(diǎn)

##子步驟2--確認(rèn)節(jié)點(diǎn)加入集群

##通過命令查看集群狀態(tài)和查看節(jié)點(diǎn)數(shù)

curl -XGET?http://0.0.0.0:9200/_cluster/health?pretty

##子步驟3--按照子步驟1-2重啟剩余的client-node節(jié)點(diǎn)

#步驟5-重啟master節(jié)點(diǎn)

##子步驟1--重啟一個(gè)master-node的非master節(jié)點(diǎn)

##子步驟2--確認(rèn)節(jié)點(diǎn)加入集群

##子步驟3--重復(fù)子步驟1-2重啟剩余的非master節(jié)點(diǎn)

##子步驟4--重復(fù)子步驟1-2重啟剩余的master節(jié)點(diǎn)

##子步驟5--檢查master是否重新選舉成功（30s后會(huì)開始選舉：原因discovery.zen.ping_timeout:30s）

##master選舉過程中會(huì)堵塞寫操作，對(duì)search無影響，堵塞API的操作

#步驟6-重啟data-node節(jié)點(diǎn)

##子步驟1--重啟一個(gè)data-node組的data-node節(jié)點(diǎn)

##子步驟2--確認(rèn)節(jié)點(diǎn)加入集群

##子步驟3--重復(fù)子步驟1-2重啟剩余的data節(jié)點(diǎn)

#步驟7-啟用分片分配

curl -X PUT?http://0.0.0.0:9200/_cluster/settings?pretty?-d '{"transient": {"cluster.routing.allocation.enable": "all"}}'

22.常見日志錯(cuò)誤

集群名稱不一致：java.lang.IllegalStateException: handshake failed, mismatched cluster name

配置文件報(bào)錯(cuò)：java.lang.IllegalArgumentException: node settings must not contain any index level settings

org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (put-mapping) within 30s

cluster state update task [put-mapping[type-2018-05-20-22-25-1526826326]] took [57.2s] above the warn threshold of 30s

[o.e.a.a.i.m.p.TransportPutMappingAction] failed to put mappings on indices [[[tpmonitor-elasticsearch/Vzr0MlOKRimGGCcMb0wIdA]]], type [type-2018-05-21-22-18-1526912288]

Failed to connect to server: 10.1.1.1/10.1.1.1:9000: try once and fail.

[2018-05-22T16:50:59,473][WARN ][o.e.b.ElasticsearchUncaughtExceptionHandler] [172-0] uncaught exception in thread [main]

org.elasticsearch.bootstrap.StartupException: java.lang.IllegalStateException: Unable to access ‘path.data’ (/data10/elasticsearch)

標(biāo)簽：elasticsearch 問題 elk 解決 linux 運(yùn)維生產(chǎn)環(huán)境日志收集經(jīng)驗(yàn)資料

elasticsearch生產(chǎn)環(huán)境運(yùn)維經(jīng)驗(yàn)總結(jié)的評(píng)論 (共條)

愛情散文傷感散文哲理散文優(yōu)美生活隨筆親情唯美句子傷感的句子現(xiàn)代詩歌空間日志經(jīng)典語句愛情句子作文大全

最美情侣中文字幕电影,在线麻豆精品传媒,在线网站高清黄,久久黄色视频

elasticsearch生產(chǎn)環(huán)境運(yùn)維經(jīng)驗(yàn)總結(jié)

elasticsearch生產(chǎn)環(huán)境運(yùn)維經(jīng)驗(yàn)總結(jié)的評(píng)論 (共條)

你可能也喜歡這些文章

最新發(fā)布的文章

最美情侣中文字幕电影,在线麻豆精品传媒,在线网站高清黄,久久黄色视频

elasticsearch生產(chǎn)環(huán)境運(yùn)維經(jīng)驗(yàn)總結(jié)

本文作者的其他文章

elasticsearch生產(chǎn)環(huán)境運(yùn)維經(jīng)驗(yàn)總結(jié)的評(píng)論 (共 條)

你可能也喜歡這些文章

最新發(fā)布的文章

elasticsearch生產(chǎn)環(huán)境運(yùn)維經(jīng)驗(yàn)總結(jié)的評(píng)論 (共條)