您好,登錄后才能下訂單哦!
本篇內容介紹了“怎么用Kubernetes和Helm進行高效的超參數調優”的有關知識,在實際案例的操作過程中,不少人都會遇到這樣的困境,接下來就讓小編帶領大家學習一下如何處理這些情況吧!希望大家仔細閱讀,能夠學有所成!
在進行Hyperparameter Sweep的時候,我們需要根據許多不同的超參數組合進行不同的訓練,為同一模型進行多次訓練需要消耗大量計算資源或者耗費大量時間。
如果根據不同的超參數并行進行訓練,這需要大量計算資源。
如果在固定計算資源上順序進行所有不同超參數組合對應的訓練,這需要花費大量時間完成所有組合對應的訓練。
因此在落地時中,大多數人通過非常有限的幾次手動微調他們的超參數就挑選一個相對最優的組合。
通過Kubernetes與Helm,您可以非常輕松地探索非常大的超參數空間,同時最大化集群的利用率,從而優化成本。
Helm使我們能夠將應用程序打包到chart中并輕松地對其進行參數化。在Hyperparameter Sweep時,我們可以利用Helm chart values的配置,在template中生成對應的TFJobs進行訓練部署,同時chart中還可以部署一個TensorBoard實例來監控所有這些TFJobs,這樣我們就可以快速比較我們所有的超參數組合訓練的結果,對那些訓練效果不好的超參數組合,我們可以盡早刪除對應的訓練任務,這無疑會大幅的節省集群的計算資源,從而降低成本。
我們將通過Azure/kubeflow-labs/hyperparam-sweep中的例子進行Demo。
首先通過以下Dockerfile制作訓練的鏡像:
FROM tensorflow/tensorflow:1.7.0-gpu COPY requirements.txt /app/requirements.txt WORKDIR /app RUN mkdir ./output RUN mkdir ./logs RUN mkdir ./checkpoints RUN pip install -r requirements.txt COPY ./* /app/ ENTRYPOINT [ "python", "/app/main.py" ]
其中main.py訓練腳本內容如下:
import click import tensorflow as tf import numpy as np from skimage.data import astronaut from scipy.misc import imresize, imsave, imread img = imread('./starry.jpg') img = imresize(img, (100, 100)) save_dir = 'output' epochs = 2000 def linear_layer(X, layer_size, layer_name): with tf.variable_scope(layer_name): W = tf.Variable(tf.random_uniform([X.get_shape().as_list()[1], layer_size], dtype=tf.float32), name='W') b = tf.Variable(tf.zeros([layer_size]), name='b') return tf.nn.relu(tf.matmul(X, W) + b) @click.command() @click.option("--learning-rate", default=0.01) @click.option("--hidden-layers", default=7) @click.option("--logdir") def main(learning_rate, hidden_layers, logdir='./logs/1'): X = tf.placeholder(dtype=tf.float32, shape=(None, 2), name='X') y = tf.placeholder(dtype=tf.float32, shape=(None, 3), name='y') current_input = X for layer_id in range(hidden_layers): h = linear_layer(current_input, 20, 'layer{}'.format(layer_id)) current_input = h y_pred = linear_layer(current_input, 3, 'output') #loss will be distance between predicted and true RGB loss = tf.reduce_mean(tf.reduce_sum(tf.squared_difference(y, y_pred), 1)) tf.summary.scalar('loss', loss) train_op = tf.train.AdamOptimizer(learning_rate).minimize(loss) merged_summary_op = tf.summary.merge_all() res_img = tf.cast(tf.clip_by_value(tf.reshape(y_pred, (1,) + img.shape), 0, 255), tf.uint8) img_summary = tf.summary.image('out', res_img, max_outputs=1) xs, ys = get_data(img) with tf.Session() as sess: tf.global_variables_initializer().run() train_writer = tf.summary.FileWriter(logdir + '/train', sess.graph) test_writer = tf.summary.FileWriter(logdir + '/test') batch_size = 50 for i in range(epochs): # Get a random sampling of the dataset idxs = np.random.permutation(range(len(xs))) # The number of batches we have to iterate over n_batches = len(idxs) // batch_size # Now iterate over our stochastic minibatches: for batch_i in range(n_batches): batch_idxs = idxs[batch_i * batch_size: (batch_i + 1) * batch_size] sess.run([train_op, loss, merged_summary_op], feed_dict={X: xs[batch_idxs], y: ys[batch_idxs]}) if batch_i % 100 == 0: c, summary = sess.run([loss, merged_summary_op], feed_dict={X: xs[batch_idxs], y: ys[batch_idxs]}) train_writer.add_summary(summary, (i * n_batches * batch_size) + batch_i) print("epoch {}, (l2) loss {}".format(i, c)) if i % 10 == 0: img_summary_res = sess.run(img_summary, feed_dict={X: xs, y: ys}) test_writer.add_summary(img_summary_res, i * n_batches * batch_size) def get_data(img): xs = [] ys = [] for row_i in range(img.shape[0]): for col_i in range(img.shape[1]): xs.append([row_i, col_i]) ys.append(img[row_i, col_i]) xs = (xs - np.mean(xs)) / np.std(xs) return xs, np.array(ys) if __name__ == "__main__": main()
docker build制作鏡像時,會將根目錄下的starry.jpg圖片打包進去供main.py讀取。
main.py使用基于Andrej Karpathy's Image painting demo的模型,這個模型的目標是繪制一個盡可能接近原作的新圖片,文森特梵高的“星夜”。
在Helm chart values.yaml中配置如下:
image:ritazh / tf-paint:gpu useGPU:true hyperParamValues: learningRate: - 0.001 - 0.01 - 0.1 hiddenLayers: - 5 - 6 - 7
image: 配置訓練任務對應的docker image,就是前面您制作的鏡像。
useGPU: bool值,默認true表示將使用gpu進行訓練,如果是false,則需要您制作鏡像時使用tensorflow/tensorflow:1.7.0
base image。
hyperParamValues: 超參數們的配置,在這里我們只配置了learningRate
, hiddenLayers
兩個超參數。
Helm chart中主要是TFJob對應的定義、Tensorboard的Deployment及其Service的定義:
# First we copy the values of values.yaml in variable to make it easier to access them {{- $lrlist := .Values.hyperParamValues.learningRate -}} {{- $nblayerslist := .Values.hyperParamValues.hiddenLayers -}} {{- $image := .Values.image -}} {{- $useGPU := .Values.useGPU -}} {{- $chartname := .Chart.Name -}} {{- $chartversion := .Chart.Version -}} # Then we loop over every value of $lrlist (learning rate) and $nblayerslist (hidden layer depth) # This will result in create 1 TFJob for every pair of learning rate and hidden layer depth {{- range $i, $lr := $lrlist }} {{- range $j, $nblayers := $nblayerslist }} apiVersion: kubeflow.org/v1alpha1 kind: TFJob # Each one of our trainings will be a separate TFJob metadata: name: module8-tf-paint-{{ $i }}-{{ $j }} # We give a unique name to each training labels: chart: "{{ $chartname }}-{{ $chartversion | replace "+" "_" }}" spec: replicaSpecs: - template: spec: restartPolicy: OnFailure containers: - name: tensorflow image: {{ $image }} env: - name: LC_ALL value: C.UTF-8 args: # Here we pass a unique learning rate and hidden layer count to each instance. # We also put the values between quotes to avoid potential formatting issues - --learning-rate - {{ $lr | quote }} - --hidden-layers - {{ $nblayers | quote }} - --logdir - /tmp/tensorflow/tf-paint-lr{{ $lr }}-d-{{ $nblayers }} # We save the summaries in a different directory {{ if $useGPU }} # We only want to request GPUs if we asked for it in values.yaml with useGPU resources: limits: nvidia.com/gpu: 1 {{ end }} volumeMounts: - mountPath: /tmp/tensorflow subPath: module8-tf-paint # As usual we want to save everything in a separate subdirectory name: azurefile volumes: - name: azurefile persistentVolumeClaim: claimName: azurefile --- {{- end }} {{- end }} # We only want one instance running for all our jobs, and not 1 per job. apiVersion: v1 kind: Service metadata: labels: app: tensorboard name: module8-tensorboard spec: ports: - port: 80 targetPort: 6006 selector: app: tensorboard type: LoadBalancer --- apiVersion: extensions/v1beta1 kind: Deployment metadata: labels: app: tensorboard name: module8-tensorboard spec: template: metadata: labels: app: tensorboard spec: volumes: - name: azurefile persistentVolumeClaim: claimName: azurefile containers: - name: tensorboard command: - /usr/local/bin/tensorboard - --logdir=/tmp/tensorflow - --host=0.0.0.0 image: tensorflow/tensorflow ports: - containerPort: 6006 volumeMounts: - mountPath: /tmp/tensorflow subPath: module8-tf-paint name: azurefile
按照上面的超參數配置,在helm install時,9個超參數組合會產生9個TFJob,對應我們指定的3個learningRate和3個hiddenLayers所有組合。
main.py訓練腳本有3個參數:
argument | description | default value |
---|---|---|
--learning-rate | Learning rate value | 0.001 |
--hidden-layers | Number of hidden layers in our network. | 4 |
--log-dir | Path to save TensorFlow's summaries | None |
執行helm install命令即可輕松完成所有不同超參數組合對應的訓練部署,這里我們只使用了單機訓練,您也可以使用分布式訓練。
helm install . NAME: telling-buffalo LAST DEPLOYED: NAMESPACE: tfworkflow STATUS: DEPLOYED RESOURCES: ==> v1/Service NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE module8-tensorboard LoadBalancer 10.0.142.217 <pending> 80:30896/TCP 1s ==> v1beta1/Deployment NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE module8-tensorboard 1 1 1 0 1s ==> v1alpha1/TFJob NAME AGE module8-tf-paint-0-0 1s module8-tf-paint-1-0 1s module8-tf-paint-1-1 1s module8-tf-paint-2-1 1s module8-tf-paint-2-2 1s module8-tf-paint-0-1 1s module8-tf-paint-0-2 1s module8-tf-paint-1-2 1s module8-tf-paint-2-0 0s ==> v1/Pod(related) NAME READY STATUS RESTARTS AGE module8-tensorboard-7ccb598cdd-6vg7h 0/1 ContainerCreating 0 1s
部署chart后,查看已創建的Pods,您應該看到對應的一些列Pods,以及監視所有窗格的單個TensorBoard實例:
$ kubectl get pods NAME READY STATUS RESTARTS AGE module8-tensorboard-7ccb598cdd-6vg7h 1/1 Running 0 16s module8-tf-paint-0-0-master-juc5-0-hw5cm 0/1 Pending 0 4s module8-tf-paint-0-1-master-pu49-0-jp06r 1/1 Running 0 14s module8-tf-paint-0-2-master-awhs-0-gfra0 0/1 Pending 0 6s module8-tf-paint-1-0-master-5tfm-0-dhhhv 1/1 Running 0 16s module8-tf-paint-1-1-master-be91-0-zw4gk 1/1 Running 0 16s module8-tf-paint-1-2-master-r2nd-0-zhws1 0/1 Pending 0 7s module8-tf-paint-2-0-master-7w37-0-ff0w9 0/1 Pending 0 13s module8-tf-paint-2-1-master-260j-0-l4o7r 0/1 Pending 0 10s module8-tf-paint-2-2-master-jtjb-0-5l84q 0/1 Pending 0 9s
注意:由于群集中可用的GPU資源,某些pod正在等待處理。如果群集中有3個GPU,則在給定時間最多只能有3個TFJob(每個TFJob請求了一塊gpu)并行訓練。
TensorBoard Service也會在Helm install執行時自動完成創建,您可以使用該Service的External-IP連接到TensorBoard。
$ kubectl get service NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE module8-tensorboard LoadBalancer 10.0.142.217 <PUBLIC IP> 80:30896/TCP 5m
通過瀏覽器訪問TensorBoard的Public IP地址,你會看到類似如下的頁面(TensorBoard需要一點時間才能顯示圖像。)
在這里我們可以看到一些超參數對應的模型比其他模型表現更好。例如,所有learning rate為0.1對應的模型全部產生全黑圖像,模型效果極差。幾分鐘后,我們可以看到兩個表現最好的超參數組合是:
hidden layers = 5,learning rate = 0.01
hidden layers = 7,learning rate = 0.001
此時,我們可以立刻Kill掉其他表現差的模型訓練,釋放寶貴的gpu資源。
“怎么用Kubernetes和Helm進行高效的超參數調優”的內容就介紹到這里了,感謝大家的閱讀。如果想了解更多行業相關的知識可以關注億速云網站,小編將為大家輸出更多高質量的實用文章!
免責聲明:本站發布的內容(圖片、視頻和文字)以原創、轉載和分享為主,文章觀點不代表本網站立場,如果涉及侵權請聯系站長郵箱:is@yisu.com進行舉報,并提供相關證據,一經查實,將立刻刪除涉嫌侵權內容。