Multi Model Server and SageMaker Multi-Model Endpoint Overview

1. Introduction


MMS(Multi Model Server)

  • https://github.com/awslabs/multi-model-serverarrow-up-right (2017๋…„ 12์›” ์ดˆ MXNet 1.0 ๋ฆด๋ฆฌ์Šค ์‹œ ์ตœ์ดˆ ๊ณต๊ฐœ, MXNet์šฉ ๋ชจ๋ธ ์„œ๋ฒ„๋กœ ์‹œ์ž‘)

  • Prerequisites: Java 8, MXNet (๋‹จ, MXNet ์‚ฌ์šฉ ์‹œ์—๋งŒ)

  • MMS๋Š” ํ”„๋ ˆ์ž„์›Œํฌ์— ๊ตฌ์• ๋ฐ›์ง€ ์•Š๋„๋ก ์„ค๊ณ„๋˜์—ˆ๊ธฐ ๋•Œ๋ฌธ์—, ๋ชจ๋“  ํ”„๋ ˆ์ž„์›Œํฌ์˜ ๋ฐฑ์—”๋“œ ์—”์ง„ ์—ญํ• ์„ ํ•  ์ˆ˜ ์žˆ๋Š” ์ถฉ๋ถ„ํ•œ ์œ ์—ฐ์„ฑ ์ œ๊ณต

  • ๋งˆ์ดํฌ๋กœ ์„œ๋น„์Šค ๊ธฐ๋ฐ˜ ์•„ํ‚คํ…์ฒ˜๋กœ ์ž์ฒด์ ์œผ๋กœ ์—”๋“œํฌ์ธํŠธ๋ฅผ ์ƒ์„ฑํ•˜์—ฌ ์ถ”๋ก  ์„œ๋ฒ„๋ฅผ ๊ตฌ์ถ•ํ•˜๋ฏ€๋กœ, MMS๋งŒ ์‚ฌ์šฉํ•œ๋‹ค๋ฉด SageMaker์™€ ๋ฌด๊ด€

    • ํ”„๋ก ํŠธ์—”๋“œ: REST API๋ฅผ ์ œ๊ณตํ•˜๋Š” ์ž๋ฐ” ๊ธฐ๋ฐ˜ ์›น ์„œ๋น„์Šค

    • ๋ฐฑ์—”๋“œ: Custom Service ์ฝ”๋“œ๋ฅผ ์‹คํ–‰ํ•˜๋Š” worker

  • Flask ๋Œ€๋น„ ๋‹ค์–‘ํ•œ ๋ฒ„์ „์˜ ๋ชจ๋ธ์„ ๊ด€๋ฆฌํ•˜๋Š” ์ธก๋ฉด ๋ฐ logging & ์ง€ํ‘œ ํ™•์ธ ์ธก๋ฉด์˜ ML ํŽธ์˜์„ฑ์ด ์ข‹์Œ.

  • ์ถ”๋ก ์šฉ ์„œ๋ฒ„์˜ CPU๊ฐ€ ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ์ถฉ๋ถ„ํ•  ๋•Œ ๋‹ค์–‘ํ•œ ๋ชจ๋ธ์„ ํ•˜๋‚˜์˜ ์—”๋“œํฌ์ธํŠธ์— ๋ฐฐํฌํ•  ์ˆ˜ ์žˆ๋Š” ๊ธฐ๋Šฅ ๋‚ด์žฅ; Multi-Model Endpoint ์‚ฌ์šฉ ๊ฐ€๋Šฅ

  • ๊ฐ„๋žตํ•œ ์‚ฌ์šฉ ๋ฐฉ๋ฒ•

    • Model Handler ๊ตฌํ˜„ (Custom Service class ๊ตฌํ˜„)

        1. How to initialize Multi-Model Endpoint on SageMaker? ์ฐธ์กฐ

    • model-archiver ๋กœ ๋ชจ๋ธ ํŒจํ‚ค์ง• โ†’ MMS๊ฐ€ ํŒŒ์‹ฑํ•  ์ˆ˜ ์žˆ๋Š” ์•„์นด์ด๋ธŒ ์ƒ์„ฑ

      • ๋ชจ๋ธ ์•„ํ‹ฐํŒฉํŠธ๋“ค์„ MMS๊ฐ€ ํŒŒ์‹ฑํ•  ์ˆ˜ ์žˆ๋Š” ๋‹จ์ผ ๋ชจ๋ธ ์•„์นด์ด๋ธŒ ํŒŒ์ผ๋กœ ํŒจํ‚ค์ง•

        • [ํ•„์ˆ˜] Model artifacts (weights, layer ๋“ฑ)

        • [ํ•„์ˆ˜] Model signature file (์ž…๋ ฅ ๋ฐ์ดํ„ฐ ํ…์„œ์˜ shape)

        • [์„ ํƒ] Custom service file: ์ž…/์ถœ๋ ฅ ์ „์ฒ˜๋ฆฌ; ๋ชจ๋ธ ์ดˆ๊ธฐํ™”, raw ๋ฐ์ดํ„ฐ๋ฅผ tensor๋กœ ๋ณ€ํ™˜ ๋“ฑ

        • [์„ ํƒ] Auxiliary files (์ถ”๋ก  ์ˆ˜ํ–‰์— ํ•„์š”ํ•œ ์ถ”๊ฐ€ ํŒŒ์ผ ๋ฐ Python ๋ชจ๋“ˆ)

          • ์˜ˆ: object detection ์‹œ, ๊ฐ ํด๋ž˜์Šค์˜ string ์ €์žฅ

      • export_path๋กœ ์ง€์ •ํ•œ ๊ฒฝ๋กœ์— ์ถ”๋ก  ์š”์ฒญ์„ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด MMS์— ์ œ๊ณตํ•˜๋Š” <model-name>.mar ํŒŒ์ผ์ด ์ƒ์„ฑ๋จ. (https://github.com/awslabs/multi-model-server/tree/master/model-archiver#creating-a-model-archivearrow-up-right ์ฐธ์กฐ)

    • MMS ์‹œ์ž‘

      • ์ปดํ“จํŒ… ๋ฆฌ์†Œ์Šค๊ฐ€ ๋งŽ์€ ํ˜ธ์ŠคํŠธ์˜ ๊ฒฝ์šฐ ์„œ๋ฒ„ ์‹œ์ž‘์— ์‹œ๊ฐ„์ด ๋งŽ์ด ๊ฑธ๋ฆด ์ˆ˜ ์žˆ์Œ.

      • ์•„๋ž˜ ์˜ˆ์‹œ์—์„œ๋Š” MMS ๋กœ์ปฌ ํŒŒ์ผ ์‹œ์Šคํ…œ์—์„œ .mar ํŒŒ์ผ์„ ๋กœ๋“œํ•˜์ง€๋งŒ, AWS S3์— .mar ํŒŒ์ผ์„ ์ €์žฅํ•˜๊ณ  http:// ๋˜๋Š” https://์™€ ๊ฐ™์€ URL์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ ์•„์นด์ด๋ธŒ ๋กœ๋“œ ๊ฐ€๋Šฅ

    • ์ถ”๋ก  ์˜ˆ์‹œ; MMS ํ”„๋กœ์„ธ์Šค๊ฐ€ ๋ชจ๋ธ ์•„์นด์ด๋ธŒ๋ฅผ ๋‹ค์šด๋กœ๋“œ ๋ฐ ์••์ถ• ํ•ด์ œ ํ›„, ๋ชจ๋ธ ์•„ํ‹ฐํŒฉํŠธ๋กœ ์„œ๋น„์Šค๋ฅผ ๊ตฌ์„ฑํ•˜๊ณ  ์—”๋“œํฌ์ธํŠธ๋ฅผ ํ†ตํ•ด ๋“ค์–ด์˜ค๋Š” ์š”์ฒญ์„ ์ˆ˜์‹ ํ•˜๊ธฐ ์‹œ์ž‘

    • MMS ์„œ๋น„์Šค ์ค‘๋‹จ ์˜ˆ์‹œ

  • ๊ณต์‹ ๋ฌธ์„œ์—์„œ๋Š” ๋” ๊ฐ•๋ ฅํ•œ ๋ณด์•ˆ์„ ์œ„ํ•ด Docker ์ปจํ…Œ์ด๋„ˆ ๋‚ด์—์„œ MMS๋ฅผ ์‹คํ–‰ํ•˜๋Š” ๊ฒƒ์„ ๊ถŒ์žฅํ•จ.

SageMaker Inference Toolkit

  • SageMaker ์ƒ์—์„œ MMS๋ฅผ ์ข€ ๋” ์‰ฝ๊ณ  ํŽธํ•˜๊ฒŒ ๋ฐฐํฌํ•  ์ˆ˜ ์žˆ๋Š” high-level ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์œผ๋กœ ๋ฐฐํฌํ•œ ํˆดํ‚ท

  • ๋˜ํ•œ, SageMaker Multi-Model endpoint๋ฅผ ์‰ฝ๊ฒŒ ์‹œ์ž‘ํ•  ์ˆ˜ ์žˆ๋Š” ๊ตฌ์„ฑ ๋ฐ ์„ค์ • ๋ฐ ์ธํ„ฐํŽ˜์ด์Šค๋ฅผ ์ง€์›ํ•จ.

  • ๋‹จ, Python ๋ชจ๋ธ ํ•ธ๋“ค๋Ÿฌ๋งŒ ์ง€์›ํ•˜๋ฉฐ, ๋‹ค๋ฅธ ์–ธ์–ด๋กœ ํ•ธ๋“ค๋Ÿฌ๋ฅผ ๊ตฌํ˜„ํ•˜๋ ค๋ฉด MMS๋ฅผ ์‚ฌ์šฉํ•ด์•ผ ํ•จ.

  • MMS๋ฅผ ๋ž˜ํ•‘ํ•˜์—ฌ SageMaker ์ถ”๋ก  ์ปจํ…Œ์ด๋„ˆ๋กœ ์ž‘๋™ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ปจํ…Œ์ด๋„ˆ ์ž‘์„ฑ ๋‚œ์ด๋„๊ฐ€ MMS๋ฅผ ์ง์ ‘ ๊ฐ€์ ธ๋‹ค ์“ฐ๋Š” ๊ฒƒ๋ณด๋‹ค ๋‚ฎ์Œ.

  • MXNet, PyTorch ์ถ”๋ก  ์ปจํ…Œ์ด๋„ˆ์—๋Š” ์ด toolkit๋ฅผ ๋””ํดํŠธ๋กœ ์‚ฌ์šฉํ•˜๋ฏ€๋กœ, inference handler script ์ธํ„ฐํŽ˜์ด์Šค๊ฐ€ ๋™์ผํ•จ

    ํ–ฅํ›„ PyTorch ์ถ”๋ก  ์ปจํ…Œ์ด๋„ˆ๋Š” TorchServe๋กœ ๋งˆ์ด๊ทธ๋ ˆ์ด์…˜๋  ์˜ˆ์ •. ๊ธฐ๋ณธ์ ์ธ ๊ทผ๊ฐ„์€ MMS์ด์ง€๋งŒ PyTorch ํŠนํ™” feature๊ฐ€ ์žˆ์Œ; ์ถœ์ฒ˜: https://twitter.com/shshnkp/status/1290801831518433280?s=20arrow-up-right

SageMaker Multi-Model Endpoint

  • 2019๋…„ 11์›” ๋ง re:Invent 2019 ์ง์ „์— ๊ณต๊ฐœ

  • ๋ณดํ†ต์€ ์—”๋“œํฌ์ธํŠธ ์ƒ์„ฑ ์‹œ S3์— ์ €์žฅ๋œ ๋ชจ๋ธ์„ ์ถ”๋ก  ์ปจํ…Œ์ด๋„ˆ๋กœ ๋‹ค์šด๋กœ๋“œํ•˜๊ณ  ์ง€์†์ ์œผ๋กœ ๋ฉ”๋ชจ๋ฆฌ์— ๋กœ๋“œํ•˜์ง€๋งŒ, Multi-Model Endpoint๋Š” ๋ชจ๋ธ์„ S3์—์„œ ๋™์ ์œผ๋กœ ๋กœ๋“œ

    • ํŠน์ • ๋ชจ๋ธ์— ๋Œ€ํ•ด ์ฒซ ๋ฒˆ์งธ ์š”์ฒญ์ด ๋“ค์–ด์˜ค๋ฉด, ๊ทธ ๋•Œ S3์—์„œ ๋ชจ๋ธ์„ ์ถ”๋ก  ์ปจํ…Œ์ด๋„ˆ๋กœ ๋‹ค์šด๋กœ๋“œํ•˜๊ณ  ๋ฉ”๋ชจ๋ฆฌ์— ๋กœ๋“œ โ†’ Cold start ๋ฐœ์ƒ

    • ํ•œ ๋ฒˆ ํ˜ธ์ถœ๋œ ๋ชจ๋ธ์€ ์ธ์Šคํ„ด์Šค์— ๋‹ค์šด๋กœ๋“œ๋˜์–ด ๋ฉ”๋ชจ๋ฆฌ์— ๋กœ๋“œ๋˜๋ฏ€๋กœ, ์ถ”๋ก ์ด ๋น ๋ฅด๊ฒŒ ์ˆ˜ํ–‰๋จ.

    • ์‹ ๊ทœ ๋ชจ๋ธ์„ ์œ„ํ•œ ๊ณต๊ฐ„์„ ํ™•๋ณดํ•˜๊ธฐ ์œ„ํ•ด ์บ์‹œ ๊ณต๊ฐ„์ด ๋ถ€์กฑํ•  ๋•Œ, ๋ชจ๋ธ์„ ๋™์ ์œผ๋กœ ์–ธ๋กœ๋“œ

      • ModelCacheHit, ModelUnloadingTime ์ง€ํ‘œ๋ฅผ ํ™œ์šฉํ•ด ๋ชจ๋ธ ์บ์‹ฑ/์–ธ๋กœ๋“œ๋ฅผ ๋ชจ๋‹ˆํ„ฐ๋งํ•  ์ˆ˜ ์žˆ๊ณ , ๋ชจ๋ธ ์–ธ๋กœ๋“œ ๋นˆ๋„๊ฐ€ ์žฆ์€ ๊ฒฝ์šฐ, ์ธ์Šคํ„ด์Šค ๊ฐฏ์ˆ˜๋ฅผ ๋Š˜๋ฆฌ๊ฑฐ๋‚˜ ์ธ์Šคํ„ด์Šค ์‚ฌ์–‘์„ ๋†’์ด๋Š” ๊ฒƒ์ด ์ข‹์Œ.

    • ๋”ฐ๋ผ์„œ, ๋งŒ์•ฝ low latency ๋ฐ high TPS๊ฐ€ ํ•„์š”ํ•˜๋ฉด Multi-Model Endpoint๋Š” ์ ์ ˆํ•œ ์†”๋ฃจ์…˜์ด ์•„๋‹˜.

    • ํ•˜์ง€๋งŒ, ์‹ ๊ทœ ๋ชจ๋ธ ๋ฐฐํฌ ์‹œ, Endpoint ์ค‘๋‹จ-์—…๋ฐ์ดํŠธ-์‹œ์ž‘ ๊ณผ์ •์ด ํ•„์š” ์—†์ด S3์— ๋ณต์‚ฌํ•˜๊ธฐ๋งŒ ๋˜๋ฏ€๋กœ ๋งŽ์€ ์ˆ˜์˜ ๋ชจ๋ธ์„ ๋ฐฐํฌํ•˜๊ฑฐ๋‚˜ A/B ํ…Œ์ŠคํŠธ ์‹œ์— ์œ ๋ฆฌ (๋‹จ, framework, ๋„คํŠธ์›Œํฌ ๊ตฌ์กฐ, ์ž…/์ถœ๋ ฅ์ด ๋™์ผํ•ด์•ผ ํ•จ)

  • Multi-Model Endpoint๋Š” ๋”ฅ๋Ÿฌ๋‹ ํ”„๋ ˆ์ž„์›Œํฌ ์ค‘ MXNet ์ถ”๋ก  ์ปจํ…Œ์ด๋„ˆ ๋ฐ PyTorch ์ถ”๋ก  ์ปจํ…Œ์ด๋„ˆ์—์„œ๋งŒ ๋””ํดํŠธ๋กœ ์ง€์›๋˜๋ฉฐ, ๋‹ค๋ฅธ ํ”„๋ ˆ์ž„์›Œํฌ์— ์ ์šฉํ•˜๋ ค๋ฉด MMS ์„œ๋น„์Šค๋ฅผ ์‹œ์ž‘ํ•˜๊ณ  ํ˜ธ์ถœํ•˜๋Š” ์ปจํ…Œ์ด๋„ˆ ๋นŒ๋“œ ํ•„์š” (BYOC)

  • ๋Œ€๋ถ€๋ถ„์˜ SageMaker built-in ์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค์€ Multi-Model Endpoint ๋ฏธ์ง€์› (์ƒํ™ฉ์— ๋”ฐ๋ผ ๋ณ€๋™ ๊ฐ€๋Šฅ)

  • ์ง์ ‘ Inference ์ปจํ…Œ์ด๋„ˆ์—์„œ Multi-Model Endpoint๋ฅผ ์‚ฌ์šฉํ•˜๋ ค๋ฉด LOAD MODEL, LIST MODEL, GET MODEL, UNLOAD MODEL, INVOKE MODEL API๋ฅผ ๊ตฌํ˜„ํ•ด์•ผ ํ•จ; https://docs.aws.amazon.com/sagemaker/latest/dg/mms-container-apis.htmlarrow-up-right ์ฐธ์กฐ

  • ๋‹ค์ค‘ ๋ชจ๋ธ์— ๋Œ€ํ•œ Model Monitoring ๊ธฐ๋Šฅ์€ ํ–ฅํ›„ ์ง€์› ์˜ˆ์ •

  • ์ฃผ์˜

    • Elastic Inference์™€ ๋™์‹œ ์‚ฌ์šฉ์ด ๋ถˆ๊ฐ€๋Šฅํ•˜๊ณ  GPU ๋ฏธ์ง€์›

    • Multi-container endpoint๊ฐ€ ์•„๋‹˜. ์ปจํ…Œ์ด๋„ˆ, ์—”๋“œํฌ์ธํŠธ๋Š” ๋‹จ์ผ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์Œ

Trade-off between Server Load and Response Latency

Horizontal scaling

  • AutoScaling ๊ธฐ๋Šฅ์„ ์‚ฌ์šฉํ•˜์—ฌ ์‚ฌ์ „ ์ •์˜๋œ CloudWatch ์ง€ํ‘œ ์ค‘ InvocationsPerInstance ์ง€ํ‘œ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ scale-out/in์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์Œ.

Vertical scaling

  • ์ ์ ˆํ•œ ์ปดํ“จํŒ… ์ธ์Šคํ„ด์Šค๋ฅผ ์„ ํƒํ•˜๋˜, ๋น„์šฉ ๋ฐ latency๊ฐ€ ์ค‘์š”ํ•˜๋ฉด Elastic Inference / Inferentia์„ ๊ณ ๋ คํ•  ๊ฒƒ

  • ๋‹จ, Multi-Model Endpoint๋Š” GPU, Elastic Inference, Inferentia ๋ฏธ์ง€์›

2. How to initialize Multi-Model Endpoint on SageMaker?


๊ตฌํ˜„ ๊ฐœ์š”

  1. Handler ๊ตฌํ˜„

    • Method 1. SageMaker Inference Toolkit์˜ Handler ๋ฐ HandlerService ๊ตฌํ˜„

    • Method 2. MMS ํ…œํ”Œ๋ฆฟ์˜ Custom Service ํŒŒ์ผ ๊ตฌํ˜„

  2. Model Server๋ฅผ ์‹œ์ž‘ํ•˜๋Š” Serving ์—”๋“œ๋ฆฌํฌ์ธํŠธ ๊ตฌํ˜„

  3. Dockerfile ์ƒ์„ฑ

Handler ๊ตฌํ˜„ ๋ฐฉ๋ฒ•

Method 1. SageMaker Inference Toolkit์˜ Handler ๋ฐ HandlerService ๊ตฌํ˜„

  • Inference handler ๊ตฌํ˜„: ํ”ํžˆ SageMaker Endpoint ๋”ฅ๋Ÿฌ๋‹ ํ”„๋ ˆ์ž„์›Œํฌ์—์„œ ์ถ”๋ก  ์‹œ ์‚ฌ์šฉํ•˜๋Š” input_fn, predict_fn, output_fn, model_fn ๊ณผ ๋™์ผํ•œ ํ˜•ํƒœ

  • SageMaker inference toolkit์˜ ๋ ˆํผ๋Ÿฐ์Šค ๊ตฌํ˜„์ด์ง€๋งŒ, ํ…œํ”Œ๋ฆฟ ์ฝ”๋“œ๋งŒ ์กด์žฌํ•˜๊ณ  ์‹ค์ œ ์˜ˆ์ œ ์ฝ”๋“œ๊ฐ€ ์—†์Œ.

  • DefaultHandlerService๋ฅผ ์ƒ์†๋ฐ›์•„ HandlerService ํด๋ž˜์Šค ๊ตฌํ˜„

Method 2. MMS ํ…œํ”Œ๋ฆฟ์˜ Custom Service ํŒŒ์ผ ๊ตฌ์กฐ

Docker Entrypoint ์ •์˜ ์˜ˆ

References

Last updated