Amazon SageMaker gives a number of methods to run distributed knowledge processing jobs with Apache Spark, a well-liked distributed computing framework for giant knowledge processing.
You’ll be able to run Spark functions interactively from Amazon SageMaker Studio by connecting SageMaker Studio notebooks and AWS Glue Interactive Periods to run Spark jobs with a serverless cluster. With interactive periods, you may select Apache Spark or Ray to simply course of massive datasets, with out worrying about cluster administration.
Alternately, if you happen to want extra management over the surroundings, you should utilize a pre-built SageMaker Spark container to run Spark functions as batch jobs on a totally managed distributed cluster with Amazon SageMaker Processing. This selection lets you choose a number of varieties of situations (compute optimized, reminiscence optimized, and extra), the variety of nodes within the cluster, and the cluster configuration, thereby enabling better flexibility for knowledge processing and mannequin coaching.
Lastly, you may run Spark functions by connecting Studio notebooks with Amazon EMR clusters, or by working your Spark cluster on Amazon Elastic Compute Cloud (Amazon EC2).
All these choices can help you generate and retailer Spark occasion logs to investigate them via the web-based person interface generally named the Spark UI, which runs a Spark Historical past Server to observe the progress of Spark functions, observe useful resource utilization, and debug errors.
On this submit, we share a resolution for putting in and working Spark Historical past Server on SageMaker Studio and accessing the Spark UI straight from the SageMaker Studio IDE, for analyzing Spark logs produced by totally different AWS providers (AWS Glue Interactive Periods, SageMaker Processing jobs, and Amazon EMR) and saved in an Amazon Easy Storage Service (Amazon S3) bucket.
Resolution overview
The answer integrates Spark Historical past Server into the Jupyter Server app in SageMaker Studio. This permits customers to entry Spark logs straight from the SageMaker Studio IDE. The built-in Spark Historical past Server helps the next:
- Accessing logs generated by SageMaker Processing Spark jobs
- Accessing logs generated by AWS Glue Spark functions
- Accessing logs generated by self-managed Spark clusters and Amazon EMR
A utility command line interface (CLI) referred to as sm-spark-cli
can be offered for interacting with the Spark UI from the SageMaker Studio system terminal. The sm-spark-cli
allows managing Spark Historical past Server with out leaving SageMaker Studio.
The answer consists of shell scripts that carry out the next actions:
- Set up Spark on the Jupyter Server for SageMaker Studio person profiles or for a SageMaker Studio shared house
- Set up the
sm-spark-cli
for a person profile or shared house
Set up the Spark UI manually in a SageMaker Studio area
To host Spark UI on SageMaker Studio, full the next steps:
- Select System terminal from the SageMaker Studio launcher.
- Run the next instructions within the system terminal:
The instructions will take a number of seconds to finish.
- When the set up is full, you can begin the Spark UI by utilizing the offered
sm-spark-cli
and entry it from an online browser by working the next code:
sm-spark-cli begin s3://DOC-EXAMPLE-BUCKET/<SPARK_EVENT_LOGS_LOCATION>
The S3 location the place the occasion logs produced by SageMaker Processing, AWS Glue, or Amazon EMR are saved may be configured when working Spark functions.
For SageMaker Studio notebooks and AWS Glue Interactive Periods, you may arrange the Spark occasion log location straight from the pocket book by utilizing the sparkmagic
kernel.
The sparkmagic
kernel incorporates a set of instruments for interacting with distant Spark clusters via notebooks. It gives magic (%spark
, %sql
) instructions to run Spark code, carry out SQL queries, and configure Spark settings like executor reminiscence and cores.
For the SageMaker Processing job, you may configure the Spark occasion log location straight from the SageMaker Python SDK.
Check with the AWS documentation for extra info:
You’ll be able to select the generated URL to entry the Spark UI.
The next screenshot reveals an instance of the Spark UI.
You’ll be able to test the standing of the Spark Historical past Server by utilizing the sm-spark-cli standing
command within the Studio System terminal.
You may as well cease the Spark Historical past Server when wanted.
Automate the Spark UI set up for customers in a SageMaker Studio area
As an IT admin, you may automate the set up for SageMaker Studio customers by utilizing a lifecycle configuration. This may be executed for all person profiles below a SageMaker Studio area or for particular ones. See Customise Amazon SageMaker Studio utilizing Lifecycle Configurations for extra particulars.
You’ll be able to create a lifecycle configuration from the install-history-server.sh script and fasten it to an current SageMaker Studio area. The set up is run for all of the person profiles within the area.
From a terminal configured with the AWS Command Line Interface (AWS CLI) and applicable permissions, run the next instructions:
After Jupyter Server restarts, the Spark UI and the sm-spark-cli
will likely be accessible in your SageMaker Studio surroundings.
Clear up
On this part, we present you clear up the Spark UI in a SageMaker Studio area, both manually or mechanically.
Manually uninstall the Spark UI
To manually uninstall the Spark UI in SageMaker Studio, full the next steps:
- Select System terminal within the SageMaker Studio launcher.
- Run the next instructions within the system terminal:
Uninstall the Spark UI mechanically for all SageMaker Studio person profiles
To mechanically uninstall the Spark UI in SageMaker Studio for all person profiles, full the next steps:
- On the SageMaker console, select Domains within the navigation pane, then select the SageMaker Studio area.
- On the area particulars web page, navigate to the Atmosphere tab.
- Choose the lifecycle configuration for the Spark UI on SageMaker Studio.
- Select Detach.
- Delete and restart the Jupyter Server apps for the SageMaker Studio person profiles.
Conclusion
On this submit, we shared an answer you should utilize to shortly set up the Spark UI on SageMaker Studio. With the Spark UI hosted on SageMaker, machine studying (ML) and knowledge engineering groups can use scalable cloud compute to entry and analyze Spark logs from anyplace and pace up their mission supply. IT admins can standardize and expedite the provisioning of the answer within the cloud and keep away from proliferation of customized growth environments for ML initiatives.
All of the code proven as a part of this submit is on the market within the GitHub repository.
In regards to the Authors
Giuseppe Angelo Porcelli is a Principal Machine Studying Specialist Options Architect for Amazon Net Companies. With a number of years software program engineering and an ML background, he works with clients of any measurement to grasp their enterprise and technical wants and design AI and ML options that make the most effective use of the AWS Cloud and the Amazon Machine Studying stack. He has labored on initiatives in several domains, together with MLOps, pc imaginative and prescient, and NLP, involving a broad set of AWS providers. In his free time, Giuseppe enjoys enjoying soccer.
Bruno Pistone is an AI/ML Specialist Options Architect for AWS based mostly in Milan. He works with clients of any measurement, serving to them perceive their technical wants and design AI and ML options that make the most effective use of the AWS Cloud and the Amazon Machine Studying stack. His subject of expertice consists of machine studying finish to finish, machine studying endustrialization, and generative AI. He enjoys spending time together with his pals and exploring new locations, in addition to touring to new locations.