Get started with R Server on HDInsight Azure. HDInsight includes an R Server option to be integrated into your HDInsight cluster. This option allows R scripts to use Spark and Map. Reduce to run distributed computations. In this document, you learn how to create an R Server on HDInsight cluster and then run an R script that demonstrates using Spark for distributed R computations. Automated cluster creation. You can automate the creation of HDInsight R Servers using Azure Resource Manager templates, the SDK, and also PowerShell. Prerequisites. An Azure subscription Before you begin this tutorial, you must have an Azure subscription. Go to the article Get Microsoft Azure free trial for more information. A Secure Shell SSH client An SSH client is used to remotely connect to the HDInsight cluster and run commands directly on the cluster. For more information, see Use SSH with HDInsight. SSH keys optional You can secure the SSH account used to connect to the cluster using either a password or a public key. Using a password is easier, and allows you to get started without having to create a publicprivate key pair. However, using a key is more secure. Note. The steps in this document assume that you are using a password. Automated cluster creation. You can automate the creation of HDInsight R Servers using Azure Resource Manager templates, the SDK, and also Power. Shell. Create the cluster using the Azure portal. Sign in to the Azure portal. Select NEW Intelligence Analytics, HDInsight. In the Quick create experience, enter a name for the cluster in the Cluster Name field. If you have multiple Azure subscriptions, use the Subscription entry to select the one you want to use. Select Cluster type to open the Cluster configuration blade. On the Cluster Configuration blade, select the following options Cluster Type R Server. Version select the version of R Server to install on the cluster. The version currently available is R Server 9. HDI 3. 6. Release notes for the available versions of R Server are available here. R Studio community edition for R Server this browser based IDE is installed by default on the edge node. If you would prefer to not have it installed, then uncheck the check box. If you choose to have it installed, the URL for accessing the RStudio Server login is found on a portal application blade for your cluster once its been created. Leave the other options at the default values and use the Select button to save the cluster type. Enter a Cluster login username and Cluster login password. Specify an SSH Username. SSH is used to remotely connect to the cluster using a Secure Shell SSH client. You can either specify the SSH user in this dialog or after the cluster has been created in the Configuration tab for the cluster. R Server is configured to expect an SSH username of remoteuser. If you use a different username, you must perform an additional step after the cluster is created. Leave the box checked for Use same password as cluster login to use PASSWORD as the authentication type unless you prefer use of a public key. You need a publicprivate key pair to access R Server on the cluster via a remote client as, for example, RTVS, RStudio or another desktop IDE. If you install the RStudio Server community edition, you need to choose an SSH password. To create and use a publicprivate key pair, uncheck Use same password as cluster login and then select PUBLIC KEY and proceed as follows. These instructions assume that you have Cygwin with ssh keygen or an equivalent installed. Generate a publicprivate key pair from the command prompt on your laptop ssh keygen t rsa b 2. Follow the prompt to name a key file and then enter a passphrase for added security. Your screen should look something like the following image This command creates a private key file and a public key file under the name. Then specify the public key file. HDI cluster credentials and finally confirm your resource group and region and select Next. Change permissions on the private keyfile on your laptop chmod 6. Use the private key file with SSH for remote login ssh i remoteuser Or, as part the definition of your Hadoop Spark compute context for R Server on the client. See the Using Microsoft R Server as a Hadoop Client subsection in Create a Compute Context for Spark. The quick create transitions you to the Storage blade to select the storage account settings to be used for the primary location of the HDFS file system used by the cluster. Select either a new or existing Azure Storage account or an existing Data Lake Storage account. If you select an Azure Storage account, an existing storage account is selected by choosing Select a storage account and then selecting the relevant account. Create a new account using the Create New link in the Select a storage account section. Note. If you select New you must enter a name for the new storage account. A green check appears if the name is accepted. The Default Container defaults to the name of the cluster. Leave this default as the value. If a new storage account option was selected a prompt to select Location is given to select which region to create the storage account. Important. Selecting the location for the default data source also sets the location of the HDInsight cluster. The cluster and default data source must be in the same region. If you want to use an existing Data Lake Store, then select the ADLS storage account to use and add the cluster ADD identity to your cluster to allow access to the store. For more information on this process, see Creating HDInsight cluster with Data Lake Store using Azure portal. Use the Select button to save the data source configuration. The Summary blade then displays to validate all your settings. Here you can change your Cluster size to modify the number of servers in your cluster and also specify any Script actions you want to run. Unless you know that you need a larger cluster, leave the number of worker nodes at the default of 4. The estimated cost of the cluster is shown within the blade. Note. If needed, you can resize your cluster later through the Portal Cluster Settings Scale Cluster to increase or decrease the number of worker nodes. This resizing can be useful for idling down the cluster when not in use, or for adding capacity to meet the needs of larger tasks. Some factors to keep in mind when sizing your cluster, the data nodes, and the edge node include The performance of distributed R Server analyses on Spark is proportional to the number of worker nodes when the data is large. The performance of R Server analyses is linear in the size of data being analyzed. For example For small to modest data, performance is best when analyzed in a local compute context on the edge node. For more information on the scenarios under which the local and Spark compute contexts work best, see Compute context options for R Server on HDInsight. If you log in to the edge node and run your R script, then all but the Scale. R rx functions are executed locally on the edge node. So the memory and number of cores of the edge node should be sized accordingly. The same applies if you use R Server on HDI as a remote compute context from your laptop. Use the Select button to save the node pricing configuration. There is also a link for Download template and parameters. Click on this link to display scripts that can be used to automate the creation of a cluster with the selected configuration. These scripts are also available from the Azure portal entry for your cluster once it has been created. Note. It takes some time for the cluster to be created, usually around 2. Use the tile on the Startboard, or the Notifications entry on the left of the page to check on the creation process. Connect to RStudio Server. If youve chosen to include RStudio Server community edition in your installation, then you can access the RStudio login via two different methods. Go to the following URL where CLUSTERNAME is the name of the cluster youve created https CLUSTERNAME. Open the entry for your cluster in the Azure portal, select the R Server Dashboards quick link and then selecting the R Studio Dashboard Important. Regardless of the method used, the first time you log in you need to authenticate twice. At the first authentication, provide the cluster Admin userid and password. At the second prompt, provide the SSH userid and password. Subsequent log ins only require the SSH password and userid. Connect to the R Server edge node. Connect to R Server edge node of the HDInsight cluster using SSH with the command ssh USERNAMECLUSTERNAME ed ssh.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. Archives
November 2017
Categories |