Utilizando o google colab para instalar o Spark e ler um arquivo csv
Digite no google: google colab para criar um ambiente de desenvolvimento online (computação inteirativa) e que não irá utilizar recursos da sua máquina.
Clique em Arquivo, novo notebook
![](https://static.wixstatic.com/media/09d18d_82def6bab02b48649eb581891bdf06d7~mv2.png/v1/fill/w_871,h_422,al_c,q_90,enc_avif,quality_auto/09d18d_82def6bab02b48649eb581891bdf06d7~mv2.png)
Instalação do ambiente do Pyspark no nosos ambiente google colab.
![](https://static.wixstatic.com/media/09d18d_29a9ba5af8c94be190ceede3e888236d~mv2.png/v1/fill/w_916,h_193,al_c,q_85,enc_avif,quality_auto/09d18d_29a9ba5af8c94be190ceede3e888236d~mv2.png)
%%bash # Instalação Java apt-get update && apt-get install open jdk-8-jdk-headless -qq > /dev/null # Intalação do PySpark pip install -q PySpark
Após digitar pressiono o SHIFT + enter para executar o comando
Utilizamos o %%bash na primeira linha da célula, para informar que é um comando de terminal.
![](https://static.wixstatic.com/media/09d18d_c2609b90360a451ea0330857a8673887~mv2.png/v1/fill/w_980,h_81,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/09d18d_c2609b90360a451ea0330857a8673887~mv2.png)
# Definir uma variável de ambiente para o Spark conseguir identificar o local do JAVA adequadamente import os os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
![](https://static.wixstatic.com/media/09d18d_a8c77ee7a04a4dcfbddd5d64c9944adb~mv2.png/v1/fill/w_980,h_76,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/09d18d_a8c77ee7a04a4dcfbddd5d64c9944adb~mv2.png)
![](https://static.wixstatic.com/media/09d18d_023e49ca8c064bdb88b5bf14b3de7a1b~mv2.png/v1/fill/w_361,h_291,al_c,q_85,enc_avif,quality_auto/09d18d_023e49ca8c064bdb88b5bf14b3de7a1b~mv2.png)
%%bash # Download dos dados utilizados. Vou criar um diretorio mkdir titanic curl https://raw.githubusercontent.com/neylsoncrepalde/titanic_data_with_semicolon/blob/main/titanic.csv -o titanic/titanic.csv
![](https://static.wixstatic.com/media/09d18d_fa0fd9a4daa84b1e9b1fe96ad0466f1e~mv2.png/v1/fill/w_480,h_110,al_c,q_85,enc_avif,quality_auto/09d18d_fa0fd9a4daa84b1e9b1fe96ad0466f1e~mv2.png)
# Importar os módulos necessários
from pyspark.sql import SparkSession
from pyspark.sql import functions as f
![](https://static.wixstatic.com/media/09d18d_75db7441cd364f018d6270ec6a549ec1~mv2.png/v1/fill/w_494,h_70,al_c,q_85,enc_avif,quality_auto/09d18d_75db7441cd364f018d6270ec6a549ec1~mv2.png)
Neste exemplo utilizamos o "inferSchema" por se tratar de poucos dados.
![](https://static.wixstatic.com/media/09d18d_51de05452fc34e0cb192d091b0275168~mv2.png/v1/fill/w_471,h_359,al_c,q_85,enc_avif,quality_auto/09d18d_51de05452fc34e0cb192d091b0275168~mv2.png)
![](https://static.wixstatic.com/media/09d18d_e624723dbb7246ac8fc39872b36201ac~mv2.png/v1/fill/w_655,h_408,al_c,q_85,enc_avif,quality_auto/09d18d_e624723dbb7246ac8fc39872b36201ac~mv2.png)
![](https://static.wixstatic.com/media/09d18d_b6bd5aa8c48249b8aa874b45848eceef~mv2.png/v1/fill/w_179,h_42,al_c,q_85,enc_avif,quality_auto/09d18d_b6bd5aa8c48249b8aa874b45848eceef~mv2.png)
titanic.show() para visualizar os dados.
![](https://static.wixstatic.com/media/09d18d_9ec3f9c2981d44379ef08ef5dfb151fa~mv2.png/v1/fill/w_980,h_153,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/09d18d_9ec3f9c2981d44379ef08ef5dfb151fa~mv2.png)
![](https://static.wixstatic.com/media/09d18d_68cdf63aa1c142dc92ac9ea64a1aed6d~mv2.png/v1/fill/w_340,h_232,al_c,q_85,enc_avif,quality_auto/09d18d_68cdf63aa1c142dc92ac9ea64a1aed6d~mv2.png)
![](https://static.wixstatic.com/media/09d18d_89e12c82996b411eafbcd6ffa3aa2c3f~mv2.png/v1/fill/w_330,h_234,al_c,q_85,enc_avif,quality_auto/09d18d_89e12c82996b411eafbcd6ffa3aa2c3f~mv2.png)
![](https://static.wixstatic.com/media/09d18d_40bae9fd86a34c26a3bf28c6b0920b72~mv2.png/v1/fill/w_960,h_137,al_c,q_85,enc_avif,quality_auto/09d18d_40bae9fd86a34c26a3bf28c6b0920b72~mv2.png)