Install DataVirtuality JDBC driver
Windows
Download the latest DataVirtuality JDBC driver from here.
Copy the jar file to your spark install folder:
D:\spark-2.3.1-bin-hadoop2.6\jars
Linux
Download the latest DataVirtuality JDBC driver from here.
Copy the jar file to your spark install folder:
~/.local/lib/python3.11/site-packages/pyspark/jars/
Alternative solution using SparkConf
https://stackoverflow.com/questions/46925864/how-to-add-jdbc-drivers-to-classpath-when-using-pyspark
Enabling JDBC connection
In your DataVirtuality web Interface go to Preferences, and make sure that JDBC port is open, and write down the port number.
Querying from PySpark in Jupyter Notebook
Now that we have set up the JDBC driver we can connect to it and query data from it.
Open a new Jupyter Notebook session and copy the following
Replace the highlights with your own credentials, and provide a working sql query.
NOTE: use mm instead of mms in jdbc_url, if you are not using a SSL connection.
import findsparkfindspark.init()import pysparkfrom pyspark.sql import SparkSessionspark = SparkSession.builder.getOrCreate()from pyspark.sql import SparkSessionjdbc_url = "jdbc:datavirtuality:datavirtuality@mms://DVHostname:PortNumber;user=db_username;password=db_password"query = "(select * from hands_on.orders) as table1"df1 = spark.read.format('jdbc').options(driver = 'com.datavirtuality.dv.jdbc.Driver',url=jdbc_url, dbtable=query ).load()df1.show()