When I started learning Data Analysis a few years ago, the first thing I learned was SQL and Pandas. As a data analyst, it is crucial to have a strong foundation in working with SQL and Pandas. Both are powerful tools that help data analysts efficiently analyze and manipulate stored data in databases.
Overview of SQL and Pandas
SQL (Structured Query Language) is a programming language used to manage and manipulate relational databases. On the other hand, Pandas is a Python library used for data manipulation and analysis.
Data analysis involves working with large amounts of data, and databases are often used to store this data. SQL and Pandas provide powerful tools for working with databases, allowing data analysts to efficiently extract, manipulate, and analyze data. By leveraging these tools, data analysts can gain valuable insights from data that would otherwise be difficult to obtain.
In this article, we will explore how to use SQL and Pandas to read and write to a database.
Connecting to the DB
Installing the Libraries
We must first install the necessary libraries before we can connect to the SQL database with Pandas. The two main libraries required are Pandas and SQLAlchemy. Pandas is a popular data manipulation library that allows for the storage of large data structures, as mentioned in the introduction. In contrast, SQLAlchemy provides an API for connecting to and interacting with the SQL database.
We can install both libraries using the Python package manager, pip, by running the following commands at the command prompt.
$ pip install pandas
$ pip install sqlalchemy
Making the Connection
With the libraries installed, we can now use Pandas to connect to the SQL database.
To begin, we will create a SQLAlchemy engine object with create_engine()
. The create_engine()
function connects the Python code to the database. It takes as an argument a connection string that specifies the database type and connection details. In this example, we'll use the SQLite database type and the database file's path.
Create an engine object for a SQLite database using the example below:
import pandas as pd
from sqlalchemy import create_engine
# Create an engine object
engine = create_engine('sqlite:///C/SQLite/student.db')
If the SQLite database file, student.db in our case, is in the same directory as the Python script, we can use the file name directly, as shown below.
engine = create_engine('sqlite:///student.db')
Reading SQL Files with Pandas
Let's read data now that we've established a connection. In this section, we will look at the read_sql
, read_sql_table
, and read_sql_query
functions and how to use them to work with a database.
Executing SQL Queries using Panda's read_sql() Function
The read_sql()
is a Pandas library function that allows us to execute an SQL query and retrieve the results into a Pandas dataframe. The read_sql()
function connects SQL and Python, allowing us to take advantage of the power of both languages. The function wraps read_sql_table()
and read_sql_query()
. The read_sql()
function is internally routed based on the input provided, which means that if the input is to execute an SQL query, it will be routed to read_sql_query()
, and if it is a database table, it will be routed to read_sql_table()
.
The read_sql()
syntax is as follows:
pandas.read_sql(sql, con, index_col=None, coerce_float=True, params=None, parse_dates=None, columns=None, chunksize=None)
SQL and con parameters are required; the rest are optional. However, we can manipulate the result using these optional parameters. Let's take a closer look at each parameter.
sql
: SQL query or database table namecon
: Connection object or connection URLindex_col
: This parameter allows us to use one or more columns from the SQL query result as a data frame index. It can take either a single column or a list of columns.coerce_float
: This parameter specifies whether non-numerical values should be converted to floating numbers or left as strings. It is set to true by default. If possible, it converts non-numeric values to float types.params
: The params provide a secure method for passing dynamic values to the SQL query. We can use the params parameter to pass a dictionary, tuple, or list. Depending on the database, the syntax of params varies.parse_dates
: This allows us to specify which column in the resulting dataframe will be interpreted as a date. It accepts a single column, a list of columns, or a dictionary with the key as the column name and the value as the column format.columns
: This allows us to fetch only selected columns from the list.chunksize
: When working with a large data set, chunksize is important. It retrieves the query result in smaller chunks, enhancing performance.
Here's an example of how to use read_sql()
:
Code:
import pandas as pd
from sqlalchemy import create_engine
# Create an engine object
engine = create_engine('sqlite:///C/SQLite/student.db')
# Fetch all records from Student table and manipulate the result
df = pd.read_sql("SELECT * FROM Student", engine, index_col='Roll Number', parse_dates='dateOfBirth')
print(df)
print("The Data type of dateOfBirth: ", df.dateOfBirth.dtype)
# Close the Database connection
engine.dispose()
Output:
firstName lastName email dateOfBirth
rollNumber
1 Mark Simson [email protected] 2000-02-23
2 Peter Griffen [email protected] 2001-04-15
3 Meg Aniston [email protected] 2001-09-20
Date type of dateOfBirth: datetime64[ns]
After connecting to the database, we execute a query that returns all records from the Student
table and stores them in the DataFrame df
. The "Roll Number" column is converted into an index using the index_col
parameter, and the "dateOfBirth" datatype is "datetime64[ns]" due to parse_dates
. We can use read_sql()
not only to retrieve data but also to perform other operations such as insert, delete, and update. read_sql()
is a generic function.
Loading Specific Tables or Views from the DB
Loading a specific table or view with Pandas read_sql_table()
is another technique to read data from the database into a Pandas dataframe.
What is read_sql_table?
The Pandas library provides the read_sql_table
function, which is specifically designed to read an entire SQL table without executing any queries and return the result as a Pandas dataframe.
The syntax of read_sql_table()
is as below:
pandas.read_sql_table(table_name, con, schema=None, index_col=None, coerce_float=True, parse_dates=None, columns=None, chunksize=None)
Except for table_name
and schema, the parameters are explained in the same way as read_sql()
.
table_name
: The parametertable_name
is the name of the SQL table in the database.schema
: This optional parameter is the name of the schema containing the table name.
After creating a connection to the database, we will use the read_sql_table
function to load the Student
table into a Pandas DataFrame.
import pandas as pd
from sqlalchemy import create_engine
# Create an engine object
engine = create_engine('sqlite:///C/SQLite/student.db')
# Load Student Table from database
df = pd.read_sql_table('Student', engine)
print(df.head())
# Close the Database connection
engine.dispose()
Output:
Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!
rollNumber firstName lastName email dateOfBirth
0 1 Mark Simson [email protected] 2000-02-23
1 2 Peter Griffen [email protected] 2001-04-15
2 3 Meg Aniston [email protected] 2001-09-20
We'll assume it is a large table that can be memory-intensive. Let's explore how we can use the chunksize
parameter to address this issue.
Code:
import pandas as pd
from sqlalchemy import create_engine
# Create an engine object
engine = create_engine('sqlite:///C/SQLite/student.db')
# Load student table from database
df_iterator = pd.read_sql_table('Student', engine, chunksize = 1)
# Iterate the dataframe
for df in df_iterator:
print(df.head())
# Close the Database connection
engine.dispose()
Output:
rollNumber firstName lastName email dateOfBirth
0 1 Mark Simson [email protected] 2000-02-23
0 2 Peter Griffen [email protected] 2001-04-15
0 3 Meg Aniston [email protected] 2001-09-20
Please keep in mind that the chunksize
I'm using here is 1 because I only have 3 records in my table.
Querying the DB Directly with Pandas' SQL Syntax
Extracting insights from the database is an important part for data analysts and scientists. To do so, we will leverage the read_sql_query()
function.
What is read_sql_query()?
Using Pandas' read_sql_query()
function, we can run SQL queries and get the results directly into a DataFrame. The read_sql_query()
function is created specifically for SELECT
statements. It cannot be used for any other operations, such as DELETE
or UPDATE
.
Syntax:
pandas.read_sql_query(sql, con, index_col=None, coerce_float=True, params=None, parse_dates=None, chunksize=None, dtype=None, dtype_backend=_NoDefault.no_default)
All parameter descriptions are the same as the read_sql()
function. Here's an example of read_sql_query()
:
Code:
import pandas as pd
from sqlalchemy import create_engine
# Create an engine object
engine = create_engine('sqlite:///C/SQLite/student.db')
# Query the Student table
df = pd.read_sql_query('Select firstName, lastName From Student Where rollNumber = 1', engine)
print(df)
# Close the Database connection
engine.dispose()
Output:
firstName lastName
0 Mark Simson
Writing SQL Files with Pandas
While analyzing data, suppose we discovered that a few entries need to be modified or that a new table or view with the data is required. To update or insert a new record, one method is to use read_sql()
and write a query. However, that method can be lengthy. Pandas provide a great method called to_sql()
for situations like this.
In this section, we will first build a new table in the database and then edit an existing one.
Creating a New Table in the SQL Database
Before we create a new table, let's first discuss to_sql()
in detail.
What is to_sql()?
The to_sql()
function of the Pandas library allows us to write or update the database. The to_sql()
function can save DataFrame data to a SQL database.
Syntax for to_sql()
:
DataFrame.to_sql(name, con, schema=None, if_exists='fail', index=True, index_label=None, chunksize=None, dtype=None, method=None)
Only name
and con
parameters are mandatory to run to_sql()
; however, other parameters provide additional flexibility and customization options. Let's discuss each parameter in detail:
name
: The name of the SQL table to be created or altered.con
: The connection object of the database.schema
: The schema of the table (optional).if_exists
: The default value of this parameter is "fail". This parameter allows us to decide the action to be taken if the table already exists. Options include "fail", "replace", and "append".index
: The index parameter accepts a boolean value. By default, it is set to True, meaning the index of the DataFrame will be written to the SQL table.index_label
: This optional parameter allows us to specify a column label for the index columns. By default, the index is written to the table, but a specific name can be given using this parameter.chunksize
: The number of rows to be written at a time in the SQL database.dtype
: This parameter accepts a dictionary with keys as column names and values as their datatypes.method
: The method parameter allows specifying the method used for inserting data into the SQL. By default, it is set to None, which means pandas will find the most efficient way based on the database. There are two main options for method parameters:multi
: It allows inserting multiple rows in a single SQL query. However, not all databases support multi-row insert.- Callable function: Here, we can write a custom function for insert and call it using method parameters.
Here's an example using to_sql()
:
import pandas as pd
from sqlalchemy import create_engine
# Create an engine object
engine = create_engine('sqlite:///C/SQLite/student.db')
# Create a new Dataframe that will be our new table
data = {'Name': ['Paul', 'Tom', 'Jerry'], 'Age': [9, 8, 7]}
df = pd.DataFrame(data)
# Create a new table called Customer
df.to_sql('Customer', con=engine, if_exists='fail')
# Close the Database connection
engine.dispose()
A new table called Customer is created in the database, with two fields called "Name" and "Age."
Database snapshot:
Updating Existing Tables with Pandas Dataframes
Updating data in a database is a complex task, particularly when dealing with large data. However, using the to_sql()
function in Pandas can make this task much easier. To update the existing table in the database, the to_sql()
function can be used with the if_exists
parameter set to "replace". This will overwrite the existing table with the new data.
Here is an example of to_sql()
that updates the previously created Customer
table. Suppose, in the Customer
table we want to update the age of a customer named Paul from 9 to 10. To do so, first, we can modify the corresponding row in the DataFrame, and then use the to_sql()
function to update the database.
Code:
import pandas as pd
from sqlalchemy import create_engine
# Create a connection to the SQLite database
engine = create_engine('sqlite:///C/SQLite/student.db')
# Load Customer Table into a Dataframe
df = pd.read_sql_table('Customer', engine)
# Modify the age of the customer named Paul
df.loc[df['Name'] == 'Paul', 'Age'] = 10
# Update the Customer table with the modified DataFrame
df.to_sql('Customer', con=engine, if_exists='replace')
# Close the Database connection
engine.dispose()
In the database, Paul's age is updated:
Conclusion
In conclusion, Pandas and SQL are both powerful tools for data analysis tasks such as reading and writing data to the SQL database. Pandas provides an easy way to connect to the SQL database, read data from the database into a Pandas dataframe, and write dataframe data back to the database.
The Pandas library makes it easy to manipulate data in a dataframe, whereas SQL provides a powerful language for querying data in a database. Using both Pandas and SQL to read and write the data can save time and effort in data analysis tasks, especially when the data is very large. Overall, leveraging SQL and Pandas together can help data analysts and scientists streamline their workflow.