Your Web News in One Place

Help Webnuz

Referal links:

Sign up for GreenGeeks web hosting
January 12, 2023 11:20 am GMT

Understanding the basics and advanced concepts of RDBMS and SQL for data science and analysis

Day 7 of 100 Days Data Science Bootcamp from noob to expert.

GitHub link: Complete-Data-Science-Bootcamp

Main Post: Complete-Data-Science-Bootcamp

Recap Day 6

Yesterday we have studied in detail about Data Visualization using matplotlib and seaborn in Python.

Let's Start

A DBMS (Database Management System) is a software application that interacts with end users, other applications, and the database itself to capture and analyze the data. It is a tool that allows data scientists to store, manage, and retrieve large amounts of structured and unstructured data.

There are different types of DBMSs, such as relational databases, NoSQL databases, and graph databases.

  • Relational databases, such as MySQL and PostgreSQL, use a structured query language (SQL) to manage data stored in tables with rows and columns. These databases are well-suited for data that can be easily organized into a tabular format and are useful for data warehousing, business intelligence, and other reporting applications.

  • NoSQL databases, such as MongoDB and Cassandra, are designed to handle large amounts of unstructured data and support different data models, such as document, key-value, and graph. These databases are useful for big data and real-time applications, such as real-time analytics, social media, and gaming.

  • Graph databases, such as Neo4j and Amazon Neptune, are used to store and query data in a graph format, where nodes and edges represent entities and relationships. These databases are useful for applications that involve complex relationships, such as fraud detection, recommendation systems, and social network analysis.

In data science, the choice of DBMS depends on the type and volume of data, the complexity of the relationships among data, and the performance and scalability requirements of the application.

In this article we will study in detail Relational databases

Basic Understanding

A relational database is a type of database that organizes data into tables, with each table consisting of rows and columns.

  • Tables: A table is a collection of related data that is organized into rows and columns. Each table has a unique name that identifies it within the database. For example, a table named "customers" might contain information about all the customers of a company, with each row representing a different customer and each column representing a different piece of information about that customer, such as name, address, and phone number.

  • Columns: A column is a vertical set of values in a table and it has a name and a data type. Each column represents a specific attribute of the data in the table, such as "customer_name" or "customer_address". For example, in a table named "customers", the columns could be "customer_id", "customer_name", "customer_address", "customer_phone_number", "customer_email" etc.

  • Rows: A row is a horizontal set of values in a table and it represents a single record or tuple in the table. Each row represents a unique instance of the data in the table, such as a specific customer. For example, in a table named "customers", a row could represent a single customer with values such as "1" for the "customer_id", "John Doe" for the "customer_name", "123 Main St" for the "customer_address", "555-555-5555" for the "customer_phone_number" and "[email protected]" for the "customer_email"

  • Primary keys: A primary key is a column or set of columns in a table that uniquely identify each row in the table. Each table can have only one primary key, and it is used to enforce the integrity of the data and create relationships with other tables. For example, in a table named "customers" the "customer_id" column might be the primary key, because it is unique for each customer and can be used to identify a specific customer in the table.

  • Foreign keys: A foreign key is a column or set of columns in a table that refers to the primary key of another table. It is used to create relationships between tables and enforce referential integrity. For example, in a table named "orders" that contains information about customer orders, the "customer_id" column might be a foreign key that refers to the primary key "customer_id" in the "customers" table.

  • Indexing: Indexing is the process of creating a separate data structure that allows for faster searching and sorting of data in a table. Indexes can be created on one or more columns in a table to improve the performance of queries that search for specific data. For example, if you frequently search for customers by their last name, you might create an index on the "customer_name" column in the "customers" table to speed up those searches.

Here is an example of a "customers" table:

customer_idcustomer_namecustomer_addresscustomer_phone_numbercustomer_email
1John Doe123 Main St555-555-5555[email protected]
2Jane Smith456 Park Ave555-555-5556[email protected]
3Bob Johnson789 Elm St555-555-5557[email protected]

In this example table, the "customer_id" column is the primary key and it is used to uniquely identify each customer. The other columns such as "customer_name", "customer_address", "customer_phone_number" and "customer_email" are non-key attributes and they provide additional information about each customer.

An index can be created on any column to speed up the query performance. In this example, an index can be created on the "customer_name" column if we frequently need to search customer by their name.

It is important to note that the example is a very simple representation of the table and in real world scenario, tables can have multiple columns as primary key, multiple foreign keys and a complex data model with multiple tables, but the basic concept remains the same.

SQL Quries

We will execute all our queries using Python and SQLite

Connecting to a database

To start interacting with the database we first we need to establish a connection.

#!/usr/bin/pythonimport sqlite3conn = sqlite3.connect('100daysofdatascience.db')print("Opened database successfully");

Opened database successfully

Sample data

# Customers table:conn.execute('''CREATE TABLE customers (                        id INT PRIMARY KEY,                        first_name VARCHAR(255),                        last_name VARCHAR(255),                        city VARCHAR(255)                    );''')conn.execute('''INSERT INTO customers (id, first_name, last_name, city)                             VALUES (1, 'John', 'Smith', 'Paris'),                                   (2, 'Mary', 'Johnson', 'London'),                                   (3, 'Michael', 'Williams', 'Berlin'),                                   (4, 'Brad', 'Brown', 'Rome');''')print("Customer table created successfully!")

Customer table created successfully!

# Products table:conn.execute('''CREATE TABLE products (                        id INT PRIMARY KEY,                        product_name VARCHAR(255),                        category VARCHAR(255),                        price DECIMAL(10,2),                        in_stock BOOLEAN                    );''')conn.execute('''INSERT INTO products (id, product_name, category, price, in_stock)                        VALUES (1, 'MacBook Pro', 'electronics', 1500, true),                               (2, 'iPhone', 'electronics', 1000, true),                               (3, 'T-Shirt', 'clothing', 20, true),                               (4, 'Jeans', 'clothing', 50, false);''')print("Products table created successfully!")

Products table created successfully!

# Orders table:conn.execute('''CREATE TABLE orders (                        id INT PRIMARY KEY,                        customer_id INT,                        order_date DATE,                        total DECIMAL(10,2)                    );''')conn.execute('''INSERT INTO orders (id, customer_id, order_date, total)                        VALUES (1, 1, '2021-01-01', 100),                               (2, 2, '2021-01-02', 200),                               (3, 3, '2021-01-03', 150),                               (4, 2, '2021-01-04', 75);''')print("Orders table created successfully!")

Orders table created successfully!

# Create the employee tableconn.execute('''CREATE TABLE employees                (id INT PRIMARY KEY NOT NULL,                name TEXT NOT NULL,                salary REAL);''')# Insert data into the employee tableconn.execute("INSERT INTO employees (id, name, salary) VALUES (1, 'John Smith', 50000)")conn.execute("INSERT INTO employees (id, name, salary) VALUES (2, 'Mary Johnson', 55000)")conn.execute("INSERT INTO employees (id, name, salary) VALUES (3, 'Michael Williams', 60000)")conn.execute("INSERT INTO employees (id, name, salary) VALUES (4, 'Brad Brown', 65000)")print("Employees table created successfully!")

Employees table created successfully!

conn.commit()

Here are some examples of basic-to-advanced SQL queries that a data scientist might use to retrieve and manipulate data:

SELECT:

The SELECT statement is used to retrieve data from one or more tables. For example, the following query retrieves all columns from a table called "customers":

import pandas as pddata = pd.read_sql_query("SELECT * FROM customers", conn)data
idfirst_namelast_namecity
01JohnSmithParis
12MaryJohnsonLondon
23MichaelWilliamsBerlin
34BradBrownRome

WHERE:

The WHERE clause is used to filter the data returned by a SELECT statement. For example, the following query retrieves all columns from the "customers" table for customers whose city is "New York":

data = pd.read_sql_query("SELECT * FROM customers WHERE city = 'London'", conn)data
idfirst_namelast_namecity
02MaryJohnsonLondon

JOIN:

The JOIN clause is used to combine rows from two or more tables based on a related column between them. For example, the following query retrieves all columns from the "orders" table and the "customers" table, where the customer_id in the "orders" table matches the id in the "customers" table:

data = pd.read_sql_query("SELECT * FROM orders JOIN customers ON orders.customer_id = customers.id", conn)data
idcustomer_idorder_datetotalidfirst_namelast_namecity
0112021-01-011001JohnSmithParis
1222021-01-022002MaryJohnsonLondon
2332021-01-031503MichaelWilliamsBerlin
3422021-01-04752MaryJohnsonLondon

UPDATE:

The UPDATE statement is used to modify data in a table. For example, the following query updates the salary of all employees in the "employees" table by 10%:

conn.execute('''UPDATE employees SET salary = salary * 10;''')conn.commit()
data = pd.read_sql_query("SELECT * FROM employees", conn)data
idnamesalary
01John Smith500000.0
12Mary Johnson550000.0
23Michael Williams600000.0
34Brad Brown650000.0

DELETE:

The DELETE statement is used to remove data from a table. For example, the following query deletes all rows from the "employees" table where the salary is less than $50,000:

conn.execute('''DELETE FROM employees WHERE salary < 550000;''')conn.commit()data = pd.read_sql_query("SELECT * FROM employees", conn)data
idnamesalary
02Mary Johnson550000.0
13Michael Williams600000.0
24Brad Brown650000.0

GROUP BY:

The GROUP BY clause is used to group rows in a SELECT statement based on one or more columns. For example, the following query retrieves the total in_stock for each product category:

data = pd.read_sql_query("SELECT category, SUM(in_stock) FROM products GROUP BY category;", conn)data
categorySUM(in_stock)
0clothing1
1electronics2

ORDER BY:

The ORDER BY clause is used to sort the data returned by a SELECT statement. For example, the following query retrieves all columns from the "customers" table, sorted by last name in ascending order:

data = pd.read_sql_query("SELECT * FROM customers ORDER BY last_name;", conn)data
idfirst_namelast_namecity
04BradBrownRome
12MaryJohnsonLondon
21JohnSmithParis
33MichaelWilliamsBerlin

LIMIT:

The LIMIT clause is used to limit the number of rows returned by a SELECT statement. For example, the following query retrieves the top 2 products by in_stocks:

data = pd.read_sql_query("SELECT * FROM products ORDER BY in_stock DESC LIMIT 2;", conn)data
idproduct_namecategorypricein_stock
01MacBook Proelectronics15001
12iPhoneelectronics10001

LIKE:

The LIKE operator is used to search for a specific pattern in a column. For example, the following query retrieves all customers whose last name starts with "S":

data = pd.read_sql_query("SELECT * FROM customers WHERE last_name LIKE 'S%';", conn)data
idfirst_namelast_namecity
01JohnSmithParis

INNER JOIN:

The INNER JOIN keyword is used to combine rows from two or more tables based on a related column between them. This will only return rows when there is at least one match in both tables. For example:

data = pd.read_sql_query('''SELECT orders.id, customers.first_name                            FROM orders                            INNER JOIN customers ON orders.customer_id = customers.id;''', conn)data
idfirst_name
01John
12Mary
23Michael
34Mary

OUTER JOIN:

The OUTER JOIN keyword is used to combine rows from two or more tables based on a related column between them. This will return all rows from one table and the matching rows from the other table. If there is no match, NULL values will be returned. For example:

data = pd.read_sql_query('''SELECT orders.id, customers.first_name                            FROM orders                            LEFT OUTER JOIN customers ON orders.customer_id = customers.id;                            ''', conn)data
idfirst_name
01John
12Mary
23Michael
34Mary

UNION:

The UNION operator is used to combine the result-set of two or more SELECT statements. The UNION operator selects only distinct values by default. For example:

data = pd.read_sql_query('''SELECT * FROM orders UNION SELECT * FROM customers;''', conn)data
idcustomer_idorder_datetotal
0112021-01-01100
11JohnSmithParis
2222021-01-02200
32MaryJohnsonLondon
4332021-01-03150
53MichaelWilliamsBerlin
6422021-01-0475
74BradBrownRome

INDEX:

Indexes are used to improve the performance of a database by allowing the database to find and retrieve specific rows much faster. For example, the following command creates an index on the "last_name" column in the "customers" table:

conn.execute('''CREATE INDEX last_name_indexes ON customers (last_name);''')conn.commit()

Date and Time Functions:

SQL provides a number of functions to work with date and time data types. Some examples include:

  • CURDATE() - returns the current date

  • NOW() - returns the current date and time

  • YEAR() - returns the year of a given date

  • MONTH() - returns the month of a given date

  • DAY() - returns the day of a given date

As we are using SQLite. So, SQLite provides several built-in functions for working with date and time. Here is a list of some of the most commonly used date and time functions in SQLite:

  • date(timestring, modifier, modifier, ...): This function returns the date part of a date-time string.

  • time(timestring, modifier, modifier, ...): This function returns the time part of a date-time string.

  • datetime(timestring, modifier, modifier, ...): This function returns the date and time parts of a date-time string.

  • julianday(timestring, modifier, modifier, ...): This function returns the Julian day - the number of days since noon in Greenwich on November 24, 4714 B.C.

  • strftime(format, timestring, modifier, ...): This function returns a string representation of the date and time based on the specified format. The format string can contain various placeholders for different parts of the date and time, such as %Y for the year, %m for the month, %d for the day, %H for the hour, %M for the minute, and %S for the second.

  • date(timestring, '+' or '-', number, 'days' or 'months' or 'years'): This function allows you to add or subtract days, months, or years from a date.

  • current_date, current_time, current_timestamp: These functions return the current date, time, and timestamp respectively.

  • year(timestring), month(timestring), day(timestring), hour(timestring), minute(timestring), second(timestring): These functions return the year, month, day, hour, minute, and second of the given timestring respectively.

For example, the following query retrieves the total sales for each month of the current year:

data = pd.read_sql_query('''SELECT strftime('%m', order_date) as month, SUM(total) as totals                                FROM orders                                WHERE strftime('%Y', order_date) = strftime('%Y', 'now')                                GROUP BY month;''', conn)

Handling NULL values:

SQL provides several operators and functions to handle NULL values. Some examples include:

  • IS NULL - used to check for NULL values

  • IS NOT NULL - used to check for non-NULL values

  • COALESCE() - returns the first non-NULL value in a list of expressions

  • NULLIF() - returns NULL if two expressions are equal, otherwise it returns the first expression

For example, the following query retrieves all customers who have not placed any orders:

data = pd.read_sql_query('''SELECT * FROM customers                                WHERE id NOT IN (SELECT customer_id FROM orders)                                ''', conn)data
idfirst_namelast_namecity
04BradBrownRome

Other data types:

SQL also supports other data types such as BLOB (binary large object) for storing binary data, BOOLEAN for storing true/false values, and ENUM for storing predefined set of strings.

For example, the following query retrieves all products with image and only if the products are in stock:

data = pd.read_sql_query('''SELECT product_name, in_stock FROM products                                    WHERE product_name IS NOT NULL AND in_stock = TRUE;                                    ''', conn)data
product_namein_stock
0MacBook Pro1
1iPhone1
2T-Shirt1

Summary

This article provides a comprehensive guide to relational database management systems (RDBMS) and SQL (Structured Query Language) for data science and analysis. It covers the basics and advanced concepts of RDBMS and SQL, including data types, queries, and advanced concepts such as joins and subqueries. The article also provides examples and code snippets to help illustrate the concepts and provide a hands-on learning experience. The main goal of this article is to help data scientists, analysts, and developers understand the importance of RDBMS and SQL in managing and analyzing large datasets.

Exercise Question you will find in the exercise notebook of Day 6 on GitHub.

If you liked it then...

Buy Me A Coffee


Original Link: https://dev.to/anurag629/understanding-the-basics-and-advanced-concepts-of-rdbms-and-sql-for-data-science-and-analysis-5h4l

Share this article:    Share on Facebook
View Full Article

Dev To

An online community for sharing and discovering great ideas, having debates, and making friends

More About this Source Visit Dev To