Module 1: Importing Data Sets

Data Science with Python - Key Libraries

Python libraries are collections of functions and methods that allow performing various actions without writing extensive code. They offer built-in modules for different functionalities, providing a broad range of facilities.

Categories of Python Data Analysis Libraries:

  1. Scientific Computing Libraries
  1. Data Visualization Libraries
  1. Algorithmic Libraries


Scientific Computing Libraries

1. Pandas

2. NumPy

3. SciPy


Data Visualization Libraries

1. Matplotlib

2. Seaborn


Algorithmic Libraries

1. Scikit-learn

2. Statsmodels


Reading Data with Pandas

Data acquisition is the process of loading and reading data into a notebook from various sources. Using Python’s Pandas package, we can efficiently read and manipulate data.

Key Factors:

  1. Format: The way data is encoded (e.g., CSV, JSON, XLSX, HDF).
  1. File Path: The location of the data, either on the local computer or online.

Steps to Read Data with Pandas

1. Import Pandas

import pandas as pd

2. Define File Path

Specify the location of the data file.

file_path = 'path_to_your_file.csv'

3. Read CSV File

Use the read_csv method to load data into a DataFrame.

df = pd.read_csv(file_path)

Special Case: No Headers in CSV

If the data file does not contain headers, set header to None.

df = pd.read_csv(file_path, header=None)

4. Inspect the DataFrame

Use df.head() to view the first few rows of the DataFrame.

print(df.head())

Use df.tail() to view the last few rows.

print(df.tail())

5. Assign Column Names

If the column names are available separately, assign them to the DataFrame.

Verify by using df.head() again.

print(df.head())

6. Export DataFrame to CSV

To save the DataFrame as a new CSV file, use the to_csv method.

df.to_csv('output_file.csv', index=False)

Additional Formats

Pandas supports importing and exporting of various data formats. The syntax for reading and saving different data formats is similar to read_csv and to_csv.


Exploring and Understanding Data with Pandas

Exploring a dataset is crucial for data scientists to understand its structure, data types, and statistical distributions. Pandas provides several methods for these tasks.

Data Types in Pandas

Pandas stores data in various types:

Checking Data Types

Use dtypes to view data types of each column:

print(df.dtypes)

Statistical Summary

Use describe to get statistical summary:

print(df.describe())

Output Example:

              0            1            2            3
count  205.000000  205.000000  205.000000  205.000000
mean    13.071707   25.317073  198.313659    3.256098
std      6.153123   26.021249   90.145293    1.125947
min      5.000000    4.000000   68.000000    2.000000
25%      9.000000    8.000000  113.000000    2.000000
50%     12.000000   19.000000  151.000000    3.000000
75%     16.000000   37.000000  248.000000    4.000000
max     35.000000  148.000000  540.000000    8.000000

To include all columns:

print(df.describe(include='all'))

Output Example:

              0           1          2          3       ...      25       26       27
count   205.000000  205.000000  205.000000  205.000000  ...     205      205      205
unique        NaN         NaN         NaN         NaN   ...     25       25       25
top           NaN         NaN         NaN         NaN   ...     value    value    value
freq          NaN         NaN         NaN         NaN   ...     10       10       10
mean     13.071707   25.317073  198.313659    3.256098  ...     NaN      NaN      NaN
std       6.153123   26.021249   90.145293    1.125947  ...     NaN      NaN      NaN
min       5.000000    4.000000   68.000000    2.000000  ...     NaN      NaN      NaN
25%       9.000000    8.000000  113.000000    2.000000  ...     NaN      NaN      NaN
50%      12.000000   19.000000  151.000000    3.000000  ...     NaN      NaN      NaN
75%      16.000000   37.000000  248.000000    4.000000  ...     NaN      NaN      NaN
max      35.000000  148.000000  540.000000    8.000000  ...     NaN      NaN      NaN

For object columns, it shows additional statistics like the number of unique values, the most frequent value (top), and its frequency (freq).

DataFrame Info

Use info for a concise summary:

df.info()

Output Example:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 11 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   0       205 non-null    int64
 1   1       205 non-null    int64
 2   2       205 non-null    int64
 3   3       205 non-null    int64
 4   4       205 non-null    object
 5   5       205 non-null    object
 6   6       205 non-null    object
 7   7       205 non-null    object
 8   8       205 non-null    object
 9   9       205 non-null    object
 10  10      205 non-null    object
dtypes: int64(4), object(24)
memory usage: 45.0+ KB

Accessing Databases with Python: SQL APIs and Python DB APIs

Databases are essential tools for data scientists, and Python provides powerful libraries for connecting to and interacting with databases. This module covers the basics of using Python to access databases through SQL APIs and Python DB APIs.

SQL APIs

Basic Operations of SQL API

  1. Connecting to Database:
    • Use API calls to establish a connection between the Python program and the DBMS.
  1. Executing SQL Statements:
    • Build SQL statements as text strings and pass them to the DBMS using API calls.
  1. Handling Errors and Status:
    • Use API calls to check the status of DBMS requests and handle errors during database operations.
  1. Disconnecting from Database:
    • End database access with an API call that disconnects the Python program from the database.


Python DB API

Key Concepts in Python DB API

  1. Connection Objects:
    • Used to connect to a database and manage transactions.
    • Created using the connect function from the database module.
  1. Cursor Objects:
    • Used to execute queries and fetch results from the database.
    • Similar to a cursor in text processing, used to navigate through query results.

Methods with Connection Objects

Methods with Cursor Objects

Python Application Example

  1. Import Database Module:
    • Import the database module and use the connect function to establish a connection.
  1. Connect to Database:
    • Use connect with database name, username, and password parameters to get a connection object.
  1. Create Cursor:
    • Create a cursor object on the connection to execute queries and fetch results.
  1. Execute Queries:
    • Use execute() function of the cursor to run queries and fetchall() to fetch query results.
  1. Close Connection:
    • Use the close() method on the connection object to release resources after queries are complete.

Conclusion

Understanding SQL APIs and Python DB APIs allows data scientists to effectively manage and analyze data stored in relational databases using Python. Always remember to manage connections properly to optimize resource usage.


Cheat Sheet:  Data Wrangling

Read the CSV

Read the CSV file containing a data set to a pandas data frame

df = pd.read_csv(<CSV_path>, header=None) # load without header
df = pd.read_csv(<CSV_path>, header=0)    # load using first row as header

Print first few entries

Print the first few entries (default 5) of the pandas data frame

df.head(n)  # n=number of entries; default 5

Print last few entries

Print the last few entries (default 5) of the pandas data frame

df.tail(n)  # n=number of entries; default 5

Assign header names

Assign appropriate header names to the data frame

df.columns = headers

Replace "?" with NaN

Replace the entries "?" with NaN entry from Numpy library

df = df.replace("?", np.nan)

Retrieve data types

Retrieve the data types of the data frame columns

df.dtypes

Retrieve statistical description

Retrieve the statistical description of the data set. Defaults use is for only numerical data types. Use include="all" to create summary for all variables.

df.describe()  # default use df.describe(include="all")

Retrieve data set summary

Retrieve the summary of the data set being used, from the data frame

df.info()

Save data frame to CSV

Save the processed data frame to a CSV file with a specified path

df.to_csv(<output CSV path>)