Data Cleaning with PostgreSQL: Parking Violations

Back to All Projects

Tags: Data Cleaning, SQL

Domain: Transportation

PostgreSQL

DBeaver

● GOAL

Ensure the dataset is well-structured and consistent, reducing the need for additional cleaning during the analysis phase.

● OVERVIEW

Data quality is crucial, especially when used for analysis and reporting. This case study explores the process of cleaning and standardizing data using PostgreSQL. The dataset used consists of 5,000 rows of New York City parking violation records, including vehicle details, time and location of the infraction, and associated metadata. The data originates from a transactional system, and contains inconsistencies, missing values, and formatting issues that should be addressed before further analysis.

● KEY FUNCTIONS USED

➜ CTEs, ROW-NUMBER() OVER (PARTITION BY ...) to identify and eliminate duplicate records.

➜ DIFFERENCE(), CASE WHEN to standardize data, and the fuzzystrmatch extension to identify similar phonetic matching (Soundex).

➜ COALESCE() to handle missing values.

➜ TRIM() and Regular Expressions (using SIMILAR TO and ~), to identify and update invalid values.

➜ NULLIF(), LEFT(), SUBSTR(), TO_TIMESTAMP() for data transformation

➜ UPDATE, DELETE, ALTER TABLE to apply changes to the table.

● DATA CLEANING PROCESS

Initial Setup

- Create a staging table as a working copy of the original table.

Handling Missing and Invalid Values

- Converted empty strings and certain invalid values to NULL.
- Replaced certain NULL values with a placeholder.
- Removed invalid characters and trimmed whitespaces.

Removing Duplicates

- Identified and deleted exact duplicate rows to ensure data integrity.

Standardizing Text Fields

- Standardized common typos, abbreviations, and misspellings.
- Corrected values based on their phonetic similarity — left ambiguous values unchanged.

Formatting Dates and Times

- Corrected invalid time values (format and other inconsistencies).
- Converted date and time fields into standardized formats to facilitate time-based analysis.

Ensuring Data Consistency

- Checked and aligned inconsistencies in vehicle attributes.
- Removed non-standard characters from key fields.

● RESULTS AND IMPACT

After applying the SQL cleaning steps, the dataset became structured and standardized, making it significantly easier to work with for reporting. While some minor adjustments may still be needed for specific analyses, analysts can focus on generating insights rather than extensive data cleaning. The cleaned data will support reports on violation frequency, trends by location, and correlations with vehicle attributes.

By proactively addressing data quality issues at this stage, the project achieves its primary goal: reducing future cleaning efforts and improving report accuracy.

To see the code, please visit my GitHub.

Go to GitHub

LET'S WORK TOGETHER

CONTACT ME ➚

Page updated

Google Sites

Report abuse