Comparing Python to SQL isn’t really fair, but recently I’ve been evaluating my options how to get the best performance out of my local system resources when doing some routine data engineer/ETL tasks. In the end, I usually add my data into a local MS SQL Express database and run native data export tasks. The performance wasn’t great so I asked myself the question “can I use python to get a flat file faster?”. At first glance it’s an odd question but was worth asking if it saves me time. Naturally I ran some tests.
First, I should say that I know these tests are anecdotal. System resources and file sizes will vary but for my purposes, I’m looking at files generally at the few millions of rows mark. Next thing to consider is my system resources. I’m testing on my local PC with 20 GB of RAM and a Phenom II X4.
The next step is to simulate some data. I’ll be creating a table with a 4 million row that is 5 columns wide with a session ID and other randomly generated values. Code and screenshots below;
--Create Table
Create table PythonvsSQL
(Col1 varchar(120)
, Col2 decimal(32,6)
, Col3 decimal(32,6)
, Col4 decimal(32,6)
, Col5 decimal(32,6))
--Add Records
Declare @idx int, @sessionID varchar(120)
Set @Idx = 0
While @Idx < 4000000
Begin
Set @SessionID = NewID()
Insert Into PythonvsSQL
Select @SessionID
,(CHECKSUM(NewId())) % 14*Rand()
,(CHECKSUM(NewId())) % 14*Rand()
,(CHECKSUM(NewId())) % 14*Rand()
,(CHECKSUM(NewId())) % 14*Rand()
set @idx = @idx + 1
End
The table results look like this;
Now let’s test how long it takes to export using the native MS SQL export option. I pulled out my stop watch and used the “export” flat file option.
The 4 million row export ran in 25.5 seconds
Next let’s test python using the pyodbc library and the code below. Image below for reference too.
import pyodbc
import pandas as pd
import pathlib
import csv
import os
from datetime import datetime
conn = pyodbc.connect('Driver={SQL Server};'
'Server=*******\SQLEXPRESS;'
'Database=BIBlog;'
'Trusted_Connection=yes;')
cursor = conn.cursor()
sqldata = "Select * from PythonvsSQL"
pd.read_sql(sqldata,conn).to_csv(r"C:\Users\******\Desktop\File.txt"
, header="values", index=None, sep="|", mode="a")
Running this flat file extract took longer than I thought. The file was produced in 2 Minutes, 8.6 seconds. yikes..
I didn't see the results I was hoping this time but it was a fun exercise. Hope you had fun too 😁. What are your thoughts? Do you have a faster way to export flat files? I would love to hear about it in the comments below.
Thanks for checking out this blog post. If you found this post helpful please consider donating via PayPal. Any contribution is appreciated!
Comments