Omar Vidaure

Jun 23, 20212 min

Python VS MS SQL: Race to the Flat File!

Comparing Python to SQL isn’t really fair, but recently I’ve been evaluating my options how to get the best performance out of my local system resources when doing some routine data engineer/ETL tasks. In the end, I usually add my data into a local MS SQL Express database and run native data export tasks. The performance wasn’t great so I asked myself the question “can I use python to get a flat file faster?”. At first glance it’s an odd question but was worth asking if it saves me time. Naturally I ran some tests.

First, I should say that I know these tests are anecdotal. System resources and file sizes will vary but for my purposes, I’m looking at files generally at the few millions of rows mark. Next thing to consider is my system resources. I’m testing on my local PC with 20 GB of RAM and a Phenom II X4.

The next step is to simulate some data. I’ll be creating a table with a 4 million row that is 5 columns wide with a session ID and other randomly generated values. Code and screenshots below;

--Create Table
 
Create table PythonvsSQL
 
(Col1 varchar(120)
 
, Col2 decimal(32,6)
 
, Col3 decimal(32,6)
 
, Col4 decimal(32,6)
 
, Col5 decimal(32,6))
 

 
--Add Records
 
Declare @idx int, @sessionID varchar(120)
 
Set @Idx = 0
 
While @Idx < 4000000
 
Begin
 
Set @SessionID = NewID()
 
Insert Into PythonvsSQL
 
Select @SessionID
 
,(CHECKSUM(NewId())) % 14*Rand()
 
,(CHECKSUM(NewId())) % 14*Rand()
 
,(CHECKSUM(NewId())) % 14*Rand()
 
,(CHECKSUM(NewId())) % 14*Rand()
 
set @idx = @idx + 1
 
End

The table results look like this;

Now let’s test how long it takes to export using the native MS SQL export option. I pulled out my stop watch and used the “export” flat file option.

The 4 million row export ran in 25.5 seconds

Next let’s test python using the pyodbc library and the code below. Image below for reference too.

import pyodbc
 
import pandas as pd
 
import pathlib
 
import csv
 
import os
 
from datetime import datetime
 

 
conn = pyodbc.connect('Driver={SQL Server};'
 
'Server=*******\SQLEXPRESS;'
 
'Database=BIBlog;'
 
'Trusted_Connection=yes;')
 
cursor = conn.cursor()
 
sqldata = "Select * from PythonvsSQL"
 

 
pd.read_sql(sqldata,conn).to_csv(r"C:\Users\******\Desktop\File.txt"
 
, header="values", index=None, sep="|", mode="a")

Running this flat file extract took longer than I thought. The file was produced in 2 Minutes, 8.6 seconds. yikes..

I didn't see the results I was hoping this time but it was a fun exercise. Hope you had fun too 😁. What are your thoughts? Do you have a faster way to export flat files? I would love to hear about it in the comments below.


Thanks for checking out this blog post. If you found this post helpful please consider donating via PayPal. Any contribution is appreciated!

    3980
    0