There are relationships between them. This is an export from a previous CRM. 3 of the sets are too big to open in Excel so I couldn't do a lookup to create the proper values and clean up the data this way.
There are few columns that I'd like to morph into just one column based on an if statement (if legacy user created_by exist make that created_by if it doesn't exist use created_by user). Same with created time.
I've messed around using bash, python, and postgres but it's been a hassle. When I pull new data I have to end up deleting and purging all the info and redoing a bunch of segmented scripts. (I spun up a rails app that I connected to my new db and displayed the data and could do basic search/pagination) -- but here is the thing. I realized I'm waisting my time on building that out. That was pulling in all the data and it's got so much excess junk and the ID structure is terrible. I want to keep the associations but use new primary key based on another column. I rather just start with a really reasonable and simple database that has the data properly setup so I can plug it into anything (or do simple sql queries/exports!)
Not sure if anyone has had to deal with something similar and could share what they did but I'd appreciate any guidance. Looked into pgfutter http://easymorph.com/ http://tadviewer.com/ csvkit. I'm open to paying someone for their time if they could do this since I'm at my wits end. (It didn't take super long to do things in bash/python but I'm not an expert so I was hacking my way through it). Truth be told, the data isn't complex but doing the few associations and setting up new IDs is foreign to me. No pun intended.
Loading the data is simply:
df = pd.read_csv(...)
http://pandas.pydata.org/pandas-docs/stable/io.html#io-read-...
where pd is an alias for pandas (import pandas as pd), df is the var in which the data frame is stored and the ... is the set of parameters required to load and parse your csv (see the docs)
then for example if you for example need to find all rows where a certain column containing an empty string, something like:
df_without_empty_my_columns = df[my_column != '']
http://pandas.pydata.org/pandas-docs/stable/comparison_with_...
Finally, as an example, to export the refined data frame to some sql database supported by sqlalchemy:
refined_df.to_sql(...)
http://pandas.pydata.org/pandas-docs/stable/generated/pandas...
Background: My team is currently doing ETL through pandas for a data analytics platform and we kinda love it.
Check out http://pandas.pydata.org/pandas-docs/stable/ for the entire set of docs if you are interested.
Have fun!
Edit: just saw your other post. You shouldn't have any problems using plain pandas with a 750 mb dataset on any sort of modern computers with a few Gb's of memory.
Disclaimer: Contributed performance improvements to Pandas around two years ago.
Get familiar with the COPY command (https://www.postgresql.org/docs/current/static/sql-copy.html) for importing your data as it will save you a ton of time compared to doing single or even bulk inserts from a Python script. You can use COPY to either import directly from a CSV file, or you can use it to load Python tuples in memory w/ the Psycopg2 client and copy_expert(). For example:
For things too big for memory, transform your data to a secondary CSV file, then feed a fd to copy_expert() Import your data into tables with no indexes, and then create the indexes after the import.In your postgresql.conf consider the following options to help w/ write throughput:
Once your initial data is loaded into PG, consider doing some transformations on your data with materialized views (https://www.postgresql.org/docs/10/static/sql-creatematerial...) as oppose to doing the transforms with a Python script. PG will be able to create the views much faster than you can import data into a new table.http://search.cpan.org/search?query=csv&mode=all
I've done a bit of work with some of these modules. If you still need some help feel free to contact me...
I would consider going one of two ways:
A: Do all of the data transformation in Postgres. Load it into a staging table using COPY and load the data into a target table.
B: Do the data transformation in Python and just load it into a target table in Postgres. I have had some pretty good success using pandas and pandas.toSQL to do this sort of thing with files > 1 MB but <1GB.
It basically comes down to whether you are more comfortable manipulating the data in SQL or Python
Had rails grab it and manually built out the models and controllers to handle what I need. Working great!
Now onto having fun with search, sort, filter. ;)
Load what you can, spew the rubbish into an error log and parse separately?
https://stackoverflow.com/questions/22537852/creating-a-post...
One in your database establisgh queries to populate your new tables
You can go with pandas (python) to programmatically read the csv.
Advice: Avoid CSVs for huge amount of data.
1) Create a new schema reflecting the data normalization you want. Meaning creating all the tables foreign key relationships, etc.
2) Use programming language of your choice (my quiver would be rails), just for the ORM to read in the csvs then write the data out to the tables.
sqlite might actually work fairly well.
I would love to help you out as this is what I do on a daily basis. Maybe I am misunderstanding the issue but couldn't you just load all the CSVs and then build queries to according to your preferred schema?
All together the CSVs totaled 750MBs.