Ask HN: Need help creating a database from five sets of CSVs

I have 5 sets of CSVs (multiple CSVs in each set but the columns within the same set are the same). Only 2 of the sets have more than 2-3 columns required - the rest can be dropped.

There are relationships between them. This is an export from a previous CRM. 3 of the sets are too big to open in Excel so I couldn't do a lookup to create the proper values and clean up the data this way.

There are few columns that I'd like to morph into just one column based on an if statement (if legacy user created_by exist make that created_by if it doesn't exist use created_by user). Same with created time.

I've messed around using bash, python, and postgres but it's been a hassle. When I pull new data I have to end up deleting and purging all the info and redoing a bunch of segmented scripts. (I spun up a rails app that I connected to my new db and displayed the data and could do basic search/pagination) -- but here is the thing. I realized I'm waisting my time on building that out. That was pulling in all the data and it's got so much excess junk and the ID structure is terrible. I want to keep the associations but use new primary key based on another column. I rather just start with a really reasonable and simple database that has the data properly setup so I can plug it into anything (or do simple sql queries/exports!)

Not sure if anyone has had to deal with something similar and could share what they did but I'd appreciate any guidance. Looked into pgfutter http://easymorph.com/ http://tadviewer.com/ csvkit. I'm open to paying someone for their time if they could do this since I'm at my wits end. (It didn't take super long to do things in bash/python but I'm not an expert so I was hacking my way through it). Truth be told, the data isn't complex but doing the few associations and setting up new IDs is foreign to me. No pun intended.

10 points | by SnowingXIV 2380 days ago

12 comments

  • mindcrash 2378 days ago
    You can do a shitload of data cleanup and transformation stuff with python and pandas (or more precise: using data frames) which only does not give you the idea of having super powers but also makes the whole process actually fun.

    Loading the data is simply:

    df = pd.read_csv(...)

    http://pandas.pydata.org/pandas-docs/stable/io.html#io-read-...

    where pd is an alias for pandas (import pandas as pd), df is the var in which the data frame is stored and the ... is the set of parameters required to load and parse your csv (see the docs)

    then for example if you for example need to find all rows where a certain column containing an empty string, something like:

    df_without_empty_my_columns = df[my_column != '']

    http://pandas.pydata.org/pandas-docs/stable/comparison_with_...

    Finally, as an example, to export the refined data frame to some sql database supported by sqlalchemy:

    refined_df.to_sql(...)

    http://pandas.pydata.org/pandas-docs/stable/generated/pandas...

    Background: My team is currently doing ETL through pandas for a data analytics platform and we kinda love it.

    Check out http://pandas.pydata.org/pandas-docs/stable/ for the entire set of docs if you are interested.

    Have fun!

    • auxym 2377 days ago
      Pandas is however limited to what can fit in memory, and can hit that limit pretty quick, unfortunately. If your CSVs are pretty big, or your machine is limited in memory, you could consider dask dataframes. It is supposed to have a similar api to Pandas, but it can keep most data on disk if needed, I believe.

      Edit: just saw your other post. You shouldn't have any problems using plain pandas with a 750 mb dataset on any sort of modern computers with a few Gb's of memory.

    • BrandonBradley 2378 days ago
      I have recently finished (today) some ETL work with Pandas. Pandas touts time series resampling features that are hard to match anywhere else.

      Disclaimer: Contributed performance improvements to Pandas around two years ago.

  • lastofus 2379 days ago
    Since you are using Postgres: https://www.postgresql.org/docs/current/static/populate.html

    Get familiar with the COPY command (https://www.postgresql.org/docs/current/static/sql-copy.html) for importing your data as it will save you a ton of time compared to doing single or even bulk inserts from a Python script. You can use COPY to either import directly from a CSV file, or you can use it to load Python tuples in memory w/ the Psycopg2 client and copy_expert(). For example:

      buffer = cStringIO.StringIO()
      csv_writer = csv.writer(buffer)
      csv_writer.writerow((col1_data_field, col2_data_field, col3_data_field))
      # [write more rows to a buffer, probably in a loop]
      buffer.seek(0)
      curr.copy_expert('COPY table_name (col1, col2, col3) FROM STDIN CSV', buffer)
    
    For things too big for memory, transform your data to a secondary CSV file, then feed a fd to copy_expert()

      csv_file = open(csv_path, 'r')
      curr.copy_expert("COPY %s from STDIN DELIMITER ',' CSV" % table_name, csv_file)
    
    Import your data into tables with no indexes, and then create the indexes after the import.

    In your postgresql.conf consider the following options to help w/ write throughput:

      synchronous_commit = off
    
    Once your initial data is loaded into PG, consider doing some transformations on your data with materialized views (https://www.postgresql.org/docs/10/static/sql-creatematerial...) as oppose to doing the transforms with a Python script. PG will be able to create the views much faster than you can import data into a new table.
    • SnowingXIV 2379 days ago
      Copy does work really well, that's actually where I've began again. Scrapped what I had. My method currently is taking all my raw data (folder of CSVs) and doing the following.

        1. Create psql tables exactly how there are in the CSVs.
        2. COPY all the data into each corresponding table.
        3. Create NEW database/tables with the schema I want (only including the columns needed and proper names).
           I'm thinking this is where your materialized view comes in to play.
        4. Magic?
  • oblib 2380 days ago
    Kind of sounds like something that Perl might handle well.

    http://search.cpan.org/search?query=csv&mode=all

    I've done a bit of work with some of these modules. If you still need some help feel free to contact me...

  • m00s3 2380 days ago
    Not sure if it would fit your needs, but I've used this with success when trying to do useful things with CSVs: http://csvkit.readthedocs.io/en/1.0.2/tutorial.html
    • SnowingXIV 2380 days ago
      I've definitely used csvkit quite a bit when I was trying to understand the data quickly. Grabbed the headers + couple first rows of each set.
  • jklein11 2379 days ago
    What is the issue with the ID structure? I would think generating your own ID's would make the process of updating rows when you get new data much more difficult. Also, is there a reason why dropping the database and adding these records is an issue. Updating records could certainly add some complexity in terms of the data in the file changing.

    I would consider going one of two ways:

    A: Do all of the data transformation in Postgres. Load it into a staging table using COPY and load the data into a target table.

    B: Do the data transformation in Python and just load it into a target table in Postgres. I have had some pretty good success using pandas and pandas.toSQL to do this sort of thing with files > 1 MB but <1GB.

    It basically comes down to whether you are more comfortable manipulating the data in SQL or Python

  • SnowingXIV 2376 days ago
    Thanks everyone for the suggestions! I actually finally got it done. I primarily stuck to writing commands in psql as that was rather efficient. Once that was done I moved the clean tables over to a new database with proper names, etc.

    Had rails grab it and manually built out the models and controllers to handle what I need. Working great!

    Now onto having fun with search, sort, filter. ;)

  • fiftyacorn 2377 days ago
    Can you not use an external table in your database?

    Load what you can, spew the rubbish into an error log and parse separately?

    https://stackoverflow.com/questions/22537852/creating-a-post...

    One in your database establisgh queries to populate your new tables

  • wirddin 2379 days ago
    If I was you, I would have exported these csv along with the header line to mongodb. Excel and other tools have a limit to the number of rows and columns they can display. Now, this has worked for me because I eventually had to process the data and my go-to stack had mongodb. You can try that if you like, or:

    You can go with pandas (python) to programmatically read the csv.

    Advice: Avoid CSVs for huge amount of data.

  • palidanx 2380 days ago
    It kind of sounds like you need to normalize the data? What I would do is

    1) Create a new schema reflecting the data normalization you want. Meaning creating all the tables foreign key relationships, etc.

    2) Use programming language of your choice (my quiver would be rails), just for the ORM to read in the csvs then write the data out to the tables.

  • framebit 2379 days ago
    With the data sizes you listed, I would consider using Spark and Spark SQL in local mode (--master local[*]). You can grab Spark from the Apache website, install it wherever you like on your laptop, and start using spark-shell immediately with Scala, Java, or Python.
  • jtchang 2380 days ago
    Can you actually post the size of the CSVs? Are we talking 1-2GB, 10GB, 100GB, 1TB?

    sqlite might actually work fairly well.

    • SnowingXIV 2380 days ago
      The biggest individual CSV file is 48.5MB. Together uncompressed is ~750MB.
      • tixocloud 2379 days ago
        Are we talking about stitching 5*x CSVs that are each close to 1GB each?

        I would love to help you out as this is what I do on a daily basis. Maybe I am misunderstanding the issue but couldn't you just load all the CSVs and then build queries to according to your preferred schema?

        • SnowingXIV 2379 days ago
          I actually managed to load all CSVs into a postgres database. I have one database now with 5 tables that contain all the information. Now the tricky part is doing the restructure, stripping out unneeded columns, some if statements to choose the correct column to use, connecting the relationships, and redoing the IDs and making that as painless as possible (hopefully a query I can repeat) since I'll likely have a new fresh batch of data coming in a month.

          All together the CSVs totaled 750MBs.

          • tixocloud 2379 days ago
            Gotcha. Sure, I can help you out. Send me an email and we can connect.
    • assafmo 2379 days ago
      Yeah, awk + sqlite will do the trick. 750MB is fairly small dataset.
  • shyn3 2378 days ago
    Try powershell. Import-CSV. I'd love to get access to this file.