SQL Advice (optimizing loading data into tables)

dgrdfd · February 21, 2009, 5:35am

I have what I think is an interesting question. I have a text file that contains data that needs to be uploaded into 3 tables:

contains all the data
contains 80% of the data
contains 20% of the data

Now to make things interesting 1 of these fields can take on one of 2 values and I need to keep the same ratio on this field after the split (ie, if one of the values takes on 90% of the rows before the split, after doing the 80/20 split 90% of the rows in the 80% split and 90% of the rows in the 20% split should have this value).

So it looks like I need to take this data and separate it into the two types and take 80% at random of each type and upload into the 80% table and similarly for the 20% table.

The problem is the doing this in the most efficient way since I am using Java and MySQL. I have no idea how big this file is going to get. It’s currently a couple hundred rows but could conceivably get to thousands or hundreds of thousands of rows. This is actually a school assignment and our professor won’t tell us.

Loading all the data into Java and doing the processing would be easy for me since I am better in Java than I am SQL, but that’s not efficient since I could be using a lot of RAM if the file is big.

So my basic idea is to get the data into a MySQL database ASAP. I thought about parsing the input file and converting it into an SQL file that I can immediately uploaded into a database (by creating an essentially duplicate file that modifies the input file so that it would be uploaded into a database). Then I can use MySQL statements for my processing.

I also though about adding all the tuples to all three tables and then deleting the tuples I don’t need. I know how to get a count of the total number of tuples and a count of the number in each category so I would know the ratios. But I don’t know how to delete the correct number at random. Can I do a query where I essentially say “SELECT * FROM table_name WHERE cat=“A”” and then drop one of these at random? If so, then it seems like that’s an efficient and easy solution.

I guess my basic problem is that I can easily implement this brute force so-to-speak in Java, but my SQL is pretty weak, and I don’t know of an elegant solution. Any advice or thoughts would be appreciated.

If anyone is curious, our end goal is to implement the PRISM Classification algorithm.

Thank you very much!

(I wanted to add that smiley because I like it)

Dervorin · February 21, 2009, 2:33pm

Intriguing problem; the fact that it’s in MySQL hampers me slightly, since I do most of my work in SQL Server, which includes a very useful ETL tool that has a percentage sampling command.

A couple of questions: when you do this 80/20 split, do you need the rows in the table to be unique, or can a row appear twice in the two tables?

As you suggested, I think the best way might be to load three tables and then selectively delete what you don’t need. I’m not sure of the exact syntax, but some searching leads me to believe that


SELECT * FROM table1 WHERE RAND()<0.2 ORDER BY RAND()

will give you your 20% random table, and then it’s just a matter of deleting all these rows from the 80% table, which will make them unique. This may not be the exact solution you need, but should give you a pointer in the right direction.

http://dev.mysql.com/doc/refman/5.0/en/mathematical-functions.html#function_rand

Some further reading suggests that this might not be the best possible approach; it could possibly be better if you were to generate the delete statement from within your application, where there is more control over number generation.

If your table has a consecutive RowID or similar, you can generate a statement along the lines of


DELETE FROM table WHERE RowID IN (num1, num2....)

where num1, num2 is between 0 and Max(RowID).

http://www.titov.net/2005/09/21/do-not-use-order-by-rand-or-how-to-get-random-rows-from-table/

dgrdfd · February 21, 2009, 4:24pm

I am not completely clear on this, but I really don’t know. I need to ask. My algorithm I plan on implementing hinges on being able to allow duplicates. I can make it work though if I can’t.

Dervorin:

Some further reading suggests that this might not be the best possible approach; it could possibly be better if you were to generate the delete statement from within your application, where there is more control over number generation.

If your table has a consecutive RowID or similar, you can generate a statement along the lines of
DELETE FROM table WHERE RowID IN (num1, num2....)
where num1, num2 is between 0 and Max(RowID).

Anton Titov’s blog » “Do not use ORDER BY RAND()” or “How to get random rows from table?”

That’s it! I don’t know what I was thinking last night, but that would be an easy and efficient solution. Basically my algorithm is going to be like this:

The first line of every file is the number of categories so that basically gives me the schema for my tables. All the lines following are the tuples I want to insert
I will insert every tuple into every table
I will then get my ratios by querying one of the tables
Now I can operate in Java by getting a list of the ID’s I need to delete by using random numbers to delete the appropriate amount
Delete the entries by ID
Now I will have all three tables loaded efficiently with a minimal amount of RAM

My problem was that in looking at the text files there wasn’t an ID that I could key on, but your comment reminded me that the schema actually has an ID as the primary key so I can use that to generate random numbers and delete off of.

Thank you very much for the help!

RaftPeople · February 21, 2009, 5:05pm

Why do you say it wouldn’t be efficient to do it in Java? Hundreds of thousands of rows of text would not take long (seconds?).

Java method:



Value1Counter=0
Value2Counter=0

Read a row

if (Field==Value1)
    Value1Counter++
    if (Value1Counter % 2==0)
        write to file 1
    else
        write to file 2
else
    Value2Counter++
    if (Value2Counter % 2==0)
        write to file 1
    else
        write to file 2
end

Lather, rinse, repeat

RaftPeople · February 21, 2009, 5:52pm

SQL method:

Load complete table
Alter table to have additional column “DestFileNum”

Create a stats table with count of value 1, count of value 2 and total rows
note: I don’t use MySQL so the case statement may be different and not sure if MySQL allows the join on update



Update t1
set DestFileNum=(case when Field=Value1
                         and rand()<(t2.Value1Count/t2.TotalRows)
                             then 0
                      when Field=Value2
                         and rand()<(t2.Value2Count/t2.TotalRows)
                             then 0
                      else 1 end)
from CompleteTable t1
cross join StatsTable t2

Then select rows into split tables based on DestFileNum

dgrdfd · February 21, 2009, 10:19pm

I meant it wouldn’t be efficient insofar as loading the entire data into some kind of Java structure (Vector<Vector<String>> for example). There could conceivably be hundreds of thousands or millions of rows of data. From an efficiency standpoint I was hoping to not have to load all the data into RAM to process. If didn’t NEED to, then I would prefer not to.

RaftPeople · February 21, 2009, 10:53pm

Ok. I posted code that would do it row by row, no need to load it up into memory, if you don’t know the Value1 and Value2 in advance you just have to add a couple lines to detect them upon first encounter.

DMC · February 22, 2009, 5:06am

Load the full table including an identity column. Move everything where mod 5 against the identity column returns 0 to one table. Move everything where it doesn’t equal zero to another table.

DMC · February 22, 2009, 5:08am

I missed this part. Assuming MySQL allows you to load during the sort, prior to identity values being set, just sort on this particular column and you’ll have the proper split.

crazyjoe · February 22, 2009, 7:58pm

Has the prohibition on helping people with homework been lifted?

Tim_T-Bonham.net · February 22, 2009, 10:55pm

You are loading the same duplicate data into 3 tables? – and you ask about optimizing this redundancy? Optimize your design first!

Create one table, with only 1 copy of the data, and add a flag to indicate that each row is part of the 80% group or the 20% group. But do not make duplicate copies of the data.

Topic		Replies	Views
Ridiculously complicated Excel question Factual Questions	7	913	September 22, 2003
Does narrowing a sql search put more strain or less strain on a SQL server? Factual Questions	14	1425	May 21, 2008
Access 2000 importing question Factual Questions	5	585	July 9, 2003
How would you solve this database exercise? In My Humble Opinion	13	1708	January 29, 2009
Database questions (do I really need one?) Factual Questions	15	1080	May 4, 2005

SQL Advice (optimizing loading data into tables)

Related topics