I have what I think is an interesting question. I have a text file that contains data that needs to be uploaded into 3 tables:
contains all the data
contains 80% of the data
contains 20% of the data
Now to make things interesting 1 of these fields can take on one of 2 values and I need to keep the same ratio on this field after the split (ie, if one of the values takes on 90% of the rows before the split, after doing the 80/20 split 90% of the rows in the 80% split and 90% of the rows in the 20% split should have this value).
So it looks like I need to take this data and separate it into the two types and take 80% at random of each type and upload into the 80% table and similarly for the 20% table.
The problem is the doing this in the most efficient way since I am using Java and MySQL. I have no idea how big this file is going to get. It’s currently a couple hundred rows but could conceivably get to thousands or hundreds of thousands of rows. This is actually a school assignment and our professor won’t tell us.
Loading all the data into Java and doing the processing would be easy for me since I am better in Java than I am SQL, but that’s not efficient since I could be using a lot of RAM if the file is big.
So my basic idea is to get the data into a MySQL database ASAP. I thought about parsing the input file and converting it into an SQL file that I can immediately uploaded into a database (by creating an essentially duplicate file that modifies the input file so that it would be uploaded into a database). Then I can use MySQL statements for my processing.
I also though about adding all the tuples to all three tables and then deleting the tuples I don’t need. I know how to get a count of the total number of tuples and a count of the number in each category so I would know the ratios. But I don’t know how to delete the correct number at random. Can I do a query where I essentially say “SELECT * FROM table_name WHERE cat=“A”” and then drop one of these at random? If so, then it seems like that’s an efficient and easy solution.
I guess my basic problem is that I can easily implement this brute force so-to-speak in Java, but my SQL is pretty weak, and I don’t know of an elegant solution. Any advice or thoughts would be appreciated.
If anyone is curious, our end goal is to implement the PRISM Classification algorithm.
Intriguing problem; the fact that it’s in MySQL hampers me slightly, since I do most of my work in SQL Server, which includes a very useful ETL tool that has a percentage sampling command.
A couple of questions: when you do this 80/20 split, do you need the rows in the table to be unique, or can a row appear twice in the two tables?
As you suggested, I think the best way might be to load three tables and then selectively delete what you don’t need. I’m not sure of the exact syntax, but some searching leads me to believe that
SELECT * FROM table1 WHERE RAND()<0.2 ORDER BY RAND()
will give you your 20% random table, and then it’s just a matter of deleting all these rows from the 80% table, which will make them unique. This may not be the exact solution you need, but should give you a pointer in the right direction.
Some further reading suggests that this might not be the best possible approach; it could possibly be better if you were to generate the delete statement from within your application, where there is more control over number generation.
If your table has a consecutive RowID or similar, you can generate a statement along the lines of
I am not completely clear on this, but I really don’t know. I need to ask. My algorithm I plan on implementing hinges on being able to allow duplicates. I can make it work though if I can’t.
That’s it! I don’t know what I was thinking last night, but that would be an easy and efficient solution. Basically my algorithm is going to be like this:
The first line of every file is the number of categories so that basically gives me the schema for my tables. All the lines following are the tuples I want to insert
I will insert every tuple into every table
I will then get my ratios by querying one of the tables
Now I can operate in Java by getting a list of the ID’s I need to delete by using random numbers to delete the appropriate amount
Delete the entries by ID
Now I will have all three tables loaded efficiently with a minimal amount of RAM
My problem was that in looking at the text files there wasn’t an ID that I could key on, but your comment reminded me that the schema actually has an ID as the primary key so I can use that to generate random numbers and delete off of.
Why do you say it wouldn’t be efficient to do it in Java? Hundreds of thousands of rows of text would not take long (seconds?).
Java method:
Value1Counter=0
Value2Counter=0
Read a row
if (Field==Value1)
Value1Counter++
if (Value1Counter % 2==0)
write to file 1
else
write to file 2
else
Value2Counter++
if (Value2Counter % 2==0)
write to file 1
else
write to file 2
end
Lather, rinse, repeat
Load complete table
Alter table to have additional column “DestFileNum”
Create a stats table with count of value 1, count of value 2 and total rows
note: I don’t use MySQL so the case statement may be different and not sure if MySQL allows the join on update
Update t1
set DestFileNum=(case when Field=Value1
and rand()<(t2.Value1Count/t2.TotalRows)
then 0
when Field=Value2
and rand()<(t2.Value2Count/t2.TotalRows)
then 0
else 1 end)
from CompleteTable t1
cross join StatsTable t2
Then select rows into split tables based on DestFileNum
I meant it wouldn’t be efficient insofar as loading the entire data into some kind of Java structure (Vector<Vector<String>> for example). There could conceivably be hundreds of thousands or millions of rows of data. From an efficiency standpoint I was hoping to not have to load all the data into RAM to process. If didn’t NEED to, then I would prefer not to.
Ok. I posted code that would do it row by row, no need to load it up into memory, if you don’t know the Value1 and Value2 in advance you just have to add a couple lines to detect them upon first encounter.
Load the full table including an identity column. Move everything where mod 5 against the identity column returns 0 to one table. Move everything where it doesn’t equal zero to another table.
I missed this part. Assuming MySQL allows you to load during the sort, prior to identity values being set, just sort on this particular column and you’ll have the proper split.
You are loading the same duplicate data into 3 tables? – and you ask about optimizing this redundancy? Optimize your design first!
Create one table, with only 1 copy of the data, and add a flag to indicate that each row is part of the 80% group or the 20% group. But do not make duplicate copies of the data.