SQL Question

I don’t know how many SQL gurus there are here so if anyone doesn’t know the answer but knows a site where I can post SQL questions, that would be appreciated too.

Anyway, the question:

I have a table with a few hundred thousand rows, each of which has an identifier of C1, C2 etc up to C21. The identifier is known as the segment.

To illustrate, if I execute

SELECT segment, count (*)
FROM my_table
GROUP BY segment

I get:



C1		12000
C10		10542
C11		10389
C12		10737
C13		12000
C14		12000
C15		12000
C16		12000
C17		48253
C18		24195
C19		39708
C2		12000
C20		24039
C21		24352
C3		12000
C4		12000
C5		13121
C6		16870
C7		41653
C8		16852
C9		16819


What I need to do is set 5% of each of the segments as a control. ie I want to set the “segment” column to be set to ‘CTRL’ for 5% of each segment.

Does anyone know a way to do this with a single command? It can be as complex as necessary, but must be a single statement.

Thanks in advance for any help.

Is there some other column that you could use in the Where clause that has a regular value range, like between 1 and 100 evenly spaced, then you could say Where columnX < 20.

Or use a stored procedure.

Typo: I meant < 5.

Whenever I have to do something like this, I write something quick in whatever language I’m developing on, and use a loop and counter.

Unfortunately there is no column like that.

I can’t do that either. The data is on a system to which I have no access except trough some archaic third party software which will only let me send one statement at a time. And I can’t load up stored procedures on the remote system.

Yeah, I’d happily use PL/SQL, but I can’t send it to the remote system. Things would be so much easier if they were reasonable about the access privileges around here. :mad:

If they won’t let you use a stored procedure, could it be done by writing an external program that makes ODBC calls to the database? That would probably be a big pain in the ass, but I can’t think of any way to do this using just a single SQL statement.

The only way I’m aware of to return a random selection of rows that is some percentage of your whole is with the SAMPLE clause. For instance:

SELECT *
FROM my_table SAMPLE (5)
WHERE segment = ‘C1’;

will give you 5% of the rows that have an identifier of ‘C1’. However, SAMPLE is pretty picky (for reasons I don’t really understand) about where it can be used. I don’t think you can ever use it in the WHERE clause of an UPDATE statement, as that would involve joining the table you want to update with some other table (even if it’s just a different version of the same table). Oracle says (and I think this is true of any SQL database):

Even if each row had a primary key that you could reference, a statement like:

UPDATE my_table mt1
SET mt1.segment = ‘CTRL’
WHERE mt1.id IN (SELECT m2.id
FROM my_table mt2 SAMPLE (5)
WHERE mt2.segment = ‘C1’);

wouldn’t work. Running it produces a syntax error.

Wait a minute, if I’m reading you right, you can issue multiple statements, it’s just that it has to be one at a time. If each row does have an unique identifer, then I think something like this would work:

CREATE TABLE my_new_table
AS SELECT *
FROM my_table SAMPLE (5)
WHERE segment = ‘C1’;

UPDATE my_table mt
SET mt.segment = ‘CTRL’
WHERE mt.id IN (SELECT mnt.id
from my_new_table);

DROP my_new_table;

Of course, that only works if there’s some column in the table that you can use to join it with a table containing a random subset. If that’s not the case, and you can’t add a column like that, then I’m out of ideas.

How important is it that you get exactly 20% for each set? Could you tolerate a little variation?

If you can, then you might try a modulus function on some nearly random column, i.e. UPDATE TBL SET COL = ‘CTRL’ WHERE (OTHERCOL MOD 5) = 0. I’m not sure of the modulus operator but that’s the basic idea.