Excel text database formula/function HELP!

My data is set up rougly like this (there are columns in between, but these are the columns I’m dealing with, say the first is B and the second is F)
B…F
1 blue………. blah@blah
2 home………. blah@blah
3 tree………. blah@blah
4 poof………. dakfo[ksdokfaodk@diajf[oidjf[a
5 tree………. dapofj[sidfj@doiafj[oi
6 home………. dinkywinky@someplace
7 blue………. buttinsky@mars.tv
8 sign………. buttinsky@mars.tv
9 bean………. buttinsky@mars.tv
10 cars………. buttinsky@mars.tv
11 tree………. dajdijf@doijafij
12 poof………. difjadoi8u37y2@diajfs
13 sign………. dijfaudhfa@idjf[ja
14 bean………. idjfqe4df@ijdufha
15 cars………. dofkadajf@ojdfaj8e
16 blue………. dope@straightdope.com
17 home………. dope@straightdope.com
18 sign………. dope@straightdope.com
19 bean………. dope@straightdope.com
20 cars………. dope@straightdope.com
21 blue………. dope@straightdope.com
22 home………. dope@straightdope.com
23 tree………. dope@straightdope.com
24 poof………. aidfj[oudhuh@daohfoudhsf
25 tree………. idfjaoudhjao@'adijfiaj
26 home………. dijaf[oij@apdijfaidj
27 blue………. glooberton@blurggy.com
28 sign………. glooberton@blurggy.com
29 bean………. glooberton@blurggy.com
30 cars………. glooberton@blurggy.com
31 tree………. glooberton@blurggy.com
32 poof………. diajfij@soiadjf[j
33 tree………. diafj[dosjf@sdiojfaosdj
34 home………. diajf[oidjoj@aodjfoaij
35 tree………. asfpdhf@isdf[ohd
If I could create the formula in English, it would read as follows:

If B1 contains the word “blue”, look at the email address in F2. If the email address is F2 is the same address in F1, enter “blue” (or blue 2 or whatever) in B2. Then look at F3. If it is the same email address that is in F2 and F1, enter “blue” in b3. Keep going until the F cell you check does NOT match the previous F cells as I just described.

The results of the successful implementation of this would be:

1 blue………. blah@blah
2 blue………. blah@blah
3 blue………. blah@blah
4 poof………. dakfo[ksdokfaodk@diajf[oidjf[a
5 tree………. dapofj[sidfj@doiafj[oi
6 home………. dinkywinky@someplace
7 blue………. buttinsky@mars.tv
8 blue………. buttinsky@mars.tv
9 blue………. buttinsky@mars.tv
10 blue………. buttinsky@mars.tv
11 tree………. dajdijf@doijafij
12 poof………. difjadoi8u37y2@diajfs
13 blue………. dijfaudhfa@idjf[ja
14 home………. idjfqe4df@ijdufha
15 tree………. dofkadajf@ojdfaj8e
16 blue dope@straightdope.com
17 blue………. dope@straightdope.com
18 blue………. dope@straightdope.com
19 blue………. dope@straightdope.com
20 blue………. dope@straightdope.com
21 blue………. dope@straightdope.com
22 blue………. dope@straightdope.com
23 blue………. dope@straightdope.com
24 poof aidfj[oudhuh@daohfoudhsf
25 blue………. idfjaoudhjao@'adijfiaj
26 home………. dijaf[oij@apdijfaidj
27 blue glooberton@blurggy.com
28 blue glooberton@blurggy.com
29 blue glooberton@blurggy.com
30 blue glooberton@blurggy.com
31 blue glooberton@blurggy.com
32 poof………. diajfij@soiadjf[j
33 blue………. diafj[dosjf@sdiojfaosdj
34 home………. diajf[oidjoj@aodjfoaij
35 tree………. asfpdhf@isdf[ohd
AN ALTERNATIVE that would be just fine would be this:

If F1 is the same as F2, make B2 the same as B1. If f3 matches f2 and f1, make B3 match b1 and b2.

The results would look like this:

1 poof………. blah@blah
2 poof………. blah@blah
3 poof………. blah@blah
4 blue………. dakfo[ksdokfaodk@diajf[oidjf[a
5 sign………. dapofj[sidfj@doiafj[oi
6 bean………. dinkywinky@someplace
7 cars………. buttinsky@mars.tv
8 cars………. buttinsky@mars.tv
9 cars………. buttinsky@mars.tv
10 cars………. buttinsky@mars.tv
11 cars………. dajdijf@doijafij
12 blue………. difjadoi8u37y2@diajfs
13 home………. dijfaudhfa@idjf[ja
14 tree………. idjfqe4df@ijdufha
15 poof………. dofkadajf@ojdfaj8e
16 bean………. dope@straightdope.com
17 bean………. dope@straightdope.com
18 bean………. dope@straightdope.com
19 bean………. dope@straightdope.com
20 bean………. dope@straightdope.com
21 bean………. dope@straightdope.com
22 bean………. dope@straightdope.com
23 bean………. dope@straightdope.com
24 blue………. aidfj[oudhuh@daohfoudhsf
25 sign………. idfjaoudhjao@'adijfiaj
26 bean………. dijaf[oij@apdijfaidj
27 poof………. glooberton@blurggy.com
28 poof………. glooberton@blurggy.com
29 poof………. glooberton@blurggy.com
30 poof………. glooberton@blurggy.com
31 poof………. glooberton@blurggy.com
32 tree………. diajfij@soiadjf[j
33 poof………. diafj[dosjf@sdiojfaosdj
34 tree………. diajf[oidjoj@aodjfoaij
35 home………. asfpdhf@isdf[ohd

THE GOAL;

I have consolidated 3 separately compiled lists of customers, which have tons of repeats. I need to make sure that a certain category of customer, call it blue, is deleted from the list. If I just delete the ones that are already marked BLUE, I will miss the repeats that are marked “tree” or something else.

How do you decide whether a customer falls into the category? Is there an identifying thing about it that would enable you to identify it rather than evaluating each line?

I’m trying to remove the active customers.

I have compiled the list from three sources, only one of which indicates whether the customer is currently active.

So I brought them all together. In the “status” column, I entered buzzzz for the lines that didn’t say one way or another. Then I sorted on the email column. I get what I show above, several lines, anywhere from three to ten, that have exactly the same the email address. But only two of the lines show as “active” in the status column, the rest say “buzzz” So if I simply sort on “status” and delete all the “active”, the repeats of the active will still be in there, which is bad. So I need to change all the repeats to say “active” as well, sort on active, and delete all the actives, which will leave a net of only INactive customers, and those are the ones I want to reach.

I hacked up the emails for customer privacy but it looks like this:

ACTIVE 332000
null 332000
null 332000
null 372005
null 372005
null 372005
null 585858
null 585858
null 585858
null _699@h
null _699@h
null _699@h
ACTIVE _75@ma
null _75@ma
null _75@ma
null _9999@
null _9999@
null _army@
null _army@
null _army@
ACTIVE _bam@y
null _bam@y
null _bam@y
null _barro
null _barro
null _barro
null _eb@ya
null _eb@ya
null _eb@ya
null _eb@ya
ACTIVE _egus@
null _egus@
null _egus@
null _fan77
null _fan77
null _fan77
null _g@yah
null _g@yah
null _g@yah
null punk
null punk
ACTIVE _q@yah
null _q@yah
null _q@yah
null _wckr@
null _wckr@
null _wckr@
null _yanni
null _yanni
null _yanni
null -1@msn
null -1@msn
null -1@msn
ACTIVE .a.hay
null .a.hay
null .a.hay
null .adria
null .adria
null .adria
null .almas
null .almas
null .almas
null .aro@l
null .aro@l
null .aro@l
null .astor
null .astor
null .astor
ACTIVE .bard@
null .bard@
null .bard@
null .bard@
null .bard@
null .bard@
ACTIVE .bickl
null .bickl
null .bickl
null .bodyc
null .bodyc
null .bodyc
null .boyki
null .chuck
null .chuck
ACTIVE .cifel
null .cifel
null .cifel
null .cleme
null .diamo
null .diamo
So wherever it says “active” and the next however many emails are the same as the one on that line, I need to get rid of all of them.

Any ideas?

I don’t quite understand the difficulty. Why don’t you first identify all the email addresses of the active accounts, then eliminate all records with those email addresses? Doing it via a spreadsheet is possible – first create a lookup table for the email addresses to be eliminated, then do the lookups (using Excel’s VLOOKUP function or the like) for each record and flag those records that match one of the emails in the lookup table. Sort by this flag and eliminate the flagged records. Instead of all this finagling in a spreadsheet, a simple database query would be more direct.

Thank you! I have never before done this or anything close, but I just tried and it’s very intriguing.

However, excel doesn’t seem to like to look for email addresses, either singly or in large groups. It won’t accept the formula and highlights the email as the error zone.

using jsut one email address, my formula looks right to my virgin eyes:

=VLOOKUP(someguy@pbs.org,d300:d1000,7,false)

I’ll keep noodling, but any help would be great.

Try quotation marks around the email address.

I can’t believe this… and I can’t believe the help files and instructions available aren’t more clear and specific for people who are not accustomed to these things.

So far, in trying to get it right, I’ve triggered a #value!, a #name!, and now a #ref! - when I look up what they all mean, I cannot tie them to the instructions for this!

Seconded. This sounds like the sort of thing that would be way easier in SQL.

IMO, the help files and instructions actually are pretty good. But there are some things that require futzing around with to figure out how they work. Realistically speaking, this would probably be about a half-hour project for somebody who knows what they’re doing; unless one of your goals is to figure out how to do it yourself, it might be more efficient for you to throw a few bucks at somebody who’s done this sort of thing before and have them do it for you.

If you can delete the duplicate rows and you have no specific duplicated row that you need to keep, you can use Excel’s AdvancedFilter function to “filter for uniques” to accomplish this.

If you are keeping all the data and want to overwrite just column B or F, the cleanest way I can think of is to write a macro in VBA using if/then statements. It will have no code/space needed on your spreadsheet. I can write a sample later if you prefer that method.
You can do it all on an Excel spreadsheet too. One quick but messy method I thought up is to use a count if function that looks only at the rows below the activecell for duplicates. Create columnG with:

if(countif(F2:F[LASTROW],F2)>1,“DUPLICATE BELOW”,“UNIQUE”)
if(countif(F3:F[LASTROW],F3)>1,“DUPLICATE BELOW”,“UNIQUE”)
if(countif(F4:F[LASTROW],F4)>1,“DUPLICATE BELOW”,“UNIQUE”)

if you used a fixed position for the last row i.e. F$10000, and you drag the boxes down, it will propagate so that each row only looks below for another duplicate of it. Sooner or later you will have one “UNIQUE” for each email address in column F. It will mark the last unique email address as UNIQUE.
We will use this UNIQUE/DUPLICATE column to pick the columnB value we are gonna keep. Whenever an email in columnF is a duplicate, it will use the value of the columnB of the LAST instance of that email address (what we labeled as UNIQUE).
To do this we write in ColumnH something like:
if(G2=“UNIQUE”,B2","")
if(G3=“UNIQUE”,B3","")

and in ColumnI something like:
if(G2=“UNIQUE”,F2","")
if(G3=“UNIQUE”,F3","")

It finds the unique rows and copies over the columnB and columnF values into an array to use a vlookup.
in columnJ you write:
if(G2=“DUPLICATE”,vlookup(F2,H2:I[LASTROW],1),B2)

*** it’s 1am, and I’m not writing the correct vlookup code. You basically tell Excel to look in column I, and find the email address in F2. When it finds it, it takes the corresponding value in column H, which is columnB value for a UNIQUE we decided to keep.

If it is the UNIQUE row, you it will just pull the value from B2 ***

in the end you will have email addresses of F2 that is left untouched, everytime there is a duplicate email in F2, columnJ will now have the same piece of data for all those email addresses.

A bit messy unfortunately. I’m sure someone can find a cleaner method in Excel only, or you can consider using a macro in VBA

Reading the post again, can you clarify…

  1. Is columnB your “status” column that has either ACTIVE or BUZZZZ?

I am going to assume yes:

  1. If there are 10 duplicate email address entries (123@yahoo.com 10 times) and the “status” column of all of these say “BUZZZZ” do you still want to have these deleted?

I realize the above solution does not discern whether an email address is ever marked ACTIVE before it tries to overwrite all the columnB values to make them the same. You may end up with an ACTIVE status email address that the above method will overwrite the ACTIVE with a BUZZZZ entry.

I just realized if your “status” column is just ACTIVE/BUZZZZ… it sounds like you can do:

if B2 = status, and is either ACTIVE or BUZZZZ
if F2 = email address

In columnG (or some blank space)
if(B2=“ACTIVE”,F2,"")

This will locate all of the email addresses of the ACTIVE customers you know you want to clear off this list. You will have many blanks for any “status” that is BUZZZZ right now.
In the next columnH over try:

if(countif(G$2:G$9999,F2)>1,“DELETE”,“KEEP”)
if(countif(G$2:G$9999,F3)>1,“DELETE”,“KEEP”)
if(countif(G$2:G$9999,F4)>1,“DELETE”,“KEEP”)
if(countif(G$2:G$9999,F5)>1,“DELETE”,“KEEP”)

I think this logic works for what you want… if a line is labeled “ACTIVE” in the status column, it will copy the email address over to columnG, otherwise it leaves it blank.

Then it looks at each row of email addresses in columnF, and compares to the entire list of ACTIVE customer emails we identified in columnG. If it is there on that list, it puts “DELETE” in there for you, otherwise “KEEP”.

One thing about this is that you must delete all of the BUZZZZ lines you need to delete before you delete any lines labeled ACTIVE. If you delete an ACTIVE line, it also removes that email from the list of ACTIVE customer emails in columnG that is used to compare whether it is needed to be deleted or not.

Funnily enough the easiest way to do what you want is basically your own suggestion in the OP.

To get the answer you list after this comment:

*If I could create the formula in English, it would read as follows:

If B1 contains the word “blue”, look at the email address in F2. If the email address is F2 is the same address in F1, enter “blue” (or blue 2 or whatever) in B2. Then look at F3. If it is the same email address that is in F2 and F1, enter “blue” in b3. Keep going until the F cell you check does NOT match the previous F cells as I just described.

The results of the successful implementation of this would be:*

just do this:

in cell G1 enter blue
in cell G2 enter =if(F2=F1,G1,B2)
copy to the bottom of G
copy the cells in G and paste special as values over column B

See it’s just like your “English formula”, most formulas are.