WO2006060967A2

WO2006060967A2 - System and method for extending an antiphishing aggregator

Info

Publication number: WO2006060967A2
Application number: PCT/CN2005/002154
Authority: WO
Inventors: Marvin Shannon; Wesley Boudeville
Original assignee: Metaswarm (Hong Kong) Ltd.
Priority date: 2004-12-12
Filing date: 2005-12-12
Publication date: 2006-06-15

Description

SYSTEM AND METHOD FOR EXTENDING AN ANTIPHISHING AGGREGATOR

TECHNICAL FIELD

This invention relates generally to information delivery and management in a computer network. More particularly, the invention relates to techniques for automatically classifying electronic communications and web pages as phishing or non-phishing.

This application claims the benefit of the filing date of U.S. Provisional Application, Number 60/593115, "System and Method for Attacking Malware in Electronic Messages", filed December 12, 2004, which is incorporated by reference in its entirety. It also incorporates by reference in its entirety the U.S. Provisional Application, Number 60/593114, "System and Method of Blocking Pornographic Websites and Content", filed December 12, 2004, and the U.S. Provisional Application, Number 60/593186, "System and Method for Making a Validated Search Engine", filed December 18, 2004.

BACKGROUND OF THE INVENTION

Some phishers introduce by various means viruses that take over computers. These viruses can then issue phishing messages, without these messages coming directly from a computer owned by the phisher. Also, a virus might act as a web server, to be the destination of links in phishing messages. The virus would then forward received information to another computer controlled by the phisher. In both cases, the aim is for the phisher to conceal her presence. Such phishing networks are called "bot nets", where "bot" stands for "robot". The viruses are often called "malware" [aka. "malicious software"]. In this aspect, phishing is merging into the general problem of malware.

In general, computer viruses can spread by various means. A common way is via electronic messages, like email. In these messages, there could be an enclosed attachment. The sender field of the email might suggest a reputable company. The text might say that the attachment is a useful program that you should download to your disk and then run, to install it. For example, the sender might claim to be from the company that wrote the operating system for your computer. But, as with any operating system, periodic patches are needed. So you are ^

urged to install the enclosed patch. Possibly, the message might urge this, saying that the patch fixes a security bug.

Another example is where the sender pretends to be a computer gaming company. The patch will supposedly speed up the response time on your computer, in a multiplayer game, that is played across the Internet. Or, you might be asked to download something, but are told that it is not a program, so you do not have to explicitly install it. Instead, it might be data, supposedly. In some operating systems, possibly in combination with some applications, some bugs might exist that enable the installation of malware without the user having to explicitly install a downloaded entity. Another possibility is that the entity to be downloaded might not be in the message itself. Instead, the message might have a link, say, to some location on the network, containing that entity. Many other variants are possible. Note that, as in these examples, the purported companies need not have anything to do with the financial sector.

Another problem that arose soon after the dissemination of the first browser was that of pornography on the Web. While it had existed on the Internet, prior to browsers and the Web, the ease of use of browsers and the mass uptake led to a proliferation of such websites, and the massive amounts of spam sent by those sites, to induce readers to click on links to the sites.

As the Web as grown, search engines have become indispensable for users. Along with the rise of the Web has been the concomitant rise of e-commerce. This has led to a worsening problem. A general purpose search engine can be manipulated by fraudsters, who set up websites purporting to offer goods and services. Then, via widely known Search Engine Optimization methods, they can pump up their unpaid rankings, when a search engine returns answers to a query. These methods might involve the use of link farms. Or, the fraudsters might buy ad space on the engines, where the ads might be associated with particular key words. So that when a user queries with those words, the fraudsters' websites would be shown as clickable ads. In both cases, the intent is to persuade the user to go to a fraudster's website.

A reference is cited herein:

Antiphishing Working Group, antiphishing.org. http://www.theregister.co.uk/2004/12/03/phishing_survey_towergroup SUMMARY OF THE INVENTION

The foregoing has outlined some of the more pertinent objects and features of the present invention. These objects and features should be construed to be merely illustrative of some of the more prominent features and applications of the invention. Other beneficial results can be achieved by using the disclosed invention in a different manner or changing the invention as will be described. Thus, other objects and a fuller understanding of the invention may be had by referring to the following detailed description of the Preferred Embodiment.

In our earlier Provisionals [see below], we discussed various means of attacking phishing messages and web sites that misrepresent themselves as other companies, where the latter companies are often in the financial sector. From the use of those Provisionals, we can detect most phishing messages with links to those companies. Here, we extend the scope to detecting how phishers might try to take over computers, or to run programs or scripts in a way that might mislead the user.

We describe how to block pornographic websites at a browser, using a blacklist and a plug-in. The blacklist is updated regularly (possibly daily or hourly) from an Aggregator website that uses our Bulk Message Envelope and clustering methods on new groups of electronic messages to find clusters of pornographic websites. This method can also filter other types of electronic interactions, like SMS, IM and junk faxes. Plus, our website can also offer other types of domains for the browser to block. These might be hate websites or financial fraud websites. Using watermarking methods, the Aggregator can detect if competitors copy its blacklists. The plug-in can also upload anonymized reporting information that lets the Aggregator improve its analysis of undesired websites.

We attack fraudsters establishing websites that purport to perform e-commerce, and then registering these with search engines. We show how to make a Validated Search Engine [VSE] by using our antiphishing methods and an Aggregator. The VSE restricts its search to websites of the Aggregator's clients, which are typically well known, large companies. These clients can also sponsor or validate their business partners or franchisees, to build up the scope of the Aggregator' VSE. The VSE can also optionally pass search queries down to its clients' search engines, which confine themselves to the clients' websites and inventory. For Web Services, the Aggregator can act as a UDDI by only registering services from its clients. (It can also furnish such information to other UDDIs.) Thus the Aggregator can be a Trusted UDDI, to help enable commercial Web Services.

BRIEF DESCRIPTION OF THE DRAWINGS

Fig. 1 shows the general configuration on a computer network of various elements in our Invention - the Aggregator, its client companies and a browser with a plug-in that gets information from the Aggregator.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

What we claim as new and desire to secure by letters patent is set forth in the following claims.

We described a lightweight means of detecting phishing in electronic messages, or detecting fraudulent web sites in these earlier U.S. Provisionals: Number 60522245 ("2245"), "System and Method to Detect Phishing and Verify Electronic Advertising", filed September 7, 2004; Number 60522458 ("2458"), "System and Method for Enhanced Detection of Phishing", filed October 4, 2004; Number 60552528 ("2528"), "System and Method for Finding Message Bodies in Web-Displayed Messaging", filed October 11, 2004; Number 60552640 ("2640"), "System and Method for Investigating Phishing Websites", filed October 22, 2004; Number 60552644 ("2644"), "System and Method for Detecting Phishing Messages in Sparse Data Communications", filed October 24, 2004.

We will refer to these collectively as the "Antiphishing Provisionals".

Below, we will also refer to the following U.S. Provisionals submitted by us, where these concern primarily antispam methods: Number 60320046 ("0046"), "System and Method for the Classification of Electronic Communications", filed March 24, 2003; Number 60481745 ("1745"), "System and Method for the Algorithmic Categorization and Grouping of Electronic Communications, filed December 5, 2003; Number 60481789, "System and Method for the Algorithmic Disposition of Electronic Communications", filed December 14, 2003; Number 60481899, "Systems and Method for Advanced Statistical Categorization of Electronic Communications", filed January 15, 2004; Number 60521014 ("1014"), "Systems and Method for the Correlations of Electronic Communications", filed February 5, 2004; Number 60521174 ("1174"), "System and Method for Finding and Using Styles in Electronic Communications", filed March 3, 2004; Number 60521622 ("1622"), "System and Method for Using a Domain Cloaking to Correlate the Various Domains Related to Electronic Messages", filed June 7, 2004; Number 60521698 ("1698"), "System and Method Relating to Dynamically Constructed Addresses in Electronic Messages", filed June 20, 2004; Number 60521942 ("1942"), "System and Method to Categorize Electronic Messages by Graphical Analysis", filed July 23, 2004; Number 60522113 ("2113"), "System and Method to Detect Spammer Probe Accounts", filed August 17, 2004; Number 60522244 ("2244"), "System and Method to Rank Electronic Messages", filed September 7, 2004.

We will refer to these collectively as the "Antispam Provisionals".

Our methods apply in general to any modality of electronic messaging. We will refer to email below, as one specific modality.

Our methods are divided into three sections. The first relates to attacking malware in electronic messages. The second relates to blocking pornographic websites and content. The third relates to making a validated search engine. The common theme is to expand the ability of a centralized Aggregator and a browser plug-in that gets and sends data to the Aggregator.

The intent is that the end user, who uses a browser (or equivalent program) is less likely to encounter fraudulant or offensive messages or websites.

1. Attacking Malware in Electronic Messages

Our methods can be used in a plug-in for a browser or any other program that is used to view messages on the recipient's machine. They can be used to validate any part or parts of the message that the plug-in can programmatically find. We use the concept of the <notphish> tag introduced in "2458".

We illustrate by showing two types of parts in email. One is attachments. The other is any scripting. These are important and common cases. Consider an email with an attachment. The sender field says support@someco.com, where Someco is assumed to be some reputable company. The body has a notphish tag that refers to someco.com. We apply the methods in "2245", "2458" and "2528" to the links extracted from the body, and to the sender field. This of course assumes that someco.com is a company that is a client of the Aggregator that the plug-in uses. If the methods say the message is invalid, then the plug-in marks it as such.

But if the methods of the previous 3 Provisionals say the message is valid, we can do the extra steps here. The plug-in might have a default policy that it will ask the Aggregator if authentic messages from Someco will have any attachments. Or, the notphish tag might have an attribute that says this. Hence, the Aggregator might have information given to it by Someco at some earlier time, that says "no". That is, currently, Someco will not issue any messages containing attachments. In this case, the plug-in knows that the message is not valid, without having to perform any analysis on the attachment. It can mark the message as invalid. Plus, if the user tries to download the attachment, the plug-in can issue a warning. Or, in some browsers, even disable the download.

What if Someco does allow attachments in its messages? As with its Partner Lists ("2245"), it can issue a list of hashes for valid attachments, to the Aggregator. In this case, the plug-in has slightly more work to do. It can compute a hash of the attachment, and then see if it is in Someco's hash list, given to it by the Aggregator. Or it could send that hash to the Aggregator, who then makes the comparison, instead of the plug-in having to download the hash list. In either case, if the computed hash is not in Someco's hash list, the plug-in marks the message as invalid.

Regarding the hash computation, the plug-in needs to download the attachment to the browser. But this attachment can be held in memory, without being written to disk. Plus, program control does not have to be passed to the attachment. And if so, it can be done after and only if the hash has been found and verified as a valid hash for that company. This is an important advantage of our method. Because an attachment might be written in a formatting language that is, in some respects, a programming language in its own right. An example is Adobe Corp.'s PDF language for describing images. If a file in one of these languages is downloaded and run, in some program that understands the file's language (this might also be done in the browser), then there is always the possibility that the file may have "rogue" instructions which subvert the program and thence the computer. Such viewing programs are written to guard against this possibility in files that they run. But it is usually very difficult to mathematically prove that this has been completely done.

Note that our method guards against more than just the possibility of rogue executables. Suppose Someco sends out a message with images. These might be specified as hyperlinks or be inside the message as attachments. If a spammer got this message, she might replace an image with one of her own. If this image were given as a hyperlink, to a non-Someco domain, and that is not also a domain in Someco's Partner List, then our methods of the Antiphishing Provisionals can detect this. But suppose she puts the image into an attachment. Then our earlier methods would not detect it. In this image, she might have, for example, "For a great deal, call 1-800-555-1212", where she would put her phone number, in place of our example number. This text is viewable by a human, but it is just a bitmap in the image, as opposed to actual text. Standard antispam methods cannot extract such text from images. While optical character recognition (OCR) methods could be used, these are typically too computationally intensive to run on every image in a set of messages. And there might be no basis for deciding programmatically which particular images to run these OCR tests on. It has been noticed that some spammers are in fact using such images to conceal text from antispam methods. Our method gives a means of easily detecting if a spammer or phisher does this, without recourse to OCR tests.

In a similar way, if an attachment is video or audio, and a spammer inserts her own, we can also detect this.

Trivially, the plug-in may have received the information about Someco from the Aggregator at some earlier time, and cached it locally.

Our method is lightweight. The hash computation does not require the more computationally intensive public key infrastructure. The same advantage as with our usage of Partner Lists.

Another advantage is with respect to traditional antivirus methods. These take a given set of bits, and apply many analytical methods to discern if it is a virus. Sometimes, these methods might compute signatures (analogous to hashes) from the data, and then compare those against a database of signatures of known viruses. The database is analogous to a blacklist of spammer or phisher domains. There are several problems with this signature comparison approach. Firstly, it can fail against leading edge viruses. That is, viruses that have not yet been found [by whatever means], and so whose signatures are not yet in the database. Secondly, some viruses are polymorphic, to try to defeat such signature extraction.

But, as with our Partner Lists ("2245"), we do not have to perform such intensive analysis. We simply have to compare against a small set of hashes of authentic attachments. Renders virus countermeasures moot.

Of course, if an attachment were to fail this hash test, the plug-in could optionally forward it or the entire message, to the Aggregator or the company or a regulatory authority, for more intensive analysis, which could include the standard antivirus methods.

Thus far, we have discussed attachments. Our remarks also apply to any scripts that are in the message. Someco might have a flag that says that no scripts should be present in its messages. So if we find scripts, we mark the message as invalid. Or suppose Someco allows scripts. Such scripts have to be able to be run by the browser. Hence, our plug-in can also find these programmatically, and compute a hash for them. That hash can checked against Someco's hash list. This hash list can have hashes for both scripts and attachments. While it is possible for Someco to maintain 2 hash lists, the hash space is typically so large that there is no need for this. Makes it easier for Someco, the Aggregator and the plug-in. Clearly, our method of hashing the scripts is independent of whatever languages the scripts might be in.

Related to this, the plug-in, or the browser itself, might have a policy that it will only run a script if it has been hashed and verified against a list from a reputable company. If this policy exists, the user might be able to enable or disable it.

Using extra tags, we can also check arbitrary sections of a message body. For example, Someco might put a tag called "<dohash>" at the start of a portion of the body, and a tag "</dohash>" at the end of that portion, for a given message that it actually sends out. Then, if the plug-in sees such tags, it can find the hash of the delimited area and compare it against Someco's hash list, as given by the Aggregator. These could be several such regions in the body from which to find tags.

But why would a phisher put in such tags? Because Someco can inform the Aggregator that in its messages, certain hashes must always be present. In turn, this could be conveyed to the plug-in. Such necessary hashes might come from these delimited regions, or from the attachments or scripts discussed earlier.

If a phisher omits the <notphish> tag, then the message will not validate. If she puts in the tag, due to the nature of hashing, there is very little chance that a fake entry she adds will yield the same hash as a portion of an authentic message. Even if Someco's hashes are public knowledge. So the widespread use of our method may train most users to expect that downloading from a major, reputable company's message should always be accompanied by a valid sign from the plug-in. Making it harder for viruses to propagate through the user community.

There is also a role for an ISP. It can apply the methods discussed here, prior to the recipients ever viewing their messages. In doing so, it can offer more protection to them.

In our Provisional Patent #60320046 ("0046"), "System and Method for the Classification of Electronic Communications", filed March 24, 2003, we described our antispam methods. These act on a message, reducing it to a canonical form and then finding several hashes from it. Here, we only need to find one hash per data segment, instead of several. Plus, the data segment that is hashed does not need the many canonical steps that we perform in "0046". Most of those were motivated by eliminating spurious invisible randomness introduced by a spammer, to try to make unique instances of a message. Because the author of an authentic message is the company, and it has no need to introduce random variations like spammers do with their messages.

Along the lines of "2458", the plug-in can also validate or invalidate a website that the user is viewing, but now also using the above methods. This is useful if Someco has another website, perhaps of a subsidiary, and it want to programmatically reassure the user that web pages, and parts thereof, of that website are valid. Tag delimiters can be used by the plug-in to find hashes of parts of a page and then compared to a list of valid hashes from the Aggregator. Plus, if material is offered for download, it could be downloaded to memory and hashed, and then the hash is checked against the list of valid hashes.

Our methods can also be extended to Web Services. As these have gotten built out, a language has arisen to describe the business logic, Business Process Execution Language [BPEL]. It is written in XML, and is a programming language in its own right. That is, it can describe complex operations. It is expected that messages written in BPEL will be exchanged between servers involved with Web Services. These messages, or portions thereof, can then be executed by the recipient servers. This can create dangers similar to those described above.

An existing security methodology, WS-Security (or variations of it), can be used to authenticate a BPEL message. It is designed for heavyweight encryption. By contrast, our methods here can be used to validate portions of the message, via hashing and comparing to hash lists from an Aggregator. This extends our Web Services method of "2640", which discussed validating links in a WS document.

Our remarks here about BPEL also apply to any other implementation of Web Services that might use another language to describe business logic.

2. Blocking Pornographic Websites and Content

A problem that arose soon after the dissemination of the first browser was that of pornography on the Web. While it had existed on the Internet, prior to browsers and the Web, the ease of use of browsers and the mass uptake led to a proliferation of such websites, and the massive amounts of spam sent by those sites, to induce readers to click on links to the sites.

One specific problem was that many parents who had home computers connected to the net wanted to prevent their children from accessing such websites, whether by accident or not. Related to this are computers in schools. Plus also computers in public libraries that might be reserved for primarily child usage. Another related usage is in companies that want to prohibit their employees accessing those websites.

This led to an entire niche of products to find and block the websites. Like ContentProtect, CyberSitter, NetNanny, Cyber Patrol, FilterPak, Cyber Sentinel, Cyber Snoop and Child Safe. These use various methods to block a website. While these are proprietary, a common idea appears to be an analysis of an URL's text. So that if this text contains "bad words", then the URL might be blocked. Another method involves a black list, where this might be periodically updated across the net. In this case, a problem is how comprehensive is the list, and how up to date is it. _π

We improve on this. Our methods address both issues.

In our Antispam Provisionals, we described how an ISP or other message provider can apply our methods on electronic messages to find Bulk Message Envelopes (BMEs). From these, we can find the most common domains and classify various as pornographic or as other types. An enhancement is to find clusters of metadata using "1745". Specifically, we can find clusters of domains or websites, that are pointed to (i.e. have links to them) in the messages. We have observed that clusters, or subclusters, tend to be domains concerning a common theme, where these clusters or subclusters are found programmatically, not manually. Empirically, a purveyor of home mortgage refinancing or a travel website is unlikely to be in a subcluster with porn domains, for example. From this, we can easily find clusters of porn websites. Even if done manually, the clustering makes this method very efficient. Including the fact that we can order clusters or subclusters by the number of underlying messages that point to them. So we can direct our attention to the clusters that are either largest in number of domains or in number of messages, and then to smaller clusters, in either dimension.

In a reduction to practice, based on a sample set of 800 000 messages, and given the BMEs derived from these, we found over 1 000 porn domains in thirty minutes of manual analysis. We fully expect that based on more data, and further refinements in our methods, we will be able to find more such domains [if they exist] and in less time.

So we can have a website, call it an Aggregator, that takes BMEs or blacklists found by our method on messages from one or more ISPs or other message providers. The Aggregator would obtain these on a regular basis; perhaps daily. Note that the Aggregator does not need to directly analyze the messages. This could or should be the purview of the message providers.

Optionally, the Aggregator can also use blacklists found by having spiders crawl the web and analyzing websites for porn content. Possibly correlating this data with that in another Electronic Communication Modality, like messages, in order to aid classification of the websites.

Optionally, the Aggregator can also use blacklists found by querying search engines and crawling the domains of the results. For example, the queries might be vernacular sexual terms, in various languages. It is well known that many commercial websites, of whatever ^

type, strive to be returned high in the results of search engines for terms descriptive of their products, and this applies to many porn websites.

Optionally, the Aggregator can also use blacklists found from antivirus methods.

Optionally, the Aggregator can also use blacklists found from antiphishing methods. Which might include the methods of our Antiphishing Provisionals.

Given a blacklist, made by combining results from the ways described above, the Aggregator can then disseminate this to a plug-in that runs in the browser on a network node. The plug-in is produced by the Aggregator and quite possibly freely given out to users to install on their browsers.

This dissemination of the blacklist can be via an unencrypted communication (e.g. http instead of https). Which reduces the computational load on both the Aggregator and the browser. Plus, because the blacklist, or updates to it, are read-only, as far as the clients are concerned, then it is possible to have a distributed group of Aggregator mirror sites that publish the data. Thus, the Aggregator can scale globally.

While we use the browser as an example on the client side, in general, it could also be any program that can display a message in a markup language that has hyperlinks, and where the program can follow a hyperlink if the user selects it. In this event, there could be another version of our plug-in, that runs within a given such program.

When the plug-in is installed, it can register itself with the Aggregator in order to receive timely updates of a blacklist. During this installation, the Aggregator may require the client to pay a license fee. Typically, the person furnishing the fee controls various parameters of the plug-in, via a password. Specifically, this can include being the only person that can modify the blacklist. Suppose the person is a parent. Then she can prevent her children from adjusting the blacklist.

Initially, the plug-in would obtain a full blacklist from the Aggregator. Then, on some regular basis, it might obtain updates to the blacklist and occasionally another full blacklist.

When a user types in a network address (e.g. an URL) into the address bar of the browser, or when the user clicks on a hyperlink, the plug-in would take that address and reduce it to a base domain. For example, "http://www.trythisnow.co.uk" would produce "trythisnow.co.uk". The plug-in would then compare the base domain against those in its blacklist. If there is no match, then the browser goes to that URL, as usual. But if the domain is in the blacklist, then the plug-in prevents the browser from doing so. Possibly, the plug-in might display some message to this effect in the browser or in a popup window.

It might be seen that thus far, the unique elements we introduce are in the production of the blacklist. But our methods can give more comprehensive results and quicker, than alternative means.

We also introduce new features here.

2.1 Preventing Unauthorized Copying

One problem we [the Aggregator] face is the potential misappropriation of our blacklist by a competitor. We expect that our competitors will register with us, to obtain our latest blacklists. Likewise, we can register with any of our competitors that use a blacklist. We do so to in order to find any such misappropriation.

With our blacklist, we can do one of two things. Firstly, we can hash each entry at the Aggregator. Plus, we introduce random hashes into the blacklist, and we record which hashes these are. We can use a publicly available, common and reliable hashing method, like SHA-I. Nothing we describe here depends on the precise choice of hashing method.

Then the plug-in gets this list of hashes as its blacklist. When it gets a user chosen base domain, it hashes the domain and then compares the hash with the blacklist, as before. The point is that the user cannot distinguish the random hashes from the non-random hashes. Thus, if we detect any of our random hashes in a competitor's blacklist hash table, then it is a strong indication of unauthorized usage.

We can strengthen this with another optional but preferred step. Suppose we have determined that a domain is a porn domain. We hash it and put it into our blacklist. But a competitor who also uses a blacklist of hashes might also have the same hash. Perhaps because it has, independently of us, determined that the domain is a porn domain. And it and we have chosen the same common hash method, so we both got the same hash. Or, it could have copied the hash from our blacklist.

These two cases can be distinguished in the following way. On every domain that we want to hash, we can apply a simple mathematical step that is quick to perform, and which leaves two different strings still different from each other afterwards. This step has no other significance, and acts as a watermark. For example, we might do a one's complement of the first three bytes, or reverse the order of the first two bytes, or add a fixed value (nonce) to the string. Then, the plug-in would do the same step to its base domain, before hashing. Of course, a competitor who was protecting its own blacklist might do similarly. But there is no reason for it to choose the same watermarking steps as us.

Thus we can, to high confidence, ensure that our hashes are unique.

Alternatively, we might not hash our domains at the Aggregator. Instead, we transmit these as clear text to the plug-in. We can then put fake domains and IP addresses into the list. And then search for these in a competitor's list to test for misuse.

We also include the case where the Aggregator transmits two lists to the plug-in. One of hashed domains, the other of unhashed domains. The plug-in would then have two hash tables.

Given a user defined base domain, the plug-in would see if it is in the hash table for unhashed domains. If so, then it blocks the domain. Else it applies any such watermarking step as was described above, and hashes the base domain. Then it checks this against the hash table for hashed domains. If present, then it blocks the domain. Else it lets the browser go to the domain.

Notice that at the plug-in, we use hash tables, whether the data is hashed domains or unhashed domains. For fast lookup, that scales as log n, where n is the number of entries.

2.2 Other Types of Blacklists

Above, we described how our methods of earlier Provisionals could find clusters and subclusters of domains that tend to focus on one type of business or topic. Hence, the

Aggregator could also offer other types of blacklists found by this means, or possibly also by other means. One type would be for porn, as already discussed. Another might be for phishing or financial fraud websites. Another might be for hate websites. There could be other types.

The plug-in might merge all such blacklists into one group. Or, preferably, maintain these as different blacklists. So that if a user tried to access a phishing website, for example, the plug-in might indicate what type of website is being blocked.

2.3 Other Tests

The plug-in can also optionally have other tests on a network address that the user wants to go to. These might include scanning the text of the address for "bad words". Or perhaps letting the person who registered the plug-in with the Aggregator be able to add certain addresses as forbidden. Or to provide a whitelist of addresses that should be accessible, even if they appear on a blacklist from the Aggregator.

2.4 Reporting Services

It is possible for the plug-in to record certain types of information and then periodically upload these to the Aggregator. This could aid in the further analysis and detection of undesired websites.

For example, when the user clicks on a link that is in a blacklist, or types an address with a base domain in a blacklist, then the plug-in can record certain information like the forbidden domain, and the current URL (or its base domain) that the browser is accessing, if any. (The browser might not currently be at any URL.) Plus maybe the time when this was done. The plug-in could also record whether the forbidden address was in a link that the user picked or if the user typed it into the address bar. If the former, then the plug-in might also record any other links (or the base domains of these links) in the page.

Another possibility is that the plug-in could analyze the current page being shown by the browser, and apply the canonical steps of "0046" and "1174" to find the page's styles. These styles can be collectively and compactly held as one long word (64 bits). Or, if the page is recognized by the plug-in as belonging to a known set of message providers, with known boundaries for a currently displayed message ("2528"), then the styles might be found for that message, as opposed to the entire page. Io

Then, at some future time, these cached results could be uploaded to the Aggregator. This uploading might be initiated by either party. Optionally, but preferably, the person who registered the plug-in has the ability to turn off such uploading.

The Aggregator can get results, across a wide range of its clients. Such information can be very useful. Suppose badwords.com is in a blacklist. The Aggregator finds that for 30% of the time that access is desired for badwords.com, the current URL's base domain is someSearchEngine.com. While 50% of the time, the base domain is someMailProvider.com and for 10% of the time, the base domain is anotherMailProvider.com. And the remaining 10% of the time, there were various other domains. This suggests that badwords.com is advertising itself with spam sent to the two mail providers, and that many users are using someSearchEngine to find it.

This information could have value to someSearchEngine. Badwords.com is probably either a paid advertiser or it is high in the results for some search query. If the latter is true, badwords.com might be manipulating the search engine's methods, perhaps via the use of link farms. Typically, a search engine company wants to know this, because such activities degrade the efficacy of its results.

Now consider that badwords.com also appears to be sending spam to users at the two mail providers. This information may have value to the providers. Note that in general, those mail providers are different from any that are using our methods of our Antispam Provisionals to analyze their messages, and from whom we derived the blacklist. So they might not be blocking badwords.com, whereas our mail providers already are, or will shortly be. Thus, we could offer such information to those other mail providers to suggest how they could improve their services to their customers.

More broadly, the Aggregator can surveil the distribution methods that a website is using. It may want to increase the percentage of times that badwords.com is accessed by typing into the address bar. Because this manual mode is not as convenient for the user, it might act to lower the absolute number of attempts to reach the website. Plus, suppose that for the 50% of the time that the users were at someMailProvider.com when trying to reach badwords.com, 80% of these instances had users typing it, instead of clicking on a link. This suggests that they are reading a message containing "badwords.com" in the text, or in an image in the message. Hence, we can add extra heuristics to detect such techniques. ^

Suppose now the plug-in also were to record the URL or its base domain, for where a user went, after she tried to reach a blocked URL, assuming that this new base domain was not also in a blacklist. If the blocked domain was a porn domain, then perhaps she is still intent on reaching similar domains. So the new unblocked domain might also be a porn domain. Hence, at the Aggregator, by analyzing such unblocked domains, we might get extra coverage of new, hitherto unknown porn domains.

If the times at which users attempt to reach blocked sites is recorded, this can also be of use. It is well known that the clock on a computer can be quite wrong from the actual time. But when the plug-in uploads information, it could also upload what it considers the current time to be. So that the Aggregator can use this to correct any time information,

Now, on an aggregate basis, if we look at a given website, especially one involved in phishing, then such websites often have a transient existence. If the website's activity is fraudulent, it may have only a brief uptime, before its ISP might take it down. Or before the computer's owner does so, if the computer was hijacked by a virus sent by the phisher. Hence the phisher wants as many responses as possible, before her website is removed. By looking at the time distribution of attempts to reach it, we may be able to infer how many attempts were successful, at computers that are not using our plug-in described here, or not using our antiphishing plug-in of our Antiphishing Provisionals. Or where those users were reading phishing messages sent to message providers that were not implementing our server side antiphishing methods. Plus, from the time distribution, we may also be able to infer if our blacklist generation or our antiphishing methods need to respond quicker, and to get a quantitative estimate of how much quicker.

If the Aggregator gets styles and domains uploaded, then it can construct style and domain clusters using the methods of "1745". Hence, our earlier methods that were applied to analyze clusters can also be used here.

Purveyors of websites that we are attempting to block might try countermeasures. Like uploading false data. But to do that, they first have to register as clients of the Aggregator, and make payments. Costly to them. Plus, we can implement routines on the Aggregator that take uploaded data and apply simple tests, to weed out false data. And then, by relying on bulk results, we gain further statistical protection. 1 o

When the Aggregator gets such information from its clients, it may as a matter of policy or regulation apply various anonymizing steps to the information, to protect the privacy of its customers. For example, it can discard any specific items in the uploaded information that can uniquely identify the computer from which it came. As can be seen from the above example, much of its analysis is useful only in an aggregate sense anyway.

3. Making a Validated Search Engine

In our previous Antiphishing Provisionals, we discussed various means of attacking phishing messages and websites that misrepresent themselves as other companies, where the latter companies are often in the financial sector. From the use of those Provisionals, we can detect most phishing messages with links to those companies. Here, we extend the scope to showing how we can build a trusted search engine to facilitate e-commerce.

As the Web as grown, search engines have become indispensable for users. Along with the rise of the Web has been the concomitant rise of e-commerce. This has led to a worsening problem. A general purpose search engine can be manipulated by fraudsters, who set up websites purporting to offer goods and services. Then, via widely known Search Engine Optimization methods, they can pump up their unpaid rankings, when a search engine returns answers to a query. These methods might involve the use of link farms. Or, the fraudsters might buy ad space on the engines, where the ads might be associated with particular key words. So that when a user queries with those words, the fraudsters' websites would be shown as clickable ads. In both cases, the intent is to persuade the user to go to a fraudster's website. Here, the user is typically induced to entering her credit card information, or other personal data, in order to buy an item. Then, several things might happen:

1.The user does not get anything. 2. The good or service she gets is not what she expected, or its value is less than what she paid. 3. She gets what she paid for.

For any of the above, and especially for #3, the fraudster might also make unauthorized purchases against her card. Worse, if the fraudster got enough personal information, she might _{j g}

then impersonate the user, by applying for credit in the user's name, for example.

Ironically, for unpaid search results, the more comprehensively a search engine spiders the Web, the more vulnerable it might be to such manipulations. While for paid results, search engines try to automate this as much as possible, to reduce their personnel costs. Basically, virtually any entity can buy ad space, so long as it pays the search engine, with this transaction often fully automated.

But as e-commerce rises, it would be highly desirable for a means of combating such fraud, while still letting users search for a wide range of desired items. One partial answer is for a user to go to a particular website and sign up as a member. Where the website institutes some type of vetting of its members. Various business-to-business [b2b] portals might fall in this category. But such portals might not be easily accessible (if at all) to the general public. As their name suggests, membership may be restricted to companies. Plus, the items or quantities offered by the portal's members might be off little use to an individual. (One tonne of soybeans? One kilotonne of wheat?)

Another partial answer is to join a website open to the public, where members institute feedback on each other regarding transactions. The best known examples are eBay, Amazon Marketplace and Yahoo Auctions. There are still problems here, with a certain level of fraudulent transactions.

We propose the construction of a Validated Search Engine [VSE]. Firstly, it combines aspects of a search directory and a search engine. It is a search directory, because it has a collection of websites, whose structure can be optionally but preferably set by manual classification of its websites. There are many such search directories on the Web. But most do not scan the contents of pages at the websites they cover. Our VSE does so. In this respect it is both a directory and an engine. It can have multiple concurrent hierarchies, based on various criteria, like region or industry.

Also, optionally, but preferably, each of the VSE's surveyed websites can maintain its own internal search engine. There might be a programmatic interface offered by the VSE, for those to hook to, so that a query to the VSE could be passed down to each, and the results collated by the VSE and shown to the user. Our VSE is validated because we also integrate it with the methods of our Antiphishing Provisonals. In what follows, when we refer to an Aggregator, we also include the possibility of a subAggregator. Also, when we refer to a browser that a user uses, we also include any other program that can display web pages, and follow links in those pages.

Consider an Aggregator. At its website, it can have a page which shows a top level view of the VSE, with widgets where the user can type and submit a search query. But the websites and companies that the VSE searches are those that are validated by the Aggregator. It should never return a web page (or a link to such a page) that is invalidated by our plug-in at the user's browser. A simple but crucial restriction that means when a user searches for something, she can have high confidence that she won't end up at a fraudster's website.

In our Antiphishing Provisionals, we described the use of our plug-in to validate messages or websites that a user reads or visits. But it turns out that when the Aggregator has a VSE, she can use the VSE, even without a plug-in on her browser, and still be confident that the results she gets are not fraudulent. (Though we still strongly recommend the use of our plug-in, to protect against arbitrary websites and messages.) So long as her browser can access the Aggregator's website (and not an imposter's), and the results it returns to her browser are displayed unaltered by any virus or other malware, then this is a very good assumption.

The scope of the VSE can be enhanced in several ways. Consider a company who is a client of the Aggregator, so its pages validate (or at least do not invalidate). Imagine, for example, that it has franchisees. Typically, it has extensive knowledge of their corporate history. The company can, in effect, sponsor some of its financially stronger franchisees to be clients of the Aggregator, if they are not already so. Useful, because a franchisee might be confined to one locality, and the Aggregator might have no a priori substantive knowledge of it.

This mechanism lets the VSE expand its scope down to many localities, instead of just being confined to large national or international companies. There are advantages to a franchisee joining up, because now in its advertising, it can promulgate that it is a validated company, and that it has the protection of our methods of the Antiphishing Provisionals against fraudsters trying to impersonate it electronically.

While the above discussed franchisees as an example, it can be generalized to other companies with a strong business relationship with a company that is already a client of the ^

Aggregator.

It is necessary to discourage promiscuous sponsoring, which makes it easy for a fraudster to enter the system as an Aggregator client. Hence, an optional but preferred method would be for the sponsoring company, or the affiliates that it sponsors, to post a bond or take out an insurance policy against such an eventuality.

The VSE could also refer to a b2b portal or trade association, and its main corporate members. Where a similar mechanism of requiring a bond or insurance policy could be applied. It can be seen that such a bond or insurance ultimately benefits those who pay it, because it helps act as a barrier against fraudsters.

The VSE might explicitly abjure ever having the broad scope of a general search engine. Which means that a user searching for an obscure or rare item might be less likely to find it in the VSE. But if, say, major retailers join the VSE, then the most commonly sold items could be found via the VSE, with strong antifraud confidence on the user's part.

The antifraud nature of the VSE and the associated companies can be reinforced with financial incentives given by credit card issuers to users who purchase online at the VSE's companies (as contrasted to online purchases elsewhere). Such a purchase might be made by the actual cardholder. Or she might have shopped at a fake website, whose phisher then used that information to make purchases at a VSE company. For a single purchase, the merchant may not be able to immediate distinguish between the two cases. But, over time, users should learn that shopping at a VSE company involves much less chance of fraud than at other websites. It is to a card issuer's benefit to reinforce this behavior, since the issuer also suffers fewer losses.

3.1 Advertising

The search results can also include advertising. Preferably clearly indicated as such; and distinct from the regular search results. The advertisers should be restricted to only those companies that have been validated by the Aggregator. Outside companies should not be allowed to advertise, as this allows an attack by phishers. ^

Consider the VSE plus the plug-in for the user's desktop. Users have to register their plug-ins with the Aggregator. Hence, when a plug-in contacts the Aggregator, or is contacted by it, the Aggregator can associate an email address for that plug-in. Over time, the Aggregator might build up a history for each registered user. Which could include a record of what types of items the user searches the VSE for. As well as the network address (IP address) or range of network addresses, of the user's computer, when the user contacts the VSE.

So a VSE search by a registered user might have contextual information, external to the specific query. As discussed earlier, the VSE can also permit searches by users without plug-ins. But it could also ask these users to register with it. Hence, we have these possible types of users:

1.Registered user with plug-in. 2.Registered user without plug-in. 3. Unregistered user (without plug-in). The default case at a general search engine.

Advertisers might bid against each other for key words. So that when a user types in such a word or phrase, the ads that appear are from those advertisers bidding the highest for the word. As is well known, this is already being performed on general purpose search engines like Google and Yahoo.

Our method has several differences.

The advertisers that can bid are restricted to the Aggregator's validated clients. Where these have been subject to high scrutiny prior to being accepted as clients. In contrast to another search engine that might accept ads from virtually any organization or person that can pay it. Note that typically such a search engine makes no claims about validation or credentials of its advertisers.

In our method, advertisers can also bid against each other for which type of user is making the query. An advertiser might consider that a registered user with a plug-in is of higher value to it than other types of users. Possibly because the user might be more confident and likely about entering into a real transaction, since she has our plug-in to detect scam websites and messages. So the Aggregator's user data can offer more value to advertisers, and thus extract another revenue source. ^

Advertisers can also bid against each other for what domains or IP ranges or geographic or political regions that a user is from or not from. This location might be that where a user is currently at, when making the query to the VSE. Or, for registered users, it might be the domain (or the corresponding IP address of the domain) in the user's email address. Or the geographic or political region given in the user's information about herself, when she registered. For a registered user, where there might be a difference between a region in her information and the one where she is currently communicating from, the VSE might have some policy to distinguish between the two.

Another difference involves the bidding mechanism, as distinct from what the advertisers are bidding for. The bidding can take into account factors other than just a simple comparison of monetary amounts. For example, an advertiser, Kappa, might the "favored" advertiser when the user comes from a *.ca domain. Prior to the auction, Kappa "bought" this right from the VSE. All this occurs on the VSE's computers, and in general the VSE has leeway to impose such conditions. So an implementation of this example might be that when Kappa bids, others must exceed her bid by some amount or percentage. This is just one example of how bids might be weighted in order to implement some policy. Clearly, there could be an arbitrarily complex mathematical implementation of a policy.

Another example is that Kappa might be the favored advertiser when there is a registered user with a plug-in, and who is also coming from a *.ca domain. The implementation of which might use the prior approach of weighted bids. This example illustrates how the bidding could involve several conditions.

Another example involves a stochastic element in the bidding. If Kappa bids 10 on a user coming from a .ca domain, and Omega bids 5 and Psi bids 4, then Kappa's probability of winning the bid is defined to be 10/(10+5+4)=10/19. That is, by the relative weighting of Kappa's bid, over all the bids. And if Kappa wins, then Kappa pays 10. With analogous statements for the others. Qualitatively, a mechanism like this lets advertisers with small budgets still have some chance of buying ads.

Why might it be useful for an advertiser to know what domain or IP range a user is at or not at? Suppose the advertiser is offering loans to students at an Australian university. Those have domains that end in .edu.au. So the advertiser might be willing to pay more in order to get prominent placing for its ad, when the user is making a query from such a domain. Of course, an Australian student might be using a redirector located at an Asian address, say. Or she might be registered with the Aggregator, and furnished an email address outside Australia. So the advertiser might want to advertise under other conditions, to try to cover such cases. But clearly, a query from a .edu.au domain would be more attractive to it than from other domains.

Similarly, consider when the VSE sends a query to one of its client's search engines. Where the returned results are not advertising. The VSE can offer extra value to its clients by also sending suitably anonymized data about the user, like the domain where the user is coming from, or the user's preferences. Some current corporate search engines, that trawl a company's inventory, might not be able to take advantage of such extra data. But with the VSE being able to offer these, it gives incentive for those search engines to be enhanced to use them, to provide more relevant results to the user.

Under such circumstances, a registered user can opt out, by informing the VSE not to even pass such anonymized information about herself to a client search engine.

There are also other circumstances in which the advertisers can bid against each other. These concern the situation when the user's plug-in validates a page or message. Suppose the validation is for company B, which is a client of the Aggregator, and which has its Partner

Lists. When the plug-in validates, it might display an ad. But whose ad? Suppose B has competitors C and D, which are also clients of the Aggregator. It could let B, C and D bid for the right to send an ad to the plug-in. C and D have incentive to do so, because the user presumably is interested in what B offers, and they have similar products. A high value context for them.

Note the difference between this and when such actions happen in the context of searching. Here, there is no search. The Aggrtegator does not know the reason by which the user got to the page or message that was validated.

Concerning the competition between B, C and D, it can be expected that B may feel a need to be defensive, given that the user got a message from B, or is at a website of B's. Of course, C and D are equally "vulnerable" when their websites or messages validate. Optionally, the VSE might require that if B bids, then C or D need to bid at least some minimum amount or percentage above B's value, in order to win the ad. Whereas between C and D, if both meet this condition, then to decide between them as to who wins can just be a simple condition of who has the higher bid.

If B wins the ad, it may not actually want to place an ad. Because the item that was validated is either a message from it, or one of its pages, so an ad might be redundant.

3.2 Web Services

Our methods can also be extended to Web Services [WS]. The basic idea of WS is to enable the programmatic composition of a compound service, by joining together other services, located at various arbitrary locations on the Internet. Each service has an Application Programming Interface [API], whereby other services can access it. WS are programmed in the Web Services Description Language [WSDL} and the Business Process Execution Language [BPEL]. (Others may arise in the future.) Currently, the technical aspects of programmatically describing a WS and compositing these are still being resolved. There have been some simple services deployed. Notably by eBay, Google and Amazon. So that third parties can, perhaps for a nominal fee or for free, access these services and build other services using them.

In our Antiphishing Provisionals, we described how WS could use our plug-in and Aggregator, for a lightweight validation of a message received from another WS, if the message contains our validation tag. In many situations, this could be used in preference to a heavyweight PKI authentication.

Suppose WS become popular, as their proponents hope. So that there may be millions of computers offering these, just as there currently are millions of websites. Consider a particular WS, Alpha. It wants to find another WS satisfying certain criteria. To do so, it sends these criteria (or some subset) to a Universal Description, Discovery and Integration [UDDI] service that is usually on another computer. At earlier times, other WS that wish to be found register themselves with the UDDI. You can consider a UDDI as analogous to a conventional search engine, except that it gets programmatic queries, not manual ones. Here, the UDDI sees if there is a match between Alpha's query and its database of registered services. If so, then it sends a list of these to Alpha, who then communicates directly with one or more of them.

If there are many WS, then a UDDI might wish to somehow decide whether a WS that presents itself to registration should be accepted. Because a fraudster might be running one of those WS. Imposing a requirement that the WS carry a strong certification from some widely recognized certificate issuing authority may be insufficient. Those authorities might issue certificates after a payment and perhaps a cursory identity check. Typically, such a certificate in a message merely proves that a message can be traced back to a given party. It may not offer much validation about the background of that party.

One answer is that the Aggregator can act as a trusted UDDI. It only registers WS from its clients. There are two benefits to another WS using the Aggregator as a UDDI. Firstly, it gets WS that are highly unlikely to be fraudulent. Secondly, it can communicate with those WS using the lightweight validation described in our Antiphishing Provisionals, if it uses our plug-in. This latter point is optional, though preferred. Suppose it does not have our plug-in. So long as it can access our Aggregator UDDI and receive its replies unchanged, then it could use this context as an implicit validation of subsequent messages from those validated WS. There is still a risk of a man in the middle attacks on those later messages. So it should use our plug-in and possibly other measures.

Suppose that the WS Alpha asks another UDDI that is not the Aggregator. The UDDI might check with our Aggregator, in order to only allow registration from the Aggregator's clients. Hence, the Aggregator can obtain another revenue stream.

Earlier, we discussed how Alpha uses a UDDI to find other WS. We avoided the issue of how Alpha finds a UDDI in the first place. A bootstrap issue. Somehow, by one means or another, Alpha needs to find at least one initial UDDI. Which could then perhaps direct it to other UDDIs. But if WS become popular, then the problem of finding a UDDI will diminish. Because many groups can now set up their own UDDIs and advertise them in various ways.

Which gives rise to another problem. How does Alpha choose between UDDIs? Some may be run by fraudsters. Others may just accept for registration any WS that asks for this. These reasons give value to the Aggregator acting as a trusted UDDI on the Internet.

We have described UDDI above as the preferred example. However, our method is also applicable to other functionally similar or equivalent registry services. Including a CORBA-type service. Or any other proprietary or open-source service that provides similar capabilities in a programmatic fashion.

3.3 Comparison with Whitelist

We now compare our method with an apparently simpler alternative. Where there is a central server, call it Gamma, with a list of network addresses that are reputable. For example, let those addresses be URLs. These URLs could be owned by companies that have been validated by Gamma, using some means different from those in our Antiphishing Provisionals. Then, Gamma might distribute a plug-in or tool bar for browsers. So that when a user goes to a website, the plug-in might contact Gamma and see if the URL that the browser is viewing is in Gamma's list. If so, then the plug-in might display something to indicate this. Otherwise, it might show an "invalid" image.

This method is essentially a whitelist, akin to what has been used in the handling of email for many years. Where here, the whitelist is of known, good URLs, instead of email addresses. The method lacks our Partner Lists and notphish tags and their usages. The method suffers from the following disadvantages, compared to our method:

1 . It does not work against phishing messages. Where these are sent to users at an ISP, and where the users then run a browser (or equivalent program) to read them. The ISP's URLs might be put into the whitelist. But if a message is phishing, it is being read at an URL at the ISP. So the method applying the whitelist cannot distinguish this message from any other non-phishing message received by the user and read at an URL of that ISP. In contrast, our method handles websites and messages in a unified fashion.

2. It does not work for dynamic webpages. Where the URLs for these contain unique elements. A severe drawback. Many large companies now dynamically generate pages, as a result of a customer searching for something, or to put something into a page that is specific to a customer. The latter might include a salutation with the customer's name or account number with the company.

3. It needs much greater storage. Recording all of the valid URLs for a company may often be much lengthier than recording a set of Partner Lists for that company. 4. More cumbersome for the companies. Since they have to send Gamma a list of their θ

URLs. And then, in a timely fashion, tell Gamma whenever new URLs are added and old URLs are deleted. With our method, the Partner Lists are far smaller information.

5. Slower lookup. Even when representing the URLs in a hashtable, the lookup times may often be longer than with searching through Partner Lists, where the latter would also be held in a hashtable.

6. Heavier network traffic. In our method, a plug-in might cache Partner Lists for various companies, because these are short, so the bandwidth and storage requirements are minimal. Which helps reduce the overall demands on the server. But suppose there is a whitelist of URLs, and the plug-in caches the URLs that the user has visited, and which the server has said are in the whitelist. The very specific nature of a URL suggests that it is of limited use in a cache, with respect to the user going back to that URL. Essentially, the whitelist method means that the plug-in may have to keep going back to the server for whichever new URL the user is visiting.

7. Vulnerable to an overload attack. The previous item means that a phisher could attack the whitelist method. By running a bot net (robot network) that mimics browsers with the plug-in, and whose members query the server with URLs. Similar to a DDoS attack. Whereas our method can be implemented such that our plug-ins get a large portion or all of the Partner Lists for our companies, since current desktop machines have enough memory and disk space to hold the data. By combining this with keeping a record of the network addresses of computers that have recently asked for data, and hence restricting repeated queries from the same computer within a given time interval, we can defend against an overload attack.

8. Vulnerable to subverted content. The whitelist is a list of known, good addresses. It says nothing about the contents of the pages at those addresses. Consider a large company, with many pages and many departments (sales, marketing, human relations, accounting etc). Often, different departments would maintain different parts of the website. And this would be done by different people. There could be thousands of static pages. A phisher could try a network attack against some particular department or its computers. Such that she can modify an existing page, which is listed in the whitelist. By changing that to have a form where the user can submit personal information, and with a link to the phisher's external computer, she can try to redirect such information to herself. Or, she can try a human factors approach, and attempt to suborn or infiltrate the personnel, in order to make such changes. The larger the company, the more computers and people it has to guard against these attacks. Plus, if the company has other marketing partners, it may be vulnerable through such attacks on them. But with our method, the key objects to guard are the Partner Lists, against unauthorized changes. And to ensure that the Lists get transmitted to the Aggregator. These are far fewer objects to guard. And the number of people involved are far less. It also means that the company can more easily add two-person manual approvals of the Lists, as contrasted to having to do so for the much larger set of URLs.

9. Vulnerable to an external URL. Suppose a company ("Rho") whitelists its URLs. There is nothing to prevent another company, who is a client of Gamma, from setting up pages at its whitelisted URLs, that purport to be from Rho, or a marketing partner of Rho. Where this might be done with actual links to Rho, to add verisimilitude. The previous item described how Rho can be vulnerable across its computers and personnel. This item is worse. Because now Rho is vulnerable across all of Gamma's clients' computers and personnel. The weakest of Gamma's clients endangers all the clients. Our method blocks this, because Rho has to permit another Aggregator client to have Rho in its Partner List. That is, Rho has a veto over the use of its name in other Partner Lists. 10. Less automated. Based on publicly available information, current implementations of the whitelist use a manual scrutiny of some or all of the URLs received at Gamma from the companies. This adds to the cost of running Gamma, and it increases the delay, before which an URL can enter Gamma's database. Our method can use an automated verification of the Partner Lists that the Aggregator receives. Cheaper and faster. 11. Less advertising control. The previous item has a negative implication for a company that wants to start a surprise ad campaign, for example, using web pages that it builds, in isolation from the network. Which it then wants to make generally accessible at a time of its choosing, and which it also wants Gamma to validate. Because if Gamma has to manually peruse these beforehand, it is a potential giveaway of the campaign. It could let personnel at Gamma have private access to the new pages, using some kind of password protection. But that is cumbersome to both parties. And a Gamma employee could inadvertently (or deliberately) reveal the intent of the new pages to others. Our method does not involve perusal of the pages by the Aggregator. Plus, a company might want even the name of a new domain to be a secret until it is announced. In our method, the company can send the Aggregator a Partner List with the new domain name hashed.

12. Harder to disseminate. By the very nature of webpages, anyone using a browser to view a page can copy the source and host it at a different website. And of course, some phishing attacks include unauthorized copying of a company's pages. But, sometimes, a company might want to encourage the copying of some of its pages. If this is done to destinations outside the whitelist, then the new copies will not validate. But suppose the pages include our tags. And suppose that the new copies do not violate the conditions of those tags. Which may often mean that no links to domains outside the relevant Partner List are put into the new pages. Then our plug-in can still validate the new copies.

Given the above drawbacks of the whitelist, an improvement might be that instead of Gamma storing full URLs, it stores domains and base domains, and the plug-in knows how to use these. So a company might now not need to explicitly whitelist all its URLs. But in the above list, items 1, 8-12 are still pertinent. While the other items in the list are also still germane for the server and those companies still using full URLs.

Also, suppose Gamma and its clients want to improve the whitelist, to address item 8. So now the clients might also upload some type of fingerprint of the valid content at a valid URL. This might take the form of a hash of part or all of the page, for example. But if it is of all of the page, then it fails in the general case for dynamic pages. Because those might be unique for each customer. Suppose then, it is for subsets of pages. So that for dynamic pages, the subsets might be the invariant portions. But how to demarcate these, so that the plug-in knows what to compute, when viewing such a page? This brings us to the use of our tags to perform such demarcations, as discussed in the first section of this Invention.

Claims

WHAT IS CLAIMED IS:

1. A method, given a company ("Alpha") that is a client of another company ("Aggregator"), where the latter furnishes certain information from its clients to browser plug-ins, where this information relates to electronic messages sent by the clients, of Alpha indicating whether its messages have attachments, and if so, Alpha can optionally upload signatures of those attachments to the Aggregator.

2. A method of using claim 1, where the browser plug-in downloads that information about Alpha, from the Aggregator, to ascertain whether a message purporting to come from Alpha, and which contains an attachment, is actually from Alpha, by comparing that information with a signature or signatures that the plug-in directly computes from the message.

3. A method of using claim 1, where Alpha indicates to the Aggregator whether Alpha's messages will have images, and if so, Alpha can optionally upload signatures of those images to the Aggregator.

4. A method of using claim 3, where the browser plug-in downloads that information about Alpha, from the Aggregator, to ascertain whether a message purporting to come from Alpha, and which contains images, is actually from Alpham by comparing that information with a signature or signatures that the plug-in directly computes from the message.

5. A method of using claim 1, where Alpha indicates to the Aggregator whether Alpha's messages will have scripts, and if so, Alpha can optionally upload signatures of those scripts to the Aggregator.

6. A method of using claim 5, where the browser plug-in downloads that information about Alpha, from the Aggregator, to ascertain whether a message purporting to come from Alpha, and which contains scripts, is actually from Alpham by comparing that information with a signature or signatures that the plug-in directly computes from the message.

7. A method of the Aggregator compiling a blacklist of domains from various sources, including antivirus analysis, antispam analysis of electronic messages using Bulk Message Envelopes (BMEs) and the clustering and classification or categorization of metadata derived from those BMEs.

8. A method of using claim 7, where the blacklist is downloaded at regular intervals to a browser plug-in, which can then apply it to one or both of electronic messages viewed in the browser and URLs that the browser is sent to, by possibly indicating that a message has a link to a domain in the blacklist, or by not downloading and displaying a web page with an address containing a domain in the blacklist.

9. A method of the Aggregator publishing a website where a visitor can search the pages of its clients and of the Aggregator itself, in a mode we term a "Validated Search Engine" (VSE).

10. A method of using claim 9, where a client of the Aggregator might sponsor (i.e. vouch for the bona fides of) other companies that are not clients of the Aggregator, to be in the search scope of the VSE.

11. A method of the Aggregator acting as a UDDI service, where it only accepts Web Services from its clients.

12. A method of using claim 11, where a client of the Aggregator might sponsor (i.e. vouch for the bona fides of) other companies that are not clients of the Aggregator, so that those companies can provide Web Services that will be accepted by the Aggregator acting as a UDDI service.