With the continuous development of the Internet, spam has greatly increased. It's not just a little harassment as before. Now spam can be said to be overwhelming. At first, spam was mainly unsolicited commercial propaganda mail, but now more spam related to pornography and politics is increasing, even reaching about 40% of the total spam, and there is still a continuous growth trend. On the other hand, spam has become a new way for computer viruses to spread rapidly.
And at present, 50% of the mail in the world is spam, and only a few organizations are responsible for it. Many anti-spam measures have been put forward, but only a few have been implemented. Unfortunately, none of these solutions can completely stop spam, and they will also have an impact on normal mail communication.
1. 1. What is spam?
To some extent, the definition of spam can be: those emails that people don't want to receive are spam. For example:
* Commercial advertisements. Many companies promote new products and activities by e-mail.
:: Political remarks. At present, we will receive many such emails from other countries or reactionary organizations, which, like junk commercial advertisements, promote and sell their so-called remarks.
* Worm mail. More and more viruses are spreading rapidly through e-mail, which is really a fast and effective way to spread.
:: Malicious mail. Threatening and deceptive emails. For example, phishing is an email with a fake web page, which is a complete skill to cheat users' personal information, account numbers and even credit cards.
How can the email address of an ordinary individual become the target of spam? There are many reasons for this result, such as registering email in websites, forums and other places, viruses finding your email in friends' emails, enumerating users of email providers and so on. Generally speaking, the fewer mailboxes are exposed, the less spam is received, the shorter the use time, and the less spam is received. Some helpless users chose to give up their mailboxes and replace them with new ones.
1.2, security issues
Spam has brought great influence to the Internet and users. This influence not only requires people to spend time dealing with spam and occupy system resources, but also brings many security problems.
Obviously, spam takes up a lot of network resources. Some mail servers are regarded as spam forwarding stations because of poor security, and events such as warnings and IP blocking occur from time to time. It consumes a lot of network resources, which slows down the normal business operation. With the development of global anti-spam and the sharing of blacklists among institutions, innocent servers are blocked in a wider range, which will undoubtedly cause serious problems to normal users.
Spam is more and more closely combined with hacker attacks and viruses. For example, SoBig worm installed an open proxy that can be used to support mail forwarding. With the development of spam, the use of malicious code or monitoring software to support spam has increased significantly. On June 5438+February 3, 20031,a hacker organization in Brazil sent spam containing malicious javascript scripts to millions of users. Those who browse spam through Hotmail inadvertently reveal their accounts. Another example is the recent display problem of IE's website. Adding "%0 1" before the host name can hide the real host address, and it will appear in spam within a few weeks after it is published.
More and more deceptive virus mails have made many enterprises suffer greatly. Even if a good network protection strategy is adopted, it is still difficult to avoid. More and more security incidents are caused by emails, which may be viruses, Trojans or other malicious programs. It is really difficult for ordinary users to make a correct judgment on fishing fraud tricks, but the losses caused are very direct.
2. Anti-spam technology
Existing and mentioned anti-spam methods try to reduce spam problems and deal with security requirements. By correctly identifying spam, email viruses or email attackers will be reduced. These solutions use a variety of security methods to stop spam.
In the article "Anti-spam Solutions and Security", Dr. Neal Krawetz classified anti-spam technologies well. The current anti-spam technologies can be divided into four categories: filtering, reverse lookup, challenge and encryption. These solutions can reduce the spam problem, but they all have their limitations. This paper will discuss these technologies and the implementation of some main technologies in the following content.
2. 1, filtering
Filtering is a relatively simple but direct technology to deal with spam. This technology is mainly used for receiving system (MUA, such as OUTLOOK EXPRESS or MTA, such as sendmail) to identify and deal with spam. From the application situation, this technology is also the most widely used, such as anti-spam plug-ins on many mail servers, anti-spam gateways, anti-spam functions of clients and so on. , are all filtering technologies.
2. 1. 1, keyword filtering
Keyword filtering technology usually creates some simple or complex word lists associated with spam to identify and deal with spam. For example, some keywords appear in spam, such as the email titles of some viruses, such as: test. This method is similar to the virus characteristics used by antivirus software. It can be said that this is a simple content filtering method to deal with spam, and its foundation is to establish a huge list of filtering keywords.
This technical defect is obvious, the filtering ability is obviously related to keywords, and the keyword list may also cause false positives. Of course, the system will consume more system resources when using this technology to process emails. In addition, general techniques to avoid keywords, such as word splitting, can easily bypass filtering.
2. 1.2, black and white list
Blacklist and whitelist. They are IP addresses or e-mail addresses of known spammers or trusted senders, respectively. At present, many organizations are making *bl(block list), collecting IP addresses (even IP address ranges) that often send spam, and making a block list, such as SBL (spamhaus Block List) and BL of Spamhaus, which can be shared on a large scale. Many ISPs are adopting BL of some organizations to stop receiving spam. White list is the antonym of black list, which is completely accepted for those trusted email addresses or IP.
At present, many e-mail recipients use black and white lists to deal with spam, including MUA and MTA. Of course, it is more widely used in MTA and can effectively reduce the burden on the server.
BL technology also has obvious defects, because it can't include all (even a large number of) IP addresses in the block list, and spammers can easily create garbage through different IP addresses.
2. 1.3 hash technology
Hash technology is that the mail system describes the content of the mail by creating a hash, such as taking the content and sender of the mail as parameters, and finally calculating the hash of the mail to describe the mail. If the hashes are the same, the content of the message is the same as the sender. This is used in some ISPs. If there are duplicate hash values, it can be suspected that it is a mass email.
2. 1.4 Rule-based filtering
This filtering forms rules according to certain characteristics (such as words, phrases, locations, sizes, attachments, etc.). ) and describe spam through these rules, just like describing intrusion events in IDS. In order to make filters effective, this means that managers must maintain a huge rule base.
2. 1.5 intelligent and probabilistic systems
Bayesian algorithm is widely used, it can learn the frequency and pattern of words, and then judge it by associating it with spam and normal mail. This is a more complex and intelligent content filtering technology than keywords. I will describe this technology in detail below, which is the most widely used technology in client and server.
2. 1.5. 1 Bayesian Bayesian algorithm
Among the filters, the score-based filter should be the best one now, because it is easy for us to understand how simple it is to deal with cunning spam, such as black and white list, keyword library or hash filter. Scoring system filter is the most basic algorithm filter and the basic prototype of Bayesian algorithm. Its principle is to check the words or characters in spam and give each characteristic element a score (positive score). On the other hand, check the characteristic elements of normal mail to reduce the score (negative score). Finally, the whole email gets a total spam score, which can be used to judge whether it is spam or not.
This scoring filter tries to realize the function of automatically identifying spam, but there are still some unsuitable problems:
* The list of feature elements is obtained by spam or ordinary mail. Therefore, to improve the effect of identifying spam, we must learn from hundreds of emails, which reduces the efficiency of filters, because the characteristic elements of normal emails are different for different people.
* The number of emails obtained by feature element analysis is a key. If spammers also adapt to these characteristics, it is possible to make spam more like ordinary mail. In this case, the filtering characteristics will change.
* The score calculated for each word should be based on a good evaluation, but it is still arbitrary. For example, features may not adapt to the changes of spam words or the needs of users.
Bayesian theory is now widely used in computer industry, which is an uncertain description of things. For example, Bayesian theory is used in Google computing. The filtering of Bayesian algorithm is to calculate the probability of spam in email content. It must first learn from many spam and normal emails. Therefore, the effect will be better than ordinary content filters, and there will be fewer false positives. Bayesian filter is also a score-based filter. But it is not only a simple calculation of scores, but also a fundamental identification. It adopts the way of automatically establishing feature table. In principle, firstly, a large number of spam and a large number of normal emails are analyzed, and the probability of multiple features in emails is analyzed by the algorithm.
The sources of computational characteristics of Bayesian algorithm are usually:
Words in the body of an e-mail message
Message header (sender, delivery path, etc.). )
Other representations, such as HTML coding (such as color, etc. )
Phrases and phrases
Meta-information, such as the location of special phrases.
For example, the word AAA often appears in normal mail, but basically does not appear in spam, so the probability of AAA marking spam is close to zero, and vice versa.
The steps of Bayesian algorithm are as follows:
1. Collect a large number of spam and non-spam, and establish spam and non-spam sets.
2. Extract independent character strings from feature sources such as AAA as token strings, and count the frequency of extracting token strings. According to the above method, all messages in the spam collection and the non-spam collection are processed separately.
3. Each mail set corresponds to a hash table, hashtable_good corresponds to a non-spam mail set, and hashtable_bad corresponds to a spam mail set. This table stores the mapping relationship between tag strings and word frequencies.
4. Calculate the probability of token string in each hash table P= (word frequency of token string)/(length of corresponding hash table).
5. Considering hashtable_good and hashtable_bad synthetically, the probability that a new email is spam when a token string appears is deduced. The mathematical expression is:
Event mail is spam;
T 1, t2 ... long ton ... TN stands for token string.
P(A|ti) indicates the probability that an email is spam when the token string ti appears in the email. set up
P1(ti) = the value of ti in hashtable _ good.
P2 (ti) = the value of ti in the hash table _ bad.
Then p (a | ti) = p2 (ti)/[(p1(ti)+p2 (ti)];
6. Establish a new hash table hashtable_probability to store the mapping of token string ti to P(A|ti).
7. According to the established hash table hashtable_probability, the possibility that the newly arrived mail is spam can be estimated.
When a new email arrives, generate a token string according to step 2. Query hashtable_probability to get the key value of token string. Suppose that from the email * * *, and the corresponding values in T2 t 1, N token strings are obtained ... TN and Hashtable _ probability are P 1, P2...……PN, P (A | T 1, T2, T3...TN) means that the email is simultaneously.
According to the compound probability formula:
P(A|t 1,t2,T3……TN)=(p 1 * P2 *……PN)/[p 1 * P2 *……PN+( 1-p 1)*( 1-P2)*……( 1-PN)]
When p (a | t 1, t2, t3...tn) exceeds a predetermined threshold, it can be judged that the email is spam.
When the new mail arrives, it is analyzed by Bayesian filter, and the probability that the mail is spam is calculated by using various characteristics. Through continuous analysis, the filter is constantly updated. For example, judging that an email containing the word AAA is spam by various features, the probability that the word AAA becomes a feature of spam increases.
In this way, Bayesian filter has adaptive ability, which can be completed automatically or manually by users, and is more suitable for single users. However, it is difficult for spammers to gain such adaptability, so it is more difficult for them to escape the filter, but they can of course disguise their mail as normal mail. Unless spammers can judge someone's filter, for example, by sending a receipt to know which emails users have opened, they can adapt to the filter.
Although Bayesian filter still has the defects of scoring filter, it is more optimized. Practice has also proved that Bayesian filter has a very obvious effect on client and server, and excellent Bayesian filter can identify more than 99.9% spam. Most anti-spam products currently in use adopt this technology. For example, Bayesian filtering in Foxmail.
2. 1.6 limitations and shortcomings
At present, many anti-spam products with filtering technology usually adopt various filtering technologies to make the products more effective. Filters are sorted according to their false positives and false negatives. Missing the report means that spam bypassed the filter. False positive is to judge normal mail as spam. A perfect filtration system should have no false positives and false negatives, but this is an ideal situation.
Some anti-spam systems based on the filter principle usually have the following three limitations:
Could be bypassed. Spammers and the sending tools they use are not static, and they will quickly adapt to the filter. For example, for the keyword list, they can randomly change the spelling of some words, such as ("tough", "bow is tough" and "strong-tough"). Hash busters (generating different hashes in each email) are designed to bypass the hash filter. Bayesian filters commonly used at present can be bypassed by inserting random words or sentences. Most filters are only effective for a few weeks at most. In order to keep the anti-spam system practical, the filter rules must be constantly updated, such as daily or weekly.
False positive. The biggest headache is to judge normal mail as spam. For example, an ordinary email containing the word sample may be judged as spam. Unfortunately, some normal servers are blacklisted by irresponsible organizations to block a certain network segment, not because they send spam (xfocus's server is such an example). However, if we want to reduce the problem of false positives, it may cause serious underreporting.
Filter review. Due to the problem of false positives, messages that are usually marked as spam are generally not deleted immediately, but are put into the spam box for later inspection. Unfortunately, this also means that users still have to spend time checking spam, even if it's just for the email title.
At present, the more serious problem is that people still think that filters can effectively stop spam. In fact, spam filters can't effectively stop spam. In most cases, spam still exists and still spreads through the network. Users don't mind still browsing spam unless they don't mind false positives. Filters can help us organize and separate emails into spam and normal emails, but filtering technology can't stop spam, it is actually just "processing" spam.
Although filtering technology has its limitations, it is the most widely used anti-spam technology at present.
2.2. Verify the query
SMTP is not designed with security in mind. In 1973, computer security was meaningless, and it was great to have an executable mail protocol at that time. For example, RFC524 describes some cases where SMTP is an independent protocol:
"Although people can or may be able to design software based on this document, please annotate it appropriately. Please make suggestions and questions. I firmly believe that there are still problems in the agreement, and I hope readers can point them out when reading RFC. "
Although the command group of SMTP has been developed for a long time, people still implement SMTP based on RFC524, and they all assume that they will solve problems (such as security problems) in the future. Therefore, until 2004, the error originated from RFC524 still existed. At this time, SMTP has become very common and it is difficult to simply replace it. Spam is an example of abuse of SMTP protocol. Most spam tools can forge message headers, forge senders or hide sources.
Spam generally uses forged sender address, and very few spam messages use real address. Spammers forge e-mails for the following reasons:
* Because it is illegal. In many countries, it is illegal to send spam. By forging the mailing address, the sender can avoid being sued.
* Because of the unpopularity. Spammers know that spam is unwelcome. By forging the sender's address, it is possible to reduce this reaction.
* Restricted by ISP. Most ISPs have terms of service to prevent spam. By forging the sender's address, they can reduce the possibility of being banned from accessing the network by ISP.
Therefore, if we can use the black-and-white list to identify which emails are forged and which are legal more intelligently, we can solve the spam problem to a great extent, and the verification query technology is based on this starting point. The following will also analyze some major anti-spam technologies, such as Yahoo! Anti-spam technology advocated and hosted by Microsoft and IBM. It is not appropriate to classify them as reverse verification query technologies, but in a way, these technologies are more complex verification queries.
2.2. 1, reverse query technology
From the perspective of spam forgery, we can solve the problem of spam forgery and avoid a lot of spam. In order to limit the forgery of the sender's address, some systems require verification of the sender's e-mail address. These systems include:
Reverse mail exchange
Sender's license (SPF)
Marked mail protocol (DMP)
These technologies are similar. DNS is a global Internet service, which is used to handle the conversion between IP addresses and domain names. 1986, DNS extension, with mail exchange record (MX). When sending a mail, the mail server corresponds to the recipient's domain name by querying MX records.
Similar to MX records, the solution of reverse query is to define reverse MX records ("rmx"-rmx, "SPF"-SPF, "DMP"-DMP) to determine whether the specified domain name and IP address of the mail correspond completely. The basic reason is that the address of forged mail will not really come from RMX address, so it can be judged whether it is forged or not.
2.2.2 DKIM technology
DKIM (Domain Name Key Identification Mail) technology is based on Yahoo's domain name key authentication technology and Cisco's Internet identification mail.
Yahoo's DomainKeys uses public key encryption to verify the e-mail sender. The sending system generates a signature and inserts it into the email header, while the receiving system verifies the signature using the public key issued by DNS. Cisco's authentication technology also uses encryption technology, but it links the signature with the email itself. The sending server signs the e-mail and inserts the signature and the public key used to generate the signature into the new message header. And the receiving system verifies that the public key used to sign the e-mail message is authorized for the sender's address.
DKIM will integrate these two verification systems. It will use the public key issued by DNS to verify the signature in the same way as DomainKeys, and it will also use Cisco's title signature technology to ensure consistency.
DKIM provides a mechanism for mail to simultaneously verify the sender of each domain and the integrity of the mail. Once the domain is verified, it is compared with the sender's address in the mail to detect forgery. If it is forged, it may be spam or fraudulent mail and can be discarded. If it is not forged, and the domain name is known, it can establish a good reputation, can be bound to the anti-spam policy system, can also be shared among service providers, or even directly provided to users.
For well-known companies, it is usually necessary to send various business emails to customers, banks and so on. So it is very important to confirm the email. Can resist phishing attacks.
Now, the DKIM technical standard has been submitted to IETF, and you can refer to the draft document.
Implementation process of domain name key
The sending server goes through two steps:
1, established. Domain name owners need to generate a pair of public/private keys to mark all outgoing messages (multiple pairs of keys are allowed). The public key is public in DNS, and the private key is on the mail server using DomainKey.
2. Signature. When each user sends a mail, the mail system automatically generates a signature using the stored private key. The signature is part of the message header, and then the message is delivered to the receiving server.
The receiving server verifies the signed mail through three steps:
1, stand by. The receiving server extracts the signature and sending domain (From:) from the mail header, and then obtains the corresponding public key from DNS.
2. verification. The receiving server uses the public key obtained from DNS to verify the signature generated with the private key. This ensures that the mail is indeed sent and has not been modified.
3. pass. The receiving server uses the local policy to produce the final result. If the domain is verified and other anti-spam tests have not been decided, the message will be sent to the user's inbox. Otherwise, the message will be discarded and quarantined.
2.2.3, SenderID technology
In 2004, Gates vowed to predict that Microsoft could eliminate spam in the future. What he expected was sender ID technology, but recently he retracted his prediction. This is also a standard dispute. Microsoft hopes that IETF can adopt Sender ID technology as a standard, and has received a lot of support, such as Cisco, Comcast, IBM, Cisco, Port 25, Sendmail, Symantec, Verisign and so on. , including the support of AOL who defected later. However, in the open source community, Microsoft did not get enough support, and IETF finally rejected Microsoft's proposal.
SenderID technology mainly includes two aspects: the support of the sender and the support of the receiver. The support for the sender mainly includes three parts: the sender needs to modify the DNS of the mail server and add a specific sender policy framework to show his identity, such as "v = SPF 1 IP 4:192.0.2.0/24-all", which means SPF1is used. Optionally, the MTA of the Sender supports adding an extension such as SUBMITTER to its outgoing mail communication protocol, and adding headers such as Resent-Sender, Resent-From and sender to its mail.
The receiver's support includes: the receiver's mail server must use the SenderID checking technology to check the PRA or MAILFROM of the received mail, query the sender policy framework of the sender's DNS, and verify the sender's identity.
Therefore, the sender ID technology is adopted, and the whole process is as follows:
Step 1, the sender writes an email and sends it out;
Step 2, sending the mail to a mail receiving server;
Step 3, the receiving mail server checks the identity claimed by the sender through the SenderID technology (through the specific query of DNS);
Step 4, if it is found that the identity claimed by the sender matches its sending address, receive the email, otherwise, take specific actions on the email, such as directly rejecting the email or treating it as spam.
In fact, the sender ID technology is not a magic weapon to eradicate spam. It is only a technology to solve the source of spam, and it cannot identify whether an email is spam or not in essence. For example, spammers can send spam by registering cheap domain names. From a technical point of view, everything conforms to the specifications; Also, spammers can forward their own spam through the vulnerability of other people's mail servers, which is also unable to be solved by SenderID technology.
2.2.4, fair technology
Fairuce (Fair Use of Unsolicited Commercial Email) was developed by IBM. This technology uses the built-in identity management tools in the network field to filter and intercept spam by analyzing email domain names.
FairUCE links the received e-mail with the IP address of its source-establishing a connection between the e-mail address, the e-mail domain and the computer that sent the e-mail to determine the legitimacy of the e-mail. For example, use SPF or other methods. If the relationship can be found, check the black-and-white list of recipients and the reputation of domain names, so as to decide the operation of the mail, such as receiving and rejecting.
Another function of FairUCE is to find the source of spam by tracing the source, and then return those delivered spam to the source, thus cracking down on spammers. This method has both advantages and disadvantages. The advantage is that it can affect the performance of the source of spam, and the disadvantage is that it may hit the normal work of normal servers (such as exploited servers), and at the same time this function copies a lot of spam traffic.
2.2.5 Limitations and shortcomings
These solutions have some usability, but they also have some disadvantages:
* * Non-host or empty domain name
The reverse query method requires that the mail comes from a known and trusted mail server and corresponds to a reasonable IP address (reverse MX record). However, most domain names do not actually correspond to completely static IP addresses. Usually, individuals and small companies also want to have their own domain names, but this does not provide enough IP addresses to meet the requirements. DNS registered hosts, such as GoDaddy, provide free mail forwarding services for those who have no hosts or only empty domain names. Although this mail forwarding service can only manage the received mail, it cannot provide mail sending service.
The reverse query solution will bring some problems to users who have no host or only empty domain names:
There is no reverse MX record. These users can now configure mail clients to send mail using their registered domain names. However, if you want to query the IP address of the sender's domain name in reverse, especially those users who change their IP addresses frequently when moving or dialing, you can't find it at all.
Unable to send mail. One way to solve the above problems is to forward mail through ISP's server, which can provide reverse MX records. However, as long as the sender's domain name is different from the ISP's domain name, ISP is not allowed to forward mail now.
In both cases, these users will be intercepted by the reverse query system.
* * Legal domain name
Being able to verify identity is not necessarily a legal identity. For example, spammers can send spam by registering cheap domain names. From a technical point of view, everything conforms to the specifications; Also, at present, many spammers can enter the legal mail system to forward their own spam through the loopholes in other people's mail servers, and these problems can not be solved by verifying queries.
2.3. Challenges
Spammers can generate millions of emails every day by using some automatic email sending software. This challenging technology will hinder a large number of mail senders by delaying the mail processing process. Normal users who only send a small amount of mail will not be significantly affected. However, this challenging technology will only succeed if few people use it. If it is more popular, people may be more concerned about whether it will affect mail delivery than hinder spam.
There are two main types of challenges: challenge-response, challenge-response and proposed computing challenges.
2.3. 1 challenge response
The challenge-response (Cr) system keeps a list of authorized senders. Messages sent by new senders will be temporarily saved and will not be delivered immediately.
References: