How to Set Up Anti-Spam Filters Using Regular Expressions the Smart Way

In order to fish our spammer’s emails and other identifying information, you can use your mail clients’ log files and/or junk email. Here is how to get started.

Collect historical data on mail processing

If you trust your users to catch at least some of the spam, you can collect spam activity logs, legitimate mail, and discarded junk mail from their mail clients to try to set up server-side mail processing filters with high accuracy.

Collecting spam activity logs from mail clients

Spam activity logs can be useful in designing blacklists to allow your mail client to identify and appropriately process unwanted mail.

For Postbox, the spam activity log is a file named junklog.html. It is conveniently located in the root of the directory that hosts the user’s Postbox profile. Use the Postbox Profile Manager to figure out the exact location of each profile and the corresponding logs. On Windows, users of Postfix can simply execute Run and enter:

postbox.exe -ProfileManager

In the window that follows, Postbox will show the location of the profile in a cloud.

Postbox Profile Manager: the path containing the profile leads directly to the Spam log

Collecting discarded junk messages from email clients

Alternatively, you can collect junk messages in a temporary directory on your local computer in order to extract spammers’ identifying information from it. (Just make sure your internet security software is up and running.)

Collecting legitimate mail to set up whitelists

Mail clients allow you to save individual messages (usually as separate .eml files). This way, you can extract and whitelist legitimate mail addresses to prevent Postfix from classifying them as spam.

Concatenate files

In order to concatenate files on Windows, start Command Prompt and enter:

copy /a *.html concatenated-logs.txt

The system will concatenate all files with the file extension .html (this includes Postbox’ log files in .html) in the current directory. For .eml files (individual junk messages) use:

copy /a *.eml concatenated-messages.txt

Depending on how many files you are dealing with, it can take some time. The resulting file will be saved in the same directory.

Extract spammer’s email addresses

There are several ways you can teach Postfix to recognize and handle spam (read about types of smtpd access restrictions in Anti-Spam Defense: Using Postfix With smtpd Access Restrictions).

Valid email addresses should conform, at lest in theory, to a standardized syntax that can be verified using a regular expression. Don’t bother verifying standards compliance, though. Why should spammers be conforming to anything if all they want is to spam you? What you really want is a RegEx pattern that will tolerate weird characters such as forward slashes and underscores.

In order to simplify dealing with line-end encoding, you may want to use the command “Edit > EOL Conversion” in Notepad++ to conform it to a standard of your choice. LF is represented by \n. CR/LF is represented by \r\n. In any case, be mindful of the appropriate line-end encoding.

Make sure you are using appropriate line ending for your system

In order to clean up the file by removing unnecessary content leaving only the relevant information in place, you need to place each email address on its own separate line, bookmark all relevant lines based on a regex, and remove all lines that haven’t been bookmarked.

Step 1. Mark email addresses using a regular expression and inspect the accuracy of the results

First, you need to figure out how to capture spammers’ email addresses using a regex that best matches the characteristics of your input file. When performing replacements in order to separate email addresses from the remaining text, a more tolerant regular expression means a lower accuracy and a higher need for manual post-processing.

WARNING: A regular expression that works with the Mark feature in Notepad++ can fail when used in a replacement. In order to find email addresses in concatenated mail messages containing a pattern like this one:

<SportUtilityVehicles-@whatever.net.in>; Sun, 15 May ... envelope-from <yadayada@whatever.tld>)

you could use roughly the following regular expression (positive lookbehind, followed by the actual email pattern, followed by a positive lookahead):

(?<=[<])(\S*?(?=[@]))@(.+?(?=[>]))

Using a positive lookbehind and a positive lookahead offers you the ability to correctly mark anything resembling an email address enclosed in <> with the Search > Mark… feature in Notepad++. You could refine the first and second capture group for more accuracy, but unfortunately, this regular expression will fail when used in an Replace operation in Notepad++. For replacements, you need something other than that.

Begin by capturing From: email addresses:

^(From:.+)(<)(([A-Za-z0-9\-\_\/\.\=\:\%\!\'\–\&\+]*)@(([A-Za-z0-9\-\_\/\=\.)*([A-Za-z0-9\-\_\/\=]+)\.?))(>)

with this replacement (in a CR/LF file):

$1$2\r\n$3\r\n$7

in order to split From: with the sender’s mail account name from the actual email address. When you run this replacement, Notepad++ should place each email address on its separate line.

Next, you want to remove all contents of the file except for the sender’s (spammer’s) email that’s now sitting on its own separate line. Here is how to do it.

Step 2. Find and bookmark lines that contain spammers’ email addresses

Enter this expression in Find what (in a CR/LF file):

(?<=(<\r\n))(([A-Za-z0-9\-\_\/\.\=\:\%\!\’\–\&\+]*)@(([A-Za-z0-9\-\_\/\=\.)*([A-Za-z0-9\-\_\/\=]+)\.?))$

Switch over to the Mark tab, activate option Bookmark line, Clear all marks if you haven’t already, and hit Mark All.

Step 3. Inspect the result

Inspect a couple of random bookmarks for accuracy; the number of bookmarked lines should equal the number of replacement operations you made when adding newline characters to put the email addresses on their separate lines.

Step 4. Remove everything except for bookmarked lines

Select Edit > Bookmark > Remove unselected lines.

Spammers' emails in Notepad++ — Spammers’ emails in Notepad++: you may be tempted to block some tlds (such as .pro) completely; on the other hand, make sure you don’t capture legitimate addresses (for example your own), generic user names (info@), or domain names of legitimate email services (ymail.com)

Make sure that all lines in the file containing the email addresses you collected conform to this regex (line end CR/LF):

^([A-Za-z0-9\-\_\/\.\=\:\%\!\’\–\&\+]*)@(([A-Za-z0-9\-\_\/\=\.)*([A-Za-z0-9\-\_\/\=]+)\.?)(.+)\r\n

Find email addresses in spam logs based on a regular expression and separate spammers’ addresses by placing each on its own line

In order to extract email addresses from the html log of Postbox that looks like this:

&lt;SportUtility%Vehicles-@whatever.net.in&gt;

you could use this regular expression:

([A-Za-z0-9\-\_\/\.\=\:\%\!\'\–\&\+]+)@(([A-Za-z0-9\-\_\/\=)([A-Za-z0-9\-\_\/\=]+)\.?)(.+)(?=&)

If you were to insist on a positive lookbehind (not supported in JavaScript), you might try this:

(?<=&lt;)([A-Za-z0-9\-\_\/\.\=\:\%\!\'\–\&\+]+)@(([A-Za-z0-9\-\_\/\=)([A-Za-z0-9\-\_\/\=]+)\.?)(.+)(?=&)

Mark email addresses and inspect the accuracy of the regular expression

To verify that the regular expression of your choice is doing what you intend it to do, open the log file in Notepad++ (or another similarly capable code editor) and use the command Search > Mark… to open the Mark dialog. In Find what:, enter:

(([A-Za-z0-9\-\_\/\.\=\:\%\!\’\–\&\+]+)@(([A-Za-z0-9\-\_\/\=)([A-Za-z0-9\-\_\/\=]+)\.?)(.+))(?=&)

activate the Bookmark line option, then hit Mark All. Notepad++ will highlight all occurrences of a text string that conforms to the pattern you specified. Inspect it as thoroughly as it appears reasonable, to ensure that the regular expression is doing a decent job of finding spammers’ addresses without capturing the recipients’.

Separate spammers’ addresses by placing each on its own line

In order to separate spammer’s email addresses from the rest of the document so that each sits on its own separate line, switch to the Replace tab in Notepad++. In Replace with, enter:

\n$1\n

and hit Replace All in order to add a line break both before and after each occurrence of the first capture group. (Clear all marks, if necessary.)

Mark spammers’ email addresses with the Bookmark lines option activated to define lines to retain

In Notepad++, use the Mark feature with the option Bookmark lines activated to highlight the following pattern:

^([A-Za-z0-9\-\_\/\.\=\:\%\!\'\–\&\+]+)@(([A-Za-z0-9\-\_\/\=)([A-Za-z0-9\-\_\/\=]+)\.?)(.+)\n

(in a file with an LF line end, \n will cause line break as it represents the newline character; in files with CRLF line end encoding, use \r\n instead, where \r represents carriage return and \n a new line).

Remove unmarked lines and clean up the results

Using the command Search > Bookmark > Remove Unmarked lines, clear the document of all noise, leaving–hopefully–only spammers’ emails in place. Save the file, then inspect it again carefully. When you notice something amiss, make the necessary corrections. For example, you may want to search for a double forward slash (as in imap://), a mailto:, @font-face, @-ms-viewport or a space, and see if any unusual remnants of the original log file come up. If they do, see whether the “offending” lines contain duplicates of information you already have; if they do, you could simply mark these lines and remove them (it’s usually easier than cleaning them up or further tweaking your original regex). All of it depends how much data you are dealing with.

The art of whitelisting: permit legitimate email addresses

Probably the easiest way to do it involves concatenating emails that are known to be legitimate, capturing domain names from email addresses, and whitelisting them with the OK directive, with regex:

/.*\.*client1\.tld$/     OK

This will permit any email address and any hostname associated with the domain client.tld to pass the checks.

Block the biggest offenders on a top-level domain basis (yes, it’s cruel)

This may sound controversial and it is indeed a sign of desperation, but after sifting through millions of lines containing email addresses of spammers you may come to the same conclusion: spammers like some top level domains more than others, and they seem to be alone in their passion for them. We have yet to see a legitimate user of a domain that ends in .xyz or .date. Depending on your business, your mileage may vary, but if your particular situation allows you to block spammy senders on a tld level, then by all means, do it. We would caution you, however, from blocking country specific tlds such as .ru or .co, regardless of how spammy they are, as a matter of fairness towards legitimate users in those geographic locations.

In order to block a tld using a regex, use this syntax in your access maps with either REJECT (and an optional message that gets delivered with the bounce) or DISCARD (and an optional note to yourself that gets logged by your mail server):

/\.usa\.cc$/     REJECT
/\.xyz$/     REJECT

Once this general block is in place, you no longer need to bother sifting through email addresses that use these domain names; when you discard them from your collection of spammers’ emails you get to focus on those that cannot be blocked this way (such as, most notably, .com).

Bookmark all lines that you no longer need using a regular expression like this one:

(click|bid|club|date|download|faith|link|pro|racing|review|science|space|top|usa.cc|website|work|xyz)$

and remove all bookmarked lines from the file. This leaves you with plenty of other tlds to deal with.

Filter for spammy email usernames and remove duplicates

Sort collected email addresses “Edit > Line operations > Sort lines […]” and write regex for user names. Once you have entered offensive user names into your configuration file you can strip them from the email addresses (username@) in your work file in Notepad++.

To remove user names from email addresses leaving only hostnames in place, search for:

(.+)@(.+)

and replace with:

$2

Extract domains with known two-part top level domain extensions, for example: in.net, co.za or co.uk, and save them in a separate file to process manually; otherwise you would block all domains with these extensions! Mark all lines containing the pattern:

(.+)\.(.+)\.(.+)

and remove host names with a tld that contains a dot (co.uk or com.br or co.in). In the remaning file, strip the host names using this replacement pattern:

$2\.$3

Remove duplicates by selecting the contents of your file (Ctrl-A) and using the command “TextFX > TextFX Tools > Sort lines case insensitive (at column)” with option “Sort outputs only UNIQUE (at column)” activated (TextFX is an extension to Notepad++).

It is time for you to turn your attention to the domain names.

Step 6. Filter offending domain names

In order to capture future spam from domains that have spammed you before, allow for any hostname on those domains to be rejected. This way, whoever wants to spam you again needs a new domain name and that means additional expenses.

In Find what, enter:

^([A-Za-z0-9\-]*\.)*([A-Za-z0-9\-]+)\.([a-z]+)$

In Replace with, enter:

/.*\\.*$2\\.$3\$/     DISCARD sender rejected

turns host.somedomain.tld to

/.*\.*subdomain\.tld$/    DISCARD sender rejected

Perform another search and replace to mask every – as \- (use the Normal mode for this operation).

Step 7. Trust but verify

To verify how your rules are being applied, you can run this test on the command line of your server (with the path to the list of access restrictions you want to test):

postmap -q "teststring" regexp:/etc/postfix/access_maps/regex_access_sender

Now you also have to make sure that Postfix actually reads this list. Here is how to add it to your configuration of Postfix smtpd restrictions.

When you are all done, have a quick look at current log activity:

tail -f /var/log/maillog