Lookalike Modeling
  • 8 Minutes To Read

Lookalike Modeling

Lookalike Modeling
Finding groups of people (audiences) who look and act like your other readers


Lookalike Modeling is a powerful Cxense feature that helps you extend segments in the data management platform (DMP) to more extensive, higher value audiences. Now your advertisers can buy even higher quality segments for more effective ad spend on your inventory, and as a result - happier readers.

Using lookalike modeling, you can find readers who are similar to the members of a particular segment and extend reach. For example, you may wish to target readers similar to those who already bought a subscription. Perhaps you also want to target a small set of known readers based on 1st party data and include a more significant set of anonymous but similar readers.

Example - Extending Reach Using Lookalike Modelling

Say you have had 24 unique visitors in the last 31 days:


Out of these 24 visitors, just 3 of them signed up for your product updates. These readers form a specific audience segment. You base it on the 1st party data from your CRM system or on DMP events identifying the readers who visited a particular URL after signing up:


You want to run a targeted campaign to sell more of this product, but 3 readers are not enough to get meaningful results from an advertising campaign. To extend the reach of the campaign, you switch on the lookalike modeling on the original segment, with a fraction setting of 21%. This instructs Cxense DMP to find 21% of the total audience that is most similar to the readers in your original segment.

Our advanced machine learning models are trained to detect patterns in the behavior and interests of those 24 readers and apply them to the rest of the audience to find similar readers. The output is a segment of 5 readers that are similar to the original 3, with no overlap.

As a result of the lookalike modeling, you get 8 readers to target with your campaign. You get to increase both your reach and sales.


To target your audience with the ad campaign, you need to use both the original segment and the lookalike segment in an ad server. If you want to run a targeted campaign without involving external ad servers, use the Cxense recommendation technology. Deliver tailored recommendations on your site(s) to users in those segments.

Key Features

✅ Instant feedback on segment modeling based on events from the last month, filling the lookalike segment to requested size in one go

✅ Segment modeling using a broad set of data sources (e.g., content consumption, site behavior, and offline data)

✅ No overlap between original segments and lookalike segments

✅ Define specific inclusion criteria for each segment. This helps find the right balance between segment quality (low inclusion) and reach (higher inclusion %)

✅ Constant monitoring of the lookalike modeling precision to guarantee the highest possible quality

Lookalike Model Accuracy

Cxense regularly and automatically evaluates the lookalike model precision. The evaluation process can be described as follows:

First, the evaluation algorithm splits the set of segment members into two parts:

  • A subset for training the models
  • A test set, together with a sample of non-members

The models predict lookalikes from the users in the test set. Model precision is the ratio of actual segment members among predicted lookalikes. All precision scores are compared to a baseline of random sampling. The precision scores are tracked continuously to ensure the high quality of the lookalike modeling.

To achieve optimal quantity vs. quality balance, set an appropriate percentage (1 - 20% is fine): the higher the percentage, the less similar users.

Specify "negative segments" whenever it makes sense, for example:

  • A lookalike for women: Use men as negative segment
  • A lookalike for age: Use other age segments as negative segments

Getting Started With Lookalike Modelling

Lookalike Modeling for audience segments can be enabled directly from the Cxense DMP UI.

To create the lookalike segments with the highest performance, keep in mind the following:

The best candidates for the lookalike modeling are the segments that are:

  • Narrowly defined (matching a small subset of users)
  • Based on 1st party data or directly observable event characteristics

Segments must contain at least 100 active users who accessed URLs that are classified as article (not frontpage) in the last 31 days for Lookalike modeling to be able to train ML models on the dataset

Tips for improving quality

Choose the right segments for Lookalike Modeling

The best segments to enable Lookalike Modeling for in the DMP are narrowly defined segments matching a small part of the audience. The segment filters should be based on directly observable event characteristics (for example specific URLs) or based on 1st party data you upload from your CRM systems. It is perfectly fine if the segment only matches a small part of the total audience. The behavior of a small, narrowly defined set of users has a higher likelihood of containing interesting and valuable patterns that the ML models can extract and apply to find similar users. It is intuitive that the bigger the original segment is, the more will the behavior of the users in that segment match the average behavior of the total audience. So try to keep the segment definition for the original segment as narrow as you can while still including the interesting properties that you are after

Set an appropriate fraction (1-20%), not too high

The fraction setting is a percentage setting that determines how many users are tagged as lookalikes. Setting the fraction is a quality/quantity tradeoff. The ML models produce ranked lists of users with the most similar users on the top of the list. The backend system includes similar users from the top of the lists until the fraction setting is satisfied. If you increase the fraction setting, the quantity of users increases at the expense of the overall quality of the results (as you then include users that are less similar).

A high fraction setting is anything above 20%, which should be reserved for very specific use-cases only. Check with Cxense if you are unsure about the appropriate fraction (percentage) to set. An appropriate fraction setting is anything between 1% and 20%. Set the fraction based on how many users you need to be in the lookalike segment and how many users you have in your total audience.

Specify a "Negative Segment" where possible

A "negative segment" is a segment of users that are as nonsimilar to the original segment as possible. An example is if you have 1st party data on gender from users that have logged into your site. With an original segment of women, the negative segment will be the segment of men. Machine learning models can utilize this information to give better quality results, so if you have two segments that are natural opposites, please use the /segment/lookalike/update API to set the "negative" segment. The ML models will then look for similar users to the original segment which at the same time are not similar to the users in the negative segment. In machine learning, this is called the positive and negative classes of training data. This way of training the ML models is proven to improve quality compared to the common case of having a positive class (the original segment) and an unknown class (the rest of the audience). But if you don't have a negative segment, don't despair. Our ML models still generate good quality results in either case

Bonus: Incorporating new patterns in the data

As long as Lookalike Modeling is enabled for a segment in the Cxense DMP, Cxense is periodically re-training the machine learning models on your data and events. The set of similar users is recomputed every 24 hours on average, to ensure that the models automatically take into consideration new patterns that emerge in your dataset. So there is nothing you need to do to ensure new patterns are included. The model performance is continuously monitored and improved by Cxense R&D, to ensure that the quality stays high over time.

The Inner Workings

Advanced machine learning / AI models power the lookalike modeling in the Cxense DMP. The data fed to the models is obtained from:

  • Pageview events statistics
  • 1st party data
  • Content profiles

When lookalike modeling is enabled for a segment in the DMP, several machine learning models are trained on your data and events from the last 31 days. That data is used to select the most similar users to those in the original segment.

We make sure we avoid any overlap between the original segment and the set of similar users. The amount of similar users to annotate as lookalikes is limited to the percentage of the audience, or the fraction setting, which you set upon enabling the lookalike modeling for a segment.

The machine learning models rank all users in your total audience (visitors to your site(s) in the last 31 days) according to how similar they are to the members of the original segment. The system then selects the most similar users as lookalikes and fills the lookalike segments up to the required volume (defined by your fraction setting). The amount of lookalike users is shown in the "Lookalike Modelling" tab of the Segment Editor of the original segment.

Machine Learning Models

The machine learning models currently used in the Cxense Lookalike modeling are based on cosine similarity and logistic regression:

  • Represent each user as a word vector, based on consumed content
  • Use logistic regression to find the set of most significant words
  • Compute centroid (average vector) for all segment members
  • Compute similarity between non-members and the centroid
  • Output a ranked list of non-members, from most to least similar
  • Special cases: demographic properties (e.g., gender, age)

To learn more about how you can maximize the value of your hard-earned data, get in touch.

Was This Article Helpful?