Nick Little

Detecting PII with Amazon Macie

What is Amazon Macie?

In a nutshell, Amazon Macie is a managed AWS Service which uses AI/ML, and configured logic to help you detect personally identifiable information (PII) being stored in Amazon S3 buckets. These can be publicly available S3 buckets, or private S3 buckets which are accessible within your AWS account(s) with the correct IAM service role configured.

This can include common tasks such as checking for files being uploaded containing PII such as credit card numbers, NBN’s, tax numbers and more.

Step1 - Configure your data source

First of all you’ll need an existing S3 bucket to test with, or you’ll need to create an S3 bucket. My example is using most of the default settings: MacieS3

Now upload the sample files to to your bucket. There is mixture of files and types here to simulate more real world scenarios.

Step 2 - Enable Macie

Type “Macie” in the AWS Management Console, and click “Get started”

Note: You’ll need to ensure the S3 bucket is in the same region as where you’re configuring Macie.

Follow the default prompts, and click to create to service-linked role that AWS will create in order to have the permissions to scan the S3 bucket.

Now click “Enable Macie” and wait a few minutes until it’s complete and you’ll see the Macie console:

Macieconsole

Step 3 - Create a Macie Job

On the top right, click “Create job”, and browse to and select the S3 bucket you created earlier.

Macieconsole1

Macieconsole2

Continue to review the S3 bucket, and you’re ready for the next step.

Now we configure the scope. For the purpose of this tutorial we’ll do a one off job, but you can easily setup a recurring daily job that will run each day. In the additional settings, you’ll note more granular options such as include or exclude specific file name extensions, but we’ll leave this as-is for now.

Macieconsole3

Configure the managed data identifiers. This is where you can use the in-built logic identifiers supplied by AWS in the form of regular expression (regex) checks. For this, we’ll leave “All” selected as the default. This will use all the AWS managed data identifiers to scan S3 buckets with. For a full list of the data identifiers available please see here

Macieconsole4

For now, we’ll leave the Custom Data Identifiers not set. This allows you to upload your own regular expressions to use as identifiers for anything not covered by the AWS Managed Logic Identifiers we just configured. This could be to check for country specific, or application specific identifiers for example.

We’ll also leave the Allow Lists off for now. This allows you to ignore patterns or specific text in files that you want Macie to ignore (similar to the way the file extensions ignore works)

Give the job a Name, and description, and click “Next”:

Macieconsole5

Click “Submit” after you’ve validated the job configuration.

Step 4 - Analyse the findings

Now you’re back at the Macie console. Click on your job and see it running. It can take around 20minutes or so to run. So go grab a coffee (or beverage of your choice)!

Once your job is complete: click “Show Results > Show Findings”

Macieconsole6

You’ll see the PII idnentified in my file cc.txt, which is the list of plain text credit card numbers. So the job has run, and Macie has identified as expected.

Optional Extra

As you’ll note, the findings for Macie are available via the Macie Console. You’ll see up the top, Macie will prompt you to setup an S3 bucket to store the job findings in. Make sure you configure a KMS key for the S3 bucket that Macie can have permissions for:

Macieconsole7

Click to follow through the prompts to create a new S3 bucket to store your findings in.

Limitations

As it stands we’ve got a one-off job running in Macie. You could edit the job to run as a daily job, but you’ll still need to manually login to the AWS Management Console, and to the Macie service to view any findings.

We also are relying on the AWS Managed Data Identifiers to pickup PII data. So if you are worried about storing any data not covered by these, you might want to look into creating your own Custom Data Identifiers

Stay tuned for part 2 where I’ll configure an SNS topic and use EventBridge to send out email alerts when Macie finds any PII data leaks, which is a much better solution for the longer term.