About This Book
What Does This Book Cover?
This book was written to provide readers with an introduction to the vast world that is unstructured data analysis. I wanted to ensure that SAS programmers of many different levels could approach the subject matter here, and come away with a robust set of tools to enable sophisticated analysis in the future.
I focus on the regular expression functionality that is available in SAS, and on presenting some basic data manipulation tools with the capabilities that SAS has to offer. I also spend significant time developing capabilities the reader can apply to the subject of entity resolution from end to end.
This book does not cover enterprise tools available from SAS that make some of the topics discussed herein much easier to use or more efficient. The goal here is to educate programmers, and help them understand the methods available to tackle these things for problems of reasonable scale. And for this reason, I dont tackle things like entity resolution in a big data context. Its just too much to do in one book, and that would not be a good place for a beginner or intermediate programmer to start.
Performing an array of unstructured data analysis techniques, culminating in the development of an entity resolution analytics framework with SAS code, is the central focus of this book. Therefore, I have generally arranged the chapters around that process. There is foundational information that must be covered in order to enable some of the later activities. So, , and that is very useful for later chapters.
: Getting Started with Regular Expressions
In order to effectively prepare you for doing advanced unstructured data analysis, you need the fundamental tools to tackle that with SAS code. So, in this chapter, I introduce regular expressions.
: Using Regular Expressions in SAS
In this chapter, I will begin using regular expressions via SAS code by introducing the SAS functions and call routines that allow us to accomplish fairly sophisticated tasks. And I wrap up the chapter with some practical examples that should help you tackle real-world unstructured data analysis problems.
: Entity Resolution Analytics
I will introduce entity resolution analytics as a framework for applying what was learned in in combination with techniques introduced in the subsequent chapters of this book. This framework will be guiding force through the remaining chapters of this book, providing you with an approach to begin tackling entity resolution in your environment.
: Entity Extraction
Leveraging the foundation established in , with a particular focuspreparing for the entity resolution.
: Extract, Transform, Load
I will cover some key ETL elements needed for effective data preparation of entity references, and demonstrate how they can be used with SAS code.
: Entity Resolution
In this chapter, I will walk you through the process of actually resolving entities, and acquaint you with some of the challenges of that process. I will again have examples in SAS code.
: Entity Network Mapping and Analysis
This chapter is focused on the steps taken to construct entity networks and analyze them. After the entity networks have been defined, I will walk through a variety of analyses that might be performed at this point (this is not an exhaustive list).
: Entity Management
In this chapter, I will discuss the challenges and best practices for managing entities effectively. I try to keep these guidelines general enough to fit within whatever management process your organization uses.
Appendix A: Additional Resources
I have included a few sections for random entity generation, regular expression references, Perl version notes, and binary/hexadecimal/ASCII code cross-references. I hope they prove useful references even after you have mastered the material.
Is This Book for You?
I wrote this book for ambitious SAS programmers who have practical problems to solve in their day-to-day tasks. I hope that it provides enough introductory information to get you started, motivational examples to keep you excited about these topics, and sufficient reference material to keep you referring back to it.
To make the best use of this book, you should have a solid understanding of Base SAS programming principles like the DATA step. While it is not required, exposure to PROC SQL and macros will be helpful in following some of the later code examples.
This book has been created with a fairly wide audience in mindstudents, new SAS programmers, experienced analytics professionals, and expert data scientists. Therefore, I have provided information about both the business and technical aspects of performing unstructured data analysis throughout the book. Even if you are not a very experienced analytics professional, I expect you will gain an understanding of the business process and implications of unstructured data analysis techniques.
At a minimum, I want everyone reading this book to walk away with the following:
A sound understanding of what both regular expressions and entity resolution are (and arent)
An appreciation for the real-world challenges involved in executing complex unstructured data analysis
The ability to implement (or manage an implementation) of the entity resolution analytics methodology discussed later in this book
An understanding of how to leverage SAS software to perform unstructured data analysis for their desired applications
The SAS Platform is quite broad in scope and therefore provides professionals and organizations many different ways to execute the techniques that we will cover in this book. As such, I cant hope to cover every conceivable path or platform configuration to meet an organizations needs. Each situation is just different enough that the SAS software required to meet that organizations scale, user skill level(s), financial parameters, and business goals will vary greatly.
Therefore, I am presenting an approach to the subject matter which enables individuals and organizations to get started with the unstructured data analysis topics of regular expressions and entity resolution. The code and concepts developed in this book can be applied with solutions such as SAS Viya to yield an incredible level of flexibility and scale. But I am limiting the goals to those that can yield achievable results on a small scale in order for the process and techniques to be well understood. Also, the process for implementation is general enough to be applied to virtually any scale of project. And it is my sincere hope that this book provides you with the foundational knowledge to pursue unstructured data analysis projects well beyond my humble aim
What Should You Know about the Examples?
This book includes tutorials for you to follow to gain hands-on experience with SAS.
Software Used to Develop the Book's Content
SAS Studio (the same programming environment as SAS University Edition) was used to write and test all the code shown in this book. The functions and call routines demonstrated are from Base SAS, SAS/STAT, SAS/GRAPH, and SAS/OR.
Example Code and Data
You can access the example code and data for this book from the author page at https://support.sas.com/authors . Look for the cover thumbnail of this book and select Example Code and Data.
Next page