Field discovery does most of the hard-work for you. It finds system metrics, emails, ipAddress and all sorts of things that you never really realised were filling up your logs. Log analysis has never been so powerful.
Logscape builds on the already popular auto-field discovery by providing users with the ability to add their own, 'auto-patterns'. The system is called grokIt.
Implementations:
With Key-Value pattern extraction, The idea is simple, whenever a recognised Key-Value pattern is found we index the pair and make them searchable terms.
CPU:99 hostname:travisio
and
{ "user":"john barness","ip:"128.10.8.150","action":"login" }
Included is the ability to extract known patterns such as, email addresses, hostnames, log levels, paths, etc. So every time john@jj-pennies.com is seen, then the data is extracted and indexed against the key (_email). The standard config file is logscape/downloads/grokit.properties
#field-name and regular expression matchers that extract a single group for the value _email:.*?([_A-Za-z0-9-\.]+@[A-Za-z0-9-]+\.[A-Za-z]{2,}).*? _ipAddress:.*?([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}).*? _exception:.*?([_A-Za-z0-9-\.]+Exception).*? _url:.*?([A-Za-z]{4,4}://[A-Za-z.0-9]+[:0-9]{0,6}[A-Za-z/]+).*? _level:.*?(INFO|ERROR|WARN|DEBUG|FATAL|TRACE|SEVERE|DEBUG).*? _hour:.*?[,.\s-]([0-9]{2,2}):[0-9]{2,2}:[0-9]{2,2}[,.\s-].*? _minute:.*?[,.\s-][0-9]{2,2}:([0-9]{2,2}):[0-9]{2,2}[,.\s-].*? _gpath:.*?(\/[A-Za-z0-9]+\/[\/A-Za-z0-9]+).*?
Each of these patterns were considered to be the most practical in terms of a) seeing useful information or b) slicing your data by time (hour of day).
Each entry contains the FieldName (lhs) : Expression (rhs). The regular expression must return a group that contains the value (see the orange brackets above). At the bottom we reference some of the awesome regular expression tools we used for these.
To make changes you can add or remove entries. Open your favourite text editor (vim?) - make the changes and save it (make sure you test it) . Once saved, then upload the file via the deployments page where the file is replicated to all agents on the network.
Any new files being monitored will pick up the configuration change (note: it wont happen mid-point through a file). To have the change applied retrospectively you will need to re-index the Datasource.
As with anything, we have tried to make both discovery systems as fast as possible. Key-Value extraction can perform at a rate of 17-20MB/s per pattern, unfortunately the supported 8 different rules cumulatively slow things down. GrokIt - or regular expression parsing is about 14MB/s per compiled pattern. Again this is too slow; as you will see from above, there are 8 of them.
IndexTime: The easiest way to remove the performance penalty is to do the work once, and not when the user is waiting. In our case, when either of the discovery systems are enabled, a Field Database is used to store the data in its most efficient form (dictionary oriented maps). This decouples the processing and provides reasonable search performance on attributes that are unlikely to change.
SearchTime: At search time the executor will pull in any discovered fields and make them available for that event. This provides decent performance and better system scalability.
To allow better performance, we have exposed FieldDiscovery flags on the DataSource/Advanced tab. Standard logscape sources have discovery disabled.