Sql And Data Mining Discussion

Discussion continued from BagSetImpedanceMismatchDiscussion.

Any larger-scale endeavor is typically going recognize/form domain patterns over time such that for efficiency's sake, the organization packages low-level info into medium and high-level abstractions so that domain analysts are working with medium and high-level abstractions instead of low-level abstractions. I see no reason why this usual pattern wouldn't exist except in rare or unusual cases. (Sure, there will be some edge cases in most domains that require custom low-level fiddling, but not the majority of info.) Maybe in a new field the domain analysts are dealing with low-level info and wearing multiple hats (tech AND domain hats). But I don't expect that would last, for it's usually not economical. Patterns WILL emerge, and companies that don't mine those patterns for abstraction-based efficiency will probably dwindle or parish. Typical experienced domain analysts are not equipped to be byte diddlers. There may be rare exceptions, but I've yet to actually see one. Bring me an actual movie of this claimed unicorn. -t

Typical data scientists and BigData analysts are more than capable of using regexes. They're hardly "byte diddlers", but people who know how to extract information from more than just SQL. SQL is fine if you're building applications to store and retrieve business transactional data, but that's what SQL is limited to doing. Increasingly, nuts-n-bolts transaction processing is considered finished but for maintenance. It's no longer an area that facilitates business growth. Instead, it's an area for straightforward business maintenance. The growth is in areas like BigData. The future of custom business applications isn't in record keeping and shuttling reports around, but in creating applications that (for example) predict future sales by correlating buying activity with sentiment in customers' social media postings and current weather forecasts in order to optimise just-in-time purchasing of raw materials.

Like I mentioned before, for new fields (such as BigData), the practitioners typically wear multiple hats and straddle across the low-level technical, mid-level technical, and domain sides. I have worked with multiple "statisticians", who typically had Bachelors or Masters degrees in math or statistics. They knew some basic programming, but probably not regex's. I'm sure they could learn with some time, but it was typically expected the IT staff would help them with things like non-trivial parsing. Patterns were gradually identified and databases created with predigested info so that they could directly use SQL, MS-Access, TOAD, or a similar to tool(s), with a general plan to put the most common requests behind a QueryByExample-like interface(s) and parametrized online reports. Patterns gradually emerged and info was "tamed", with a kind of division between techies, power-users, and end-users. -t

Changes, tool building, and specialized requests may still come through the techies, but power-users would work on the semi-routine stuff, and canned on-line reports and query front-ends were built for non-power users for routine requests. I have no reason to believe "big" data will be any different. Hardware and experience will catch up and query interfaces will be designed to be human-friendly instead of byte-diddly-level. And I've had similar experience with marketing analysts, and something akin to an email content "reader", per example in parent topic. I truly doubt there are no or too few patterns to be mined and factored behind friendlier interfaces/layouts/packaging ABOVE the byte-level. Perhaps YOU have not found them yet, but I'd Vegas-bet they will or can be found with time. -t

Granted, for JobSecurity reasons, some techies in some shops will try to hang onto control and silo their knowledge. But general market forces will push away from such for larger user bases and/or the longer term. Personally, I get board doing repetitious queries and WANT to hand off as much as possible to power-users and end-users by abstracting out the common patterns. -t

I think you must regard regexes as much more technical than I do, and I suppose that's reasonable if you work in an environment that's primarily focused on conventional SQL-driven business applications. People who do text analysis write regexes in their sleep, but might not be able to write a lick of imperative code or SQL.

The half-dozen or so data analysts I've worked with didn't appear to be regex experts. They knew some rudimentary programming, but were not otherwise "coders" (except for one guy who became a self-taught SQL whiz). "Data mining" should not primarily be driven by textual issues, but rather "data" issues. To use the prior email sifting scenario, a data analyst (or statistician) counting complaints about leaking packages being delivered (among other topics) shouldn't have to concern themselves with the parsing and syntax of such emails. A "parsing expert", for lack of a better title, would perform that task in a larger operation. I suppose the DA may do inspections of the results to make sure the parser is doing an adequate job, but the technical tuning of the parser itself wouldn't be their job. I agree that a startup may employ the same person to parse and do statistical analysis, but again, I'm talking about medium-term non-start-ups; and a data-centric database is the better tool for the job. It's not logical parsimony of queries that every query or tool that wants to count leaking package complaints use regex's to do it unless a data-centric database is somehow not technically feasible. "Leakativity" should be abstracted into a flag or category marker.-t

It sounds like the data analysts you worked with were of the "old school" reporting and/or statistics variety, roughly isomorphic to COBOL programmers. Modern data scientists are computer scientists who mix statistics, machine learning and computer science. They can use regexes.

COBOL? I have no fricken idea what you are talking about. You are being vague. Statistics hasn't changed much. If machine learning is used to determine the topic of customer emails, that's fine, but the RESULTS of the bot's guess still need to be data-tized for smooth statistical analysis.

Note: when using "guessing machines", perhaps categories should have weights attached to them:

    categories (table)
    ---------
    email_ref   // ID of email
    category    // Examples: "Complaint", "Leakage", "Packaging"
    weight      // AI bot's "certainty" weight (0 to 100)
    human_reviewed (Boolean) // optional idea
    (Primary key: email_ref and category)

Maybe this rather detailed, specific example belongs somewhere else? It seems a bit out of place in the discussion here.
We need something specific as a reference example for better communication. The generality approach is failing badly.
I'm not clear what you're being specific about, or what it's a reference example of. What is a "guessing machine"? What does the above have to do with it?
I thought "guessing machine" was pretty clear. In the Amazon example below, we want automated guessing about what the email is about and guessing about common details, such as what are phone numbers.
A concrete example of the above table usage: Let's say email number 1234 is a complaint about a late-arriving tablet computer ordered from Amazon. An AI bot does a pre-classification ("guessing") about the content of email 1234 based on key-words and/or English meaning parsing, giving the following records:

email_ref: 1234 category: Complaint weight: 73 // bot is 73% sure it's a complaint human_reviewed: False -------- email_ref: 1234 category: Computer weight: 93 human_reviewed: False -------- email_ref: 1234 category: Tardiness weight: 72 human_reviewed: False -------- Etc... (Low-weighted potential categories are typically excluded.)

An example heuristic would resemble: "if the word 'ipad' appears in the message, then increment the weight of the category "computer" by (such and such value)."
Yes, I suppose if you need something like that, I'm sure it would be fine. I'm still not clear how it's relevant.
Why is it not relevant? I cannot read your mind.
I don't see how it relates to the SQL-vs-regex discussion that occupies the rest of this page.
Obviously there is a communications breakdown somewhere.

I meant "COBOL" as in, your data analysts were to statistics what COBOL programmers are to modern application developers. The mathematics of statistics hasn't changed, but the application of statistics has changed greatly. In a nutshell: Static data in spreadsheets and databases analysed by non-technical statisticians is rapidly giving way to data streams analysed by machines and programmers. Google for "data science".

How about we continue with the customer email scenario. Walk us through in sufficient detail to illustrate what you see as the "proper" staffing divisions and work-flow. -t

What customer email scenario?

The example began near PageAnchor text-scenario-01 in the parent topic, and was gradually developed as the discussion went on. The actual example I started talking about was not actually customer email analysis, but customer email analysis is a scenario that's easy to explain to readers and is relatively close to the original. (I cannot give actual domain details of the original, per org policy.) Search also for "leak" in these topics. -t

Do you mean the example of extracting telephone numbers (or whatever) from a 60gb/sec text stream?

Yes, but also extracting other domain-specific info, such as perhaps an estimated purpose of the message (such as "leaking product box" in one example).

Let's say your are Amazon.com. You will get large volumes of email from customers, vendors, etc. You probably try to herd most users through guided forms you set up, but many will still use plain email regardless. Amazon.com would want to extract info from such emails to route them properly (best guess) and to perform statistics on them, such as to identify emerging problems or trends. Possible info extracted quite possibly includes:

Names
Phone numbers
Country & location of origin
Language
Product numbers
Product category(s)
Contact type (customer, supplier, vendor, etc.)
Nature of the request (complaint, lost password, compliment, question, etc.)
Keywords that various departments have set a "watch" on to generate title flags, routing overrides, and/or highlighting.
Cuss-words and hints of aggression or extreme emotion

An "extraction engine" would typically put such into an RDBMS-like database.

Yes, if you need to extract a fixed set of attributes from a relatively low-volume data source like emails, sure, that would be fine. That's a conventional problem. It doesn't come close to 60gb/second, and it's trivially parallelisable.

I don't see how the volume changes anything. Even if a "traditional" RDBMS is not scalable enough, an RDBMS-like tool would be a better fit than always querying with regex.

How would SQL be better than a regex for finding every instance of a phone number in a stream of Twitter posts or web forum posts or Wiki pages or emails?

Like I have said before, for a one-time project, you are right. Regex is probably the simplest. But a one-time project is not what we are talking about.

How would it make a difference whether the project is one-time or done many times? Does finding every instance of a phone number in a stream of Twitter posts become easier to do in SQL the more you need to do it?

No, I didn't say it did.

Then why is a "one-time project" vs not-a-"one-time project" an issue?

More on this below.

Note WikiPedia's entry for "Data Science":

(begin quote)

Criticism

Although use of the term "data science" has exploded in business environments, many academics and journalists see no distinction between data science and statistics. Writing in Forbes, Gil Press argues that data science is a buzzword without a clear definition and has simply replaced “business analytics” in contexts such as graduate degree programs.[14] In the question-and-answer section of his keynote address at the Joint Statistical Meetings of American Statistical Association, noted applied statistician Nate Silver said, “I think data-scientist is a sexed up term for a statistician....Statistics is a branch of science. Data scientist is slightly redundant in some way and people shouldn’t berate the term statistician.”[15] {Emphasis added}

(end quote)

That's a predictable criticism from "old school" statisticians who fear being made redundant, but it ignores the significance of data acquisition -- from environmental sources, social media, traditional repositories like databases, non-traditional repositories like document stores, data streams, etc. -- and the application of machine learning. These, plus statistics, characterise data science. Statistics alone is not enough.

That's an unfair criticism that smells of agism without sufficient details to lodge such a harsh accusation. I see no reason why social media etc. would change the general nature of data analysis. One would still want "slotified" data as the better medium to process REGARDLESS of the source. You seem to be confusing syntax and/or format with semantics. Content is content. For example, in the old days you had page numbers and/or journal names/dates as cross-references. Today you have URL's. Somebody analyzing a graph of links (references) should NOT have to change their general technique and tools because the format of the "link key" has changed. That's low-level info. A graph is still a graph, a kiss is still a kiss, a sigh is still a sigh...oops, slipped into fogey-ville there. -t

Social media (for example) data doesn't change the nature of statistics, but it certainly changes the practical nature of doing statistics. Analysis that was once a spreadsheet's worth of data that you could analyse with spreadsheet formulae, that grew into a table's worth that could be extracted with a SQL 'SELECT', is now a continuous stream of data from diverse sources in multiple formats including free-form text. Does it make sense to use SQL to query data without columns, from multiple sources, that's mostly or all free-form text, where the query never finishes running?

If you want to merely find something in text, one would use something like a Google Search Appliance (search engine). If you want to do statistics and/or "meaning processing", then some kind of pre-digestion into slots/attributes/flags/categories is usually more convenient for non-trivial-sized staff to work with.

Certainly a search engine like a Google Search Appliance or Apache Lucene is more suitable than SQL for finding specific text in a (usually un-annotated) text corpus. However, note that it's a continuous stream of data, not a finite corpus. It's not that we want to search for some text in a body of text, it's that we want to be searching, always searching, in a stream of text.

I don't know what you mean. Without something more concrete domain-wise, I cannot comment further. The closest domain I can think of to your (frustratingly) vague description would be the NSA looking for threats in emails and/or discussion forums and/or general Internet traffic. They would STILL want pre-digestion for routing to humans (for further analysis) and for statistics and for forensics (such as human relationships between suspicious messages). Minus the forensics, it's similar to the Amazon scenario above.

It could be three-letter-agency searches, but it could also be searching Twitter feeds for every mention of your company's products, or making sure outbound emails don't include any company telephone numbers except those in an approved list.

As I mentioned before, for a ONE TIME PROJECT, you are right, but we are NOT TALKING ABOUT A ONE-TIME PROJECT. We are talking about a system that facilitates a medium or large organization in doing multiple or volume analysis. I don't think it's likely an org would build a giant database for a piddly or simple project. If it was piddly or simple, then you probably don't need a database of ANY kind. It's overkill. Use a file system.

Again, how would it make a difference whether the project is one-time or done many times? Does finding every instance of a phone number in a stream of Twitter posts become easier to do in SQL the more you need to do it? And how do you "use a file system" to scan Twitter posts or outbound emails?

If you don't need a database, then you are comparing apples to staples. To "find every instance of a phone number in a stream...", you don't need a database, period. I thought we were comparing different kinds of databases. I thought "find every instance of a phone number in a stream" was an EXAMPLE of something you wanted to do with the data, not the ONLY THING. Typically one only uses a database if they want to do a variety of things with the same data. Thus, your mention of a NoSql database that triggered this discussion was assumed to be for "doing a variety of things with the same data", not JUST a phone number search. -t

Is a database defined in terms of whether relations are finite or not? Isn't a database a collection of (related) data?

I'm not sure what you mean. A file system also typically puts files together by some relationship. If you only need to group by one or two facets (factors) and that's not likely to change, a file system is probably sufficient. However, if you have 3 or more facets, and/or the needed facets change often, then a database is probably a better tool.

My example of finding every instance of a phone number is one of many possible examples. We might want to find every instance of a phone number, and every instance of the name of a company employee, and every mention of a company product name, and count how many words are misspelled in an hour, and so on, all at the same time.

The devil is in the details. How many different requests?, how often?, how much turn-around time is acceptable?, what if the techie is on vacation? etc. etc. etc. Paying techies to write routine and repetitive queries all day is usually not a good use of their time. Old fashioned SystemsAnalysis is in order before selecting a fitting system. This should go without saying.

That will depend on individual situations and requirements. Some of them appear to require dedicated regex parsers, which I understand are selling quite well.

See below about "mutually exclusive".

If you are merely sifting an active stream for occasional text patterns (few patterns and/or infrequently-changing patterns), I don't see a need for a database or even a file system. Write a script to filter the incoming stream, replicate it to the multiple servers involved, and make the script send "finds" to a central file repository and/or send an email. No database and no file system needed for such (for the data being filtered). K.I.S.S. You haven't even identified a need for persistence.

Yes, that seems reasonable. I'm sure there are situations and requirements that would make that a reasonable approach. When the data is coming at you at the rate of many gb/second, it seems the "write a script to filter the incoming stream" requires some special handling, hence the dedicated regex parsers.

Regex and RDBMS (or similar) are not necessarily mutually exclusive. Typically Regex converts low-abstraction (raw) info into medium-abstraction info (or domain-perspective info), which is what is typically put into databases for medium and large orgs when they want to analyze, process, summarize, use, and/or archive the medium-level info. If your boss asks you to build something to find UK phone numbers in a byte stream, then your organization typically wants to do something with that info. Finding it is only Step 1.

Sure. So?

So, are we reaching some agreement here? Or is that like dividing by zero on this wiki?

CategoryDataMining