I doubt that is true. Perl is not the language you would use to build a large enterprise system or do complex numerical calculations (although it does have the Perl Data Language extension that is being integrated into the core Perl 6 functionality) but then, it isn’t intended to be that. Perl is often referred to as “the glue that holds the Internet together” because of how widely it is used for common gateway interface programming, as well as its use vy sys admins for a multitude of basic scripting roles like filtering data or integrating a bunch of scripts and applications into a single tool. I would agree that Python is the better choice as an all purpose language, particularly for enterprise web frameworks and the like, but Perl has its set of multiple niches and isn’t going away any time soon.
I’ve never used it but from theor site it looks more like a data integration and workflow tool. I would assume it can perform text manipulations, too—it would be kind of useless as a data integration tool if it can’t do that—but the comtrol block functionality is explicitly something the o.p. doesn’t need any more than he needs OOP.
I’ve ised event driven languages like Visual Basic, Visual Pascal, and ToolBook, and found that while they do allow you to create visual interfaces pretty quickly, to do any significant data manipulation requires sitting down and writing some structured functional code, which they are generally less than great at doing. If you want something to run an interactive kiosk they’re fine (although often not very robust) but they lack in ease of use and performance when it comes to actually handling data.
Big programs are a collection of subroutines and libraries. OO is good since it hides the details of these from other code that uses them, which prevents you from taking advantage of their internals - which bites you when someone changes the subroutine.
The disadvantage is that you have all this conceptual overhead even when you are programming in the small, which is what you are doing.
I did an OO class in grad school in the late '70s, and my dissertation was a kind of OO language - using the principles for portable microprograms. But you don’t need it, and you are right to concentrate on Perl.
If your goal is to get Easytrieve back up and running, then go with what DavidwithanR said: find the right person and pay them to do it.
If your goal is to solve one problem right now, I would lean toward perl. If your background is entirely procedural languages, you can leap into it easily without learning OOP. I’ve used a lot of programming languages over the years, and perl is one of the best for handling text strings quickly and easily.
If your goal is to set up a new system for moving forward, that’s where Python comes in. Schools are pumping out massive numbers of new Python programmers. It’s a fairly easy language for new programmers to pick up, even though it was a bit frustrating for an old-timer like me. Unlike the languages I’ve worked in before (c, perl, FORTRAN, BASIC, APL, PL/1, Pascal, VB…) there’s a mandatory indent system instead of an end-of-block character or command. A slightly misplaced cursor on copy (or cut) & paste leads to a line being indented one space too few or one space too many and the program won’t run. You don’t HAVE to use OOP to use Python, but you lose most of its advantages if you don’t. I picked it up in the equivalent of about two full-time weeks, because I just code part-time these days. A full-timer that wasn’t simultaneously learning a whole new set of tools could have done it faster.
IMO you should hire a programmer to write what you need, using a fast, compiled language (because you are processing up to about 100,000 records), and then learn to modify the code. It will be easier to modify existing code than to start writing something new in a new language from scratch.
It looks like you don’t need a database system at all. You simply need to process flat files, and many languages will do this.
I’ve done this kind of thing often using Lazarus (Object Pascal). It’s lightening fast, and has a complete range of database tools if necessary. It can do simple things in a simple and straightforward way, as well as handling any level of complexity. It’s a very mature language and IDE, constantly maintained and updated, free and open source, and cross-platform. There is any amount of help information and code samples available for anything you may want to do.
But reading fewer than 100000 short records should not be a problem speed-wise; he even mentioned the slowest operation was converting all text to upper-case, and how long can that take?
I am generally all for learning techniques like object-oriented programming and parallel processing, but the end result may be that the optimized program runs in 1 second instead of 10 seconds.
I would respectful request that we consider the ability of the OP and the complexity of the task when replying. I feel people are greatly simplifying the effort that would be required. I have trouble believing that the functions of an enterprise level report generator can be accomplished by a newbie programmer as their first project. The support cost alone is over $2k. If it was simple to create such functionally, companies with their own programming staff would just write their own report generator rather than spend thousands on this one.
It’s like if your transmission went out and you decided to replace it on your own. If you’re an experienced mechanic, you could probably pull a used one and install a new one over a weekend. But if you’ve never worked on cars, it’ll take a lot longer, be much harder, and you may never get it working.
For this issue right now, the solution is to get the program working again. You can also learn to program, but it shouldn’t be to solve today’s problem. Long term you can get to the point of writing programs to do what you need. But when talking to your boss about this problem, only mention getting the easytrieve working again. Don’t say you’ll learn to program so that easytrieve isn’t needed.
If by “database” you mean array, sure. But there is no need here for a relational database or any kind of complicated data structure unless he wants to make the records repeatedly searchable or sortable by keyed indices, and from what I’m reading he is just processing the file once per record to match the addresses and then write out to a new file. Unless the o.p. wants to actually build a database for future reference its just a lot of extra overhead, although this is certainly something that could be done in SQL if he was familiar with it.
The size of these files is nothing to even a basic modern desktop productivity machine. He should be able to load 91k records in memory in seconds, and process them about as fast as he can write them out. What the o.p. describes is just basic text processing which is exactly the thing Perl was created to do, and if he doesn’t want to learn Perl, findinf someone online or at a local school to implement some scripts in Perl (or Python, or Ruby, or whatever) should be trivial.
I do happen to know the difference between a database and an array, after 30+ years of programming.
An array won’t work. If you are matching 100k records to 100k records by reading sequentially through them each time, that’s about 50k matches per record. (It may be less if the original data is ordered.) So that’s 50K * 100k matches in total, which is 5 billion matches. That will take days, even on a very fast machine with a fast, compiled language.
So you need a database. In Lazarus/Delphi there is a proper, full, in-memory database component with extremely fast indexing and searching, which you can just plug in and use.
Believe me, I’ve DONE this kind of thing before, and I know the issues. I’ve done datasets of the order of a few hundred k records, and complicated matching, sorting and processing that was too slow (several hours) even with an SQL database on disk and a fast machine.
The OP’s requirements seem far simpler than the processing I was doing of financial records, but you won’t do it with arrays. And you won’t do it with Python, Perl, etc. when you get beyond simple formatting or totals. You can do it in a database, as the OP has been doing in Easytrieve, but it will be slow.
He doesn’t need to match every record against every other record; he just has to search one set of records for a matching account record of another, and assuming that the records are unique (as accounts tend to be) the computational complexity reduces as a factorial. Assuming the records are sequential and can be sorted and separated into groups you could even make it easier, but both Perl and Python have vectorized functions to do this automagically with large arrays.
I’m sure this could be done using Delphi, or some compiled C++ based database but for what the o.p. is doing that is overkill, particularly since he doesn’t appear to have a need to search and sort these records more than once.
I just grabbed an old laptop and tried, in Perl, reading a file of 100000 random records (total file size approximately 100 Mbytes), sorting them by the first field (a random 6-digit number), and writing them to output (no further processing, but this is just an example). The total time was less than 1 second.
NB Perl of course has built-in support for “associative arrays”; of course you are not going to search through records one by one to find a matching key. ETA I mean the data structure %somedata as opposed to conventional arrays @somedata
It doesn’t reduce as a factorial without indexing.
If you are creating an indexed array that allows fast searching, then for practical purposes it functions as an inefficient database. But if you need more than a single index on a single field, arrays won’t be sufficient.
In the OPs case, it seems relatively simple, with one key field. If the input files are ordered, matching may be simple. I would still never do it in Perl or Python.
Does the o.p. need more than one key? My understanding of what he is doing is searching account numbers and then then matching names and addresses to those account numbers, then processing the result into a couple of specific formats, presumably to be read by some other existing application. It’s true that this isn’t an efficient database, but my understanding of what the o.p. is doing is creating a one time match of records with unique identifiers; he doesn’t need powerful database functionality to perform complex searches and matches. This can be done in Perl or Python just fine.
I use Python (with NumPy and SciPy) to search records of hundreds time histories with millions of data points each to look for patterns between them using complex predictive filters and frequency transforms which is a numerically intensive task, and it can run a complete scan in a few minutes. The software that did this previously (written in Fortran 90) would take a few hours to do this on the old Unix system it ran on, but modern computing hardware and efficiencies in NumPy functions are so good that its just not worth it to try to make a computationally optimized code. BTW, I also created the same tool in Matlab and that took several times longer to run (nearly an hour in one case), which illustrates just how good NumPy has become at running close to bare metal speeds. I haven’t tried to create the same tool in C or C++ because I don’t have that kind of time or enthusiasm but I doubt it would result in more than a factor of 3 improvement, so at best I’d save a couple of minutes for the pain of writing a compiled application (although I would do so if I wanted to make it a commercial app).
I don’t use Perl Data Language (or use Perl much any more in general) but I’m morally certain it could perform this simple search and match task without taking days to do so, and the regex functionality in Perl could do all of the formatting without creating a bunch of different subroutines or elaborate data structures. Anyway, point is that the o.p.'s application is nowhere this computationally intensive, and unless he needs a database to reference for future purposes, he just doesn’t need the overhead of building a database.
100k random matches* - I didn’t bother to do any sorting - take about 0.5 second. This is a ThinkPad from 2011. ETA I mean the entire script, including reading the entire input file, performing the matches, and writing appropriate output, took 0.5 seconds.
The o.p. is talking about sorting, and from what I can, all records are unique with the single key being the account number.
I don’t doubt your experience with relational databases far exceeds my own and that for a more complicated dataset and application your criticisms would be on point, but I think you are interpreting the o.p.'s problem to be far more complicated than it really is. I wouldn’t use Perl to try to manage a relational database, but for doing basic searching, matching, and simple text processing and transforms, it’s a lightweight code with a pretty gentle learning curve and not a lot of extra functionality and overhead to have to learn to perform a one-time sort and match task.
Easytrieve can only match files when both files are sorted by the key (a/r#). The first step (after creating a fixed-position text file) is to sort the Current and Previous files. Most of my programs only require sorting on the a/r#, but there are a few that require sorting on the a/r#, then another field, and possibly more than one field.
Once the Current and Previous files are sorted, another Easytrieve matches them:
IF MATCHED
[indent]OUT-NAME = PREV-NAME
OUT-NAME2 = PREV-NAME2
OUT-ADDR = PREV-ADDR
OUT-ADDR2 = PREV-ADDR2
(and so on)
PUT FILEC
ELSE
NOT-REC = IN-REC
PUT NONMATCH
END-IF
IF NOT MATCHED FILEA
NOT2-REC = PREV-REC
PUT NONMATCH2
END-IF[/indent]
There’s more going on depending on if the records are matched or not, but that’s the gist of it. The output file names I use are [member number]_CURRENT_OUTPUT.txt, [member number]_NONMATCH.txt, and [member number]_NONMATCH2.txt. I combine these three files and import them into Excel where I colour-code the records that are not matched in the previous file, and the records that are not matched in the current file. Then I ‘clean up’ the new records, sort by a/r# (we need to send the data sorted by a/r#), turn the .csv file into a text file, and then run that text file through the reformatting Easytrieve, and example of which I posted earlier.
I know how that sounds to people familiar with relational databases and newer programming languages, but there used to be five people cleaning data manually. Now there’s just me and my boss. One file might take a week to clean up manually (and the one I’m thinking of doesn’t need to be reformatted), and even with monkeying about with importing a .csv file into Access, using Access to write a fixed-position text file, putting the output together and fixing the new records, and then doing the text conversion again, it takes under an hour and it doesn’t wind up in the recipient’s ‘suspense’ file because the data’s consistent from month to month.
I don’t know what to add to my and others’ previous comments, but I believe my experiments show that, considering the relatively small size of your data (I never understood how big exactly ‘big data’ was supposed to be, but this isn’t it), reading the entire thing into RAM and sorting/searching/matching/formatting should take a few seconds at most even in an interpreted language like Perl, if that was a lingering concern. (And if Easytrieve can do it, why should other scripting languages fail?)
ETA I’m sure you could write your scripts to access the database server directly, so none of the steps need be done manually. The robots will have reduced five man-weeks to a couple of seconds; that’s what it’s all about.
”Big Data” is just data that is large enough and too complex to easily reduce or classify. It isn’t a specific size, but it is large enough and has enough different parameters that you can’t tease out information by simple regression or binning methods, and therefore you have to use methods like non-parametric estimators to try to get some picture of what the data is telling you. An example of “Big Data” is using stock market performance as a predictor of future trends, where the traditional parametric methods give inaccurate trending and fail to predict “long tail” events (because the models do not reflect reality), in theory predictive analytics using non-parametric statistics will let you anticipate anomalous behavior that leads to market crashes because you see similar patterns of behavior.
91k records of discrete information in text format is not “Big Data”. It isn’t even all that challenging for intepreted scripting languages. I’ve handled comparable amount of data using CLisp on Unix server machines which didn’t have enough memory to hold the entire dataset in memory, and searching and sorting didn’t take days.
My personal definition of Big Data is “A dataset so large that there a significant probability that some number of the hard drives it is stored on will fail during your analysis”.