Abstract
Transactional SQL database may be vertically scaled and either shared-everything or shared-disk, while OLAP DBMS should be shared-nothing. Both should be more tightly coupled than NoSQL database.
References and sources are each specific to section.
Basic things
Chip multiprocessor with data-driven multithreading, rather than superscalar. In the case of multiple-issue, have static placement like VLIW but dynamic issue like superscalar.
Direct execution of operating system code
Stored in flash memory. At least the kernel be stored in NOR flash to allow for direct code execution.
Data would still be stored in hard disk storage and loaded into RAM.
This could be Harvard architecture, see “Asymmetric multiprocessing” below.
Isolated I/O
Multi-path, multi-stage interconnection network to connect processors to memory, bus to connect processors to I/O. The former consisting of semiconductor crosspoint matrix switches.
Channel I/O
A backplane bus, to which are connected at least one system bus and four I/O expansion buses. One or two CPU sockets and multiple memory module slots per system bus. One I/O processor (IOP) connecting the backplane bus to each I/O bus. While the CPU and its respective system bus may be on the same board as the backplane bus, each I/O processor and its respective I/O bus is in its own expansion card, therefore at least four expansion cards. May mix and match expansion cards for storage and networking.
Have DMA controllers for synchronous and isochronous devices while having I/O processors for asynchronous devices, for the latter of which longer-running I/O and endpoint devices with flexible latency requirements would go directly to memory, whereas shorter-running I/O and endpoint devices with more stringent latency requirements would go straight to the CPU, and at least some of the I/O controllers would have built-in multiplexing namely that used for terminals. DMA controllers for FireWire cameras, recorders, and scanners and ATA optical drives, and I/O channels for SCSI RAID and printers. May mix and match networking and storage processors, the former of which may be application-specific integrated circuits (ASIC) or reconfigurable arrays to offload the network stack while the latter be I/O processors (IOPs).
Another idea is to make a mainframe from commodity hardware, which will be discussed in a later section.
Asymmetric multiprocessing
Host CPU accesses to I/O and runs the operating system which is stored on flash memory and written in Assembly, C, and/or Modula-3. Therefore, modified Harvard architecture or super Harvard architecture might be good. Have one read-only data bus and one write-only data bus for I/O and one read/write data bus for memory access.
Additional processors share memory with host CPU. These additional processors run applications and be called the application processors.
Also have vector floating-point accelerators.
Could have separate I/O processors for networking and storage, but do have one host CPU for memory management, task scheduling, and process management. Maybe coarse-grain reconfigurable array for protocol stack offload and mainframe-style I/O processor (IOP) for I/O operations such as storage and printers. Save the rest for Alternative and Progressive computing.
Composing a mainframe from commodity hardware
Switched fabric like InfiniBand connecting compute nodes together and to Fibre Channel storage, NVMe storage-class memory, and PCI expansion. Also Ethernet gateways, which may be load balancers. Also Myrinet clustering, with a role equivalent to IBM Integrated Coupling Facility. The compute nodes with host channel adapters (HCAs), the storage targets with target channel adapters (TCAs). The HCAs be field-programmable gate arrays (FPGAs) or better yet coarse-grained reconfigurable arrays (CGRAs), the expansion TCAs be application-specific integrated circuits, and the storage TCAs be Systems-on-Chip. Compute nodes should be in modular chassis, while the storage servers be 4U. The Storage TCA Systems-on-Chip would be a Data Center Infrastructure Processing Unit (DCPU, DCIPU, or just DPU) and have a role equivalent to the I/O processors mentioned in the second last paragraph. If printers are used, they would be associated with the storage TCAs which be equivalent to I/O processors. File servers would also run on the storage target DCPUs, and a clustered filesystem would be implemented on them.
A point-to-point interconnect like HyperTransport would be great for the compute nodes. Also have I/O virtualization like 3Leaf Aqua.
Unisys ClearPath is a great example. 3Leaf Dynamic Data Center was near-perfect.
Accelerated processing
Scalar fixed-precision central processing unit (CPU) issuing instructions straight to vector floating-point coprocessors, but both CPU and coprocessors load and store data straight from and to memory. Von Neumann architecture for CPU, Harvard architecture for coprocessors.
Also attached dataflow processor which may be a coarse-grained reconfigurable array.
Aggregated network
Two hubs connected to a bridge. Or multiple switches connected to a router, possibly with hubs connected to each switch.
Edge switches and core routers.
Bus for hubs, semiconductor matrix for switches.
Virtualized operating system
Multi-server microkernel, second- or third-generation, hosting modular monolithic kernels like GNU/Linux. Separate VMM for each virtual machine. Inspired by NOVA hypervisor and its derivatives like Bedrock and Cyberus Hedron SVP.
Another idea is Ubuntu for Xen Dom0 and Debian for DomU. Display server and window manager should be on Dom0. Desire offload of Builder to DomB which should be NanOS. Also offload XenStore to MirageOS or alternatively OSv. Use Xen PVH for most virtual machines, for which prefer paravirtualization for disk and network, paravirtualization or better yet hardware acceleration for interrupts and timers, software virtualization for emulated motherboard, and hardware virtualization for privileged instructions and page tables
Hybrid dataflow and Von Neumann processor array
Dynamic dataflow scheduling between hyperblocks or epochs, control flow scheduling within hyperblocks or epochs and between basic blocks or clusters, and dataflow scheduling within basic blocks or clusters. Therefore, three levels, rather than two like MT.Monsoon, data-driven multithreading, scheduled dataflow, Task Superscalar[2], TRIPS[0], or DYSER[1].
Each hyperblock or epoch may have dataflow-accelerators like DySER[1] rather than unified-dataflow like TRIPS[0], as mentioned in Chapter 2.3.2 of [2].
That would be something like Tartan, Conservation Cores, or DySER[1] nested inside Data-Driven Multithreading or Task Superscalar.
This could be used for datacenter processing unit (DPU) for infrastructure offload.
[0]https://www.cs.utexas.edu/~cart/publications/dissertations/ramdas.pdf
[1]https://ieeexplore.ieee.org/document/6235947/
[2]https://www.tdx.cat/bitstream/handle/10803/277376/TYF1de1.pdf
MultiProcessor System-on-Chip
Inspired by HyperProcessor, MLCA, and Task Superscalar[0,1,2]. But it be a massively parallel processor array.
The host processor issues instructions to the worker processors in the array. The workers pass data to one another as messages. The worker processors may have dataflow accelerators like Tartan or Conservation Cores.
[0]http://www.eecg.toronto.edu/~tsa/papers/wasp03.pdf
[1]https://tspace.library.utoronto.ca/bitstream/1807/11134/1/Capalija_Davor_200806_MASc_thesis.pdf
[2]https://www.tdx.cat/bitstream/handle/10803/277376/TYF1de1.pdf
Distributed timesharing system
One or more workstations each with X Server and/or Wayland Compositor. Multiple remote devices each for different application software. The command-line terminal certainly should be on the display server workstations. Have several remote devices for the applications.
Multi-server microkernel would be great, ideally distributed more like Amoeba than Sprite. Desire combinatorial object-based and page-based distributed shared memory.
The terminals or workstations may have an instant-on operating systems stored in flash memory.
LightDM would be better than SDDM (Simple Desktop Display Manager) and LXDM either of which would be better than XDM (X Display Manager). Directory-based rather than snooping.
Have two or more additional devices for file servers.
MIMD distributed parallel computer
Shared-disk. SCSI HDD arrays be shared and NVMe flash be local.
Message passing between nodes. Each node with a couple or several processor sockets sharing memory. Desire interleaved memory in dancehall configuration for each node. Desire hybrid CUDA/OpenMP/MPI, funneled with full load balancing. RISC processor like DLX/MIPS/Gullwing would be desired, otherwise AMD Opteron or EPYC would be great, alternatively ARM processor. Desire interleaved multithreading to avoid stalls.
Desire byte-addressable NVMe flash and block-addressable SCSI HDD’s. Prefer RAID4, RAID5, RAID6, and RAID10.
Dedicated servers for distributed task scheduling, issuing tasks to the compute nodes. Compute nodes pass data to one another as messages. Have one server to deploy and manage system resources and infrastructure services, partition them into domains, grant access to them by users and groups, and delegate them to particular applications
Could have a separate control network of star, snowflake, clos, or fat-tree topology for task scheduling, while the data network, a structured partial mesh, be for passing data.
Transaction processing, business intelligence, and Big Data analytics
Two-tiers, the upper one being an OLTP or HTAP database for processing of refined, structured data, backed by the lower tier which be a data lake or lakehouse for big data analytics of raw, unstructured and semi-structured data.
Desire a database like LeanXScale, another idea is Greenplum. Hadoop and NoSQL for backing store. Desire dual-model with a key-value engine written in GoLang for transaction processing and relational columnar engine written in either C, C++, Modula-3, or Rust for analytics.
Massively parallel processing (MPP) for data processing, ideally with a coarse-grained reconfigurable array for storage offload. Better yet elastic parallel processing like Snowflake. Desire hybrid transaction and analytical processing (HTAP).
Only an OLTP engine like VoltDB or CockroachDB should be a Docker or Kubernetes container. OLAP should be native and bare-metal.
Rather than Hbase on HDFS, have a Java wide-column store on top of a C/C++ distributed filesystem.
Snowflake schema would be better than star schema. Perhaps better yet would be a fact constellation.
Ideals
Interrupt-driven, port-mapped I/O for USB mouse and keyboard. Direct memory access for FireWire cameras and recorders. Printers and scanners may be either programmed or interrupt-driven I/O in which case they be USB, direct memory access in which case they be FireWire, or channel I/O in which case they be SCSI.
Stack machines may have a dedicated return stack in addition to the data stack, and should have top-of-stack and next-on-stack registers, a couple 3-port register banks each with two read ports and one write port, and register-style load/store instructions. Stack machines should have a power of two stacks i.e. two, four, or (less likely) eight.
Only 0- and 1-operand machines should use stacks for expression evaluation. Even general-purpose register machines certainly should use stacks to store return addresses and possibly subroutine parameters and local variables. No general-purpose register machine, load/store or register-memory, should use stacks for expression evaluation.
USB should be port-mapped, especially keyboard and mouse. While flash memory should have memory-mapped read and port-mapped write, HDD storage should be channel I/O.